练习与作业1:linear regression
一元回归分析
用 readr
包的函数将 Excercises and homework/data/talk11/
目录下的 income.data_.zip
文件装入到 income.dat
变量中,进行以下分析:
- 用线性回归分析
income
与happiness
的关系; - 用点线图画出
income
与happiness
的关系,将推导出来的公式写在图上; - 用得到的线性模型,以
income
为输入,预测happiness
的值; - 用点线图画出预测值与真实
happiness
的关系,并在图上写出 R2 值。
```{r}
## 代码写这里,并运行;
require(tidyverse)
income.dat <- read_csv(
file="../Exercises and homework/data/talk11/income.data_.zip",col_names = T)
model <- lm(happiness~income,data = income.dat)
paras <- coef(model)
eq <- substitute(
paste(italic(happiness),"=",b + a %.% italic(income)),
list(
b=as.vector(format(paras[1])),
a=as.vector(format(paras["income"]))
)
)
eq <- as.character(as.expression(eq))
ggplot(income.dat,aes(x=income,y=happiness))+
geom_point() +
geom_smooth(method = "lm")+
geom_text(data=NULL,aes(x=2,y=7,label=eq,hjust=0,vjust=1),
size=4,parse=T,inherit.aes = F)
newdata <- income.dat %>% dplyr::select(income)
happiness.predicted <- predict(model,newdata)
dat <- data.frame(reference=income.dat$happiness,prediction=happiness.predicted)
eq2 <- substitute(
paste(italic(r)^2,"=",c),
list(
c=as.vector(format(with(dat,cor.test(prediction,reference))$estimate ^2))
)
)
eq2 <- as.character(as.expression(eq2))
ggplot(dat,aes(x=reference,y=prediction))+geom_point()+
geom_smooth(method = "lm") +
geom_text(data=NULL,aes(x=2,y=6,label=eq2,hjust=0,vjust=1),
size=4,parse=T,inherit.aes = F)
```
多元回归分析
用 readr
包的函数将 Excercises and homework/data/talk11/
目录下的 heart.data_.zip
文件装入到 heart.dat
变量中,进行以下分析:
- 用线性回归分析
heart.disease
与biking
和smoking
的关系; - 写出三者间关系的线性公式;
- 解释
biking
和smoking
的影响(方向和程度); biking
和smoking
能解释多少heart.disease
的variance?这个值从哪里获得?- 用
relaimpo
包的函数计算biking
和smoking
对heart.disease
的重要性。哪个更重要? - 用得到的线性模型预测
heart.disease
,用点线图画出预测值与真实值的关系,并在图上写出 R2 值。 - 在建模时考虑
biking
和smoking
的互作关系,会提高模型的 R2 值吗?如果是,意味着什么?如果不是,又意味着什么?
```{r}
## 代码写这里,并运行;
require(relaimpo)
heart.dat <- read_csv(
file="../Exercises and homework/data/talk11/heart.data_.zip",col_names = T)
model2<-lm(heart.disease~biking+smoking,data=heart.dat)
parase2 <- coef(model2)
eq3 <- paste("heart.disease","=",parase2[1],"+",parase2["biking"],"*",
"biking")
eq3<-substitute(
paste(italic(heart.disease),"=",c + a %.% italic(biking)+b%.%italic(smoking)),
list(
c=as.vector(format(parase2[1])),
b=as.vector(format(parase2["smoking"])),
a=as.vector(format(parase2["biking"]))
)
)
eq3<-as.character(as.expression(eq3))
ggplot(data = NULL)+geom_text(data=NULL,aes(x=0,y=0,label=eq3,hjust=0,vjust=0),
size=1.9,parse=T,inherit.aes = F)
```
3.答:由线性公式可知,biking对heart.disease有负影响(起抑制作用),smoking则对
heart.disease起正影响(起促进作用),其中,biking对heart.disease的抑制作用大于
smoking对heart.disease的促进作用。
```{r}
anova(model2)
```
4.答:结果如上所示。variance table可通过anova函数获得。
```{r}
calc.relimp(heart.disease~biking+smoking,data=heart.dat)
```
5.答:由上可知,biking更重要。
```{r}
newdata2 <- heart.dat%>%dplyr::select(biking,smoking)
heart.predicted<-predict (model2,newdata2)
data2 <- data.frame(reference=heart.dat$heart.disease,prediction=heart.predicted)
eq3 <- substitute(
paste(italic(r)^2,"=",c),
list(
c=as.vector(format(with(data2,cor.test(prediction,reference))$estimate ^2))
)
)
eq3 <- as.character(as.expression(eq3))
ggplot(data2,aes(x=reference,y=prediction))+geom_point()+geom_smooth(method = "lm")+geom_text(data=NULL,aes(x=2,y=15,label=eq3,hjust=0,vjust=1),
size=4,parse=T,inherit.aes = F)
```
```{r}
model3 <- lm(heart.disease~biking*smoking,data=heart.dat)
data.frame(
no_interaction=summary(model2)$r.squared,
with_interaction=summary(model3)$r.squared
)
```
7.答:可见考虑互作关系时,R2值略微提高,意味着同时biking和smoking可能会提
高heart.disease的可能性。
glm
相关问题
用 glm
建模时使用family=binomial
;在预测时, type=
参数可取值 link
(默认)和 response
。请问,两者的区别是什么?请写代码举例说明。
```{r}
## 代码写这里,并运行;
dat <- iris %>% filter( Species %in% c("setosa", "virginica") );
bm <- glm( Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width,
data = dat, family = binomial );
```
type="link"时给出线性预测的概率
```{r}
data.frame( predicted = bm %>% predict( dat, type = "link" ),
original = dat$Species ) %>% sample_n(6) %>% arrange( original );
```
type =" response" 给出具体的预测概率
```{r}
data.frame( predicted = bm %>% predict( dat, type = "response" ),
original = dat$Species ) %>% sample_n(6) %>% arrange( original );
```
练习与作业2:non-linear regression
分析 swiss
,用其它列的数据预测Fertility
- 使用
earth
包建模,并做 10 times 10-fold cross validation; - 使用
lm
方法建模,同样做 10 times 10-fold cross validation; - 用
RMSE
和R2
两个指标比较两种方法,挑选出较好一个; - 用
vip
包的函数查看两种方法中 feature 的重要性,并画图(如下图所示):
```{r}
## 代码写这里,并运行;
require(earth)
require(vip)
require(caret)
cv_earth <- train(
Fertility ~ .,
data = swiss,
method = "earth",
trControl = trainControl(method = "repeatedcv", number = 10,repeats = 10)
);
cv_lm <- train(
Fertility ~ .,
data = swiss,
method = "lm",
trControl = trainControl(method = "repeatedcv", number = 10, repeats = 10)
)
cv_earth$results$RMSE
cv_lm$results$RMSE
cv_earth$results$Rsquared
cv_lm$results$Rsquared
```
故“lm”方法更好
```{r}
p1 <- vip(cv_lm, num_features = 40, geom = "point", value = "gcv") +
ggtitle("LM:GCV")
p2 <- vip(cv_earth, num_features = 40, geom = "point", value = "gcv") +
ggtitle("MARS:GCV")
gridExtra::grid.arrange(p1, p2, ncol = 2)
```
COMMENTS | NOTHING