[R语言]Talk11 练习与作业

发表于 2022-01-06  703 次阅读


文章目录

练习与作业1:linear regression


一元回归分析

readr 包的函数将 Excercises and homework/data/talk11/ 目录下的 income.data_.zip 文件装入到 income.dat 变量中,进行以下分析:

  1. 用线性回归分析 incomehappiness 的关系;
  2. 用点线图画出incomehappiness 的关系,将推导出来的公式写在图上;
  3. 用得到的线性模型,以income为输入,预测happiness的值;
  4. 用点线图画出预测值与真实happiness的关系,并在图上写出 R2 值。
```{r}
## 代码写这里,并运行;
require(tidyverse)
income.dat <- read_csv(
  file="../Exercises and homework/data/talk11/income.data_.zip",col_names = T)
model <- lm(happiness~income,data = income.dat)
paras <- coef(model)
eq <- substitute(
  paste(italic(happiness),"=",b + a %.% italic(income)),
  list(
    b=as.vector(format(paras[1])),
    a=as.vector(format(paras["income"]))
  )
)
eq <- as.character(as.expression(eq))
ggplot(income.dat,aes(x=income,y=happiness))+
  geom_point() +
  geom_smooth(method = "lm")+
  geom_text(data=NULL,aes(x=2,y=7,label=eq,hjust=0,vjust=1),
            size=4,parse=T,inherit.aes = F)
newdata <- income.dat %>% dplyr::select(income)
happiness.predicted <- predict(model,newdata)
dat <- data.frame(reference=income.dat$happiness,prediction=happiness.predicted)
eq2 <- substitute(
  paste(italic(r)^2,"=",c),
  list(
    c=as.vector(format(with(dat,cor.test(prediction,reference))$estimate ^2))
  )
)
eq2 <- as.character(as.expression(eq2))
ggplot(dat,aes(x=reference,y=prediction))+geom_point()+
  geom_smooth(method = "lm") +
  geom_text(data=NULL,aes(x=2,y=6,label=eq2,hjust=0,vjust=1),
            size=4,parse=T,inherit.aes = F)
```

多元回归分析
readr 包的函数将 Excercises and homework/data/talk11/ 目录下的 heart.data_.zip 文件装入到 heart.dat 变量中,进行以下分析:

  1. 用线性回归分析 heart.diseasebikingsmoking 的关系;
  2. 写出三者间关系的线性公式;
  3. 解释bikingsmoking的影响(方向和程度);
  4. bikingsmoking能解释多少heart.disease的variance?这个值从哪里获得?
  5. relaimpo包的函数计算bikingsmokingheart.disease的重要性。哪个更重要?
  6. 用得到的线性模型预测heart.disease,用点线图画出预测值与真实值的关系,并在图上写出 R2 值。
  7. 在建模时考虑 bikingsmoking的互作关系,会提高模型的 R2 值吗?如果是,意味着什么?如果不是,又意味着什么?
```{r}
## 代码写这里,并运行;
require(relaimpo)
heart.dat <- read_csv(
  file="../Exercises and homework/data/talk11/heart.data_.zip",col_names = T)
model2<-lm(heart.disease~biking+smoking,data=heart.dat)
parase2 <- coef(model2)
eq3 <- paste("heart.disease","=",parase2[1],"+",parase2["biking"],"*",
             "biking")
eq3<-substitute(
  paste(italic(heart.disease),"=",c + a %.% italic(biking)+b%.%italic(smoking)),
  list(
    c=as.vector(format(parase2[1])),
    b=as.vector(format(parase2["smoking"])),
    a=as.vector(format(parase2["biking"]))
  )
)
eq3<-as.character(as.expression(eq3))
ggplot(data = NULL)+geom_text(data=NULL,aes(x=0,y=0,label=eq3,hjust=0,vjust=0),
            size=1.9,parse=T,inherit.aes = F)
```

3.答:由线性公式可知,biking对heart.disease有负影响(起抑制作用),smoking则对
heart.disease起正影响(起促进作用),其中,biking对heart.disease的抑制作用大于
smoking对heart.disease的促进作用。

```{r}
anova(model2)
```

4.答:结果如上所示。variance table可通过anova函数获得。

```{r}
calc.relimp(heart.disease~biking+smoking,data=heart.dat)
```

5.答:由上可知,biking更重要。

```{r}
newdata2 <- heart.dat%>%dplyr::select(biking,smoking)
heart.predicted<-predict (model2,newdata2)
data2 <- data.frame(reference=heart.dat$heart.disease,prediction=heart.predicted)
eq3 <- substitute(
  paste(italic(r)^2,"=",c),
  list(
    c=as.vector(format(with(data2,cor.test(prediction,reference))$estimate ^2))
  )
)
eq3 <- as.character(as.expression(eq3))
ggplot(data2,aes(x=reference,y=prediction))+geom_point()+geom_smooth(method = "lm")+geom_text(data=NULL,aes(x=2,y=15,label=eq3,hjust=0,vjust=1),
            size=4,parse=T,inherit.aes = F)
```
```{r}
model3 <- lm(heart.disease~biking*smoking,data=heart.dat)
data.frame(
  no_interaction=summary(model2)$r.squared,
  with_interaction=summary(model3)$r.squared
)
```

7.答:可见考虑互作关系时,R2值略微提高,意味着同时biking和smoking可能会提
高heart.disease的可能性。


glm 相关问题

glm 建模时使用family=binomial;在预测时, type=参数可取值 link(默认)和 response。请问,两者的区别是什么?请写代码举例说明。

```{r}
## 代码写这里,并运行;
dat <- iris %>% filter( Species %in% c("setosa", "virginica") );
bm <- glm( Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width,
data = dat, family = binomial );
```

type="link"时给出线性预测的概率

```{r}
data.frame( predicted = bm %>% predict( dat, type = "link" ),
original = dat$Species ) %>% sample_n(6) %>% arrange( original );
```

type =" response" 给出具体的预测概率

```{r}
data.frame( predicted = bm %>% predict( dat, type = "response" ),
original = dat$Species ) %>% sample_n(6) %>% arrange( original );
```

练习与作业2:non-linear regression


分析 swiss ,用其它列的数据预测Fertility

  1. 使用earth包建模,并做 10 times 10-fold cross validation;
  2. 使用lm方法建模,同样做 10 times 10-fold cross validation;
  3. RMSER2 两个指标比较两种方法,挑选出较好一个;
  4. vip 包的函数查看两种方法中 feature 的重要性,并画图(如下图所示):
```{r}
## 代码写这里,并运行;
require(earth)
require(vip)
require(caret)
cv_earth <- train(
Fertility ~ .,
data = swiss,
method = "earth",
trControl = trainControl(method = "repeatedcv", number = 10,repeats = 10)
);
cv_lm <- train(
Fertility ~ .,
data = swiss,
method = "lm",
trControl = trainControl(method = "repeatedcv", number = 10, repeats = 10)
)
cv_earth$results$RMSE
cv_lm$results$RMSE
cv_earth$results$Rsquared
cv_lm$results$Rsquared
```

故“lm”方法更好

```{r}
p1 <- vip(cv_lm, num_features = 40, geom = "point", value = "gcv") + 
  ggtitle("LM:GCV")
p2 <- vip(cv_earth, num_features = 40, geom = "point", value = "gcv") + 
  ggtitle("MARS:GCV")
gridExtra::grid.arrange(p1, p2, ncol = 2)
```

本站文章基于国际协议BY-NC-SA 4.0协议共享;
如未特殊说明,本站文章皆为原创文章,请规范转载。

0