[R语言]Talk10 练习与作业

文章目录

练习与作业1：数据查看
练习与作业2：作图
练习与作业3：线性模型与预测

练习与作业1：数据查看

正态分布

随机生成一个数字（numberic）组成的vector，长度为10万，其值符合正态分布；
用 ggplot2 的 density plot 画出其分布情况；
检查 mean +- 1 * sd，mean +- 2 * sd和mean +- 3 * sd 范围内的取值占总值数量的百分比。

```{r}
## 代码写这里，并运行；
require(tidyverse)
x <- rnorm(100000,mean=0,sd=2);

ggplot( data.frame( data = x ), aes( data ) ) + geom_density( );
i<-1
num1<-0
for (i in x) {
  if(i<=(1+2)&i>=(1-2)){num1=num1+1 }
}
(num1/100000)
num2<-0
for (i2 in x) {
  if(i2<=(1+4)&i2>=(1-4)){num2=num2+1 }
}
(num2/100000)
num3<-0
for (i3 in x) {
  if(i3<=(1+6)&i3>=(1-6)){num3=num3+1 }
}
(num3/100000)
```

用函数生成符合以下分布的数值，并做图：

另外，在英文名后给出对应的中文名：

-   Uniform Distribution 均匀分布

-   Normal Distribution 正态分布

-   Binomial Distribution 二项分布

-   Poisson Distribution 泊松分布

-   Exponential Distribution 指数分布

-   Gamma Distribution 伽玛分布

```{r}
## 代码写这里，并运行；
unidata <- runif( 10000 );
ggplot(data.frame(dat=unidata),aes(unidata))+geom_histogram()
nordata <- rnorm(10000)
ggplot(data.frame(data=nordata),aes(nordata))+geom_density()
binodata <- rbinom(1000,5,0.5)
ggplot(data.frame(dat=binodata),aes(binodata))+geom_histogram()
poisdata <- rpois(1000,50)
ggplot(data.frame(data=poisdata),aes(poisdata))+geom_density()
expdata <- rexp(10000,rate=1)
ggplot(data.frame(data=expdata),aes(expdata))+geom_density()
gammadata <- rgamma(10,1,rate=2)
ggplot(data.frame(data=gammadata),aes(gammadata))+geom_density()
```

分组的问题
- 什么是equal-sized bin和equal-distance bin？以mtcars为例，将wt列按两种方法分组，并显示结果。

```{r}
## 代码写这里，并运行；
mtcars2 <- mtcars %>%
mutate( equal_size_bin = ntile( wt, 4 ), 
equal_distance_bin = cut( wt,
breaks = seq( from = min(wt), to = max(wt),
by = (max(wt) - min(wt)) / 4 ),
include.lowest = T ) 
) 
table( mtcars2$equal_size_bin );
table(mtcars2$equal_distance_bin)
```

boxplot中outlier值的鉴定
- 以 swiss$Infant.Mortality 为例，找到它的 outlier 并打印出来；

```{r}
## 代码写这里，并运行；
s <- summary( swiss$Infant.Mortality );
irq <- IQR(swiss$Infant.Mortality);
ggplot( swiss, aes( y = Infant.Mortality ) ) + geom_boxplot() + coord_flip()+
geom_hline( yintercept = s, colour = sample( colors(), length(s) ) )+
  geom_hline(
    yintercept = c( s["1st Qu."] - 1.5 * irq, s["3rd Qu."] + 1.5 * irq ),
colour = "red", size = 2, linetype = 2);
s[1]
```

以男女生步数数据为例，进行以下计算：

首先用以下代码装入Data:

```{r}
source("../data/talk10/input_data1.R"); ## 装入 Data data.frame ... 
head(Data);
```

分别用t.test和wilcox.test比较男女生步数是否有显著差异；打印出p.value

```{r}
## 代码写这里，并运行；
res1<-with( Data, t.test( Steps ~ Sex ) )
res1$p.value
res2<-with(Data,wilcox.test( Steps ~ Sex ))
res2$p.value
```

两种检测方法的p.value哪个更显著？为什么？

答：t.test 的 p.value更显著:t检验是参数方法，需要资料满足正态性和方差齐性的假设，而Wilcoxon检验是非参数方法。

以下是学生参加辅导班前后的成绩情况，请计算同学们的成绩是否有普遍提高？

注：先用以下代码装入数据：

```{r}
source("../data/talk10/input_data2.R");
head(scores);
```

注：计算时请使用 paired = T 参数；

```{r}
## 代码写这里，并运行；
scores.wide <- scores %>% spread( Time, Score );
res3 <- with( scores.wide, t.test( After, Before, paired = T ) )
res3$p.value
```

练习与作业2：作图

利用talk10中的data.fig3a作图
- 首先用以下命令装入数据：

```{r}
library(tidyverse);
data.fig3a <- read_csv( file = "../data/talk10/nc2015_data_for_fig3a.csv" );
```

-   利用两列数据：`tai` `zAA1.at` 做`talk10`中的`boxplot`（详见：`fig3a`的制作）；

-   用`ggsignif`为相邻的两组做统计分析（如用 `wilcox.test` 函数），并画出`p.value`；

```{r}
## 代码写这里，并运行；
library(ggsignif);
fig3a <-
ggplot( data.fig3a, aes( factor(tai), zAA1.at ) ) +
geom_boxplot( fill = "#22AD5C", linetype = 1 ,outlier.size = 1, width = 0.6) +
xlab( "tAI group" ) +
ylab( expression( paste( italic(S[RNA]) ) ) ) +
scale_x_discrete(breaks= 1:5 , labels= paste("Q", 1:5, sep = "") ) +
geom_hline( yintercept = 0, colour = "red", linetype = 2);
fig3a + geom_signif( comparisons = list(1:2, 2:3, 3:4, 4:5), test = wilcox.test,
step_increase = 0.1 )
```

问：这组数据可以用t.test吗？为什么？

答：不能。t.test适合用于检验正态分布的数据。

用系统自带变量mtcars做图
- 用散点图表示 wt（x-轴）与 mpg（y-轴）的关系
- 添加线性回归直线图层
- 计算wt）与 mpg的相关性，并将结果以公式添加到图上。其最终效果如下图所示（注：相关代码可在talk09中找到）：

```{r}
## 代码写这里，并运行；
ggplot( swiss, aes( x = Education, y = Fertility ) ) +
geom_point( shape = 20 );
with( swiss, cor.test( Education, Fertility )$estimate );
m = lm(Fertility ~ Education, swiss);
c = cor.test( swiss$Fertility, swiss$Education );
eq <- substitute(  atop(
  paste( italic(y), " = ", a + b %.% italic(x), sep = ""),
  paste( italic(r)^2, " = ", r2, ", ", italic(p)==pvalue, sep = "" )) ,
list(a = as.vector( format(coef(m)[1], digits = 2) ),
     b = as.vector( format(coef(m)[2], digits = 2) ),
     r2 = as.vector( format(summary(m)$r.squared, digits = 2) ),
     pvalue = as.vector( format( c$p.value , digits = 2))
     ))
eq <- as.character(as.expression(eq));
ggplot(swiss, aes( x = Education, y = Fertility ) ) +
geom_point( shape = 20 ) +
geom_smooth( se = T ) +
geom_text( data = NULL,
aes( x = 30, y = 80, label= eq, hjust = 0, vjust = 1), 
size = 4, parse = TRUE, inherit.aes=FALSE); 
```

练习与作业3：线性模型与预测

使用以下代码产生数据进行分析

```{r}
wts2 <- bind_rows( 
   tibble( class = 1, age = sample( 13:15, 20, replace = T ), 
           wt = sample( seq(50, 60, by = 0.1), 20 ) ),
   tibble( class = 2, age = sample( 14:16, 20, replace = T ),
           wt = sample( seq(55, 65, by = 0.1), 20 ) ),
   tibble( class = 3, age = sample( 15:17, 20, replace = T ),
           wt = sample( seq(60, 70, by = 0.1), 20 ) )
)

ggplot(wts2, aes( factor( age ), wt ) ) + geom_boxplot() + coord_flip();
```

-   用线性回归检查`age`, `class` 与 `wt` 的关系，构建线性回归模型；

-   以`age`, `class`为输入，用得到的模型预测`wt`；

-   计算预测的`wt`和实际`wt`的相关性；

-   用线性公式显示如何用`age`, `class`计算`wt`的值。

```{r}
## 代码写这里，并运行；
require(dplyr)
model2 <- lm( wt ~ class + age, data = wts2)
newdata <- wts2 %>% dplyr::select( class, age );
(wt.predicted <- predict( model2, newdata ))
dat <- data.frame( reference = wts2$wt, prediction = wt.predicted );
with( dat, cor.test( prediction, reference ) )$estimate

paras <- coef( model2 )
eq <- substitute(  
  paste( italic(wt), " = ", c + a %.% italic(class)+b %.% italic(age), sep = ""),
  list(
    c=as.vector(format(paras[1])),
    a=as.vector(format(paras["class"])),
    b=as.vector(format(paras["age"]))
  )
  )
eq <- as.character(as.expression(eq))
ggplot( dat , aes( x = reference, y = prediction ) ) + geom_point() +
geom_smooth( method = "lm", se = F )+geom_text(data=NULL,aes(x=30,y=60,label=eq,
                    hjust=0,vjust=1),size=4,parse=T,inherit.aes = F)
```