SDA practice test

5 练习

5.1 CO2 实例

CO2数据是R包内置的数据集，数据集收集了不同的材料和处理水平下植物吸收二氧化碳的速率或量，其中：

plant是一个有序因子（ordered factor），每个植物都可以通过其在该有序因子中的位置来唯一地识别。
Type代表样本的来源地。
Treatment代表否进行冷藏处理。
conc代表二氧化碳在实验室中的浓度水平。
uptake代表植物从环境中吸收二氧化碳的速率或量（单位为umol/m2/sec）。

问题1 关于因子和变量

哪些列是因子？哪些是变量？变量属于哪种类型？

因子是解释现象的原因或要素，而变量则是在研究中用来测量、观察或操作的特征或特性。

我们可以使用str()函数查看数据的类型、维度和变量的摘要。在CO2原始导入的数据中，str(CO2)显示：plant,type, treatment为factor,即因子；conc和uptake为num，即定量变量（number）。然而，conc列指的是实验中设置的CO2浓度的处理水平，而非测量结果。因此在分析中我们应该将其考虑为因子，而不是变量。所以在分析之前，首先需要更改conc列的属性。

data("CO2")
str(CO2)

## Classes 'nfnGroupedData', 'nfGroupedData', 'groupedData' and 'data.frame':   84 obs. of  5 variables:
##  $ Plant    : Ord.factor w/ 12 levels "Qn1"<"Qn2"<"Qn3"<..: 1 1 1 1 1 1 1 2 2 2 ...
##  $ Type     : Factor w/ 2 levels "Quebec","Mississippi": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Treatment: Factor w/ 2 levels "nonchilled","chilled": 1 1 1 1 1 1 1 1 1 1 ...
##  $ conc     : num  95 175 250 350 500 675 1000 95 175 250 ...
##  $ uptake   : num  16 30.4 34.8 37.2 35.3 39.2 39.7 13.6 27.3 37.1 ...
##  - attr(*, "formula")=Class 'formula'  language uptake ~ conc | Plant
##   .. ..- attr(*, ".Environment")=<environment: R_EmptyEnv> 
##  - attr(*, "outer")=Class 'formula'  language ~Treatment * Type
##   .. ..- attr(*, ".Environment")=<environment: R_EmptyEnv> 
##  - attr(*, "labels")=List of 2
##   ..$ x: chr "Ambient carbon dioxide concentration"
##   ..$ y: chr "CO2 uptake rate"
##  - attr(*, "units")=List of 2
##   ..$ x: chr "(uL/L)"
##   ..$ y: chr "(umol/m^2 s)"

问题2 数据集中列的属性转变

把CO2浓度处理水平（conc列）的属性改为factor,并打印每个浓度处理水平设置了多少个样本？

CO2$conc <- as.factor(CO2$conc)
summary(CO2$conc)

##   95  175  250  350  500  675 1000 
##   12   12   12   12   12   12   12

问题3 关于实验设计

根据数据推测该研究的实验设计,请用具体的文字进行描述。

首先理解第一列plant，它包含列Qn1,Qn2,Qn3,Qc1,Qc2,Qc3,Mn1,Mn2,Mn3,Mc1,Mc2,Mc3因子名称。根据CO2原始数据表格设计推测：名称中的Q和M分别代表Quebec和Mississippi； n和c代表nonchilled和chilled；1，2，3可能代表同基因型的生物学重复，也可能代表不同基因型，在此我将其考虑为同基因型的重复。

然后，从summary(CO2) 的输出的数据摘要中，可以推测实验设计：该实验设置有1个基因型，共繁殖了84株，分别在Quebec和Mississippi两个基地各栽种了42株。然后两个基地的材料进行了相同的处理，即：将42株分为21株非冷处理（对照）和21株冷处理（处理）；进一步的21株下各设置了7个C02浓度水平，每个处理水平设置了3个重复。

更严格的实验设计，可以在此基础上再添加至少两个基因型，即实验设计中设置至少三个基因型，每个基因型的每个处理组合有三个重复，也就是在此样本量的基础上拓展到3倍。当然，在你能完成的情况下（平衡合理的工作量和实验成本），重复越多越好，三个重复是最小设置。该实验设计可作为你们后续论文研究中的一个参考。

问题4 关于变异范围和数据摘要

summary(CO2)

##      Plant             Type         Treatment    conc        uptake     
##  Qn1    : 7   Quebec     :42   nonchilled:42   95  :12   Min.   : 7.70  
##  Qn2    : 7   Mississippi:42   chilled   :42   175 :12   1st Qu.:17.90  
##  Qn3    : 7                                    250 :12   Median :28.30  
##  Qc1    : 7                                    350 :12   Mean   :27.21  
##  Qc3    : 7                                    500 :12   3rd Qu.:37.12  
##  Qc2    : 7                                    675 :12   Max.   :45.50  
##  (Other):42                                    1000:12

变异范围指最小值到最大值，从summary(CO2) 结果中，可以轻松的观察到变异范围为7.7～45.5。也可以使用，range()函数查看，如下：

range(CO2$uptake)

## [1]  7.7 45.5

summary(CO2)

##      Plant             Type         Treatment    conc        uptake     
##  Qn1    : 7   Quebec     :42   nonchilled:42   95  :12   Min.   : 7.70  
##  Qn2    : 7   Mississippi:42   chilled   :42   175 :12   1st Qu.:17.90  
##  Qn3    : 7                                    250 :12   Median :28.30  
##  Qc1    : 7                                    350 :12   Mean   :27.21  
##  Qc3    : 7                                    500 :12   3rd Qu.:37.12  
##  Qc2    : 7                                    675 :12   Max.   :45.50  
##  (Other):42                                    1000:12

问题5 可视化探索

1)可视化全部植株的CO2吸收速率分布图

library(ggplot2)
ggplot(data = CO2) +
  geom_histogram(mapping = aes(x = uptake), binwidth = 3) +
  ggtitle("Fig.1")

2）可视化Quebec和Mississippi基地各植株的CO2吸收速率分布图对应于前面介绍的探索问题6-以定性变量进行分组，每组中的定量变量是如何分布的？

ggplot(data = CO2, mapping = aes(x = uptake, colour = Type)) +
  geom_freqpoly(binwidth = 3) +
  ggtitle("Fig.2")

3）可视化nonchilled和chilled处理下各植株的CO2吸收速率分布图

ggplot(data = CO2, mapping = aes(x = uptake, colour = Treatment)) +
  geom_freqpoly(binwidth = 3) +
  ggtitle("Fig.3")

对于比较某定性变量组间uptake的变异情况，有一种更好的方法是使用箱线图，比如：

比较Quebec和Mississippi基地各植株的CO2吸收速率

ggplot(data = CO2, mapping = aes(x = Type, y = uptake)) +
  geom_boxplot() +
  ggtitle("Fig.4")

比较两个基地中各自不同处理下的植株的CO2吸收速率

ggplot(data = CO2, mapping = aes(x = Plant, y = uptake)) +
  geom_boxplot() +
  ggtitle("Fig.5")

问题6 是否存在异常值？

检测异常值的方法有许多，如统计方法可使用标准差法、百分位数法、Z-score法等，可视化方法可使用直方图、箱线图、散点图等，模型方法可使用孤立森林（isolation forest）、离群因子（outlier factor）等。

我们前面主要介绍了可视化方法，该方法更直观和容易理解，一般直方图、箱线图、散点图等配合即可满足我们普通实验数据的异常值检测，。

在CO2数据中，从直方分布图fig.1中，我们并没有观察到异常值，而在箱线图fig.5中，可以清楚的观察到在小组内有一些离群点，需要进一步的分析原因。

#打印plant列下每组中的最小值
CO2[ave(CO2$uptake, CO2$Plant, FUN = function(x) x == min(x)) == 1, ]

##    Plant        Type  Treatment conc uptake
## 1    Qn1      Quebec nonchilled   95   16.0
## 8    Qn2      Quebec nonchilled   95   13.6
## 15   Qn3      Quebec nonchilled   95   16.2
## 22   Qc1      Quebec    chilled   95   14.2
## 29   Qc2      Quebec    chilled   95    9.3
## 36   Qc3      Quebec    chilled   95   15.1
## 43   Mn1 Mississippi nonchilled   95   10.6
## 50   Mn2 Mississippi nonchilled   95   12.0
## 57   Mn3 Mississippi nonchilled   95   11.3
## 64   Mc1 Mississippi    chilled   95   10.5
## 71   Mc2 Mississippi    chilled   95    7.7
## 78   Mc3 Mississippi    chilled   95   10.6

从结果可以看出，最小值与Type和Treatment都无关，而是由于co2浓度处理（conc=95）引起的，因此，我们在箱线图fig.5中观测到的离群点，可以解释为当环境中的CO2浓度偏低时，导致植物吸收CO2的速率明显下降。不是因为数据收集时产生的错误无需删除。

问题7 不同CO2浓度如何影响植物CO2吸收速率？

比较不同CO2浓度处理下的植株CO2吸收速率

ggplot(data = CO2, mapping = aes(x = conc, y = uptake)) +
  geom_boxplot() +
  ggtitle("Fig.7")

问题8 冷处理是否影响植物CO2吸收速率？

比较nonchilled和chilled处理下各植株的CO2吸收速率

ggplot(data = CO2, mapping = aes(x = Treatment, y = uptake)) +
  geom_boxplot() +
  ggtitle("Fig.8")

问题9 检验多因子间的交互作用或协作关系

1）多个因子与变量关系的可视化

冷处理和CO2浓度处理在影响植物CO2吸收率时是怎样的协作关系？

ggplot(CO2, aes(x = conc, y = uptake, color = Treatment, size = Treatment)) +
  geom_point() + 
  scale_size_manual(values = c(3,3)) + 
  ggtitle("Fig.10")

种植地区和CO2浓度处理在影响植物CO2吸收率时是怎样的协作关系？

ggplot(CO2, aes(x = conc, y = uptake, color = Type, size = Type)) +
  geom_point() +
  scale_color_manual(values = c("Quebec" = "blue", "Mississippi" = "red")) +
  scale_size_manual(values = c(3,3)) + 
  ggtitle("Fig.11")

三个因子对uptake的影响**

ggplot(CO2, aes(x = conc, y = uptake, color = Type , shape = Treatment, size = Treatment)) +
  geom_point() + 
  scale_color_manual(values = c("red", "blue")) +
  scale_size_manual(values = c(3,3)) + 
  ggtitle("Fig.10")

Quebec省处于西经57°_{80°，北纬45°}63°之间，位于加拿大东南部，北濒北冰洋。魁北克省气候多样，从南到北，夏季平均气温25～5℃，冬季平均气温-10~-25℃，北部最低气温低至-60℃。

Mississippi属亚热带气候，夏季炎热，有时可超过35℃。1月平均气温6～10℃，很少下雪。

2）多元回归查看各因子和因子间的交互作用

lm2 <- lm(uptake ~ conc + Treatment, data = CO2)
summary(lm2)

## 
## Call:
## lm(formula = uptake ~ conc + Treatment, data = CO2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.7536  -6.2256   0.3917   6.7060  12.6548 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        15.688      2.414   6.499 7.60e-09 ***
## conc175            10.025      3.193   3.139 0.002411 ** 
## conc250            16.617      3.193   5.204 1.61e-06 ***
## conc350            18.408      3.193   5.765 1.66e-07 ***
## conc500            18.617      3.193   5.830 1.27e-07 ***
## conc675            19.692      3.193   6.166 3.11e-08 ***
## conc1000           21.325      3.193   6.678 3.53e-09 ***
## Treatmentchilled   -6.860      1.707  -4.019 0.000137 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.822 on 76 degrees of freedom
## Multiple R-squared:  0.521,  Adjusted R-squared:  0.4768 
## F-statistic: 11.81 on 7 and 76 DF,  p-value: 4.453e-10

lm3 <- lm(uptake ~ Type + conc + Treatment, data = CO2)
summary(lm3)

## 
## Call:
## lm(formula = uptake ~ Type + conc + Treatment, data = CO2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.4238 -2.1327  0.1792  2.4363  8.1012 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       22.0179     1.3547  16.254  < 2e-16 ***
## TypeMississippi  -12.6595     0.9031 -14.018  < 2e-16 ***
## conc175           10.0250     1.6895   5.934 8.54e-08 ***
## conc250           16.6167     1.6895   9.835 3.91e-15 ***
## conc350           18.4083     1.6895  10.895  < 2e-16 ***
## conc500           18.6167     1.6895  11.019  < 2e-16 ***
## conc675           19.6917     1.6895  11.655  < 2e-16 ***
## conc1000          21.3250     1.6895  12.622  < 2e-16 ***
## Treatmentchilled  -6.8595     0.9031  -7.596 7.00e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.139 on 75 degrees of freedom
## Multiple R-squared:  0.8677, Adjusted R-squared:  0.8536 
## F-statistic: 61.47 on 8 and 75 DF,  p-value: < 2.2e-16

lmf <- lm(uptake ~ Type + conc + Treatment + Type*conc + Type*Treatment + conc*Treatment, data = CO2)
summary(lmf)

## 
## Call:
## lm(formula = uptake ~ Type + conc + Treatment + Type * conc + 
##     Type * Treatment + conc * Treatment, data = CO2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.4190 -1.7018  0.1893  2.1351  4.9893 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                       13.4524     1.5699   8.569 4.10e-12 ***
## TypeMississippi                   -0.3381     1.8933  -0.179  0.85886    
## conc175                           14.8250     2.1691   6.835 4.15e-09 ***
## conc250                           24.4333     2.1691  11.264  < 2e-16 ***
## conc350                           27.4583     2.1691  12.659  < 2e-16 ***
## conc500                           27.2667     2.1691  12.570  < 2e-16 ***
## conc675                           28.4750     2.1691  13.128  < 2e-16 ***
## conc1000                          30.7083     2.1691  14.157  < 2e-16 ***
## Treatmentchilled                   1.2286     1.8933   0.649  0.51881    
## TypeMississippi:conc175           -5.9833     2.5047  -2.389  0.01996 *  
## TypeMississippi:conc250          -10.5000     2.5047  -4.192 8.93e-05 ***
## TypeMississippi:conc350          -11.2167     2.5047  -4.478 3.29e-05 ***
## TypeMississippi:conc500          -10.9000     2.5047  -4.352 5.14e-05 ***
## TypeMississippi:conc675          -11.4833     2.5047  -4.585 2.26e-05 ***
## TypeMississippi:conc1000         -13.2167     2.5047  -5.277 1.78e-06 ***
## TypeMississippi:Treatmentchilled  -6.5571     1.3388  -4.898 7.27e-06 ***
## conc175:Treatmentchilled          -3.6167     2.5047  -1.444  0.15378    
## conc250:Treatmentchilled          -5.1333     2.5047  -2.050  0.04465 *  
## conc350:Treatmentchilled          -6.8833     2.5047  -2.748  0.00784 ** 
## conc500:Treatmentchilled          -6.4000     2.5047  -2.555  0.01308 *  
## conc675:Treatmentchilled          -6.0833     2.5047  -2.429  0.01806 *  
## conc1000:Treatmentchilled         -5.5500     2.5047  -2.216  0.03038 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.068 on 62 degrees of freedom
## Multiple R-squared:  0.9399, Adjusted R-squared:  0.9195 
## F-statistic: 46.17 on 21 and 62 DF,  p-value: < 2.2e-16

lmf <- lm(uptake ~ conc*Type*Treatment, data = CO2)
summary(lmf)

## 
## Call:
## lm(formula = uptake ~ conc * Type * Treatment, data = CO2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.567 -2.025  0.550  1.875  3.933 
## 
## Coefficients:
##                                            Estimate Std. Error t value Pr(>|t|)
## (Intercept)                                15.26667    1.67520   9.113 1.18e-12
## conc175                                    14.76667    2.36910   6.233 6.33e-08
## conc250                                    22.13333    2.36910   9.343 5.07e-13
## conc350                                    25.10000    2.36910  10.595 5.40e-15
## conc500                                    24.33333    2.36910  10.271 1.72e-14
## conc675                                    26.23333    2.36910  11.073 1.00e-15
## conc1000                                   27.90000    2.36910  11.777  < 2e-16
## TypeMississippi                            -3.96667    2.36910  -1.674   0.0996
## Treatmentchilled                           -2.40000    2.36910  -1.013   0.3154
## conc175:TypeMississippi                    -5.86667    3.35041  -1.751   0.0854
## conc250:TypeMississippi                    -5.90000    3.35041  -1.761   0.0837
## conc350:TypeMississippi                    -6.50000    3.35041  -1.940   0.0574
## conc500:TypeMississippi                    -5.03333    3.35041  -1.502   0.1386
## conc675:TypeMississippi                    -7.00000    3.35041  -2.089   0.0412
## conc1000:TypeMississippi                   -7.60000    3.35041  -2.268   0.0272
## conc175:Treatmentchilled                   -3.50000    3.35041  -1.045   0.3007
## conc250:Treatmentchilled                   -0.53333    3.35041  -0.159   0.8741
## conc350:Treatmentchilled                   -2.16667    3.35041  -0.647   0.5205
## conc500:Treatmentchilled                   -0.53333    3.35041  -0.159   0.8741
## conc675:Treatmentchilled                   -1.60000    3.35041  -0.478   0.6348
## conc1000:Treatmentchilled                   0.06667    3.35041   0.020   0.9842
## TypeMississippi:Treatmentchilled            0.70000    3.35041   0.209   0.8353
## conc175:TypeMississippi:Treatmentchilled   -0.23333    4.73819  -0.049   0.9609
## conc250:TypeMississippi:Treatmentchilled   -9.20000    4.73819  -1.942   0.0572
## conc350:TypeMississippi:Treatmentchilled   -9.43333    4.73819  -1.991   0.0514
## conc500:TypeMississippi:Treatmentchilled  -11.73333    4.73819  -2.476   0.0163
## conc675:TypeMississippi:Treatmentchilled   -8.96667    4.73819  -1.892   0.0636
## conc1000:TypeMississippi:Treatmentchilled -11.23333    4.73819  -2.371   0.0212
##                                              
## (Intercept)                               ***
## conc175                                   ***
## conc250                                   ***
## conc350                                   ***
## conc500                                   ***
## conc675                                   ***
## conc1000                                  ***
## TypeMississippi                           .  
## Treatmentchilled                             
## conc175:TypeMississippi                   .  
## conc250:TypeMississippi                   .  
## conc350:TypeMississippi                   .  
## conc500:TypeMississippi                      
## conc675:TypeMississippi                   *  
## conc1000:TypeMississippi                  *  
## conc175:Treatmentchilled                     
## conc250:Treatmentchilled                     
## conc350:Treatmentchilled                     
## conc500:Treatmentchilled                     
## conc675:Treatmentchilled                     
## conc1000:Treatmentchilled                    
## TypeMississippi:Treatmentchilled             
## conc175:TypeMississippi:Treatmentchilled     
## conc250:TypeMississippi:Treatmentchilled  .  
## conc350:TypeMississippi:Treatmentchilled  .  
## conc500:TypeMississippi:Treatmentchilled  *  
## conc675:TypeMississippi:Treatmentchilled  .  
## conc1000:TypeMississippi:Treatmentchilled *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.902 on 56 degrees of freedom
## Multiple R-squared:  0.9514, Adjusted R-squared:  0.928 
## F-statistic: 40.63 on 27 and 56 DF,  p-value: < 2.2e-16

残差（residuals）是指模型预测值与实际观测值之间的差异或偏差，即回归模型所估计的值与实际观测值之间的差异。

在提供的输出中，对残差进行了一些描述统计，包括最小值（Min）、第一四分位数（1Q）、中位数（Median）、第三四分位数（3Q）和最大值（Max）。这些统计量可以帮助我们了解残差的分布情况。

通常来说，一个好的回归模型应该使残差尽可能地小。如果残差的平均值接近于零，说明模型的拟合效果较好；如果残差的方差相对较小，说明模型的预测精度较高。因此，较小的残差表示模型能够更好地解释数据，而较大的残差可能表明模型存在一些问题需要进一步改进。

residuals <- resid(lmf)
print(residuals)

##             1             2             3             4             5 
##  7.333333e-01  3.666667e-01 -2.600000e+00 -3.166667e+00 -4.300000e+00 
##             6             7             8             9            10 
## -2.300000e+00 -3.466667e+00 -1.666667e+00 -2.733333e+00 -3.000000e-01 
##            11            12            13            14            15 
##  1.433333e+00  1.000000e+00 -1.000000e-01  1.133333e+00  9.333333e-01 
##            16            17            18            19            20 
##  2.366667e+00  2.900000e+00  1.733333e+00  3.300000e+00  2.400000e+00 
##            21            22            23            24            25 
##  2.333333e+00  1.333333e+00 -3.333333e-02 -4.166667e+00 -1.200000e+00 
##            26            27            28            29            30 
## -4.166667e+00 -2.100000e+00 -2.133333e+00 -3.566667e+00  3.166667e+00 
##            31            32            33            34            35 
##  5.333333e-01  3.000000e+00  1.933333e+00  7.137306e-16  1.566667e+00 
##            36            37            38            39            40 
##  2.233333e+00 -3.133333e+00  3.633333e+00 -1.800000e+00  2.233333e+00 
##            41            42            43            44            45 
##  2.100000e+00  5.666667e-01 -7.000000e-01 -1.000000e+00 -1.333333e+00 
##            46            47            48            49            50 
##  1.000000e-01  3.000000e-01  1.866667e+00  3.900000e+00  7.000000e-01 
##            51            52            53            54            55 
##  1.800000e+00  3.066667e+00  1.900000e+00  1.800000e+00  5.666667e-01 
##            56            57            58            59            60 
## -1.000000e-01 -2.616938e-15 -8.000000e-01 -1.733333e+00 -2.000000e+00 
##            61            62            63            64            65 
## -2.100000e+00 -2.433333e+00 -3.800000e+00  9.000000e-01  1.333333e-01 
##            66            67            68            69            70 
##  2.000000e+00  2.300000e+00  2.866667e+00  3.933333e+00  3.166667e+00 
##            71            72            73            74            75 
## -1.900000e+00 -3.366667e+00 -3.800000e+00 -3.600000e+00 -4.133333e+00 
##            76            77            78            79            80 
## -4.566667e+00 -4.333333e+00  1.000000e+00  3.233333e+00  1.800000e+00 
##            81            82            83            84 
##  1.300000e+00  1.266667e+00  6.333333e-01  1.166667e+00

hist(residuals, col="skyblue",main="Histogram of Residuals", xlab="Residual Values")

残差标准误差（Residual standard error）是指在线性回归模型中，残差的标准偏差。它代表了观测值与回归线之间的离散度，即模型预测值和实际观测值之间的平均偏差。

Residual standard error是评估回归模型的一种重要指标，它可以用来比较不同模型的拟合效果。当比较两个或多个不同的回归模型时，可以用其残差标准误差来决定哪个模型的拟合效果更好。

需要注意的是，Residual standard error只是一个基于样本数据的估计值，不能完全代表总体的残差标准误差。因此，在解释Residual standard error时，需要结合样本量、自变量的数量、模型的特征等因素进行综合考虑。 #### 自由发挥自由发挥你们自己感兴趣的问题并解答。思考实验设计中的重复在数据分析阶段该如何利用和处理？