为什么数据独立性那么重要?

独立性是指像随机抛硬币事件那样,前后两次事件结果之间互不干扰的性质.
因此,如果重复抛硬币10000次,出现概率较大的情况是,50%左右为正面,50%左右为反面.
但是如果数据之间不独立,比如同一次实验结果被连续重复的记录了2000次,这种重复记录的行为,势必会影响实验的结果,造成较大误差. 再比如,如果仅做了10次实验,每次实验结果却被重复记录了1000次数据,这样所得到的10000个数据的自由度实际上也只有10-1=9,而不是10000-1=9999,因此基于自由度为9999的各种检验将会毫无意义.

R语言: 多层次嵌套结构的数据

某省,抽取17个城市,99所学校,10000名学生,进行体能调查.
cit.fact为城市名称,sch.fact为学校名称,sex.fact为性别.
city.scores为城市综合实力,school.scores为学校综合实力,sex.boy是否为男生.
x0为入学时学生的体能测试成绩.
y0为毕业时学生的体能测试成绩.

set.seed(1); n_student=10000; n_school=99; n_cities=13
###
school.scores <- sample(x=rnorm(n=n_school, mean=0, sd = 20), size=n_student, replace=T)
sch.fact <- factor(school.scores);
levels(sch.fact) <- paste('sch', sample(1:n_school), sep='')
#
stud.sex.prob <- factor(sch.fact)
levels(stud.sex.prob) <- runif(n=n_school, min=0.4, max=1) 
stud.sex.prob <- as.numeric(as.character(stud.sex.prob))
FUN <- function(i){sample(x = c('boy', 'girl'), size = 1, replace = T, 
                  prob = c(stud.sex.prob[i], 1-stud.sex.prob[i]))}
sex.fact  <- sapply(1:n_student, FUN )
sex.boy <- as.numeric(sex.fact =='boy')
#
cit.fact <- factor(sch.fact)
levels(cit.fact) <- sample(rnorm(n=n_cities, mean=0, sd=20), 
                           size=n_school, replace=T,
                           prob =runif(n=n_cities, min=0.3, max=1));
city.scores <- as.numeric(as.character(cit.fact));
levels(cit.fact) <-  paste('cit',sample(1:n_cities),sep='')
###
school.scores <- scale(school.scores)
city.scores   <- scale(city.scores) 
sex.boy       <- scale(sex.boy)
sigma <- rnorm(n=n_student, mean = 0, sd =3)
x1 <- sex.fact; x2 <- sch.fact; x3 <- cit.fact
x0 <- origPhysicalExam <- rnorm(n=n_student, mean=0)
#
y0 <- 51+ 0.5 * sex.boy + (20 * city.scores) + (7 * school.scores) + 
      (3+ 5 * city.scores + 7 * school.scores) * x0 + sigma

最理想的数据来自于独立同分布

最理想的取样数据为,每个数据都来自同一随机分布总体且彼此间相互独立.

y0 <- 53 +3 * x0 + sigma
plot(x0, y0)
lm0 <- lm(y0~x0)
abline(lm0, col=2,lwd=3)
abline(v=0,col=3)

summary(lm0)
## 
## Call:
## lm(formula = y0 ~ x0)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.5567  -2.0121   0.0194   2.0465   9.9956 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 53.00050    0.02991  1772.1   <2e-16 ***
## x0           2.98534    0.02941   101.5   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.991 on 9998 degrees of freedom
## Multiple R-squared:  0.5076, Adjusted R-squared:  0.5076 
## F-statistic: 1.031e+04 on 1 and 9998 DF,  p-value: < 2.2e-16

数据的非独立性造成了结果的偏差

如果数据不能满足独立性的假设,将会对结果造成偏差.
比如,y1[1:3000]过分依赖y0[1],而y1[4001:7000]过分依赖y0[2].
最终导致了回归模型发生较大偏差.

y1 <- y0; 
y1[1:3000]    <- rnorm(n = 3000, mean = y0[1]); 
y1[4001:7000] <- rnorm(n = 3000, mean = y0[2])
lm1 <- lm(y1~x0)
plot(x0,y1)
abline(lm0, col=2,lwd=3)
abline(lm1, col=4,lwd=3)
abline(v=0, col=3)

summary(lm1)
## 
## Call:
## lm(formula = y1 ~ x0)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.4603  -3.5851   0.8565   3.8711  10.7410 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 57.75282    0.04811 1200.49   <2e-16 ***
## x0           1.07110    0.04730   22.65   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.811 on 9998 degrees of freedom
## Multiple R-squared:  0.04879,    Adjusted R-squared:  0.0487 
## F-statistic: 512.9 on 1 and 9998 DF,  p-value: < 2.2e-16

剔除非独立性的数据导致自由度降低

如果剔除掉不独立的数据,则回归模型回到理论水平.
但付出的代价是自由度由9998降低为3998.

y2 <- y1[-c(1:3000,4001:7000)]
x2 <- x0[-c(1:3000,4001:7000)]
lm2 <- lm(y2~x2)
plot(x2,y2)
abline(lm0, col=2,lwd=5)
abline(lm2, col=3,lwd=2)

summary(lm2)
## 
## Call:
## lm(formula = y2 ~ x2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.4489 -2.0082  0.0081  1.9965  9.9994 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 52.99657    0.04657 1138.05   <2e-16 ***
## x2           2.98523    0.04581   65.17   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.943 on 3998 degrees of freedom
## Multiple R-squared:  0.5151, Adjusted R-squared:  0.515 
## F-statistic:  4247 on 1 and 3998 DF,  p-value: < 2.2e-16

自由度降低可导致“弃真型”错误

如果自由度降低到数值过小,会导致一些实际有显著影响的因子无法被检测出来. 如自由度为99时,sex3影响因子的检测结果为不显著。

y0 <- 53 + 3 * x0 +  0.4 * sex.boy + sigma
y3 <- y0[1:100]
x3 <- x0[1:100]
sex3 <- sex.boy[1:100]
lm1 <- lm(y0~x0+sex.boy); summary(lm1)
## 
## Call:
## lm(formula = y0 ~ x0 + sex.boy)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.5568  -2.0120   0.0195   2.0464   9.9956 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 53.00050    0.02991 1771.96   <2e-16 ***
## x0           2.98534    0.02941  101.52   <2e-16 ***
## sex.boy      0.40007    0.02991   13.38   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.991 on 9997 degrees of freedom
## Multiple R-squared:  0.5122, Adjusted R-squared:  0.5121 
## F-statistic:  5248 on 2 and 9997 DF,  p-value: < 2.2e-16
lm2 <- lm(y3~x3+sex3); summary(lm2)
## 
## Call:
## lm(formula = y3 ~ x3 + sex3)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.1534  -1.7114  -0.0564   2.2869   7.4456 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  53.1098     0.3159 168.108   <2e-16 ***
## x3            3.0678     0.2973  10.320   <2e-16 ***
## sex3          0.4715     0.3048   1.547    0.125    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.122 on 97 degrees of freedom
## Multiple R-squared:  0.529,  Adjusted R-squared:  0.5193 
## F-statistic: 54.48 on 2 and 97 DF,  p-value: < 2.2e-16

数据非独立性的情况很常见

如每个城市的综合实力严重影响了y0. 那么同一个城市内数据就是非独立性数据.

y0 <- 53 + 3 * x0 + 0.5*sex.boy + 50 * city.scores + sigma
#
index.cit <- as.numeric( gsub('cit([0-9]{1,2})','\\1',cit.fact) )
cit.order <- reorder(x=cit.fact, X=index.cit )
#
boxplot(y0~cit.order, xlab='cit.fact')

非独立性数据导致简单回归模型系数出现偏差

如果我们没有考虑着不同城市cit.fact之间的差异. 则回归结果可能会出现偏差.

plot(x0, y0)
lm0 <- lm(y0~x0+sex.boy);  summary(lm0)
## 
## Call:
## lm(formula = y0 ~ x0 + sex.boy)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -77.16 -40.20 -17.10  36.59 110.09 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  52.9988     0.5007 105.844  < 2e-16 ***
## x0            3.2871     0.4923   6.677 2.57e-11 ***
## sex.boy      -0.9307     0.5007  -1.859   0.0631 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 50.07 on 9997 degrees of freedom
## Multiple R-squared:  0.004772,   Adjusted R-squared:  0.004573 
## F-statistic: 23.97 on 2 and 9997 DF,  p-value: 4.129e-11
abline(lm0, col=2,lwd=4) 
## Warning in abline(lm0, col = 2, lwd = 4): only using the first two of 3
## regression coefficients

最好的理想的回归结果

当然了,如果我们能够精确量化每个城市的综合影响因子city.scores, 那么回归结果是再好不过了.

lm0 <- lm(y0~x0+sex.boy+city.scores);  summary(lm0)
## 
## Call:
## lm(formula = y0 ~ x0 + sex.boy + city.scores)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.5587  -2.0122   0.0195   2.0465   9.9968 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 53.00050    0.02991 1771.87   <2e-16 ***
## x0           2.98534    0.02941  101.51   <2e-16 ***
## sex.boy      0.50003    0.02993   16.71   <2e-16 ***
## city.scores 49.99854    0.02993 1670.74   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.991 on 9996 degrees of freedom
## Multiple R-squared:  0.9964, Adjusted R-squared:  0.9964 
## F-statistic: 9.349e+05 on 3 and 9996 DF,  p-value: < 2.2e-16

补救措施之一,哑变量

但是,实际上要精确的量化每个城市的综合影响因子city.scores是太不可能的,如果把不同城市cit.fact作为哑变量加入模型,则回归结果将会好很多,但是模型参数太多,且截距误差太大(意义不明确).

lm1 <- lm(y0~x0+sex.boy+cit.fact);  summary(lm1)
## 
## Call:
## lm(formula = y0 ~ x0 + sex.boy + cit.fact)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.6684  -2.0101   0.0172   2.0430   9.9778 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    115.47774    0.09491 1216.66   <2e-16 ***
## x0               2.98516    0.02943  101.42   <2e-16 ***
## sex.boy          0.50313    0.03010   16.72   <2e-16 ***
## cit.factcit12  -35.30401    0.13463 -262.24   <2e-16 ***
## cit.factcit13  -92.31805    0.14094 -655.00   <2e-16 ***
## cit.factcit9  -116.79594    0.14619 -798.92   <2e-16 ***
## cit.factcit7   -21.98860    0.14517 -151.47   <2e-16 ***
## cit.factcit1   -79.25616    0.12985 -610.35   <2e-16 ***
## cit.factcit10 -116.01696    0.13264 -874.71   <2e-16 ***
## cit.factcit2   -99.31623    0.14300 -694.54   <2e-16 ***
## cit.factcit6   -31.89959    0.14897 -214.14   <2e-16 ***
## cit.factcit3  -129.58887    0.15508 -835.64   <2e-16 ***
## cit.factcit4    38.23507    0.14132  270.56   <2e-16 ***
## cit.factcit8   -81.48936    0.15754 -517.26   <2e-16 ***
## cit.factcit11  -96.83834    0.31457 -307.84   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.992 on 9985 degrees of freedom
## Multiple R-squared:  0.9964, Adjusted R-squared:  0.9964 
## F-statistic: 2.002e+05 on 14 and 9985 DF,  p-value: < 2.2e-16

随机因子的意义

如果把不同城市cit.fact作为随机因子加入混合效应模型(linear mixed-effects models),则回归结果的意义将更加明确.
得到,当cit.fact随机因子位于平均值时,回归模型的固定及随机效应的结果.
固定部分: Intercept参数为51.45,标准误为5.464; x0参数为2.98,标准误为0.02; sex.boy参数为0.5,标准差为0.03.
随机部分: cit.fact能解释随机斜距的19.703个残差标准误,剩余残差Residual为2.992个残差标准误.
对比发现,模型fm0显著优于fm.

require(lme4)           
## Loading required package: lme4
## Loading required package: Matrix
## Loading required package: Rcpp
fm  <- lmer(y0~1+x0 +(1|cit.fact), REML=F)
fm0 <- lmer(y0~1+x0+sex.boy+(1|cit.fact), REML=F);  summary(fm0)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: y0 ~ 1 + x0 + sex.boy + (1 | cit.fact)
## 
##      AIC      BIC   logLik deviance df.resid 
##  50465.4  50501.4 -25227.7  50455.4     9995 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.5658 -0.6718  0.0059  0.6826  3.3349 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  cit.fact (Intercept) 2427.390 49.269  
##  Residual                8.952  2.992  
## Number of obs: 10000, groups:  cit.fact, 13
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept) 49.12575   13.66469    3.60
## x0           2.98516    0.02943  101.43
## sex.boy      0.50312    0.03010   16.72
## 
## Correlation of Fixed Effects:
##         (Intr) x0    
## x0       0.000       
## sex.boy  0.000 -0.003
anova(fm,fm0)
## Data: 
## Models:
## fm: y0 ~ 1 + x0 + (1 | cit.fact)
## fm0: y0 ~ 1 + x0 + sex.boy + (1 | cit.fact)
##     Df   AIC   BIC logLik deviance  Chisq Chi Df Pr(>Chisq)    
## fm   4 50739 50768 -25366    50731                             
## fm0  5 50465 50501 -25228    50455 275.62      1  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

两个随机因子加入模型

把不同城市cit.fact,不同学校sch.fact作为随机因子加入混合效应模型中.

##
y0 <- 53 + 3 * x0 + 0.5*sex.boy + 50 * city.scores +
      7 * school.scores + sigma
##
index.sch <- as.numeric( gsub('sch([0-9]{1,2})','\\1',sch.fact) )
sch.order <- reorder( x=paste(sch.fact, sex.fact), 
               X=index.sch )
##
par(mar=c(5,4,1,1), mgp=c(4,0.5,0),tck=-0.02)
boxplot(y0 ~ sch.order, xlim=c(1,21.7), las=2, 
        col=c(2,4), xlab='school + sex')
title(ylab="graduate physical exam",line=2.5)

fm1: 双因子随机截距模型 VS fm0:单因子随机截距模型

对比发现,fm1显著优于fm.

fm0 <- lmer(y0~ 1 + (1|cit.fact)  , REML=F)      
fm1 <- lmer(y0~ 1 + (1|cit.fact) + (1|sch.fact), REML=F)      
anova(fm0, fm1)
## Data: 
## Models:
## fm0: y0 ~ 1 + (1 | cit.fact)
## fm1: y0 ~ 1 + (1 | cit.fact) + (1 | sch.fact)
##     Df   AIC   BIC logLik deviance Chisq Chi Df Pr(>Chisq)    
## fm0  3 69948 69969 -34971    69942                            
## fm1  4 58151 58180 -29072    58143 11798      1  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

fm2: x0+双因子随机截距模型

对比发现,fm2要显著优于fm1.

fm2 <- lmer(y0~ 1 + x0 + (1|sch.fact) + (1|cit.fact), REML=F)                
anova(fm1,fm2)  
## Data: 
## Models:
## fm1: y0 ~ 1 + (1 | cit.fact) + (1 | sch.fact)
## fm2: y0 ~ 1 + x0 + (1 | sch.fact) + (1 | cit.fact)
##     Df   AIC   BIC logLik deviance  Chisq Chi Df Pr(>Chisq)    
## fm1  4 58151 58180 -29072    58143                             
## fm2  5 51269 51305 -25629    51259 6884.7      1  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

数据: x0+sex.boy+双因子随机截距+双因子随机斜率模型

y0 <- 53+ 0.5*sex.boy + (50 * city.scores) + (7 * school.scores) + 
      (3+ 5 * city.scores + 7 * school.scores) * x0  + sigma

fm3 VS fm4

固定+双因子随机截距+双因子随机斜率模型记为fm4. 对比发现,fm4要显著优于fm3.

fm3 <- lmer(y0~ 1 + x0 + (1|sch.fact) + (1|cit.fact), REML=F)    
fm4 <- lmer(y0~ 1 + x0 + (1+ x0|cit.fact) + (1+ x0|sch.fact), REML=F)                
anova(fm3, fm4)  
## Data: 
## Models:
## fm3: y0 ~ 1 + x0 + (1 | sch.fact) + (1 | cit.fact)
## fm4: y0 ~ 1 + x0 + (1 + x0 | cit.fact) + (1 + x0 | sch.fact)
##     Df   AIC   BIC logLik deviance Chisq Chi Df Pr(>Chisq)    
## fm3  5 72928 72964 -36459    72918                            
## fm4  9 51403 51468 -25692    51385 21533      4  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

fm5: x0+sex.boy+双因子随机截距+双因子随机斜率模型

对比发现,fm5要显著优于fm4.

fm5 <- lmer(y0~ 1 + x0 + sex.boy + (1+ x0|cit.fact) + (1+ x0|sch.fact), REML=F)
anova(fm4, fm5)  
## Data: 
## Models:
## fm4: y0 ~ 1 + x0 + (1 + x0 | cit.fact) + (1 + x0 | sch.fact)
## fm5: y0 ~ 1 + x0 + sex.boy + (1 + x0 | cit.fact) + (1 + x0 | sch.fact)
##     Df   AIC   BIC logLik deviance  Chisq Chi Df Pr(>Chisq)    
## fm4  9 51403 51468 -25692    51385                             
## fm5 10 51143 51215 -25562    51123 261.22      1  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

summary(fm5)的含义

summary(fm5)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: y0 ~ 1 + x0 + sex.boy + (1 + x0 | cit.fact) + (1 + x0 | sch.fact)
## 
##      AIC      BIC   logLik deviance df.resid 
##  51143.4  51215.5 -25561.7  51123.4     9990 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.5145 -0.6734  0.0055  0.6803  3.3734 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr
##  sch.fact (Intercept)   48.955  6.997       
##           x0            49.997  7.071   1.00
##  cit.fact (Intercept) 2374.226 48.726       
##           x0            19.432  4.408   1.00
##  Residual                8.932  2.989       
## Number of obs: 10000, groups:  sch.fact, 99; cit.fact, 13
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept) 49.21252   13.53252   3.637
## x0           2.69328    1.41446   1.904
## sex.boy      0.50820    0.03121  16.282
## 
## Correlation of Fixed Effects:
##         (Intr) x0   
## x0      0.889       
## sex.boy 0.000  0.000

固定效应:
(当sch.fact和cit.fact的影响力位于平均水平时,得到的模型参数.)
Intercept参数为49.212,参数的标准误为13.532;
x0参数为2.693,参数的标准误为1.414;
sex.boy参数为0.508,参数的标准误为0.031。
随机效应:
sch.fact和cit.fact分别解释了随机截距的6.997和48.726个残差标准误;
sch.fact和cit.fact分别解释了随机斜率的7.071和4.408个残差标准误;
未能解释的残差residual为2.989个残差标准误。

(完)