独立性是指像随机抛硬币事件那样,前后两次事件结果之间互不干扰的性质.
因此,如果重复抛硬币10000次,出现概率较大的情况是,50%左右为正面,50%左右为反面.
但是如果数据之间不独立,比如同一次实验结果被连续重复的记录了2000次,这种重复记录的行为,势必会影响实验的结果,造成较大误差. 再比如,如果仅做了10次实验,每次实验结果却被重复记录了1000次数据,这样所得到的10000个数据的自由度实际上也只有10-1=9,而不是10000-1=9999,因此基于自由度为9999的各种检验将会毫无意义.
某省,抽取17个城市,99所学校,10000名学生,进行体能调查.
cit.fact为城市名称,sch.fact为学校名称,sex.fact为性别.
city.scores为城市综合实力,school.scores为学校综合实力,sex.boy是否为男生.
x0为入学时学生的体能测试成绩.
y0为毕业时学生的体能测试成绩.
set.seed(1); n_student=10000; n_school=99; n_cities=13
###
school.scores <- sample(x=rnorm(n=n_school, mean=0, sd = 20), size=n_student, replace=T)
sch.fact <- factor(school.scores);
levels(sch.fact) <- paste('sch', sample(1:n_school), sep='')
#
stud.sex.prob <- factor(sch.fact)
levels(stud.sex.prob) <- runif(n=n_school, min=0.4, max=1)
stud.sex.prob <- as.numeric(as.character(stud.sex.prob))
FUN <- function(i){sample(x = c('boy', 'girl'), size = 1, replace = T,
prob = c(stud.sex.prob[i], 1-stud.sex.prob[i]))}
sex.fact <- sapply(1:n_student, FUN )
sex.boy <- as.numeric(sex.fact =='boy')
#
cit.fact <- factor(sch.fact)
levels(cit.fact) <- sample(rnorm(n=n_cities, mean=0, sd=20),
size=n_school, replace=T,
prob =runif(n=n_cities, min=0.3, max=1));
city.scores <- as.numeric(as.character(cit.fact));
levels(cit.fact) <- paste('cit',sample(1:n_cities),sep='')
###
school.scores <- scale(school.scores)
city.scores <- scale(city.scores)
sex.boy <- scale(sex.boy)
sigma <- rnorm(n=n_student, mean = 0, sd =3)
x1 <- sex.fact; x2 <- sch.fact; x3 <- cit.fact
x0 <- origPhysicalExam <- rnorm(n=n_student, mean=0)
#
y0 <- 51+ 0.5 * sex.boy + (20 * city.scores) + (7 * school.scores) +
(3+ 5 * city.scores + 7 * school.scores) * x0 + sigma
最理想的取样数据为,每个数据都来自同一随机分布总体且彼此间相互独立.
y0 <- 53 +3 * x0 + sigma
plot(x0, y0)
lm0 <- lm(y0~x0)
abline(lm0, col=2,lwd=3)
abline(v=0,col=3)
summary(lm0)
##
## Call:
## lm(formula = y0 ~ x0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.5567 -2.0121 0.0194 2.0465 9.9956
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 53.00050 0.02991 1772.1 <2e-16 ***
## x0 2.98534 0.02941 101.5 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.991 on 9998 degrees of freedom
## Multiple R-squared: 0.5076, Adjusted R-squared: 0.5076
## F-statistic: 1.031e+04 on 1 and 9998 DF, p-value: < 2.2e-16
如果数据不能满足独立性的假设,将会对结果造成偏差.
比如,y1[1:3000]过分依赖y0[1],而y1[4001:7000]过分依赖y0[2].
最终导致了回归模型发生较大偏差.
y1 <- y0;
y1[1:3000] <- rnorm(n = 3000, mean = y0[1]);
y1[4001:7000] <- rnorm(n = 3000, mean = y0[2])
lm1 <- lm(y1~x0)
plot(x0,y1)
abline(lm0, col=2,lwd=3)
abline(lm1, col=4,lwd=3)
abline(v=0, col=3)
summary(lm1)
##
## Call:
## lm(formula = y1 ~ x0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.4603 -3.5851 0.8565 3.8711 10.7410
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 57.75282 0.04811 1200.49 <2e-16 ***
## x0 1.07110 0.04730 22.65 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.811 on 9998 degrees of freedom
## Multiple R-squared: 0.04879, Adjusted R-squared: 0.0487
## F-statistic: 512.9 on 1 and 9998 DF, p-value: < 2.2e-16
如果剔除掉不独立的数据,则回归模型回到理论水平.
但付出的代价是自由度由9998降低为3998.
y2 <- y1[-c(1:3000,4001:7000)]
x2 <- x0[-c(1:3000,4001:7000)]
lm2 <- lm(y2~x2)
plot(x2,y2)
abline(lm0, col=2,lwd=5)
abline(lm2, col=3,lwd=2)
summary(lm2)
##
## Call:
## lm(formula = y2 ~ x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.4489 -2.0082 0.0081 1.9965 9.9994
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 52.99657 0.04657 1138.05 <2e-16 ***
## x2 2.98523 0.04581 65.17 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.943 on 3998 degrees of freedom
## Multiple R-squared: 0.5151, Adjusted R-squared: 0.515
## F-statistic: 4247 on 1 and 3998 DF, p-value: < 2.2e-16
如果自由度降低到数值过小,会导致一些实际有显著影响的因子无法被检测出来. 如自由度为99时,sex3影响因子的检测结果为不显著。
y0 <- 53 + 3 * x0 + 0.4 * sex.boy + sigma
y3 <- y0[1:100]
x3 <- x0[1:100]
sex3 <- sex.boy[1:100]
lm1 <- lm(y0~x0+sex.boy); summary(lm1)
##
## Call:
## lm(formula = y0 ~ x0 + sex.boy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.5568 -2.0120 0.0195 2.0464 9.9956
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 53.00050 0.02991 1771.96 <2e-16 ***
## x0 2.98534 0.02941 101.52 <2e-16 ***
## sex.boy 0.40007 0.02991 13.38 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.991 on 9997 degrees of freedom
## Multiple R-squared: 0.5122, Adjusted R-squared: 0.5121
## F-statistic: 5248 on 2 and 9997 DF, p-value: < 2.2e-16
lm2 <- lm(y3~x3+sex3); summary(lm2)
##
## Call:
## lm(formula = y3 ~ x3 + sex3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.1534 -1.7114 -0.0564 2.2869 7.4456
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 53.1098 0.3159 168.108 <2e-16 ***
## x3 3.0678 0.2973 10.320 <2e-16 ***
## sex3 0.4715 0.3048 1.547 0.125
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.122 on 97 degrees of freedom
## Multiple R-squared: 0.529, Adjusted R-squared: 0.5193
## F-statistic: 54.48 on 2 and 97 DF, p-value: < 2.2e-16
如每个城市的综合实力严重影响了y0. 那么同一个城市内数据就是非独立性数据.
y0 <- 53 + 3 * x0 + 0.5*sex.boy + 50 * city.scores + sigma
#
index.cit <- as.numeric( gsub('cit([0-9]{1,2})','\\1',cit.fact) )
cit.order <- reorder(x=cit.fact, X=index.cit )
#
boxplot(y0~cit.order, xlab='cit.fact')
如果我们没有考虑着不同城市cit.fact之间的差异. 则回归结果可能会出现偏差.
plot(x0, y0)
lm0 <- lm(y0~x0+sex.boy); summary(lm0)
##
## Call:
## lm(formula = y0 ~ x0 + sex.boy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -77.16 -40.20 -17.10 36.59 110.09
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 52.9988 0.5007 105.844 < 2e-16 ***
## x0 3.2871 0.4923 6.677 2.57e-11 ***
## sex.boy -0.9307 0.5007 -1.859 0.0631 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 50.07 on 9997 degrees of freedom
## Multiple R-squared: 0.004772, Adjusted R-squared: 0.004573
## F-statistic: 23.97 on 2 and 9997 DF, p-value: 4.129e-11
abline(lm0, col=2,lwd=4)
## Warning in abline(lm0, col = 2, lwd = 4): only using the first two of 3
## regression coefficients
当然了,如果我们能够精确量化每个城市的综合影响因子city.scores, 那么回归结果是再好不过了.
lm0 <- lm(y0~x0+sex.boy+city.scores); summary(lm0)
##
## Call:
## lm(formula = y0 ~ x0 + sex.boy + city.scores)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.5587 -2.0122 0.0195 2.0465 9.9968
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 53.00050 0.02991 1771.87 <2e-16 ***
## x0 2.98534 0.02941 101.51 <2e-16 ***
## sex.boy 0.50003 0.02993 16.71 <2e-16 ***
## city.scores 49.99854 0.02993 1670.74 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.991 on 9996 degrees of freedom
## Multiple R-squared: 0.9964, Adjusted R-squared: 0.9964
## F-statistic: 9.349e+05 on 3 and 9996 DF, p-value: < 2.2e-16
但是,实际上要精确的量化每个城市的综合影响因子city.scores是太不可能的,如果把不同城市cit.fact作为哑变量加入模型,则回归结果将会好很多,但是模型参数太多,且截距误差太大(意义不明确).
lm1 <- lm(y0~x0+sex.boy+cit.fact); summary(lm1)
##
## Call:
## lm(formula = y0 ~ x0 + sex.boy + cit.fact)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.6684 -2.0101 0.0172 2.0430 9.9778
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 115.47774 0.09491 1216.66 <2e-16 ***
## x0 2.98516 0.02943 101.42 <2e-16 ***
## sex.boy 0.50313 0.03010 16.72 <2e-16 ***
## cit.factcit12 -35.30401 0.13463 -262.24 <2e-16 ***
## cit.factcit13 -92.31805 0.14094 -655.00 <2e-16 ***
## cit.factcit9 -116.79594 0.14619 -798.92 <2e-16 ***
## cit.factcit7 -21.98860 0.14517 -151.47 <2e-16 ***
## cit.factcit1 -79.25616 0.12985 -610.35 <2e-16 ***
## cit.factcit10 -116.01696 0.13264 -874.71 <2e-16 ***
## cit.factcit2 -99.31623 0.14300 -694.54 <2e-16 ***
## cit.factcit6 -31.89959 0.14897 -214.14 <2e-16 ***
## cit.factcit3 -129.58887 0.15508 -835.64 <2e-16 ***
## cit.factcit4 38.23507 0.14132 270.56 <2e-16 ***
## cit.factcit8 -81.48936 0.15754 -517.26 <2e-16 ***
## cit.factcit11 -96.83834 0.31457 -307.84 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.992 on 9985 degrees of freedom
## Multiple R-squared: 0.9964, Adjusted R-squared: 0.9964
## F-statistic: 2.002e+05 on 14 and 9985 DF, p-value: < 2.2e-16
如果把不同城市cit.fact作为随机因子加入混合效应模型(linear mixed-effects models),则回归结果的意义将更加明确.
得到,当cit.fact随机因子位于平均值时,回归模型的固定及随机效应的结果.
固定部分: Intercept参数为51.45,标准误为5.464; x0参数为2.98,标准误为0.02; sex.boy参数为0.5,标准差为0.03.
随机部分: cit.fact能解释随机斜距的19.703个残差标准误,剩余残差Residual为2.992个残差标准误.
对比发现,模型fm0显著优于fm.
require(lme4)
## Loading required package: lme4
## Loading required package: Matrix
## Loading required package: Rcpp
fm <- lmer(y0~1+x0 +(1|cit.fact), REML=F)
fm0 <- lmer(y0~1+x0+sex.boy+(1|cit.fact), REML=F); summary(fm0)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: y0 ~ 1 + x0 + sex.boy + (1 | cit.fact)
##
## AIC BIC logLik deviance df.resid
## 50465.4 50501.4 -25227.7 50455.4 9995
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.5658 -0.6718 0.0059 0.6826 3.3349
##
## Random effects:
## Groups Name Variance Std.Dev.
## cit.fact (Intercept) 2427.390 49.269
## Residual 8.952 2.992
## Number of obs: 10000, groups: cit.fact, 13
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 49.12575 13.66469 3.60
## x0 2.98516 0.02943 101.43
## sex.boy 0.50312 0.03010 16.72
##
## Correlation of Fixed Effects:
## (Intr) x0
## x0 0.000
## sex.boy 0.000 -0.003
anova(fm,fm0)
## Data:
## Models:
## fm: y0 ~ 1 + x0 + (1 | cit.fact)
## fm0: y0 ~ 1 + x0 + sex.boy + (1 | cit.fact)
## Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
## fm 4 50739 50768 -25366 50731
## fm0 5 50465 50501 -25228 50455 275.62 1 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
把不同城市cit.fact,不同学校sch.fact作为随机因子加入混合效应模型中.
##
y0 <- 53 + 3 * x0 + 0.5*sex.boy + 50 * city.scores +
7 * school.scores + sigma
##
index.sch <- as.numeric( gsub('sch([0-9]{1,2})','\\1',sch.fact) )
sch.order <- reorder( x=paste(sch.fact, sex.fact),
X=index.sch )
##
par(mar=c(5,4,1,1), mgp=c(4,0.5,0),tck=-0.02)
boxplot(y0 ~ sch.order, xlim=c(1,21.7), las=2,
col=c(2,4), xlab='school + sex')
title(ylab="graduate physical exam",line=2.5)
对比发现,fm1显著优于fm.
fm0 <- lmer(y0~ 1 + (1|cit.fact) , REML=F)
fm1 <- lmer(y0~ 1 + (1|cit.fact) + (1|sch.fact), REML=F)
anova(fm0, fm1)
## Data:
## Models:
## fm0: y0 ~ 1 + (1 | cit.fact)
## fm1: y0 ~ 1 + (1 | cit.fact) + (1 | sch.fact)
## Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
## fm0 3 69948 69969 -34971 69942
## fm1 4 58151 58180 -29072 58143 11798 1 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
对比发现,fm2要显著优于fm1.
fm2 <- lmer(y0~ 1 + x0 + (1|sch.fact) + (1|cit.fact), REML=F)
anova(fm1,fm2)
## Data:
## Models:
## fm1: y0 ~ 1 + (1 | cit.fact) + (1 | sch.fact)
## fm2: y0 ~ 1 + x0 + (1 | sch.fact) + (1 | cit.fact)
## Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
## fm1 4 58151 58180 -29072 58143
## fm2 5 51269 51305 -25629 51259 6884.7 1 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
y0 <- 53+ 0.5*sex.boy + (50 * city.scores) + (7 * school.scores) +
(3+ 5 * city.scores + 7 * school.scores) * x0 + sigma
固定+双因子随机截距+双因子随机斜率模型记为fm4. 对比发现,fm4要显著优于fm3.
fm3 <- lmer(y0~ 1 + x0 + (1|sch.fact) + (1|cit.fact), REML=F)
fm4 <- lmer(y0~ 1 + x0 + (1+ x0|cit.fact) + (1+ x0|sch.fact), REML=F)
anova(fm3, fm4)
## Data:
## Models:
## fm3: y0 ~ 1 + x0 + (1 | sch.fact) + (1 | cit.fact)
## fm4: y0 ~ 1 + x0 + (1 + x0 | cit.fact) + (1 + x0 | sch.fact)
## Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
## fm3 5 72928 72964 -36459 72918
## fm4 9 51403 51468 -25692 51385 21533 4 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
对比发现,fm5要显著优于fm4.
fm5 <- lmer(y0~ 1 + x0 + sex.boy + (1+ x0|cit.fact) + (1+ x0|sch.fact), REML=F)
anova(fm4, fm5)
## Data:
## Models:
## fm4: y0 ~ 1 + x0 + (1 + x0 | cit.fact) + (1 + x0 | sch.fact)
## fm5: y0 ~ 1 + x0 + sex.boy + (1 + x0 | cit.fact) + (1 + x0 | sch.fact)
## Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
## fm4 9 51403 51468 -25692 51385
## fm5 10 51143 51215 -25562 51123 261.22 1 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(fm5)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: y0 ~ 1 + x0 + sex.boy + (1 + x0 | cit.fact) + (1 + x0 | sch.fact)
##
## AIC BIC logLik deviance df.resid
## 51143.4 51215.5 -25561.7 51123.4 9990
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.5145 -0.6734 0.0055 0.6803 3.3734
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## sch.fact (Intercept) 48.955 6.997
## x0 49.997 7.071 1.00
## cit.fact (Intercept) 2374.226 48.726
## x0 19.432 4.408 1.00
## Residual 8.932 2.989
## Number of obs: 10000, groups: sch.fact, 99; cit.fact, 13
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 49.21252 13.53252 3.637
## x0 2.69328 1.41446 1.904
## sex.boy 0.50820 0.03121 16.282
##
## Correlation of Fixed Effects:
## (Intr) x0
## x0 0.889
## sex.boy 0.000 0.000
固定效应:
(当sch.fact和cit.fact的影响力位于平均水平时,得到的模型参数.)
Intercept参数为49.212,参数的标准误为13.532;
x0参数为2.693,参数的标准误为1.414;
sex.boy参数为0.508,参数的标准误为0.031。
随机效应:
sch.fact和cit.fact分别解释了随机截距的6.997和48.726个残差标准误;
sch.fact和cit.fact分别解释了随机斜率的7.071和4.408个残差标准误;
未能解释的残差residual为2.989个残差标准误。
(完)