\[z_i=\frac{x_i-\bar{x}}{s_x},\ \ \ \ w_i=\frac{y_i-\bar{y}}{s_y}\] とすると,相関係数\(r_{xy}\)は
相関係数は,次のように定義される.
\[\begin{align} r_{xy}&=\frac{1}{n}\sum_{i=1}^n z_i w_i \\ &=\frac{1}{n}\sum_{i=1}^n \left(\frac{x_i-\bar{x}}{s_x}\right)\left(\frac{y_i-\bar{y}}{s_y}\right) =\frac{s_{xy}}{s_x s_y} \end{align}\]
\[-1\leq r_{xy}\leq 1\]
この証明のために,常に非負となる式\((1/n)\sum_{i=1}^n (z_i \pm w_i)^2\)を展開する. \[\begin{align} \frac{1}{n}\sum_{i=1}^n (z_i \pm w_i)^2&=\frac{1}{n}\sum_{i=1}^n (z_i^2 \pm 2z_i w_i +w_i^2) \\ &=\frac{1}{n}\sum_{i=1}^n z_i^2 \pm \frac{2}{n}\sum_{i=1}^n z_i w_i +\frac{1}{n}\sum_{i=1}^n w_i^2 \\ &= 1\pm 2r_{xy}+1=2(1\pm r_{xy})\geq 0 \end{align}\] \(\pm r_{xy}\geq -1\)より,\(-1\leq r_{xy}\leq 1\)である.
\[すべてのiについて,y_i=ax_i+b \Longleftrightarrow |r_{xy}|=1\] 十分条件と必要条件に分けて証明する.
(\(\Longrightarrow\)) 標準化で見たように,\(y_i=ax_i+b\)のとき,\(s^2_y=a^2 s^2_x\),すなわち,\(s_y=|a|s_x\)である.また, \[\begin{align} s_{xy}&=\frac{1}{n}\sum_{i=1}^n(x_i-\bar{x})(y_i-\bar{y}) \\ &=\frac{a}{n}\sum_{i=1}^n(x_i-\bar{x})(x_i-\bar{x})=as^2_x \end{align}\] である.ゆえに, \[r_{xy}=\frac{a s^2_x}{s_x |a|s_x}=\frac{a}{|a|}.\] つまり,\(a>0\)のとき相関係数は1で,\(a<0\)のとき\(-1\)となる.
(\(\Longleftarrow\)) \(r_{xy}=\pm 1\)とする.このとき, \[s_{xy}=\pm s_x s_y=\pm \frac{s_y}{s_x}s_x^2\] が成り立つ.\(a=\pm s_y/s_x\)とすると,上式より \[\sum_{i=1}^n (x_i-\bar{x})(y_i-\bar{y})=a\sum_{i=1}^n (x_i-\bar{x})(x_i-\bar{x})\] が成り立つ.ゆえに,すべての\(i\)について \[(y_i-\bar{y})=a(x_i-\bar{x})\] でなければならない.\(b=\bar{y}-a\bar{x}\)とすると,すべての\(i\)について\(y_i=ax_i+b\)を得る.
library(MASS)
par(mfrow=c(1,4),pty ="s")
for (r in c(1,0.7,0.3,0)) {
data<-mvrnorm(500, c(0,0), matrix(c(1, r, r, 1), 2, 2))
rr<-round(cor(data)[[2,1]],2)
plot(data,asp=1,xlab="",ylab="",main=paste("r = ",rr), cex.main = 2)
lmdata<-lm(data[,2]~data[,1])
abline(coef=coef(lmdata),col="red",lwd =1)
}
par(mfrow=c(1,4),pty ="s")
for (r in c(0,-0.3,-0.7,-1)) {
data<-mvrnorm(500, c(0,0), matrix(c(1, r, r, 1), 2, 2))
rr<-round(cor(data)[[2,1]],2)
plot(data,asp=1,xlab="",ylab="",main=paste("r = ",rr), cex.main = 2)
lmdata<-lm(data[,2]~data[,1])
abline(coef=coef(lmdata),col="red",lwd =1)
}
psych
パッケージのpartial.r
関数を使って偏相関係数を計算する.次の,3変数仮想データ.
library(MASS)
library(psych)
data<-data.frame(mvrnorm(500, c(0,0,0), matrix(c(1, 0.7, 0.9, 0.9, 1, 0.9, 0.9, 0.7, 1), 3, 3)))
colnames(data)<-c("x","y","z")
plot(data)
cor(data)
## x y z
## x 1.0000000 0.7274721 0.9046409
## y 0.7274721 1.0000000 0.9115226
## z 0.9046409 0.9115226 1.0000000
\(z\)をコントロールした\(x\)と\(y\)の偏相関係数\(r_{xy\cdot z}\)は,
partial.r(data,c(1,2),3)
## partial correlations
## x y
## x 1.00 -0.55
## y -0.55 1.00
それぞれ,第3の変数をコントロールした偏相関係数行列は以下の通り.
partial.r(data)
## x y z
## x 1.0000000 -0.5541831 0.8559743
## y -0.5541831 1.0000000 0.8666524
## z 0.8559743 0.8666524 1.0000000
この偏相関係数を,定義に従って自力で計算してみる.まず,生の\(x\)と\(y\)の相関をみる.
plot(data$x,data$y)
abline(coef=coef(lm(y~x,data=data)),col="red",lwd =1)
\(y\)を\(z\)に回帰させる.つまり\(y=a+bz\)の回帰モデルを分析する.
lmyz<-lm(y~z,data=data)
summary(lmyz)
##
## Call:
## lm(formula = y ~ z, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.06713 -0.26932 -0.00465 0.28935 1.08191
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.00819 0.01816 0.451 0.652
## z 0.90641 0.01833 49.463 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.406 on 498 degrees of freedom
## Multiple R-squared: 0.8309, Adjusted R-squared: 0.8305
## F-statistic: 2447 on 1 and 498 DF, p-value: < 2.2e-16
plot(data$z,data$y)
abline(coef=coef(lmyz),col="red",lwd =1)
残差\(y-(\hat a +\hat b z)\)は\(z\)の影響を取り除いた\(y\)の値と考えられる.残差をプロットする.
plot(lmyz$residuals)
次に,\(x\)を\(z\)に回帰させる.つまり\(x=a+bz\)の回帰モデルを分析する.
lmxz<-lm(x~z,data=data)
summary(lmxz)
##
## Call:
## lm(formula = x ~ z, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.22524 -0.27949 0.00653 0.30579 1.08840
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.01380 0.01904 -0.725 0.469
## z 0.90990 0.01921 47.370 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4256 on 498 degrees of freedom
## Multiple R-squared: 0.8184, Adjusted R-squared: 0.818
## F-statistic: 2244 on 1 and 498 DF, p-value: < 2.2e-16
plot(data$z,data$x)
abline(coef=coef(lmxz),col="red",lwd =1)
残差\(x-(\hat a +\hat b z)\)は\(z\)の影響を取り除いた\(x\)の値と考えられる.
plot(lmxz$residuals)
最後に残差同士をプロットする.
plot(lmxz$residuals,lmyz$residuals)
残差同士の相関は,\(z\)の影響を取り除いた偏相関になっている.
cor(lmxz$residuals,lmyz$residuals)
## [1] -0.5541831
各学校ごとの勉強時間と成績の仮想データ
library(MASS)
scatter_a<-mvrnorm(100, c(20,90),
matrix(c(25, 25*0.7, 25*0.7, 25), 2, 2))
scatter_b<-mvrnorm(100, c(30,80),
matrix(c(25, 25*0.7, 25*0.7, 25), 2, 2))
scatter_c<-mvrnorm(100, c(40,70),
matrix(c(25, 25*0.7, 25*0.7, 25), 2, 2))
scatter_d<-mvrnorm(100, c(50,60),
matrix(c(25, 25*0.7, 25*0.7, 25), 2, 2))
scatter_e<-mvrnorm(100, c(60,50),
matrix(c(25, 25*0.7, 25*0.7, 25), 2, 2))
schooldata<-data.frame(
school=factor(c(rep("A",100),rep("B",100),rep("C",100),rep("D",100),rep("E",100))),
time=rbind(scatter_a,scatter_b,scatter_c,scatter_d,scatter_e)[,1],
score=rbind(scatter_a,scatter_b,scatter_c,scatter_d,scatter_e)[,2]
)
集団レベルの相関
mean_score<-as.vector(tapply(schooldata$score,schooldata$school,mean))
mean_time<-as.vector(tapply(schooldata$time,schooldata$school,mean))
plot(mean_time,mean_score,cex = 1.5)#,pch=19
abline(coef=coef(lm(mean_score~mean_time)),col="red",lwd =2)
cor(mean_time,mean_score)
## [1] -0.9989568
個人レベル全体の相関
plot(schooldata[,c(2,3)])
abline(coef=coef(lm(score~time,data=schooldata)),col="red",lwd =2)
cor(schooldata[,c(2,3)])
## time score
## time 1.0000000 -0.7928511
## score -0.7928511 1.0000000
集団ごと個人レベルの相関
plot(schooldata[,c(2,3)])
lmresult<-lm(score~time+school,data=schooldata)
abline(coef=coef(lmresult),col="red",lwd =2)
abline(coef=c(sum(coef(lmresult)[c(1,3)]),coef(lmresult)[2]),col="red",lwd =2)
abline(coef=c(sum(coef(lmresult)[c(1,4)]),coef(lmresult)[2]),col="red",lwd =2)
abline(coef=c(sum(coef(lmresult)[c(1,5)]),coef(lmresult)[2]),col="red",lwd =2)
abline(coef=c(sum(coef(lmresult)[c(1,6)]),coef(lmresult)[2]),col="red",lwd =2)
cor_group<-c(cor(scatter_a)[1,2],cor(scatter_b)[1,2],cor(scatter_c)[1,2],
cor(scatter_d)[1,2],cor(scatter_e)[1,2])
cor_group
## [1] 0.7992755 0.7661888 0.6602677 0.6926830 0.6889649
各集団で標準化した個人レベル全体の相関
schooldata$adjtime<-c(
scale(schooldata$time[schooldata$school=="A"]),
scale(schooldata$time[schooldata$school=="B"]),
scale(schooldata$time[schooldata$school=="C"]),
scale(schooldata$time[schooldata$school=="D"]),
scale(schooldata$time[schooldata$school=="E"])
)
schooldata$adjscore<-c(
scale(schooldata$score[schooldata$school=="A"]),
scale(schooldata$score[schooldata$school=="B"]),
scale(schooldata$score[schooldata$school=="C"]),
scale(schooldata$score[schooldata$school=="D"]),
scale(schooldata$score[schooldata$school=="E"])
)
plot(schooldata$adjtime,schooldata$adjscore)
abline(coef=coef(lm(adjscore~adjtime,data=schooldata)),col="red",lwd =2)
cor(schooldata$adjtime,schooldata$adjscore)
## [1] 0.721476