本週上課將介紹迴歸的基本原理,配合之前討論的研究設計,可以描述兩個變數之間的關係。例如兩個變數的散佈圖加上迴歸線:
library(UsingR)
ggplot(nym.2002, aes(age, time)) +
geom_point(col="#808000", size=2) +
geom_smooth(method="lm") +
theme_bw() +
theme(axis.title=element_text(size=12),
axis.text = element_text(size=12))
統計學習(statistical learning)包含許多估計\(f\)的方法。統計學習的目的是:
\(\hat{E}[Y|X]=\frac{\sum_{i=1}^N K_{h}((X_{i}-x_{0})/h)Y_{i}}{\sum_{i=1}^N K_{h}((X_{i}-x_{0})/h)}\)
\(Y_{i}=\alpha_{0}+\beta_{1}(x_{i}-x_{0})+\cdots +\beta_{p}(x_{i}-x_{0})^{p}+E_{i}\)
從以上可知:
\(MSE_{pred.-y_{i}}=Var_{pred.}+Bias_{\mathit{f_{i}}-y_{i}}^2\)
\(E[Y|X]=\beta_{0}+X\beta_{1}\)
\(RSS=e_{1}^2+e_{2}^2+\ldots +e_{n}^2\)
迴歸模型\(Y=\hat{\beta}_{0}+\hat{\beta}_{1}X\) 中,\(\hat{\beta}_{1}\)的意義為何?
如果用在預測:
sales \(\approx {\beta}_{0}+{\beta}_{1}\cdot\) TV
site="http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv"
Advertising<-read.csv(file = url(site), sep=',', header = TRUE)
m1<-lm(sales ~ TV, data=Advertising)
summary(m1)
##
## Call:
## lm(formula = sales ~ TV, data = Advertising)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.3860 -1.9545 -0.1913 2.0671 7.2124
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.032594 0.457843 15.36 <2e-16 ***
## TV 0.047537 0.002691 17.67 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.259 on 198 degrees of freedom
## Multiple R-squared: 0.6119, Adjusted R-squared: 0.6099
## F-statistic: 312.1 on 1 and 198 DF, p-value: < 2.2e-16
資料來源:James et al., 2013畫圖如下:
library(ggplot2)
A1<-ggplot(Advertising, aes(x=TV, y=sales)) +
labs(y = 'Sales', x = 'TV') +
geom_point(col="saddlebrown") +
geom_smooth(method="lm", col='blue', se=F, size=1.5) ; A1
加上預測值以及觀察值與預測值之間的殘差(參考BLOGR):
Advertising$predicted <- predict(m1) # Save the predicted values
Advertising$residuals <- residuals(m1) # Save the residual values
# Quick look at the actual, predicted, and residual values
#library(dplyr)
#Advertising %>% select(predicted, residuals) %>% head()
ggplot(Advertising, aes(x=TV, y=sales)) +
geom_point(color="saddlebrown") +
geom_point(aes(y = predicted), shape = 1, color="red") +
geom_segment(aes(xend=TV, yend=predicted), color="lightgray") +
theme_bw()
##
## Call:
## lm(formula = kid_score ~ mom_hs)
##
## Residuals:
## Min 1Q Median 3Q Max
## -57.55 -13.32 2.68 14.68 58.45
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 77.548 2.059 37.670 < 2e-16 ***
## mom_hs 11.771 2.322 5.069 5.96e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.85 on 432 degrees of freedom
## Multiple R-squared: 0.05613, Adjusted R-squared: 0.05394
## F-statistic: 25.69 on 1 and 432 DF, p-value: 5.957e-07
估計結果寫成:
kid_score = \(77.548 + 11.77\cdot\) mom_hs
##
## Call:
## lm(formula = kid_score ~ mom_iq)
##
## Residuals:
## Min 1Q Median 3Q Max
## -56.753 -12.074 2.217 11.710 47.691
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 25.79978 5.91741 4.36 1.63e-05 ***
## mom_iq 0.60997 0.05852 10.42 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 18.27 on 432 degrees of freedom
## Multiple R-squared: 0.201, Adjusted R-squared: 0.1991
## F-statistic: 108.6 on 1 and 432 DF, p-value: < 2.2e-16
估計結果寫成:
kid_score = \(25.799+ 0.609\cdot\) mom_iq
\(\widehat{\mathrm {Kid's\hspace{.25em} Scores}}=\hat{\beta}_{0}+\hat{\beta}_{1}(\mathrm {Mom\hspace{.25em} has\hspace{.25em} HS\hspace{.25em} degree})+\hat{\beta}_{2}(\mathrm {Mom's\hspace{.25em} IQ})\)
當\(X_{1}=1\)(母親有高中學歷), \[ \begin{eqnarray} \widehat{\mathrm {Kid's\hspace{.25em} Scores}} & = &\hat{\beta}_{0}+\hat{\beta}_{1}\cdot 1+\hat{\beta}_{2}(\mathrm {Mom's\hspace{.25em} IQ}) \\ & = &\hat{\beta}_{0}+\hat{\beta}_{1}+\hat{\beta}_{2}(\mathrm {Mom's\hspace{.25em} IQ}) \end{eqnarray} \]
當\(X_{1}=0\)(母親沒有高中學歷),
\[
\begin{eqnarray}
\widehat{\mathrm {Kid's\hspace{.25em} Scores}} & = &\hat{\beta}_{0}+\hat{\beta}_{1}\cdot 0+\hat{\beta}_{2}(\mathrm {Mom's\hspace{.25em} IQ}) \\
& = &\hat{\beta}_{0}+\hat{\beta}_{2}(\mathrm {Mom's\hspace{.25em} IQ})
\end{eqnarray}
\]
也就是說,這兩個模型的斜率一樣,但是當\(X_{1}=1\)時,截距多了 \(\hat{\beta}_{1}\)。
進行估計如下:
##
## Call:
## lm(formula = kid_score ~ mom_hs + mom_iq)
##
## Residuals:
## Min 1Q Median 3Q Max
## -52.873 -12.663 2.404 11.356 49.545
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 25.73154 5.87521 4.380 1.49e-05 ***
## mom_hs 5.95012 2.21181 2.690 0.00742 **
## mom_iq 0.56391 0.06057 9.309 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 18.14 on 431 degrees of freedom
## Multiple R-squared: 0.2141, Adjusted R-squared: 0.2105
## F-statistic: 58.72 on 2 and 431 DF, p-value: < 2.2e-16
估計得到的模型為:\[\hat{Y}=25.73+5.95\cdot 高中畢業與否 +0.56\cdot 母親智商\]
係數分別代表的意義為:
ggplot(iq, aes(x=mom_iq, y=kid_score, group=mom_hs)) +
geom_point(aes(color=factor(mom_hs)), shape=1, size=2) +
geom_abline( slope=0.56, intercept=25.73, col="indianred", size=1.2) +
geom_abline(slope=0.56, intercept=25.73+5.95, col="blue", size=1.2) +
scale_color_discrete(name = "Mother with High School Degree",
labels=c("No High School Degree","High School Degree")) +
theme_bw() +
theme(legend.position = "top") +
labs(x="Mother's IQ", y="Kid's Score")
##
## Call:
## lm(formula = kid_score ~ mom_iq + mom_age)
##
## Residuals:
## Min 1Q Median 3Q Max
## -56.941 -12.493 2.257 11.614 46.711
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.59625 9.08397 1.937 0.0534 .
## mom_iq 0.60357 0.05874 10.275 <2e-16 ***
## mom_age 0.38813 0.32620 1.190 0.2348
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 18.26 on 431 degrees of freedom
## Multiple R-squared: 0.2036, Adjusted R-squared: 0.1999
## F-statistic: 55.08 on 2 and 431 DF, p-value: < 2.2e-16
\[ \hat{Y}=17.6+0.6\cdot 母親智商 +0.39\cdot 母親年齡 \]
雙連續變數的迴歸係數分別代表的意義為:
因為
\[ \mathcal{\frac{\partial (y=\beta_{0}+\beta_{1}X_{1}+\beta_{2}X_{2})}{\partial X_{1}}}=\beta_{1} \]
所以\(\hspace{.3em}\beta_{1}\hspace{.3em}\)代表當其他變數在同一個值或是水準時,對於\(\hspace{.3em}y\hspace{.3em}\)的淨作用。
fit5 <- with(iq, lm(kid_score ~ mom_iq + mom_hs+mom_iq:mom_hs))
fit5.1<-with(iq, lm(kid_score[mom_hs==1] ~ mom_iq[mom_hs==1]))
fit5.2<-with(iq, lm(kid_score[mom_hs==0] ~ mom_iq[mom_hs==0]))
Dependent variable: | |||
kid_score[mom_hs == 1] | kid_score[mom_hs == 0] | kid_score | |
(1) | (2) | (3) | |
mom_iq[mom_hs == 1] | 0.485*** | ||
(0.065) | |||
mom_iq[mom_hs == 0] | 0.969*** | ||
(0.157) | |||
mom_iq | 0.969*** | ||
(0.148) | |||
mom_hs | 51.268*** | ||
(15.338) | |||
mom_iq:mom_hs | -0.484*** | ||
(0.162) | |||
Constant | 39.786*** | -11.482 | -11.482 |
(6.663) | (14.601) | (13.758) | |
Observations | 341 | 93 | 434 |
R2 | 0.143 | 0.294 | 0.230 |
Adjusted R2 | 0.140 | 0.286 | 0.225 |
Residual Std. Error | 17.664 (df = 339) | 19.073 (df = 91) | 17.971 (df = 430) |
F Statistic | 56.422*** (df = 1; 339) | 37.874*** (df = 1; 91) | 42.839*** (df = 3; 430) |
Note: | p<0.1; p<0.05; p<0.01 |
重新畫圖:
ggplot(iq, aes(x=mom_iq, y=kid_score, group=mom_hs)) +
geom_point(aes(color=factor(mom_hs)), shape=1, size=2) +
geom_abline( slope=0.969, intercept=-11.482, col="indianred", size=1.2) +
geom_abline(slope=0.485, intercept=39.786, col="blue", size=1.2) +
scale_color_discrete(name = "Mother with High School Degree",
labels=c("No High School Degree","High School Degree")) +
theme_bw() +
theme(legend.position = "top") +
labs(x="Mother's IQ", y="Kid's Score")
代入不同的值,會得到不同的交互作用與自變數的值。
\[\hat{u} \equiv \tilde{u} = \frac{1}{n}\mathcal {\sum\limits_{i=1}^n y_{i}} \]
Explained Sum of Squares
\[
\begin{eqnarray}
\sum _{i}^n(\hat{y}-\bar{y})^2 & = & SSE\\
& = & Var[\hat{y}]
\end{eqnarray}
\]
這三者的關係寫成:SST=SSE+SSR
x <- seq(1,10,length.out = 100)
y <- 2 + 1.2*x + rnorm(100,0,sd = 2)
cat("Y~X, R-squared:", summary(lm(y ~ x))$r.squared, "\n")
## Y~X, R-squared: 0.7042899
## X~Y, R-squared: 0.7042899
先創造兩個變數,並且計算相關係數:
## [1] 0.6743386
再創造Y,並且估計迴歸模型:
##
## Call:
## lm(formula = Y ~ X1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.7110 -0.6905 0.1052 0.7826 2.7231
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.006661 0.108008 0.062 0.951
## X1 1.079323 0.108350 9.961 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.076 on 98 degrees of freedom
## Multiple R-squared: 0.5031, Adjusted R-squared: 0.498
## F-statistic: 99.23 on 1 and 98 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = Y ~ X2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.4921 -0.9750 0.0051 1.0920 2.9693
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.04759 0.13897 0.342 0.733
## X2 0.50463 0.10171 4.961 2.95e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.365 on 98 degrees of freedom
## Multiple R-squared: 0.2008, Adjusted R-squared: 0.1926
## F-statistic: 24.62 on 1 and 98 DF, p-value: 2.947e-06
觀察散佈圖:
# graph
par(xpd=NA, mar=par()$mar+c(1, 0, 0, 0))
plot(X1, Y, type='n', xlab="X1, X2")
points(X1, Y, pch=16)
points(X2, Y, col="red", pch=16)
legend("bottomright", c("X1","X2"), col=c("black", "red"),
pch=c(16, 16), bty="n")
當依變數的變異程度變大,而其他變數不變,\(R^2\)會變小(為什麼?),但是\(X\)與\(Y\)之間的關係並沒有改變;迴歸係數約在1.5左右。
set.seed(3939889)
r2.0 <- function(sig){
x <- seq(1,10,length.out = 100) # our predictor
y <- 2 + 1.2*x + rnorm(100,0, sd = sig) # our response; a function of x plus some random noise
summary(lm(y ~ x))$r.squared # print the R-squared value
}
sigmas <- seq(1, 10,length.out = 20)
rout <- sapply(sigmas, r2.0) # apply our function to a series of sigma values
dt <- data.frame(sigmas, rout)
library(ggplot2)
ggplot(data=dt, aes(sigmas, rout)) +
geom_line(col='#FF6600', size=1.5) +
geom_point() +
labs(y=expression(R^2), x=expression(sigma^2)) +
theme_classic()
由上面的圖可以看出,隨著\(\sigma\)變大,\(R^2\)由大變小,但是如果檢視迴歸係數,大概在1.5上下:
set.seed(3939889)
beta <- function(sig){
x <- seq(1,10,length.out = 100) # our predictor
y <- 2 + 1.2*x + rnorm(100,0, sd = sig) # our response; a function of x plus some random noise
summary(lm(y ~ x))$coefficients[1,2] # print the R-squared value
}
sigmas <- seq(1, 10,length.out = 20)
res <- sapply(sigmas, beta)
dt <- data.frame(sigmas, res)
ggplot(data=dt, aes(sigmas, res)) +
geom_line(col='#FF6600', size=1.5) +
geom_point() +
labs(y=expression(beta[1]), x=expression(sigma^2)) +
theme_classic()
因此,我們真正關心的是迴歸係數,而不是\(R^2\),因為後者有可能隨著\(\sigma^2\)變大而趨近於0,但是迴歸係數穩定增加。這說明\(R^2\)並不代表「模型適合度」。
為什麼依變數的變異程度變大,而其他變數不變,\(R^2\)會變小?因為\(Y\)的變異數變大的同時,SSE雖然變大,但是SST也變大。借用上面的語法舉其中兩個\(\sigma\)為例:
set.seed(5525252)
Y.pre <- function(sig){
x <- seq(1,10,length.out = 100) # our predictor
y <- 2 + 1.2*x + rnorm(100,0, sd = sig) # our response; a function of x plus some random noise
m1<-lm(y ~ x)
return(m1$fitted.values) # fitted values
}
Y.mean <- function(sig){
y <- 2 + 1.2*x + rnorm(100,0, sd = sig)
return(mean(y)) # fitted values
}
Y.values <- function(sig){
y <- 2 + 1.2*x + rnorm(100,0, sd = sig) # our response
return(y)
}
sigmas <- c(1, 10)
tmp1<- sapply(sigmas, Y.pre) # apply our function
tmp2<- sapply(sigmas, Y.mean)
tmp3<- sapply(sigmas, Y.values)
library(dplyr); library(data.table)
tmp <- data.table(Y.pre1=tmp1[,1], Y.mean1=tmp2[1], Y.values1=tmp3[,1],
Y.pre2=tmp1[,2], Y.mean2=tmp2[2], Y.values2=tmp3[,2])
tmp %>% summarise(SSR1=sum(Y.pre1-Y.mean1)^2, SST1=sum(Y.values1-Y.mean1)^2,
R.squared1=SSR1/SST1,
SSR2=sum(Y.pre2-Y.mean2)^2, SST2=sum(Y.values2-Y.mean2)^2,
R.squared2=SSR2/SST2) %>%
kable() %>%
kable_styling(full_width = F)
SSR1 | SST1 | R.squared1 | SSR2 | SST2 | R.squared2 |
---|---|---|---|---|---|
208.2556 | 11.60012 | 17.95288 | 42504 | 5031.315 | 8.447892 |
因此當\(Y\)的變異數增加,雖然預測值的離散程度因此增加,也就是預測值不會集中在某一個點,但是佔所有變異量的比例可能下降,\(R^2\)變小。
調整後的\(R^2\)考慮樣本數\(n\)以及自變數的數目\(k\): \[ R^2_{Adj}=1-\frac{(1-R^2)(n-1)}{n-k-1} \]
以下幾種資料的迴歸模型的\(R^2\)很大,但是散佈圖顯示X與Y的線性關係不一定成立。 第一種是多數觀察值集中在左下角,少數集中在右上方:
library(ggplot2)
ggplot(DT, aes(x=X, y=Y)) +
geom_point(size=1.5) +
geom_smooth(method='lm') +
theme_bw()
##
## Call:
## lm(formula = Y ~ X, data = DT)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.1929 -0.3682 -0.1053 0.6318 1.6318
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.5434 0.3123 1.740 0.107
## X 0.8248 0.0918 8.984 1.12e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8653 on 12 degrees of freedom
## Multiple R-squared: 0.8706, Adjusted R-squared: 0.8598
## F-statistic: 80.72 on 1 and 12 DF, p-value: 1.124e-06
第二種是觀察值呈現非線性關係:
library(ggplot2)
ggplot(DT, aes(x=X, y=Y)) +
geom_point(size=1.5) +
geom_smooth(method='lm') +
geom_line(col='red3') +
theme_bw()
##
## Call:
## lm(formula = Y ~ X, data = DT)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.3606 -1.9695 -0.9249 2.2037 3.2930
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.9859 1.0558 -7.564 1.48e-07 ***
## X 6.7821 0.2905 23.346 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.641 on 22 degrees of freedom
## Multiple R-squared: 0.9612, Adjusted R-squared: 0.9594
## F-statistic: 545 on 1 and 22 DF, p-value: < 2.2e-16
第三種是多數的觀察值呈現線性關係,但是少數的觀察值並不在迴歸線附近:
library(ggplot2)
ggplot(DT, aes(x=X, y=Y)) +
geom_point(size=1.5) +
geom_smooth(method='lm') +
theme_bw()
##
## Call:
## lm(formula = Y ~ X, data = DT)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.3820 -1.6374 -0.5803 0.8086 5.3187
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.4659 0.3212 4.563 4.17e-05 ***
## X 2.9021 0.2758 10.522 1.80e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.126 on 43 degrees of freedom
## Multiple R-squared: 0.7202, Adjusted R-squared: 0.7137
## F-statistic: 110.7 on 1 and 43 DF, p-value: 1.803e-13
由以上三個圖可以看出,\(R^2\)只能詮釋為自變數跟依變數之間的相關,不能視為模型適合資料的程度,也不能做為判斷模型的標準。
mtcars
這筆資料的hp以及mpg繪製散佈圖與迴歸線圖,並且計算迴歸係數
UsingR
的babies
這筆資料,估計年齡(age)與身高(ht)對於嬰兒重量(wt)的影響。注意這些變數有一定的範圍。
nym.2002
資料中,性別與年齡會影響完成馬拉松的時間嗎?
## 最後更新日期 06/01/2019