本週上課將介紹迴歸的基本原理,配合之前討論的研究設計,可以描述兩個變數之間的關係。例如:
library(UsingR)
## Loading required package: MASS
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
## Loading required package: HistData
## Loading required package: Hmisc
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## Loading required package: ggplot2
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
##
## src, summarize
## The following objects are masked from 'package:base':
##
## format.pval, units
##
## Attaching package: 'UsingR'
## The following object is masked from 'package:survival':
##
## cancer
ggplot(nym.2002, aes(age, time)) +
geom_point(col="#808000", size=2) +
geom_smooth(method="lm") +
theme_bw() +
theme(axis.title=element_text(size=12),
axis.text = element_text(size=12))
\(E[Y|X]=\beta_{0}+X\beta_{1}\)
sales \(\approx {\beta}_{0}+{\beta}_{1}\cdot\) TV
site="http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv"
Advertising<-read.csv(file = url(site), sep=',', header = TRUE)
m1<-lm(sales ~ TV, data=Advertising)
summary(m1)
##
## Call:
## lm(formula = sales ~ TV, data = Advertising)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.3860 -1.9545 -0.1913 2.0671 7.2124
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.032594 0.457843 15.36 <2e-16 ***
## TV 0.047537 0.002691 17.67 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.259 on 198 degrees of freedom
## Multiple R-squared: 0.6119, Adjusted R-squared: 0.6099
## F-statistic: 312.1 on 1 and 198 DF, p-value: < 2.2e-16
資料來源:James et al., 2013畫圖如下:
library(ggplot2)
A1<-ggplot(Advertising, aes(x=TV, y=sales)) +
labs(y = 'Sales', x = 'TV') +
geom_point(col="saddlebrown") +
geom_smooth(method="lm", col='blue', se=F, size=1.5) ; A1
加上預測值以及觀察值與預測值之間的殘差(參考BLOGR):
Advertising$predicted <- predict(m1) # Save the predicted values
Advertising$residuals <- residuals(m1) # Save the residual values
# Quick look at the actual, predicted, and residual values
#library(dplyr)
#Advertising %>% select(predicted, residuals) %>% head()
ggplot(Advertising, aes(x=TV, y=sales)) +
geom_point(color="saddlebrown") +
geom_point(aes(y = predicted), shape = 1, color="red") +
geom_segment(aes(xend=TV, yend=predicted), color="lightgray") +
theme_bw()
library(foreign)
iq<-read.dta("kidiq.dta")
f1 <- with(iq, lm(kid_score ~ mom_hs) )
summary(f1)
##
## Call:
## lm(formula = kid_score ~ mom_hs)
##
## Residuals:
## Min 1Q Median 3Q Max
## -57.55 -13.32 2.68 14.68 58.45
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 77.548 2.059 37.670 < 2e-16 ***
## mom_hs 11.771 2.322 5.069 5.96e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.85 on 432 degrees of freedom
## Multiple R-squared: 0.05613, Adjusted R-squared: 0.05394
## F-statistic: 25.69 on 1 and 432 DF, p-value: 5.957e-07
估計結果寫成:
kid_score = \(77.548 + 11.77\cdot\) mom_hs
f2 <- with(iq, lm(kid_score ~ mom_iq))
summary(f2)
##
## Call:
## lm(formula = kid_score ~ mom_iq)
##
## Residuals:
## Min 1Q Median 3Q Max
## -56.753 -12.074 2.217 11.710 47.691
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 25.79978 5.91741 4.36 1.63e-05 ***
## mom_iq 0.60997 0.05852 10.42 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 18.27 on 432 degrees of freedom
## Multiple R-squared: 0.201, Adjusted R-squared: 0.1991
## F-statistic: 108.6 on 1 and 432 DF, p-value: < 2.2e-16
估計結果寫成:
kid_score = \(25.799+ 0.609\cdot\) mom_iq
\(\widehat{\mathrm {Kid's\hspace{.25em} Scores}}=\hat{\beta}_{0}+\hat{\beta}_{1}(\mathrm {Mom\hspace{.25em} has\hspace{.25em} HS\hspace{.25em} degree})+\hat{\beta}_{2}(\mathrm {Mom's\hspace{.25em} IQ})\)
當\(X_{1}=1\)(母親有高中學歷), \[ \begin{eqnarray} \widehat{\mathrm {Kid's\hspace{.25em} Scores}} & = &\hat{\beta}_{0}+\hat{\beta}_{1}\cdot 1+\hat{\beta}_{2}(\mathrm {Mom's\hspace{.25em} IQ}) \\ & = &\hat{\beta}_{0}+\hat{\beta}_{1}+\hat{\beta}_{2}(\mathrm {Mom's\hspace{.25em} IQ}) \end{eqnarray} \]
當\(X_{1}=0\)(母親沒有高中學歷),
\[
\begin{eqnarray}
\widehat{\mathrm {Kid's\hspace{.25em} Scores}} & = &\hat{\beta}_{0}+\hat{\beta}_{1}\cdot 0+\hat{\beta}_{2}(\mathrm {Mom's\hspace{.25em} IQ}) \\
& = &\hat{\beta}_{0}+\hat{\beta}_{2}(\mathrm {Mom's\hspace{.25em} IQ})
\end{eqnarray}
\]
也就是說,這兩個模型的斜率一樣,但是當\(X_{1}=1\)時,截距多了 \(\hat{\beta}_{1}\)。
進行估計如下:
f3 <- with(iq, lm(kid_score ~ mom_hs + mom_iq))
summary(f3)
##
## Call:
## lm(formula = kid_score ~ mom_hs + mom_iq)
##
## Residuals:
## Min 1Q Median 3Q Max
## -52.873 -12.663 2.404 11.356 49.545
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 25.73154 5.87521 4.380 1.49e-05 ***
## mom_hs 5.95012 2.21181 2.690 0.00742 **
## mom_iq 0.56391 0.06057 9.309 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 18.14 on 431 degrees of freedom
## Multiple R-squared: 0.2141, Adjusted R-squared: 0.2105
## F-statistic: 58.72 on 2 and 431 DF, p-value: < 2.2e-16
估計得到的模型為:\[\hat{Y}=25.73+5.95\cdot 高中畢業與否 +0.56\cdot 母親智商\]
係數分別代表的意義為:
ggplot(iq, aes(x=mom_iq, y=kid_score, group=mom_hs)) +
geom_point(aes(color=factor(mom_hs)), shape=1, size=2) +
geom_abline( slope=0.56, intercept=25.73, col="indianred", size=1.2) +
geom_abline(slope=0.6, intercept=31.73, col="blue", size=1.2) +
scale_color_discrete(name = "Mother with High School Degree",
labels=c("No High School Degree","High School Degree")) +
theme_bw() +
theme(legend.position = "top") +
labs(x="Mother's IQ", y="Kid's Score")
fit4 <- with(iq, lm(kid_score ~ mom_iq + mom_age))
summary(fit4)
##
## Call:
## lm(formula = kid_score ~ mom_iq + mom_age)
##
## Residuals:
## Min 1Q Median 3Q Max
## -56.941 -12.493 2.257 11.614 46.711
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.59625 9.08397 1.937 0.0534 .
## mom_iq 0.60357 0.05874 10.275 <2e-16 ***
## mom_age 0.38813 0.32620 1.190 0.2348
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 18.26 on 431 degrees of freedom
## Multiple R-squared: 0.2036, Adjusted R-squared: 0.1999
## F-statistic: 55.08 on 2 and 431 DF, p-value: < 2.2e-16
\[ \hat{Y}=17.6+0.6\cdot 母親智商 +0.39\cdot 母親年齡 \]
雙連續變數的迴歸係數分別代表的意義為:
因為
\[ \mathcal{\frac{\partial (y=\beta_{0}+\beta_{1}X_{1}+\beta_{2}X_{2})}{\partial X_{1}}}=\beta_{1} \]
所以\(\hspace{.3em}\beta_{1}\hspace{.3em}\)代表當其他變數在同一個值或是水準時,對於\(\hspace{.3em}y\hspace{.3em}\)的淨作用。
fit5 <- with(iq, lm(kid_score ~ mom_iq + mom_hs+mom_iq:mom_hs))
fit5.1<-with(iq, lm(kid_score[mom_hs==1] ~ mom_iq[mom_hs==1]))
fit5.2<-with(iq, lm(kid_score[mom_hs==0] ~ mom_iq[mom_hs==0]))
Dependent variable: | |||
kid_score[mom_hs == 1] | kid_score[mom_hs == 0] | kid_score | |
(1) | (2) | (3) | |
mom_iq[mom_hs == 1] | 0.485*** | ||
(0.065) | |||
mom_iq[mom_hs == 0] | 0.969*** | ||
(0.157) | |||
mom_iq | 0.969*** | ||
(0.148) | |||
mom_hs | 51.268*** | ||
(15.338) | |||
mom_iq:mom_hs | -0.484*** | ||
(0.162) | |||
Constant | 39.786*** | -11.482 | -11.482 |
(6.663) | (14.601) | (13.758) | |
Observations | 341 | 93 | 434 |
R2 | 0.143 | 0.294 | 0.230 |
Adjusted R2 | 0.140 | 0.286 | 0.225 |
Residual Std. Error | 17.664 (df = 339) | 19.073 (df = 91) | 17.971 (df = 430) |
F Statistic | 56.422*** (df = 1; 339) | 37.874*** (df = 1; 91) | 42.839*** (df = 3; 430) |
Note: | p<0.1; p<0.05; p<0.01 |
重新畫圖:
ggplot(iq, aes(x=mom_iq, y=kid_score, group=mom_hs)) +
geom_point(aes(color=factor(mom_hs)), shape=1, size=2) +
geom_abline( slope=0.969, intercept=-11.482, col="indianred", size=1.2) +
geom_abline(slope=0.485, intercept=39.786, col="blue", size=1.2) +
scale_color_discrete(name = "Mother with High School Degree",
labels=c("No High School Degree","High School Degree")) +
theme_bw() +
theme(legend.position = "top") +
labs(x="Mother's IQ", y="Kid's Score")
\[\hat{u} \equiv \tilde{u} = \frac{1}{n}\mathcal {\sum\limits_{i=1}^n y_{i}} \]
Residual Sum of Squares
\[
\begin{eqnarray}
\sum _{i}^n(\hat{y}_{i}-y_{i})^2 &=&\sum _{i}^n\hat{u}^2\\
&= &SSR \\
& = & Var[\hat{u}]
\end{eqnarray}
\]
Explained Sum of Squares
\[
\begin{eqnarray}
\sum _{i}^n(\hat{y}-\bar{y})^2 & = & SSE\\
& = & Var[\hat{y}]
\end{eqnarray}
\]
先創造兩個變數,並且計算相關係數:
set.seed(02138)
X1 <-rnorm(100, 0, 1); X2 <- X1+rnorm(100, 0, 1)
print(cor(X1, X2))
## [1] 0.6743386
再創造Y,並且估計迴歸模型:
#Y
Y <- X1+rt(100, 40)
f1<-lm(Y ~ X1)
f2<-lm(Y ~ X2)
summary(f1); summary(f2)
##
## Call:
## lm(formula = Y ~ X1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.7110 -0.6905 0.1052 0.7826 2.7231
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.006661 0.108008 0.062 0.951
## X1 1.079323 0.108350 9.961 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.076 on 98 degrees of freedom
## Multiple R-squared: 0.5031, Adjusted R-squared: 0.498
## F-statistic: 99.23 on 1 and 98 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = Y ~ X2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.4921 -0.9750 0.0051 1.0920 2.9693
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.04759 0.13897 0.342 0.733
## X2 0.50463 0.10171 4.961 2.95e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.365 on 98 degrees of freedom
## Multiple R-squared: 0.2008, Adjusted R-squared: 0.1926
## F-statistic: 24.62 on 1 and 98 DF, p-value: 2.947e-06
觀察散佈圖:
# graph
par(xpd=NA, mar=par()$mar+c(1, 0, 0, 0))
plot(X1, Y, type='n', xlab="X1, X2")
points(X1, Y, pch=16)
points(X2, Y, col="red", pch=16)
legend("bottomright", c("X1","X2"), col=c("black", "red"),
pch=c(16, 16), bty="n")
\[ R^2_{Adj}=1-\frac{(1-R^2)(n-1)}{n-k-1} \]
mtcars
這筆資料的hp以及mpg繪製散佈圖與迴歸線圖,並且計算迴歸係數
UsingR
的babies
這筆資料,估計年齡(age)與身高(ht)對於嬰兒重量(wt)的影響。注意這些變數有一定的範圍。
nym.2002
資料中,性別與年齡會影響完成馬拉松的時間嗎?