根據上述假設,有X,Z兩個自變數的複迴歸模型可寫成:
\[E[Y|X,Z]=\beta_{0}+\beta_{1}X+\beta_{2}Z\]
\[\hat{\beta}_{1}=\frac{\sum (z-\bar{z})^2\sum (x-\bar{x})(y-\bar{y})-\sum (x-\bar{x})(z-\bar{z})\sum (z-\bar{z})(y-\bar{y})}{\sum (x-\bar{x})^2\sum(z-\bar{z})^2-(\sum (x-\bar{x})(z-\bar{z}))^2}\]
\[\frac{\sum_{i}^{n}\hat{r}_{xz,i}y_{i}}{\sum_{i}^{n}\hat{r}_{xz,i}^2}\]
\[X=\lambda+\delta\dot Z+r_{xz}\]
殘差項的平均值等於0,也就是\(\bar{\hat{u}}=0\)
proof: \[\begin{align*} \bar{\hat{u}}&=\bar{y}-\hat{\beta}_{0}-\hat{\beta}_{1}\bar{x}-\hat{\beta}_{2}\bar{z} \\ & = \bar{y}-(\bar{y}-\hat{\beta}_{1}\bar{x}-\hat{\beta}_{2}\bar{z})- \hat{\beta}_{1}\bar{x}-\hat{\beta}_{2}\bar{z}=0 \end{align*}\]
這個特性可導出\(\bar{y} = \bar{\hat{y}}\)
proof: \[\begin{align*} y_{i} & =\hat{y}_{i}+\hat{u}_{i}\Rightarrow \bar{y}\\ &=n^{-1}\sum_{i=1}^{n}y_{i}=n^{-1}\sum_{i=1}^{n}(\hat{y}_{i}+\hat{u}_{i})\\ &=\bar{\hat{y}}-\bar{\hat{u}}=\bar{\hat{y}} \end{align*}\]
\(Cov(x,\hat{u})=Cov(z,\hat{u})=0\)
可導出\(Cov(\hat{y},\hat{u})=0\)
(\(\bar{x},\bar{y},\bar{z}\))一定會落在迴歸平面上。
\[\bf{y}=\bf{X}\beta+\bf{\epsilon}\]
\[ \bf{y}= \begin{bmatrix} y_{1} \\ y_{2} \\ \vdots \\ y_{n} \end{bmatrix} \]
\[ X=\begin{bmatrix} 1 & X_{11} & X_{12} &\ldots & X_{1k} \\ 1 & X_{21} & X_{22} &\ldots & X_{2k} \\ \vdots & \vdots & \vdots &\ldots &\vdots \\ 1 & X_{n1} & X_{n2} &\ldots & X_{nk}\\ \end{bmatrix} \]
\[ \bf{\beta}= (\beta_{0},\beta_{1},\beta_{2}\ldots,\beta_{k})=\begin{bmatrix} \beta_{0} \\ \beta_{1} \\ \beta_{2} \\ \vdots \\ \beta_{k} \end{bmatrix} \]
\[ \bf{\epsilon}= \begin{bmatrix} \epsilon_{1} \\ \epsilon_{2} \\ \vdots \\ \epsilon_{n} \end{bmatrix} \]
\[\begin{align*} \bf{\hat{y}}&=\bf{X}\hat{\beta} \tag{4.1} \end{align*}\]
\[\begin{align*} \bf{X'}(\bf{y}-\bf{X}\hat{\beta}) & =0\\ \Longleftrightarrow \bf{X'y}-\bf{X'X}\hat{\beta}&=0\\ \Longleftrightarrow \bf{X'X}\hat{\beta}&=\bf{X'y} \tag{4.2} \end{align*}\]
\[\begin{align*} \hat{\beta}&=(\bf{X'X})^{-1}\bf{X'y} \tag{4.3} \end{align*}\]
\[(\bf{X'X})\hat{\beta}=\bf{X'}y\]
\[\begin{align*} \bf{y}&=\bf{X}\beta+\bf{\epsilon}\\ &=\bf{X}\hat{\beta}+u \tag{4.4} \end{align*}\]
\[\begin{align*} (\bf{X'X})\hat{\beta}&=\bf{X'}(\bf{X}\beta+\bf{\epsilon})\\ &= (\bf{X'X})\hat{\beta}+\bf{X'}u \end{align*}\]
\[\bf{X'}u=0\]
換句話說,自變數與殘差沒有相關。還可以推導出預測值\(\hat{y}\)與殘差沒有相關。
\[\hat{y}u=0\]
接下來是推導\(\rm{Var}(\hat{\beta})\)。因為:
\[\rm{Var}(\hat{\beta})=E[\hat{\beta}^2]-E[\hat{\beta}]E[\hat{\beta}']\]
\[\hat{\beta}=(\bf{X'X})^{-1}\bf{X'y}\]
\[\begin{align*} Var(\hat{\beta})&=[(\bf{X'X})^{-1}\bf{X'y})^2]-\beta^2 \end{align*}\]
經過整理後, \(Var(\hat{\beta})= \sigma^2\bf{(X'X)^{-1}}\)(詳細過程)
其中\(\sigma^2\)可用\(\hat{\sigma}^2\)來估計,也就是殘差項的平方和除以n-k:
\[\hat{\sigma}^2=\frac{\sum u^2}{n-k}\]
m1<-lm(total ~ expend+ratio, data=UsingR::SAT)
stargazer::stargazer(m1, title='SAT分數迴歸模型',
type=ifelse(knitr::is_latex_output(),"latex","html"),
label=knitr::opts_current$get("label"))
Dependent variable: | |
total | |
expend | -22.310*** |
(7.956) | |
ratio | -2.295 |
(4.784) | |
Constant | 1,136.000*** |
(107.800) | |
Observations | 50 |
R2 | 0.149 |
Adjusted R2 | 0.113 |
Residual Std. Error | 70.480 (df = 47) |
F Statistic | 4.114** (df = 2; 47) |
Note: | p<0.1; p<0.05; p<0.01 |
dat<-UsingR::SAT
X<-as.matrix(cbind(1,dat$expend,dat$ratio))
head(X)
## [,1] [,2] [,3]
## [1,] 1 4.405 17.2
## [2,] 1 8.963 17.6
## [3,] 1 4.778 19.3
## [4,] 1 4.459 17.1
## [5,] 1 4.992 24.0
## [6,] 1 5.443 18.4
Y<-as.matrix(dat$total)
head(Y)
## [,1]
## [1,] 1029
## [2,] 934
## [3,] 944
## [4,] 1005
## [5,] 902
## [6,] 980
beta_hat<-solve(t(X)%*%X)%*%t(X)%*%Y
beta_hat
## [,1]
## [1,] 1136.336
## [2,] -22.308
## [3,] -2.295
我們應用\(\hat{\sigma}^2=\frac{\sum u^2}{n-k}\)來計算迴歸係數的標準誤。計算時我們可以從\(\tt{anova(model)}\)找出\(\frac{\sum u^2}{n-k}\),也就是表4.2:
anatable=anova(m1)
knitr::kable(anatable, format='simple', caption='迴歸模型變異數分析表')
Df | Sum Sq | Mean Sq | F value | Pr(>F) | |
---|---|---|---|---|---|
expend | 1 | 39722 | 39722 | 7.9974 | 0.0069 |
ratio | 1 | 1143 | 1143 | 0.2301 | 0.6337 |
Residuals | 47 | 233443 | 4967 | NA | NA |
sum((dat$total-fitted(m1))^2)/47
## [1] 4967
var_betaHat <- anova(m1)[[3]][3]* solve(t(X) %*% X)
sqrt(diag(var_betaHat))
## [1] 107.803 7.956 4.784
對照表4.1,可以看出係數的標準誤都相同。
迴歸係數的變異數與共變數矩陣如下:
\[ E[(\hat{\beta}-\beta)(\hat{\beta}-\beta)']=\begin{bmatrix} \rm{var}(\hat{\beta}_{1}) & \rm{cov}(\hat{\beta}_{1},\hat{\beta}_{2}) &\ldots & \rm{cov}(\hat{\beta}_{1},\hat{\beta}_{k}) \\ \rm{cov}(\hat{\beta}_{1},\hat{\beta}_{2}) & \rm{var}(\hat{\beta}_{2}) &\ldots & \rm{cov}(\hat{\beta}_{2},\hat{\beta}_{k}) \\ \vdots & \vdots & \vdots &\vdots \\ \rm{cov}(\hat{\beta}_{k},\hat{\beta}_{1}) & \rm{cov}(\hat{\beta}_{k},\hat{\beta}_{2}) &\ldots & \rm{var}(\hat{\beta}_{k})\\ \end{bmatrix} \]
\[\hat{\beta}\sim N(\beta, \sigma^2\bf{(X'X)^{-1}})\]
library(foreign)
file<-here::here('data','kidiq.dta')
iq<-read.dta(file)
f1 <- with(iq, lm(kid_score ~ mom_hs) )
stargazer::stargazer(f1, title='母親學歷與學生成績',
type=ifelse(knitr::is_latex_output(),"latex","html"),
label=knitr::opts_current$get("label"))
Dependent variable: | |
kid_score | |
mom_hs | 11.770*** |
(2.322) | |
Constant | 77.550*** |
(2.059) | |
Observations | 434 |
R2 | 0.056 |
Adjusted R2 | 0.054 |
Residual Std. Error | 19.850 (df = 432) |
F Statistic | 25.690*** (df = 1; 432) |
Note: | p<0.1; p<0.05; p<0.01 |
f2 <- with(iq, lm(kid_score ~ mom_iq))
stargazer::stargazer(f2, title='母親智力與學生成績',
type=ifelse(knitr::is_latex_output(),"latex","html"),
label=knitr::opts_current$get("label"))
Dependent variable: | |
kid_score | |
mom_iq | 0.610*** |
(0.059) | |
Constant | 25.800*** |
(5.917) | |
Observations | 434 |
R2 | 0.201 |
Adjusted R2 | 0.199 |
Residual Std. Error | 18.270 (df = 432) |
F Statistic | 108.600*** (df = 1; 432) |
Note: | p<0.1; p<0.05; p<0.01 |
\(\widehat{\mathrm {Kid's\hspace{.25em} Scores}}=\hat{\beta}_{0}+\hat{\beta}_{1}(\mathrm {Mom\hspace{.25em} has\hspace{.25em} HS\hspace{.25em} degree})+\hat{\beta}_{2}(\mathrm {Mom's\hspace{.25em} IQ})\)
\[\begin{align*} \widehat{\mathrm {Kid's\hspace{.25em} Scores}} & = \hat{\beta}_{0}+\hat{\beta}_{1}\cdot 1+\hat{\beta}_{2}(\mathrm {Mom's\hspace{.25em} IQ}) \\ & = \hat{\beta}_{0}+\hat{\beta}_{1}+\hat{\beta}_{2}(\mathrm {Mom's\hspace{.25em} IQ}) \end{align*}\]
\[\begin{align*} \widehat{\mathrm {Kid's\hspace{.25em} Scores}} & = \hat{\beta}_{0}+\hat{\beta}_{1}\cdot 0+\hat{\beta}_{2}(\mathrm {Mom's\hspace{.25em} IQ}) \\ & = \hat{\beta}_{0}+\hat{\beta}_{2}(\mathrm {Mom's\hspace{.25em} IQ}) \end{align*}\]
也就是說,這兩個模型的斜率一樣,但是當\(X_{1}=1\)時,截距多了 \(\hat{\beta}_{1}\)。
進行估計如下:
f3 <- with(iq, lm(kid_score ~ mom_hs + mom_iq))
stargazer::stargazer(f3, title='母親智力及學歷與學生成績',
type=ifelse(knitr::is_latex_output(),"latex","html"),
label=knitr::opts_current$get("label"))
Dependent variable: | |
kid_score | |
mom_hs | 5.950*** |
(2.212) | |
mom_iq | 0.564*** |
(0.061) | |
Constant | 25.730*** |
(5.875) | |
Observations | 434 |
R2 | 0.214 |
Adjusted R2 | 0.210 |
Residual Std. Error | 18.140 (df = 431) |
F Statistic | 58.720*** (df = 2; 431) |
Note: | p<0.1; p<0.05; p<0.01 |
估計得到的模型為: \[\hat{Y}=25.73+5.95\cdot \rm{母親高中畢業與否} +0.56\cdot \rm{母親智商}\]
迴歸係數分別代表的意義為:
ggplot(iq, aes(x=mom_iq, y=kid_score, group=mom_hs)) +
geom_point(aes(color=factor(mom_hs)), shape=1, size=2) +
geom_abline( slope=0.56, intercept=25.73, col="indianred", size=1.2) +
geom_abline(slope=0.56, intercept=25.73+5.95, col="blue", size=1.2) +
scale_color_discrete(name = "Mother with High School Degree",
labels=c("No High School Degree","High School Degree")) +
theme_bw() +
theme(legend.position = "top") +
labs(x="Mother's IQ", y="Kid's Score")
Figure 5.1: 雙變數迴歸模型
fit4 <- with(iq, lm(kid_score ~ mom_iq + mom_age))
stargazer::stargazer(fit4, title='母親智力及年齡與學生成績',
type=ifelse(knitr::is_latex_output(),"latex","html"),
label=knitr::opts_current$get("label"))
Dependent variable: | |
kid_score | |
mom_iq | 0.604*** |
(0.059) | |
mom_age | 0.388 |
(0.326) | |
Constant | 17.600* |
(9.084) | |
Observations | 434 |
R2 | 0.204 |
Adjusted R2 | 0.200 |
Residual Std. Error | 18.260 (df = 431) |
F Statistic | 55.080*** (df = 2; 431) |
Note: | p<0.1; p<0.05; p<0.01 |
\[ \hat{Y}=17.6+0.6\cdot \rm{母親智商} +0.39\cdot \rm{母親年齡} \]
\[ \mathcal{\frac{\partial (y=\beta_{0}+\beta_{1}X_{1}+\beta_{2}X_{2})}{\partial X_{1}}}=\beta_{1} \]
fit5 <- with(iq, lm(kid_score ~ mom_iq + mom_hs+mom_iq:mom_hs))
fit5.1<-with(iq, lm(kid_score[mom_hs==1] ~ mom_iq[mom_hs==1]))
fit5.2<-with(iq, lm(kid_score[mom_hs==0] ~ mom_iq[mom_hs==0]))
library(stargazer)
##
## Please cite as:
## Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
## R package version 5.2.3. https://CRAN.R-project.org/package=stargazer
stargazer(fit5.1, fit5.2, fit5, align=TRUE, title='三個模型比較',
type=ifelse(knitr::is_latex_output(),"latex","html"),
label=knitr::opts_current$get("label"))
Dependent variable: | |||
kid_score[mom_hs == 1] | kid_score[mom_hs == 0] | kid_score | |
(1) | (2) | (3) | |
mom_iq[mom_hs == 1] | 0.485*** | ||
(0.065) | |||
mom_iq[mom_hs == 0] | 0.969*** | ||
(0.157) | |||
mom_iq | 0.969*** | ||
(0.148) | |||
mom_hs | 51.270*** | ||
(15.340) | |||
mom_iq:mom_hs | -0.484*** | ||
(0.162) | |||
Constant | 39.790*** | -11.480 | -11.480 |
(6.663) | (14.600) | (13.760) | |
Observations | 341 | 93 | 434 |
R2 | 0.143 | 0.294 | 0.230 |
Adjusted R2 | 0.140 | 0.286 | 0.225 |
Residual Std. Error | 17.660 (df = 339) | 19.070 (df = 91) | 17.970 (df = 430) |
F Statistic | 56.420*** (df = 1; 339) | 37.870*** (df = 1; 91) | 42.840*** (df = 3; 430) |
Note: | p<0.1; p<0.05; p<0.01 |
ggplot(iq, aes(x=mom_iq, y=kid_score, group=mom_hs)) +
geom_point(aes(color=factor(mom_hs)), shape=1, size=2) +
geom_abline( slope=0.969, intercept=-11.482, col="indianred", size=1.2) +
geom_abline(slope=0.485, intercept=39.786, col="blue", size=1.2) +
scale_color_discrete(name = "Mother with High School Degree",
labels=c("No High School Degree","High School Degree")) +
theme_bw() +
theme(legend.position = "top") +
labs(x="Mother's IQ", y="Kid's Score")
Figure 7.1: 雙變數迴歸模型
options(contrasts = c("contr.treatment", "contr.poly"))
fit5 <- with(iq, lm(kid_score ~ mom_iq + mom_hs+mom_iq:mom_hs))
library(car)
## Loading required package: carData
## Warning: package 'carData' was built under R version 3.6.2
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
Anova(fit5, type="II")
## Anova Table (Type II tests)
##
## Response: kid_score
## Sum Sq Df F value Pr(>F)
## mom_iq 28504 1 88.26 <2e-16 ***
## mom_hs 2380 1 7.37 0.0069 **
## mom_iq:mom_hs 2878 1 8.91 0.0030 **
## Residuals 138879 430
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
因為交互作用達到統計上的顯著水準,所以拿掉這個交互作用項會影響模型與資料的符合程度,因此可以說兩個群體之間有不同的斜率。
同樣的方法適用在glm模型。
National Institute of Statistical Sciences 在這篇文章探討如何測試迴歸模型有二元變數時的斜率相等假設,請大家參考。 # 變數的淨作用
我們可以從淨作用的角度詮釋迴歸係數。
假設迴歸模型為:
\[Y=\beta_{0}+\beta_{1}X+\beta_{2}Z+\epsilon\]
估計的迴歸係數可寫成\[\hat{\beta}_{1}=\frac{\sum_{i}^{n}\hat{r}_{xz,i}y_{i}}{\sum_{i}^{n}\hat{r}_{xz,i}}\]
其中\(\hat{r}_{xz,i}\)代表用\(Z\)預測\(X\)的殘差:
\[X=\lambda+\delta\dot Z+r_{xz}\]
pscl
套件的UKHouseOfCommons為例,v2是工黨的得票率,v3是自由民主黨的得票率,labinc代表該選區現任者為工黨議員。工黨得票率受到自由民主黨的得票率影響,因為兩個黨的意識形態很接近。此外,現任者如果是工黨,這次選舉的得票率應該比較好。
先估計完整的複迴歸模型如表7.2:
library(pscl); library(stargazer)
## Classes and Methods for R developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University
## Simon Jackman
## hurdle and zeroinfl functions by Achim Zeileis
labormodel<-lm(v2 ~ v3 + labinc , data=UKHouseOfCommons)
stargazer::stargazer(labormodel, title='工黨得票率迴歸模型',
type=ifelse(knitr::is_latex_output(),"latex","html"),
label=knitr::opts_current$get("label"))
Dependent variable: | |
v2 | |
v3 | -0.998*** |
(0.041) | |
labinc | 0.206*** |
(0.009) | |
Constant | 0.495*** |
(0.010) | |
Observations | 521 |
R2 | 0.770 |
Adjusted R2 | 0.769 |
Residual Std. Error | 0.085 (df = 518) |
F Statistic | 865.700*** (df = 2; 518) |
Note: | p<0.1; p<0.05; p<0.01 |
\[\rm{v2}=0.495-0.998\cdot \rm{v3}+0.206\cdot \rm{labinc}\]
library(pscl)
m2 <- lm(v3 ~ labinc , data=UKHouseOfCommons)
resd.m2 <- residuals(m2)
m3 <- lm(v2 ~ resd.m2, data=UKHouseOfCommons)
stargazer::stargazer(m2, m3, align=T, title='用殘差值建立的工黨得票率迴歸模型',
type=ifelse(knitr::is_latex_output(),"latex","html"),
label=knitr::opts_current$get("label"))
Dependent variable: | ||
v3 | v2 | |
(1) | (2) | |
labinc | -0.080*** | |
(0.009) | ||
resd.m2 | -0.998*** | |
(0.073) | ||
Constant | 0.211*** | 0.361*** |
(0.005) | (0.007) | |
Observations | 521 | 521 |
R2 | 0.131 | 0.263 |
Adjusted R2 | 0.129 | 0.261 |
Residual Std. Error (df = 519) | 0.091 | 0.153 |
F Statistic (df = 1; 519) | 78.250*** | 185.000*** |
Note: | p<0.1; p<0.05; p<0.01 |
\[v3=0.361-0.998\cdot \mathrm{residual\hspace{.1cm}of\hspace{.1cm}model\hspace{.1cm}2}\]
library(pscl)
m2 <- lm(v3 ~ labinc , data=UKHouseOfCommons)
resd.m2 <- UKHouseOfCommons$v3-fitted(m2)
beta1<-sum(resd.m2*UKHouseOfCommons$v2)/sum((resd.m2)^2)
beta1
[1] -0.9978
\[\begin{align*} R^2 & =\frac{\sum\limits_{i=1}^{n}(\hat{Y}-\bar{Y})^2}{\sum\limits_{i=1}^{n}(Y-\bar{Y})^2}\\ &=1-\frac{\sum\limits_{i=1}^{n}(Y-\hat{Y})^2}{\sum\limits_{i=1}^{n}(Y-\bar{Y})^2}\\ &= 1-\frac{\sum\limits_{i=1}^{n}e^2}{\sum\limits_{i=1}^{n}(Y-\bar{Y})^2}\\ &=\frac{SSR}{SST} \tag{8.1} \end{align*}\]
\[\rm{adj}\hspace{.2em}R^2=1-\frac{\sum\limits_{i=1}^{n}(Y-\hat{Y})^2}{\sum\limits_{i=1}^{n}(Y-\bar{Y})^2}\times\frac{n-1}{n-k-1}\]
\[ X'=\begin{bmatrix} 1 & X_{11} & X_{21} \\ 1 & X_{12} & X_{22} \\ \vdots & \vdots &\vdots \\ 1 & X_{1n} & X_{2n} \\ \end{bmatrix} \]
\[ \bf{\beta}= (\hat{\beta}_{0},\hat{\beta}_{1},\hat{\beta}_{2})=\begin{bmatrix} \hat{\beta}_{0} \\ \hat{\beta}_{1} \\ \hat{\beta}_{2} \end{bmatrix} \]
\[ \begin{bmatrix} \hat{\beta}_{0}\times 1 +X_{11}\hat{\beta}_{1} + X_{21}\hat{\beta}_{2} \\ \hat{\beta}_{0}\times 1 + X_{12}\hat{\beta}_{1} + X_{22}\hat{\beta}_{2} \\ \vdots\\ \hat{\beta}_{0}\times 1 +X_{1n}\hat{\beta}_{1} + X_{2n}\hat{\beta}_{2} \end{bmatrix} \]
set.seed(02139)
#data
X1 <- rnorm(10, 0, 1)+runif(10, 0, 1)
X2 <- rnorm(10, 0, 1)+runif(10, 0, 0.5)
Y <- rnorm(10, 0, 1)
#model
M0 <-lm(Y ~ X1 + X2)
#predicted value
as.vector(t(M0$coefficients[c(1, 2,3)])%*%t(cbind(1,X1,X2)))
[1] -0.4156 -0.4394 -0.5445 -0.8770 -0.1836 -0.6284 -0.2466 -0.5551 -1.0363 [10] -0.4813
R
可以直接產生預測值,也就是\(\tt{predict(model)}\):as.vector(predict(M0))
[1] -0.4156 -0.4394 -0.5445 -0.8770 -0.1836 -0.6284 -0.2466 -0.5551 -1.0363 [10] -0.4813
#calculate R2 step-by-step
y = pscl::UKHouseOfCommons$v2
mean.y<-mean(y)
sum.res<-sum((y-labormodel$fitted.values)^2)
var.y <- sum((y - mean.y)^2)
r2 <- 1- (sum.res/var.y); r2
## [1] 0.7697
#function
r2_general <-function(preds, actual){
return(1- sum((preds - actual) ^ 2)/sum((actual - mean(actual))^2))
}
preds<-labormodel$fitted.values
actual <- y
r2_general(preds,y)
## [1] 0.7697
\[F=\frac{MSR}{MSE}\sim F_{k,n-k-1}\]
\[R^2=1-\frac{1}{1+\frac{F\times k}{n-k-1}}\]
\[F=\frac{R^2}{1-R^2}\times \frac{n-k-1}{k}\]
#calculate F step-by-step
#function
r2_general <-function(preds, actual){
return(1- sum((preds - actual) ^ 2)/sum((actual - mean(actual))^2))
}
#model
labormodel<-lm(v2 ~ v3 + labinc , data=UKHouseOfCommons)
#R2
preds<-labormodel$fitted.values
actual <- pscl::UKHouseOfCommons$v2
r2=r2_general(preds, actual)
cat('R squared:', r2, '\n')
## R squared: 0.7697
n=nrow(UKHouseOfCommons);n
## [1] 521
k=2
r2new=r2/(1-r2)
#F2
F_stat<-function(r2new, n, k){
return((r2new*(n-k-1)/k))
}
cat('F-statistics:', F_stat(r2new, n, k))
## F-statistics: 865.7
airquality
資料:
fit2 <- with(airquality, lm(Temp ~ Wind+factor(Month)))
stargazer::stargazer(fit2, title='氣溫的迴歸模型',
type=ifelse(knitr::is_latex_output(),"latex","html"),
label=knitr::opts_current$get("label"))
Dependent variable: | |
Temp | |
Wind | -0.743*** |
(0.149) | |
factor(Month)6 | 12.540*** |
(1.594) | |
factor(Month)7 | 16.360*** |
(1.618) | |
factor(Month)8 | 16.320*** |
(1.624) | |
factor(Month)9 | 10.280*** |
(1.596) | |
Constant | 74.190*** |
(2.054) | |
Observations | 153 |
R2 | 0.588 |
Adjusted R2 | 0.574 |
Residual Std. Error | 6.175 (df = 147) |
F Statistic | 42.030*** (df = 5; 147) |
Note: | p<0.1; p<0.05; p<0.01 |
y=airquality$Temp
mean.y<-mean(y)
sum.res<-sum((y - fit2$fitted.values)^2)
var.y <- sum((y - mean.y)^2)
r2 <- 1- (sum.res/var.y); r2
## [1] 0.5884
#function
r2_general <-function(preds, actual){
return(1- sum((preds - actual) ^ 2)/sum((actual - mean(actual))^2))
}
preds<-fit2$fitted.values
actual <- y
r2_general(preds,y)
## [1] 0.5884
\[\rm{wt} \sim gestation+age+ht+wt1+dage+dht+dwt\]
library(UsingR)
## Warning: package 'UsingR' was built under R version 3.6.2
## Loading required package: MASS
## Warning: package 'MASS' was built under R version 3.6.2
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
## Loading required package: HistData
## Warning: package 'HistData' was built under R version 3.6.2
## Loading required package: Hmisc
## Warning: package 'Hmisc' was built under R version 3.6.2
## Loading required package: lattice
## Warning: package 'lattice' was built under R version 3.6.2
## Loading required package: survival
## Warning: package 'survival' was built under R version 3.6.2
## Loading required package: Formula
## Warning: package 'Formula' was built under R version 3.6.2
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:xtable':
##
## label, label<-
## The following objects are masked from 'package:dplyr':
##
## src, summarize
## The following objects are masked from 'package:base':
##
## format.pval, units
##
## Attaching package: 'UsingR'
## The following object is masked _by_ '.GlobalEnv':
##
## iq
## The following object is masked from 'package:survival':
##
## cancer
wt.lm1 <- with(babies, lm(wt ~ gestation+age+ht+wt1+dage+dht+dwt))
stargazer(wt.lm1, align=TRUE, title='新生兒體重迴歸模型1',
type=ifelse(knitr::is_latex_output(),"latex","html"),
label=knitr::opts_current$get("label"))
Dependent variable: | |
wt | |
gestation | 0.013* |
(0.007) | |
age | 0.056 |
(0.102) | |
ht | 0.511*** |
(0.123) | |
wt1 | -0.006 |
(0.004) | |
dage | 0.024 |
(0.077) | |
dht | 0.123 |
(0.133) | |
dwt | -0.004 |
(0.005) | |
Constant | 73.040*** |
(11.520) | |
Observations | 1,236 |
R2 | 0.022 |
Adjusted R2 | 0.016 |
Residual Std. Error | 18.090 (df = 1228) |
F Statistic | 3.873*** (df = 7; 1228) |
Note: | p<0.1; p<0.05; p<0.01 |
library(UsingR)
wt.lm2 <- lm(wt ~ gestation+age+ht+wt1+dage+dht+dwt, data = babies,
subset = (gestation<350 & age< 99 & ht< 99 & wt1< 999 & dage< 99 & dht< 99 & dwt < 999))
stargazer(wt.lm2, align=TRUE, title='新生兒體重迴歸模型2',
type=ifelse(knitr::is_latex_output(),"latex","html"),
label=knitr::opts_current$get("label"))
Dependent variable: | |
wt | |
gestation | 0.462*** |
(0.040) | |
age | 0.138 |
(0.188) | |
ht | 1.216*** |
(0.285) | |
wt1 | 0.029 |
(0.034) | |
dage | 0.059 |
(0.165) | |
dht | -0.066 |
(0.270) | |
dwt | 0.078** |
(0.033) | |
Constant | -105.500*** |
(23.350) | |
Observations | 700 |
R2 | 0.215 |
Adjusted R2 | 0.207 |
Residual Std. Error | 16.410 (df = 692) |
F Statistic | 27.090*** (df = 7; 692) |
Note: | p<0.1; p<0.05; p<0.01 |
tmp <- data.frame(fv=fitted(wt.lm2), rv=resid(wt.lm2))
ggplot2::ggplot(data=tmp, aes(x=fv, y=rv)) +
geom_point() +
geom_hline(yintercept = 0, col='#FFAA11') +
labs(x='fitted value', y='residual value', caption='UsingR::babies')
Figure 9.1: 新生兒體重迴歸模型2的檢驗
\[\rm{logit}(p)=\frac{p}{1-p}\]
\[(0,1)\rightarrow(-\infty, \infty)\]
set.seed(02139)
y<-rnorm(30, 0, 1)
car::qqPlot(y)
Figure 10.1: qqplot散佈圖
## [1] 16 1
car::qqPlot(carData::Prestige$income)
Figure 10.2: qqplot散佈圖1
## [1] 2 24
\[\begin{align*} x'=x^{p}=\begin{cases} \frac{x^p-1}{p} & \rm{when}\hspace{.3em} p\neq 0\\ \rm{log}{x} & \rm{when}\hspace{.3em} p = 0 \end{cases} \tag{10.1} \end{align*}\]
library(car)
par(mrow=c(2,2))
for (p in c(-1/2, 1/3, 1/2, 1)){
qqPlot(bcPower(carData::Prestige$income, p),
ylab=paste('transformed income, p=', p))
}
Figure 10.3: qqplot散佈圖2
Figure 10.4: qqplot散佈圖2
Figure 10.5: qqplot散佈圖2
Figure 10.6: qqplot散佈圖2
p=1/2
y=carData::Prestige$income
qqPlot((y^p-1)/p)
Figure 10.7: qqplot散佈圖3
## [1] 2 24
par(mfrow=c(2, 3))
for (p1 in c(0, 1)){
for (p2 in c(1/3, 1/2, 2/3)){
scatterplot(bcPower(carData::UN$ppgdp,p1), bcPower(carData::UN$infantMortality,p2),
xlab=paste('transformed PP GDP, p=', p1), ylab=paste('transformed infant mortality, p=',
sprintf("%.2f",p2)))
}
}
Figure 10.8: 雙變數轉換後散佈圖
Figure 10.9: 雙變數轉換後散佈圖
Figure 10.10: 雙變數轉換後散佈圖
Figure 10.11: 雙變數轉換後散佈圖
Figure 10.12: 雙變數轉換後散佈圖
Figure 10.13: 雙變數轉換後散佈圖
dt<-here::here('data','killwound.txt')
mili <- read.table(dt, sep=',', header = T)
#with(mili, hist(Wounded, breaks = 20,
# xlim=c(0, 7e+05)))
ggpubr::gghistogram(mili,x='Wounded', bins = 40,
fill='lightgray',add='mean', rug = TRUE)
## Warning: geom_vline(): Ignoring `mapping` because `xintercept` was provided.
## Warning: geom_vline(): Ignoring `data` because `xintercept` was provided.
Figure 10.14: 受傷人數直方圖
library(lattice)
w_order<-order(mili$Wounded, decreasing=F)
mili_order<-mili[w_order,]
dotplot(mili_order$Location ~ mili_order$Wounded, cex=1.1)
Figure 10.15: 受傷人數點狀圖1
library(lattice)
dotchart(mili_order$Wounded,labels=mili_order$Location,pch=16,cex=1,
xlab="Number of American Soliders Wounded",aspect=0.8)
Figure 10.16: 受傷人數點狀圖2
dt<-here::here('data','killwound.txt')
mili <- read.table(dt, sep=',', header = T)
with(mili, plot(Wounded ~ Killed, main="",xlab="Number of American Soliders Killed",
ylab="Number of American Soliders Wounded"),pch=16)
mod1<-with(mili,lm(Wounded ~ Killed))
abline(mod1, col="red", lwd=2)
legend("topleft",legend=expression(paste("Wounded=-0.002+1.21 Killed"," ",R^2==0.86)),
lty=1,col="red")
Figure 10.17: 受傷與死亡人數散佈圖1
log.wounded <- log(mili$Wounded)
log.killed<-bcPower(mili$Killed, 0)
newdt<-data.frame(log.wounded, log.killed,mili$Wounded,mili$Killed,mili$Location)
is.na(newdt)<-sapply(newdt,is.infinite)
#newdt
worder<-order(newdt$log.wounded,decreasing=F)
nnewdt<-newdt[worder,]
#jpeg("logwounded.jpg")
dotchart(nnewdt$log.wounded,labels=newdt$mili.Location,pch=16,cex=1,
xlab="Number of American Soliders Wounded")
Figure 10.18: log轉換變數之點狀圖
with(newdt, plot(log.wounded ~ log.killed, main="",xlab="log(Number of American Soliders Killed)",
ylab="log(Number of American Soliders Wounded(",pch=16))
mod3<-with(newdt,lm(log.wounded ~ log.killed))
abline(mod3,col="red",lwd=2)
legend("topleft",legend=expression(paste("Wounded=0.82+0.83 killed"," ",R^2==0.84)))
Figure 10.19: 受傷與死亡人數散佈圖2
\[\rm{log}(a)=\int_{1}^{a}\frac{1}{x}dx\]
R
的自訂函數,然後利用\(\tt{integrate}\)求1到a之間的面積,等於是log(a),例如a=3,代入積分公式如下:h=function(x){1/x}
a=3
integrate(h, 1, 3)
## 1.099 with absolute error < 7.6e-14
log(3)
## [1] 1.099
h=function(x){1/x}
a=exp(1)
integrate(h, 1, a)
## 1 with absolute error < 1.1e-14
\[\hat{Y}=\hat{\beta}_{0}+\hat{\beta}_{1}\rm{log}X\]
x <- seq(0.1,5,length.out = 100)
set.seed(1)
e <- rnorm(100, mean = 0, sd = 0.2)
y <- 1.2 + 0.5 * log(x) + e
llm1 <- lm(y ~ x)
llm2 <- lm(y ~ log(x))
par(mfrow=c(1,2))
car::crPlot(llm1, variable="x")
car::crPlot(llm2, variable="log(x)")
Figure 10.20: linear-log模型
\[\begin{align*} \rm{log}X+1=\rm{log}X+\rm{log}e=\rm{log}(eX) \end{align*}\]
stargazer::stargazer(llm2, title='linear-log迴歸模型',
type=ifelse(knitr::is_latex_output(),"latex","html"),
label=knitr::opts_current$get("label"))
Dependent variable: | |
y | |
log(x) | 0.500*** |
(0.021) | |
Constant | 1.222*** |
(0.023) | |
Observations | 100 |
R2 | 0.850 |
Adjusted R2 | 0.849 |
Residual Std. Error | 0.181 (df = 98) |
F Statistic | 556.100*** (df = 1; 98) |
Note: | p<0.1; p<0.05; p<0.01 |
cat('X=1, Y=',sum(as.vector(llm2$coefficients)*c(1, 1)),'\n')
## X=1, Y= 1.722
x_1=sum(as.vector(llm2$coefficients)*c(1, 1))
cat('X+0.01,Y+',sum(as.vector(llm2$coefficients)*c(1, 1.01))-x_1,'\n')
## X+0.01,Y+ 0.004998
cat('X+0.02,Y+',sum(as.vector(llm2$coefficients)*c(1, 1.02))-x_1,'\n')
## X+0.02,Y+ 0.009996
cat('X+0.03,Y+',sum(as.vector(llm2$coefficients)*c(1, 1.03))-x_1,'\n')
## X+0.03,Y+ 0.01499
\[\mathrm{log}(e^{a})=a\]
\[\hat{\beta}_{1}=\mathrm{log}(e^{\hat{\beta}_{1}})\]
\[\begin{align*} \rm{log}Y+\hat{\beta}_{1}&=\rm{log}(Y)+\rm{log}(e^{\hat{\beta}_{1}})\\ &=\mathrm{log}(e^{\hat{\beta}_{1}}Y) \end{align*}\]
\[\begin{align*} \mathrm{log}Y_{1} & =\hat{\beta}_{0}+\hat{\beta}_{1}X_{1} \tag{10.2} \end{align*}\] \[\begin{align*} \mathrm{log}Y_{2} & =\hat{\beta}_{0}+\hat{\beta}_{1}(X_{1}+1) \tag{10.3} \end{align*}\]
\[\begin{align*} \hat{\beta}_{1} & =\mathrm{log}Y_{2}-\mathrm{log}Y_{1}\\ & =\mathrm{log}\frac{Y_{2}}{Y_{1}}\\ \frac{Y_{2}}{Y_{1}}=e^{\hat{\beta}_{1}}\\ Y_{2} = Y_{1}\times e^{\hat{\beta}_{1}} \end{align*}\]
\[ \mathrm{log}Y=\hat{\beta}_{0}+\hat{\beta}_{1}\mathrm{log}X \]
\[\begin{align*} \partial\mathrm{log}Y_{1} & =\partial\hat{\beta}_{1}\mathrm{log}X_{1} \end{align*}\]
因為\(\partial \mathrm{log}X=\frac{1}{X}\partial X\)
所以\(\hat{\beta}_{1}=\mathcal{\frac{\partial Y/Y}{\partial X/X}}\)
library(carData)
ggpubr::ggscatter(data=Leinhardt,x='income', y='infant', color = "#22AF11", fill = "white", add = "loess",shape=4,
cor.coef = TRUE, # Add correlation coefficient. see ?stat_cor
cor.coeff.args = list(method = "pearson", label.x = 2000, label.sep = "\n"))
Figure 10.21: 所得與嬰兒死亡率散佈圖1
M1<-lm(infant~income, data=Leinhardt)
Mres<-data.frame(residual=M1$residuals,fitted=M1$fitted.values)
G<-ggpubr::ggscatter(data=Mres,x='fitted', y='residual', color = "#0F1EDC")
G + geom_hline(yintercept = 0, linetype='dashed')
Figure 10.22: 所得與嬰兒死亡率迴歸模型的殘差值與預測值散佈圖
Leinhardt$log.income<-log(Leinhardt$income,10)
Leinhardt$log.infant<-log(Leinhardt$infant,10)
ggpubr::ggscatter(data=Leinhardt,x='log.income', y='log.infant', color = "#22AF11", add = "reg.line",shape=4,
cor.coef = TRUE,
cor.coeff.args = list(method = "pearson", label.x = 2.8, label.sep = "\n"))
Figure 10.23: 所得與嬰兒死亡率散佈圖2
LM1<-lm(log.infant~log.income,data=Leinhardt)
stargazer::stargazer(LM1, title='所得與嬰兒死亡率',
type=ifelse(knitr::is_latex_output(),"latex","html"),
label=knitr::opts_current$get("label"))
Dependent variable: | |
log.infant | |
log.income | -0.512*** |
(0.051) | |
Constant | 3.103*** |
(0.137) | |
Observations | 101 |
R2 | 0.502 |
Adjusted R2 | 0.497 |
Residual Std. Error | 0.298 (df = 99) |
F Statistic | 99.840*** (df = 1; 99) |
Note: | p<0.1; p<0.05; p<0.01 |
LM<-data.frame(residual=LM1$residuals,fitted=LM1$fitted.values)
G<-ggpubr::ggscatter(data=LM,x='fitted', y='residual', color = "#AA33C1")
G + geom_hline(yintercept = 0, linetype='dashed')
Figure 10.24: 所得與嬰兒死亡率迴歸模型的殘差值與預測值散佈圖
根據表10.2,我們可以得到結論為收入每增加\(1\%\),平均而言,嬰兒死亡率下降\(.51\%\)。兩者呈負相關的關係。相關資料可以參考William Jacoby的說明。
\[Y=\beta_{0}+\beta_{1}X_{1}+\beta_{2}X_{2}++\beta_{3}X_{1}X_{2}\]
est<-here::here('data','realestate.txt')
dat<-read.table(est)
dat<-dat[-1,]
dat<-dat%>%dplyr::select(V1,V2,V5)%>%
mutate_all(as.numeric)
EM<-lm(V1~V2+V5+V2:V5, dat)
EMR<-data.frame(residual=EM$residuals,fitted=EM$fitted.values)
G<-ggpubr::ggscatter(data=EMR,x='fitted', y='residual', color = "#22CCDC")
G + geom_hline(yintercept = 0, linetype='dashed')
Figure 10.25: 房價迴歸模型的殘差值與預測值散佈圖
est<-here::here('data','realestate.txt')
dat<-read.table(est)
dat<-dat[-1,]
dat<-dat%>%dplyr::select(V1,V2,V5)%>%
mutate_all(as.numeric) %>%
mutate_at(c('V1','V2'), log)
EM<-lm(V1~V2+V5+V2:V5, dat)
EMR<-data.frame(residual=EM$residuals,fitted=EM$fitted.values)
G<-ggpubr::ggscatter(data=EMR,x='fitted', y='residual', color = "#BEAADC")
G + geom_hline(yintercept = 0, linetype='dashed')
Figure 10.26: 轉換後房價迴歸模型的殘差值與預測值散佈圖
est<-here::here('data','realestate.txt')
dat<-read.table(est)
dat<-dat[-1,]
dat<-dat%>%dplyr::select(V1,V2,V5)%>%
mutate_all(as.numeric) %>%
mutate_at(c('V1','V2'), log)
EM<-lm(V1~V2+V5+V2:V5, dat)
stargazer::stargazer(EM, title='房價、面積、冷氣',covariate.labels = c('Square footabe','Air condition','Square footage X Air condition'), column.labels = c('Price'),
type=ifelse(knitr::is_latex_output(),"latex","html"),
label=knitr::opts_current$get("label"))
Dependent variable: | |
V1 | |
Price | |
Square footabe | 0.198* |
(0.101) | |
Air condition | -0.849*** |
(0.280) | |
Square footage X Air condition | 0.270*** |
(0.060) | |
Constant | 2.871*** |
(0.457) | |
Observations | 521 |
R2 | 0.584 |
Adjusted R2 | 0.582 |
Residual Std. Error | 0.575 (df = 517) |
F Statistic | 242.200*** (df = 3; 517) |
Note: | p<0.1; p<0.05; p<0.01 |
\[F=\frac{MSR}{MSE}\] \[\sum (\hat{Y}-\bar{Y})^2=SSR\] \[MSR=\frac{SSR}{k}\] \[\sum (Y-\hat{Y})^2=SSE\] \[MSE=\frac{SSE}{n-k-1}\]
set.seed(02139)
X1 <- rnorm(200, 0, 1); X2 <- rnorm(200, 0, 0.5); X3 <- rnorm(200, 0, 0.75)
Y<-c(sample(X1, 130), sample(X2, 20), sample(X3, 30), rnorm(20, 0, 0.75))
\[\Delta R=\frac{R_{t}-R_{t-1}}{R_{t-1}}\]
\[\Delta S=\frac{S_{t}-S_{t-1}}{S_{t-1}}\]
\[\rm{\Delta R}=\beta_{0}+\beta_{1}\Delta S+\epsilon\]
\[\rm{\Delta R}=\alpha+\gamma\Delta S+\delta_{1}\times D+\delta_{2}(\Delta S\times D)+\epsilon\]
## 最後更新日期 05/19/2022