Rによる計量経済分析: 第3章

3.1 回帰モデル
3.2 最小2乗法
3.3 当てはまりの尺度
3.4 最小2乗推定量の性質
3.5 パラメータについての統計的推測
3.6 回帰係数についての仮説検定: t検定
3.7 Rで回帰分析

3.1 回帰モデル

経済データでは、従属変数・独立変数共に確率変数であると考えるのが自然。
単回帰モデルには以下の仮定がある。

線形性
\(x_1, \cdots, x_n\)は全て同じ値にはならない (variationがある)
誤差項の条件付き期待値が0
- \(\mathbb{E} [u_i | {\boldsymbol x}] = 0\)
誤差項の条件付き均一分散性
- \(\mathbb{E} [u_i^2 |{\boldsymbol x} ] = \sigma^2 > 0 \ \ (i= 1,2, \cdots, n)\)
誤差項の条件付き無相関性
- \(\mathbb{E} [u_i u_j |{\boldsymbol x} ] = 0 \ \ (i, j= 1,2, \cdots, n, i \neq j)\)

3.2 最小2乗法

\(\beta_0, \beta_1\)がパラメータだとすると、最小2乗法では当てはめ値の差の2乗の和を最小にするようにパラメータを推定する。\[\min_{b_0, b_1} \sum_{i=1}^{n} (y_i - b_0 - b_1 x_i)^2\]という最適化問題を解くことになる。

3.3 当てはまりの尺度

\(y_i\)の回帰式は\(\hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1 x_i\)であったから、\[y_i = \hat{y}_i + \hat{u}_i\]が成り立つ。\[\begin{aligned} {\rm SST} &= \sum_{i=1}^n (y_i - \overline{y})^2 \\ {\rm SSE} &= \sum_{i=1}^n (\hat{y}_i - \overline{y})^2 \\ {\rm SSR} &= \sum_{i=1}^n \hat{u}_{i}^{2} \\ {\rm SST} &= {\rm SSE} + {\rm SSR} \end{aligned} \] ここで、SST = Total sum of squares (全変動)、SSE = Explained sum of squares (独立変数によって説明された変動)、SSR = Residual sum of squares (残差の変動)である。

決定係数\(R^2\)は、\[R^2 = \frac{\rm SSE}{\rm SST} = 1 - \frac{\rm SSR}{\rm SST} \]

3.4 最小2乗推定量の性質

最小2乗推定量の不偏性
- \(\mathbb{E}(\hat{\theta}) = \theta\)であること
最小2乗推定量の条件付き分散 [W 2.4, p.52] 式変形？このスライド
Gauss-Markovの定理
- 最小2乗推定量は、best linear unbiased estimatorである
最小2乗推定量の一致性
- 標本数\(n\)を\(\infty\)にしたとき、推定値の値\(\hat{\theta}\)が\(\theta\)に収束すること

3.5 パラメータについての統計的推測

仮定3.1 = 3.6の下で、\[\frac{\hat{\beta}_j - \beta_j}{{\rm se} \hat{\beta}_j} \sim t(n-2)\]が成り立つ。

3.6 回帰係数についての仮説検定: t検定

係数が0かどうかの検定

3.7 Rで回帰分析

3.7.0 ライブラリの読み込み

library(ggplot2)

3.7.1 データの説明

3.7.2 単回帰分析

データを読み込む

cvdata <- read.csv("cv.csv", row.names = 1)

散布図を書く

g <- ggplot(cvdata, aes (x=conv, y=pop)) +
  geom_point(shape=1, size=3, na.rm=TRUE) +
  xlab("Population") + ylab("Number of store") + ggtitle("Scatter plot") +
  theme_bw()
g

回帰分析を行う

result <- lm(conv~pop, data=cvdata)
summary(result)

## 
## Call:
## lm(formula = conv ~ pop, data = cvdata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -605.69  -82.42    0.93   82.99  859.51 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -106.62323   47.98638  -2.222   0.0314 *  
## pop            0.37372    0.01283  29.125   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 225.9 on 45 degrees of freedom
## Multiple R-squared:  0.9496, Adjusted R-squared:  0.9485 
## F-statistic: 848.3 on 1 and 45 DF,  p-value: < 2.2e-16

散布図に回帰直線を引く

g <- ggplot(cvdata, aes(x=pop, y=conv)) +
  geom_point(shape=1, size=3, na.rm=TRUE) +
  xlab("Population") + ylab("Number of store") + ggtitle("Scatter plot") +
  stat_smooth(method="lm", se=FALSE) + 
  theme_bw()
g

各プロットに名前をつける

g <- ggplot(cvdata, aes(x=pop, y=conv)) +
  geom_point(shape=1, size=3, na.rm=TRUE) +
  xlab("Population") + ylab("Number of store") + ggtitle("Scatter plot") +
  stat_smooth(method="lm", se=FALSE) + 
  geom_text(aes(label=rownames(cvdata)), size=3, vjust=1) + 
  theme_bw()
g

これだと見にくいので、人口の大きな都道府県に限定する

g <- ggplot(cvdata, aes(x=pop, y=conv)) +
  geom_point(shape=1, size=3, na.rm=TRUE) +
  xlab("Population") + ylab("Number of store") + ggtitle("Scatter plot") +
  stat_smooth(method="lm", se=FALSE) + 
  geom_text(data=subset(cvdata, pop>3500), aes(label=rownames(subset(cvdata, pop>3500))), size=3, vjust=1, hjust=1) + 
  theme_bw()
g