網頁版:http://rpubs.com/xup6y3ul6/ALSM_HM3

一、Chapter 6 觀念題

1. [模型]

(1) 請寫出一般線性迴歸之「數學模型與其假設」

(a) 非矩陣形式(提示:參考課本 pp. 217-218)

一般線性迴歸模型:\[\begin{align} Y_i &= \beta_0 + \beta_1X_{i1} + \beta_2X_{i2} + \dots + \beta_{p-1}X_{i,p-1} + \epsilon_i \\ &= \Sigma_{k=0}^{p-1}\beta_kX_{ik} + \epsilon_i,\, where\, X_{i0} \equiv 1 \end{align}\] 若是在常態誤差的迴歸模型下,其假設為:

  1. X 和 Y 有線性關係
  2. 變異數同質:\(Var(\epsilon|X) = \sigma^2, \forall X\)
  3. 條件化常態分配:\(given\, X \rightarrow Y \sim normal\, distribution\)
  4. 資料間是獨立的
  5. \(\epsilon|X \sim N(0,\sigma^2) \rightarrow Y|X \overset{iid}{\sim} N(\Sigma_{k=0}^{p-1}\beta_kX_{ik},\sigma^2)\)
  6. X 沒有測量誤差
  7. 沒有忽略掉重要的其他預測變項
(b) 矩陣形式(提示:參考課本 pp. 222-223)

一般線性迴歸模型 (矩陣):\[\begin{align} \underset{n\times1}{\mathbf{Y}} = \underset{n\times{p}}{\mathbf{X}}\,\underset{p\times1}{\boldsymbol{\beta}} + \underset{n\times1}{\boldsymbol{\epsilon}} \end{align}\] 其中,\(\mathbf{Y}\) 為觀察值向量, \(\mathbf{X}\) 為預測值的設計矩陣 (design matrix),\(\boldsymbol{\beta}\) 為參數向量,以及 \(\boldsymbol{\epsilon}\) 為誤差向量。

若是在常態誤差的迴歸模型下,其假設為:

  1. \(\mathbf{X}\)\(\mathbf{Y}\) 有線性關係
  2. 變異數同質:\(Var(\boldsymbol{\epsilon|X}) = \sigma^2, \forall X\)
  3. 條件化常態分配:\(given\, \mathbf{X} \rightarrow \mathbf{Y} \sim normal\, distribution\)
  4. 資料間是獨立的
  5. \(\boldsymbol{\epsilon|X} \sim N(0,\sigma^2\mathbf{I}) \rightarrow \mathbf{Y|X} \overset{iid}{\sim} N(\boldsymbol{\beta X},\sigma^2\mathbf{I})\)
  6. \(\mathbf{X}\) 沒有測量誤差
  7. 沒有忽略掉重要的其他預測變項

(2) 以圖 1-2 為例來表達「一般線性迴歸模型與其假設」於圖示上的意涵,藉此解釋何謂「…“revert” or “regress” to the mean of the group」(課本,p. 5)(提示:參考圖 1-1)

圖 1-1、單迴歸模型之圖示法 (擷取自課本Figure 1.4,p.6) 圖1-2、An Example for First-Order Regression Model with Two Predictor Variables (擷取自課本Figure 6.1,p. 215)

“Regress” to the mean of the group 原本是 Sir Francis Galton (1822-1911) 在研究人類特徵的優勢是否會遺傳給後代,結果發現父母特徵確實會遺傳給子代,但並不會產生極端身高的族群。當父母的身高已經遠離平均身高時,生下的兒女身高並沒有持續「遠離」平均,而會稍微「靠近」平均,也就是相對矮了一點;反之父母身高很矮的後代,身高會相對其父母「靠近」平均一點。Galton 把這個「極端」往「平均」移動的現象稱為 “regression to the mean”。
而今統計學將 “regression” 拿來描述變項們間的統計關係。以圖 1-2 來說,在任何一個預測變項 (\(X_{i1}, X_{i2}, \dots\)) 下,我們要預測的反應變項 \(Y_i\) 實際上有個分配存在,而該分配的平均值 \(E\{Y\}\) 與 (\(X_{1}, X_{2}, \dots\)) 之間有系統性的關聯,也就是圖中的 response plane 或是 hyperplane 可用一個數學 (回歸) 模型來描述它,而 \(Y_i\) 分配中的其他觀察值散落在 response plane 或是 hyperplane 附近。當在某預測變項點 (\(X_{i1}, X_{i2}, \dots\)) 觀察到 \(Y_i\) 有極端值產生時,下一次在同一預測變項點觀察到 \(Y_i'\) 就非常有可能會「回歸到」或是「退回到」平均值 \(E\{Y_i\}\) 附近。

2. [模型應用情境]

一般線性迴歸模型能包含許多應用之資料分析情境,課本 pp. 218-221 舉了 7 個例子來說明。請你試著閱讀這七個例子與一般線性迴歸模型之「關係」,說明自己對於一般線性迴歸模型在應用上的看法。

一般線性迴歸模型為:\[ Y_i = \beta_0 + \beta_1X_{i1} + \beta_2X_{i2} + \dots + \beta_{p-1}X_{i,p-1} + \epsilon_i \]事實上許多不同種線性模型 (課本中 7 種例子) 都是一般線性迴歸模型的特例,不管是單元、多元個變項,或者變項是類別變項、多項式、交互作用項、甚至將變項進行各式轉換,以上的任何組合實際上都可以寫成\[ Y_i = c_{i0}\beta_0 + c_{i1}\beta_1 + c_{i2}\beta_2 + \dots + c_{i,p-1}\beta_{p-1} + \epsilon_i \]的樣貌,其中 \(c_{i0}, c_{i0},\dots, c_{i,p-1}\) 皆為預測變項轉換而成的係數,因此可看成\(c_{i} = f(X_{i1}, X_{i2}, \dots)\)。實際上,一般線性迴歸模型中的「線性」指的是要估計的參數 \(\beta\) 之間的線性組合,因此不管 c 對 X 是怎樣的函數,只要符合上述的模型的形式以及假設,就能算是一般線性迴歸模型,因此運用上非常多元。

3. [模型基本分析結果]

當瞭解了模型之數學式子與假設,同時亦知道如何應用此模型於研究探討之情境中,最後就是透過分析的結果來回答研究問題。一般線性迴歸模型的基本分析結果於課本於 p. 223-236 有詳盡的解說,請你試著以此內容為基礎,以「自己的方式統整並簡述」基本結果之分析步驟。(注意:切勿描述過於細節而長篇大論)

0. 資料收集與前處理

1. 建立回歸模型
一般線性迴歸模型 (矩陣):\[\begin{align} \underset{n\times1}{\mathbf{Y}} = \underset{n\times{p}}{\mathbf{X}}\,\underset{p\times1}{\boldsymbol{\beta}} + \underset{n\times1}{\boldsymbol{\epsilon}} \end{align}\] 其中,\(\mathbf{Y}\) 為觀察值向量, \(\mathbf{X}\) 為預測值的設計矩陣 (design matrix),\(\boldsymbol{\beta}\) 為參數向量,以及 \(\boldsymbol{\epsilon}\) 為誤差向量且 \(\boldsymbol{\epsilon} \sim N(0, \sigma^2\mathbf{I})\)

2. 參數估計
透過最小平方法 (least squares method) 得到 normal equation: \[ \mathbf{X'Xb = X'Y} \]因此,參數 \(\boldsymbol{\beta}\) 的點估計值為:\[ \mathbf{b = (X'X)^{-1}X'Y} \]以及變異共變異矩陣為:\[ \mathbf{\sigma\{b\} = \sigma^2(X'X)^{-1}} \]但我們對實際的 \(\sigma\) 並不曉得,應此以 MSE 來取代,因此參數\(\boldsymbol{\beta}\)的變異共變異矩陣的估計值為:\[ \mathbf{s\{b\}} = MSE\mathbf{(X'X)^{-1}} \]

反應變項的 fitted value: \[ \mathbf{\hat{Y} = Xb = X(X'X)^{-1}X'Y = HY} \]其中,hat matrix \(\mathbf{H = X(X'X)^{-1}X'}\) 具有 symmetric 和 idempotent 的性質。

而樣本殘差 residuals: \[ \mathbf{e = Y - \hat{Y} = (I-H)Y} \]其中\(\mathbf{(I-H)}\) 同樣具有 symmetric 和 idempotent 的性質。

最後殘差的變異共變異矩陣為:\[ \mathbf{\sigma\{e\} = \sigma^2(I-H)} \]但我們對實際的 \(\sigma\) 並不曉得,應此以 MSE 來取代,因此殘差的變異共變異矩陣的估計值為:\[ \mathbf{s^2\{e\}} = MSE\mathbf{(I-H)} \]

3. 變異分析

Source of Variation SS df MS F_value
Regression \(SSR = \mathbf{Y'[I}-(\frac{1}{n})\mathbf{J]Y}\) \(p-1\) \(MSR = \frac{SSR}{p-1}\) \(F^* = \frac{MSR}{MSE}\)
Error \(SSE = \mathbf{Y'(I-H)Y}\) \(n-p\) \(MSR = \frac{SSR}{p-1}\)
Total \(SSTO = \mathbf{Y'[H}-(\frac{1}{n})\mathbf{J]Y}\) \(n-1\)

Coefficient of Multiple Determination: \[ R^2 = \frac{SSR}{SSTO} = 1 - \frac{SSE}{SSTO} \]說明 Y 的總變異量能由這組預測變項 X 們能解釋多少。

Adjust \(R^2\): \[ R_a^2 = 1 - \left(\frac{n-1}{n-p}\right) \frac{SSE}{SSTO} \]考量到預測變項數量的影響。

Coefficient of Multiple Correlation: \[ R = \sqrt{R^2} \]

4. 統計檢定
迴歸模型的顯著性檢定 (F test):
hypothesis alternatives: \[\begin{align} &H_0: \beta_1 = \beta_2 = \dots = \beta_{p-1} = 0 \\ &H_1: not all \beta_k (k = 1,\dots,p-1) equal zero \end{align}\]
decision rule: \[\begin{align} &if\,F^* \leq F(1-\alpha; p-1, n-p),\, conclude\,H_0 \\ &if\,F^* > F(1-\alpha; p-1, n-p),\, conclude\,H_1 \end{align}\]

參數的顯著性檢定 (t test):
hypothesis alternatives: \[\begin{align} &H_0: \beta_k = 0 \\ &H_1: \beta_k \neq 0 \end{align}\]
decision rule: \[\begin{align} &if\,|t^*| \leq t(1-\alpha/2; n-p),\, conclude\,H_0 \\ &if\,|t^*| > t(1-\alpha/2; n-p),\, conclude\,H_1 \end{align}\]
confidence interval: \[ b_k \pm Bs\{b_k\} \]其中\[ B = t(1-\alpha/2; n-p) \]

6. 檢驗模型
可能要考量到:

  • 各個變相間的 scatter plot matrix 和 correlation,甚至到三維度的 scatter cloud
  • residual plot
    • \(e_i\) vs. \(X_i\)
    • \(e_i\) vs. \(\hat{Y_i}\)
    • \(e_i\) vs. time or other sequence
    • \(e_i\) vs. ommited predictor variables or interaction terms
  • Q-Q plot & Correlation Test for Normality
  • Test for Constatncy of Error Variance
    • Brown-Forsythe test
    • Breusch-Pagan test
  • F Test for Lack of Fit
  • Remedial Measures
    • Box-Cox Transformation

7. 模型選擇
如果最初模型設立有問題的話,可能就要做一些改善,可能是轉換原先資料或者是重新設立模型。重新步驟 1-6。
以及使用 F test (anova approach) 在不同模型間做比較。
最終選出想要的迴歸模型後,便可進行下一步。

8. 估計反應平均和預測新觀察值
可能有以下的運用面向:

  1. Confidence limits for \(E\{Y\}\)
  2. Confidence Region for Regression Surface
  3. Simulateneous Confidence Intervals for Several Mean Ressponses
    • Working-Hotelling method
    • Bonferroni method
  4. Prediction of New Observation \(Y_{h(new)}\)
  5. Prediction of Mean of m New Observation at
  6. Predcition of g New Observation
    • Scheffe method
    • Bonferroni method

二、Chapter 7 & 8 觀念題

1. [Extra Sums of Squares]

(1) Define each of the following extra sums of squares:

(a) SSR(X5|X1)

\[\begin{align} SSR(X5|X1) &= SSR(X1, X5) - SSR(X1) \\ &= SSE(X1) - SSE(X1, X5) \end{align}\]

(b) SSR(X3, X4|X1)

\[\begin{align} SSR(X3, X4|X1) &= SSR(X1, X3, X4) - SSR(X1) \\ &= SSE(X1) - SSE(X1, X3, X4) \end{align}\]

(c) SSR(X4|X1, X2, X3)

\[\begin{align} SSR(X4|X1, X2, X3) &= SSR(X1, X2, X3, X4) - SSR(X1, X2, X3) \\ &= SSE(X1, X2, X3) - SSE(X1, X2, X3, X4) \end{align}\]

(2) he following regression model is being considered in a market research study: \[Y_i = \beta_0 + \beta_1X_{i1} + \beta_2X_{i2} + \beta_3X_{i3} + \epsilon_i\] State the reduced models for testing whether or not:

(a) \(\beta_1 = \beta_3 = 0\)

reduced model:\[ Y_i = \beta_0 + \beta_2X_{i2} + \epsilon_i \]

(b) \(\beta_0 = 0\)

reduced model:\[ Y_i = \beta_1X_{i1} + \beta_2X_{i2} + \beta_3X_{i3} + \epsilon_i \]

(c) \(\beta_3 = 5\)

reduced model:\[\begin{align} &Y_i = \beta_0 + \beta_1X_{i1} + \beta_2X_{i2} + 5X_{i3} + \epsilon_i \\ \Rightarrow\quad &Y_i' = \beta_0 + \beta_1X_{i1} + \beta_2X_{i2} + \epsilon_i \end{align}\] 其中,\(Y_i' = Y_i - 5X_{i3}\)

(d) \(\beta_0 = 10\)

\[\begin{align} &Y_i = 10 + \beta_1X_{i1} + \beta_2X_{i2} + \beta_3X_{i3} + \epsilon_i \\ \Rightarrow\quad &Y_i' = \beta_1X_{i1} + \beta_2X_{i2} + \beta_3X_{i3} + \epsilon_i \end{align}\] 其中,\(Y_i' = Y_i - 10\)

(e) \(\beta_1 = \beta_2\)

reduced model:\[\begin{align} Y_i = \beta_0 + \beta_4(X_{i1} + X_{i2}) + \beta_3X_{i3} + \epsilon_i \end{align}\] 其中,\(\beta_4 = \beta_1 = \beta_2\)

2. [共線性]

試以課本 Figure 7.2 (p. 282) 之圖示說明為何獨變項之共線性會使得迴歸模型之參數估計值檢定結果不顯著呢?

Figure 7.2 (p. 282)

Figure 7.2 (p. 282)

由於獨變項有共線性的關係時,會使得多種可能的參數估計解都能夠 fit 此筆資料一樣好。以課本 Figure 7.2 來說,兩個平面分別表示兩種不同的種的參數估計解,在有共線性的情況下,只要任意平面能夠包含這兩平面「交線」的,也會是此迴歸模型參數估計解,因此實際上會有數個、甚至無限多數解都能夠 fit 此筆資料一樣好。這也就造成迴歸係數的估計容易受到樣本的影響,而有很大的樣本變異性,使得估計單一個迴歸係數的真實值會很不精確,同時,高的樣本變異數造成統計檢定的信賴區間過寬,很容易使統計上達不到顯著。

3. [交互作用項]

Prepare a contour plot for the quadratic response surface \(E\{Y\} = 140 + 4x_1^2 - 2x_2^2 + 5x_1x_2\). Describe the shape of the response

x1 <- seq(-9, 9, 0.05)
x2 <- seq(-9, 9, 0.05)
y_func <- function(x1, x2){140 + 4*x1^2 - 2*x2^2 + 5*x1*x2}
dt <- expand.grid(x1 = x1, x2 = x2)
dt$y <- with(dt, y_func(x1, x2))

brks <- cut(dt$y, breaks = seq(-200, 800, len = 11))
brks <- gsub(",", " ~ ", brks, fixed=TRUE)
brks <- gsub("\\(|\\[|\\]", "", brks)
dt$brks <- factor(brks, levels = rev(unique(brks)))

library(ggplot2)
library(plotly)
library(RColorBrewer)
ggplot(dt, aes(x = x1, y = x2, z = y)) +
  geom_tile(aes(fill = brks)) +
  geom_contour(color = "black") +
  scale_fill_manual("y", values = rev(brewer.pal(10, "RdYlBu"))) +
  scale_x_continuous(expand = c(0, 0)) +
  scale_y_continuous(expand = c(0, 0)) +
  theme_bw()

dt2 <- expand.grid(x1 = x1, x2 = x2)
dt2$y <- with(dt2, y_func(x1, x2))
brks <- cut(dt2$y, breaks = seq(-200, 800, len = 11))
brks <- gsub(",", " ~ ", brks, fixed=TRUE)
brks <- gsub("\\(|\\[|\\]", "", brks)
dt2$brks <- factor(brks, levels = rev(unique(brks)))
plot_ly(data = dt2, x = ~x1, y = ~x2, z = ~y, type = "scatter3d", color = ~brks, colors = rev(brewer.pal(10, "RdYlBu")))

從上方的 2D 和 3D 等高線圖能得知,當 \(x_1\)\(x_2\) 一同變大或一同變小時,y 值也跟著變大;相反的,當 \(x_1\)\(x_2\) 任一方變大另一方變小時,y 值則逐漸變小。 其中我們也可發現,等高線彎曲且之間是非平行也不等距的,這是由於 \(x_1\)\(x_2\) 的二次項 (\(x_1^2\)\(x_2^2\)) 和交互作用項 (\(x_1x_2\)) 造成的。

三、資料分析 1

[Commercial properties] (Problem 6.18, pp. 251-252)

A commercial real estate company evaluates vacancy rates, square footage, rental rates, and operating expenses for commercial properties in a large metropolitan area in order to provide clients with quantitative information upon which to make rental decisions. The data from the attached file “HW3_Commercial properties.txt” are taken from 81 suburban commercial properties that are the newest, best located, most attractive, and expensive for five specific geographic areas. Shown here are the age (X1), operating expenses and taxes (X2), vacancy rates (X3), total square footage (X4), and rental rates (Y).

# Commercial properties
# load data
library(dplyr)
library(DT)
cp <- read.table("HW3_Commercial properties (0504修改).txt", header = TRUE, sep = "\t")
datatable(cp)

1. Obtain the scatter plot matrix and the correlation matrix. Interpret these and state your principal findings.

#library(psych)
#pairs.panels(cp, method = "pearson", hist.col = "grey", ellipse = FALSE, stars = TRUE)
library(GGally)
ggpairs(cp) + theme_bw()

上圖中的對角線顯示各個變項的 histogram plot,左下方則表示變項間的 scatter plot 以及相對應的右上方則是顯示變項間的 correlation。
從上圖可發現 Y 與 X2 或與 X4 有中度的正相關,有可能這兩個變項各字分別預測 Y 時會表現的不錯,然而 Y 與 X3 則是無相關 (不顯著) 且資料點集中在 0 附近,很有可能需要後續去對資料作轉換。至於預測變項之間的相關可發現,X2 分別與 X1、X3、X4 之間皆有中度相關,因此有可能在估計迴歸模型時產生共線性的問題,使得某變數可能會因為互相競爭而表現的不好。

2. Fit regression model (6.5) for four predictor variables to the data. State the estimated regression function.

cp_lm <- lm(Y ~ X1 + X2 + X3 + X4, cp)
summary(cp_lm)
## 
## Call:
## lm(formula = Y ~ X1 + X2 + X3 + X4, data = cp)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.1872 -0.5911 -0.0910  0.5579  2.9441 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.220e+01  5.780e-01  21.110  < 2e-16 ***
## X1          -1.420e-01  2.134e-02  -6.655 3.89e-09 ***
## X2           2.820e-01  6.317e-02   4.464 2.75e-05 ***
## X3           6.193e-01  1.087e+00   0.570     0.57    
## X4           7.924e-06  1.385e-06   5.722 1.98e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.137 on 76 degrees of freedom
## Multiple R-squared:  0.5847, Adjusted R-squared:  0.5629 
## F-statistic: 26.76 on 4 and 76 DF,  p-value: 7.272e-14

依照題目要求使用課本公式 (6.5),估計出的迴歸模型為:\[ E\{Y\} = 12.2 - 0.14X_1 + 0.28X_2 + 0.62X_3 + 7.924\times10^{-6}X_4 \]從迴歸模型得知,X1、X2、X3、X4 分別在其他變項固定時,增加每 1 單位會分別使得 \(E\{Y\}\) 增加 -0.14、0.28、0.62、\(7.924\times10^{-6}\) 單位。
但在這個模型中,\(X_3\) 的參數 \(\beta_3\) 是不顯著的 (p = 0.57),因此可能不適合納入模型中或是做其他的資料轉換。

3. [Residual Analysis]

(1) Obtain the residual and prepare a box plot of the residual.
cp$res <- residuals(cp_lm)
cp$res %>% round(digits = 2)
##  [1] -1.04 -1.51 -0.59 -0.13  0.31 -3.19 -0.54  0.24  1.99  0.11  0.02
## [12] -0.34  0.72 -0.39 -0.20 -0.81  0.10 -1.76 -1.21 -0.63 -0.37  0.29
## [23] -0.09  0.23 -0.85 -2.12  0.47 -0.57 -1.07 -0.20 -1.12 -0.17 -1.03
## [34] -0.09  0.22  0.78  1.08 -2.13 -0.19 -1.12 -0.01  2.50 -1.58  0.93
## [45]  0.39  0.12  0.82  1.61  0.56  0.49  0.21 -0.03  1.16  0.23 -1.07
## [56]  1.06 -0.26  1.03 -0.35  0.20  0.92  2.94  2.46  1.86  1.45 -0.48
## [67] -0.76  2.01  0.08  0.01  1.77 -0.46 -0.51 -0.11  1.21 -0.26 -0.63
## [78]  0.91 -0.55 -2.03 -0.91
ggplot(cp, aes(x = "", y = res)) + 
  geom_boxplot() +
  stat_summary(fun.y = mean, geom = "errorbar", aes(ymax = ..y.., ymin = ..y..), linetype = "dashed", color = "grey", width = 0.75) +
  labs(y = "Residuals") +
  theme_bw()

從 box plot 終能得知 residuals 大多若在 -0.5 ~ 0.5 之內,而中位數小於 0 一些些,平均值 (灰色虛線) 幾乎等於 0,其中有三個資料點可能視為離群值。但整體來說, residuals 分佈是蠻對稱的。

(2) Plot the residuals against
(a) \(\hat{Y}\)
cp$fit <- fitted(cp_lm)
ggplot(cp, aes(x = fit, y = res)) +
  geom_point() +
  geom_abline(intercept = 0, slope = 0, color = "red", linetype = "dashed") +
  labs(x = expression(paste("Fitted value (", hat(Y), ")")), 
       y = "Residuals") +
  theme_bw()

(b) each predictor variable, and
g <- lapply(list("X1", "X2", "X3", "X4"), function(.X){
  ggplot(cp, aes_string(x = .X, y = "res")) +
    geom_point() +
    geom_abline(intercept = 0, slope = 0, color = "red", linetype = "dashed") +
    labs(y = "Residuals") +
    theme_bw()
})

gridExtra::grid.arrange(g[[1]], g[[2]], g[[3]], g[[4]], nrow = 2)

(c) each two-factor interaction term on separate graphs.
cp <- cp %>% 
  mutate(X1X2 = X1*X2, X1X3 = X1*X3, X1X4 = X1*X4, 
         X2X3 = X2*X3, X2X4 = X2*X4, X3X4 = X3*X4)

g <- lapply(list("X1X2", "X1X3", "X1X4", "X2X3", "X2X4", "X3X4"), function(.X){
  ggplot(cp, aes_string(x = .X, y = "res")) +
    geom_point() +
    geom_abline(intercept = 0, slope = 0, color = "red", linetype = "dashed") +
    labs(y = "Residuals") +
    theme_bw()
})

gridExtra::grid.arrange(g[[1]], g[[2]], g[[3]], g[[4]], g[[5]], g[[6]], nrow = 3)

(d) Also prepare a normal probability plot.
ggplot(cp, aes(sample = res)) +
  geom_qq() +
  geom_qq_line(color = "blue", line.p = c(0.25, 0.75), linetype = "dashed") + 
  theme_bw()

(3) Analyze your plots and summarize your findings.
(a)

圖 (2a) residuals 與 fitted value \(\hat{Y}\) 之間看起來蠻上下對稱於 0 點的,在現有的觀察下,推測原先資料的 error 很可能是 homoscedasticity 的。

(b)

圖 (2b) 左上:residuals 與 X1 之間很可能是 heteroscedasticity 的,由於在 X1 低值與高值的時候比起中間值時,residuals 的分佈變異較大很多,並不像理想中不同 X1 值下 residuals 上下分佈應該要相似。但其中有可能的原因是 X1 在中間值的取樣點過少,展現不出原有的變異。
圖 (2b) 右上:residuals 與 X2 之間看起來蠻上下對稱於 0 點的,也沒有其他特別的分佈出現,在現有的觀察下很可能是 homoscedasticity 的。
圖 (2b) 左下:residuals 與 X3 之間很可能是 heteroscedasticity 的,由於在 X3 低值的時候比起高值時,residuals 的分佈變異較大很多,形成 megaphone 的形狀。但其中有可能的原因是 X3 在高值的取樣點過少,展現不出原有的變異。
圖 (2b) 右下:residuals 與 X4 之間同 X2 無特殊特別的分佈,分佈上下均勻,在現有的觀察下很可能是 homoscedasticity 的。

(c)

圖 (2c) 左行:residuals 分別與 X1X2、X1X4、X2X4 的散佈圖,residuals 大致上分佈均勻,也沒有其他特別的分佈出現,在現有的觀察下很可能是 homoscedasticity 的。唯有在左上角 residuals 與 X1X2 中約 30-100 之間的分佈變異感覺下將許多,因此對於 X1X2 來說比起另外兩者,就很有可就不會是 homoscedasticity。
圖 (2c) 右行:residuals 分別與 X1X3、X2X3、X3X4 的散佈圖,這三者皆可發現 interaction term 在低值到中間值的範圍內,residuals 的變異大致相同,但是當到了高值時分佈變異就突然下降了,其中有可能的原因是 X3 在高值的取樣點過少,展現不出原有的變異,但就現有觀察來說 residuals 對於這三者 interaction term 很有可能都是 heteroscedasticity 的。

(d)

圖 (2d) 為 Q-Q plot 和 Q-Q line (藍色虛線),顯示著 residuals 的 quantiles 分佈以及相對應理論上 (常態分佈下) 的 quantiles 之間的散佈圖。可觀察到大致上觀察值與理論值似乎大致上相符,但唯有在極端的兩側左方下彎、右方上翹,很有可能實際上該分配比起常態兩側的尾部要 heavier,這也暗示著 residuals 很有可能不與常態分配相似,會造成不符合迴歸模型一開始的預設 error 是出自於常態分配的。

總結來說,residuals 對於 \(\hat{Y}\)、X2、X4 之散佈圖較沒有什麼問題,而對於 X1 雖然有些微的 heteroscedasticity 但猜測影響應該不大。但就 residuals 對於 X3 來說,高值的 X3 特別少,因此觀察起來很有可能 heteroscedasticity,因此在本題組三題目 2. 中,估計出的迴歸模型 X3 的參數 \(\beta_3\) 是不顯著的,因此建議可以對 X3 做些轉換。
此外 residuals 雖有那麼一點點不符和常態分佈的假設,但影響應該不大,因為做 Correlation test for normality

cp_qq <- qqnorm(cp$res, plot.it = FALSE)
cor(cp_qq$x, cp_qq$y)
## [1] 0.9913374

顯示 residuals 的 operved quantiles 和 expected quantiles 之相關 \(r^* = 0.99 > 0.985 = r_{critical(n=81, \alpha = 0.05)}\),在 \(\alpha = .05\) 下,我們無法拒絕 residuals 是從常態分佈來的,因此 Y 值可能不必做其他的轉換。
最後 residuals 對於 interaction terms 來說並沒有發現極為特殊的分佈出現,因此在這邊可以先暫時不用考慮將 interaction terms 納入原先模型中。

4. Can you conduct a formal test for lack of fit here? If no, why?

無法,因為在相同的預測變項中 (X1, X2, X3, X4) 沒有重複的觀察值。

5. Divide the 81 cases into two groups, placing the 40 cases with the smallest fitted values \(\hat{Y_i}\) into group 1 and the remaining cases into group 2. Conduct the Brown-Forsythe test for constancy of the error variance, using alpha = .05. State the decision rule and conclusion.

n1 = 40
n2 = 41
.c <- sort(cp$fit)[n1]
cp$group <- cut(cp$fit, c(min(cp$fit), .c, max(cp$fit)), labels = c("group 1", "group 2"), include.lowest = TRUE)
e1 <- median(filter(cp, group == "group 1")$res)
e2 <- median(filter(cp, group == "group 2")$res)
d1 <- filter(cp, group == "group 1")$res - e1
d2 <- filter(cp, group == "group 2")$res - e2
d1_mean <- mean(d1)
d2_mean <- mean(d2)
s <- sqrt((sum((d1-d1_mean)^2)+sum((d2-d2_mean)^2)) / (n1+n2-2))
t_BF <- abs((d1_mean-d2_mean) / (s*sqrt(1/n1+1/n2))) # == 0.95

t_critical <- qt(1-0.05/2, n1+n2-2) # == 1.99
print(ifelse((t_BF > t_critical), "reject H0", "accept H0"))
## [1] "accept H0"

decision rule: \[ if\,|t_{BF}^*| \leq t(1-\alpha/2; n-2),\, conclude\,H_0 \\ if\,|t_{BF}^*| > t(1-\alpha/2; n-2),\, conclude\,H_1 \]

由上述的統計結果能得知,\(|t_{BF}^*| = 0.95 < 1.99 = t_{critical}\) 接受 \(H_0\),表示在 \(\alpha = .05\) 下,residuals 的變異數在統計上是同質的,並不會隨著 X 值大小而有系統性的變化。

6. Obtain the analysis of variance table that decomposes the regression sum of squares into extra sums of squares associated with X4; with X1, given X4; with X2, given X1 and X4; and with X3, given X1, X2 and X4.

# SS type I
.aov <- anova(lm(Y ~ X4 + X1 + X2 + X3, cp))
.aov
## Analysis of Variance Table
## 
## Response: Y
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## X4         1 67.775  67.775 52.4369 3.073e-10 ***
## X1         1 42.275  42.275 32.7074 2.004e-07 ***
## X2         1 27.857  27.857 21.5531 1.412e-05 ***
## X3         1  0.420   0.420  0.3248    0.5704    
## Residuals 76 98.231   1.293                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
SS <- data.frame(
  SS = c("SSR(X4)" = .aov[1, 2],
         "SSR(X1|X4)" = .aov[2, 3],
         "SSR(X2|X1,X4)" = .aov[3, 2],
         "SSR(X3|X1,X2,X4)" = .aov[4, 2],
         "SSE" = .aov[5, 2], 
         "SSTO" = sum(.aov[, 2]))
)
SS
##                           SS
## SSR(X4)           67.7750980
## SSR(X1|X4)        42.2745683
## SSR(X2|X1,X4)     27.8574935
## SSR(X3|X1,X2,X4)   0.4197463
## SSE               98.2305939
## SSTO             236.5575000

上表計算得知 SSR(X4) = 67.78, SSR(X1|X4) = 42.27, SSR(X2|X1,X4) = 27.86, SSR(X3|X1,X2,X4) = 0.42。

7. Test whether \(\beta_1 = -.1\) and \(\beta_2 = .4\); use alpha = .01. State the alternatives, full and reduced models, decision rule, and conclusion.

beta1 = -0.1; beta2 = 0.4; alpha = 0.01
cp_rest <- cp %>% mutate(
  Y_rest = Y - (beta1*X1 + beta2*X2)
)
model_R <- lm(Y_rest ~ X3 + X4, cp_rest)
model_F <- lm(Y ~ X1 + X2 + X3 + X4, cp)
anova_R <- anova(model_R)
anova_F <- anova(model_F)

SSE_R <- anova_R[3, 2]
SSE_F <- anova_F[5, 2]
df_R <- anova_R[3, 1]
df_F <- anova_F[5, 1]
F_value <- ((SSE_R-SSE_F)/(df_R-df_F)) / (SSE_F/df_F) # == 4.61
F_critical <- qf(1-alpha, df_R-df_F, df_F) # == 4.90

print(ifelse((F_value > F_critical), "reject H0", "accept H0"))
## [1] "accept H0"

本題的假設檢定為:\[ \begin{align} &H_0: \beta_1 = -.1 \,and\, \beta_2 = .4\\ &H_1: not\, both\, \beta_1 = -.1 \,and\, \beta_2 = .4\, are\, true \end{align} \]
其中,\(H_0\)\(H_1\) 分別對應 reduced model: \[ Y_i= \beta_0 - 0.1X_{i1} + 0.4X_{i2} + \beta_3X_{i3} + \beta_4X_{i4} + \epsilon_i \]和 full model:\[ Y_i = \beta_0 + \beta_1X_{i1} + \beta_2X_{i2} + \beta_3X_{i3} + \beta_4X_{i4} + \epsilon_i \]
Decision rule: \[ if\,F^* \leq F(1-\alpha; df_R-df_F, df_F),\, conclude\,H_0 \\ if\,F^* > F(1-\alpha; df_R-df_F, df_F),\, conclude\,H_1 \]
由上方的統計結果可得知,\(F^* = 4.61 \leq 4.90 = F_{critical}\),因此接受 \(H_0\)。表示著在 \(\alpha = .01\) 下,\(\beta_1 = -.1\)\(\beta_2 = .4\) 是可以的,或是說我們 full model 是可已被 reduced model 給取代的。

summary(model_R)
## 
## Call:
## lm(formula = Y_rest ~ X3 + X4, data = cp_rest)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.8267 -0.6642 -0.0671  0.5533  3.5096 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.094e+01  2.446e-01  44.737  < 2e-16 ***
## X3          2.142e+00  9.906e-01   2.162   0.0337 *  
## X4          5.804e-06  1.222e-06   4.751 9.06e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.188 on 78 degrees of freedom
## Multiple R-squared:  0.2716, Adjusted R-squared:  0.253 
## F-statistic: 14.55 on 2 and 78 DF,  p-value: 4.28e-06

因此本迴歸模型的估計為\[ E\{Y\} = 10.94 - 0.1X_1 + 0.4X_2 + 2.14X_3 + 5.80\times10^{-6}X_4 \]

8. Calculate \(R_{Y4}^2, R_{Y1}^2, R_{Y1|4}^2, R_{14}^2, R_{Y2|14}^2, R_{Y3|124}^2\), and \(R^2\). How is the degree of marginal linear association between Y and X1 affected, when adjusted for X4?

.aovY1 <- anova(lm(Y ~ X1, cp))
R2 <- data.frame(
  Rsquare = c("R2_Y4" = .aov[1, 2] / sum(.aov[, 2]),
              "R2_Y1" = .aovY1[1, 2] / sum(.aovY1[, 2]),
              "R2_Y1|4" = (.aov[2, 2]-.aov[1, 2]) / sum(.aov[, 2] - .aov[1, 2]),
              "R2_Y14" = sum(.aov[1:2, 2]) / sum(.aov[, 2]),
              "R2_Y2|14" = .aov[3, 2] / (sum(.aov[, 2]) - sum(.aov[1:2, 2])),
              "R2_Y3|124" = .aov[4, 2] / (sum(.aov[, 2]) - sum(.aov[1:3, 2])),
              "R2" = sum(.aov[1:4, 2]) / sum(.aov[, 2]))
)
R2
##               Rsquare
## R2_Y4     0.286505809
## R2_Y1     0.062642359
## R2_Y1|4   0.249228212
## R2_Y14    0.465213178
## R2_Y2|14  0.220203703
## R2_Y3|124 0.004254889
## R2        0.584749611

上表計算得知:\(R_{Y4}^2 = 0.28\)\(R_{Y1}^2 = 0.06\)\(R_{Y1|4}^2 = 0.25\)\(R_{14}^2 = 0.47\)\(R_{Y2|14}^2 = 0.22\)\(R_{Y3|124}^2 = 0.00\) 以及 \(R^2 = 0.58\)
由上述結果得知,Y 和 X1 之間的 marginal linear assocition (即為 \(R_{Y1}^2\)) 為 0.06,非常的低;然而,當 X4 先加入在模型的情況下,再來看 Y 和 X1 之間的 marginal linear assocition (即為 \(R_{Y1|4}^2\)) 竟然為 0.25,比原先 \(R_{Y1}^2\) 多出好幾倍。這主要的原因 X4 對於 X1 來說是「壓抑變項」 (suppression variable),原本 Y 和 X1 之間本然是低度相關,然而 X4 對於 Y 和 X1 分別有中度的相關,使得抑制了 X1 對 Y 之中的誤差效果,因此造成 \(R_{Y1|4}^2\) 變得比 \(R_{Y1}^2\) 來高。

四、資料分析 2

[Assessed valuations] (Problem 8.24, p. 339)

A tax consultant studied the current relation between selling price and assessed valuation of one-family residential dwellings in a large tax district by obtaining data for a random sample of 16 recent “arm’s-length” sales transactions of one-family dwellings located on corner lots and for a random sample of 48 recent sales of one-family dwellings not located on corner lots. In the data file “HW3_Assessed valuations.txt”, both selling price (Y) and assessed valuation (X1) are expressed in thousand dollars, whereas lot location (X2) is coded 1 for corner lots and 0 for non-corner lots.
Assume that the error variances in the two populations are equal and that regression model (8.49) is appropriate.

# Assessed valuation
# load data
av <- read.table("HW3_Assessed valuations.txt", header = TRUE, sep = "\t")
av$X2 <- factor(av$X2)
datatable(av)

1. Plot the sample data for the two populations as a symbolic scatter plot. Does the regression relation appear to be the same for the two populations?

ggplot(av, aes(x = X1, y = Y, group = factor(X2), colour = factor(X2), shape = factor(X2))) +
  geom_point() +
  labs(x = "Assessed valuation (X1)", y = "Selling prince (Y)") +
  scale_colour_discrete(name = "Lot location (X2)",
                        labels = c("non-corner", "corner")) +
  scale_shape_discrete(name = "Lot location (X2)",
                       labels = c("non-corner", "corner")) +
  theme_bw()

從上圖的 scatter plot 能觀察到,整體的資料點似乎呈現正相關且 Y 和 X1 可能有線性關係。分別拆開 lot location 可觀察到 non-corner lot 與 corner lot 也都各自呈現線性,然而可發現在 assessed valuation 越大時,non-corner lot 的 selling price 與 corner lot 之間的差異逐漸變大,這可能暗示這兩種類別可能是不同的迴歸關係。

2. Test for identity of the regression functions for dwellings on corner lots and dwellings in other locations; control the risk of Type I error at .05. State the alternatives, decision rule, and conclusion.

依照題目的要求以課本公式 (8.49) 的迴歸模型: \[ Y_i = \beta_0 + \beta_1X_{i1} + \beta_2X_{i2} + \beta_3X_{i1}X_{i2} + \epsilon_i \]來 fit 此筆資料。而在本題中想要檢驗不同種 lot location 的迴歸模型是否可以是相同的,也就是要檢驗上述模型中的 \(\beta_2\)\(\beta_3\) 是否皆等於 0,也就是表示沒有 lot location 變項以及交互作用項的影響。但這之中得分成兩個步驟:

  1. 檢驗交互作用項 \(\beta_3\) 是否不影響?如果統計不顯著的話,則
  2. 檢驗 lot location 變項 \(\beta_2\) 是否不影響?如果統計不顯著的話,就表示不同 lot location 的迴歸模型是可以相同的。

首先,在步驟 1. 中的假設檢定為:\[\begin{align} &H_0: \beta_3 = 0 \\ &H_1: \beta_3 \neq 0 \end{align}\] 其中,\(H_0\)\(H_1\) 分別對應 reduced model: \[ Y_i = \beta_0 + \beta_1X_{i1} + \beta_2X_{i2} + \epsilon_i \] 和 full model:\[ Y_i = \beta_0 + \beta_1X_{i1} + \beta_2X_{i2} + \beta_3X_{i1}X_{i2} + \epsilon_i \] 而 decision rule 為: \[ if\,F^* \leq F(1-\alpha; df_R-df_F, df_F),\, conclude\,H_0 \\ if\,F^* > F(1-\alpha; df_R-df_F, df_F),\, conclude\,H_1 \]

model_paral <- lm(Y ~ X1 + X2, av)
model_inter <- lm(Y ~ X1 + X2 + X1:X2, av)
anova(model_paral, model_inter)
## Analysis of Variance Table
## 
## Model 1: Y ~ X1 + X2
## Model 2: Y ~ X1 + X2 + X1:X2
##   Res.Df    RSS Df Sum of Sq      F   Pr(>F)   
## 1     61 1022.1                                
## 2     60  909.1  1       113 7.4578 0.008281 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

由上述統計結果得知,\(F^* = 7.46 \geq 0.67 = F_{critical}\),因此拒絕 \(H_0\) 接受 \(H_1\)。表示著在 \(\alpha = .05\) 下,我們 full model 是不可被 reduced model 給取代的。

summary(model_inter)
## 
## Call:
## lm(formula = Y ~ X1 + X2 + X1:X2, data = av)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.8470  -2.1639   0.0913   1.9348   9.9836 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -126.9052    14.7225  -8.620 4.33e-12 ***
## X1             2.7759     0.1963  14.142  < 2e-16 ***
## X21           76.0215    30.1314   2.523  0.01430 *  
## X1:X21        -1.1075     0.4055  -2.731  0.00828 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.893 on 60 degrees of freedom
## Multiple R-squared:  0.8233, Adjusted R-squared:  0.8145 
## F-statistic: 93.21 on 3 and 60 DF,  p-value: < 2.2e-16

因此本題合適的迴歸模型估計為 \[ E\{Y\} = -126.91 + 2.78X_1 + 76.02X_2 - 1.11X_1X_2 \] 這也說明了如果是 dwellings on non-corner lot (\(X_2 = 0\)),其 selling price 和 assessed value 的迴歸模型是:\[ E\{Y\} = -126.91 + 2.78X_1 \] 而 dwellings on corner lot (\(X_2 = 1\)) 之迴歸模型則是:\[\begin{align} E\{Y\} &= (-125.91 + 76.02) + (2.78 - 1.11)X_1 \\ &= -50.88 + 1.68X_1 \end{align}\] 不管是在斜率或是截距上,這兩模型就不同了。

由於在第一步的檢驗結果,就發現省略交互作用項就有顯著的差別了,因此第二步檢驗 lot location 變項 \(\beta_2\) 是否等於 0 就能省略了,由上述的結果得知 non-corner lot 與 corner lot 的迴歸模型是不同的,且之間還有交互作用的影響。
即便我們直接做 \(H_0: \beta_2 = \beta_3 = 0\) 的假設檢定,一樣可以發現 reduced model: \(Y_i = \beta_0 + \beta_1X_{i1} + \epsilon_i\) 和 full model: \(Y_i = \beta_0 + \beta_1X_{i1} + \beta_2X_{i2} + \beta_3X_{i1}X_{i2} + \epsilon_i\) 會有顯著差別 (\(p < .001\))。

model_same <- lm(Y ~ X1, av)
anova(model_same, model_inter)
## Analysis of Variance Table
## 
## Model 1: Y ~ X1
## Model 2: Y ~ X1 + X2 + X1:X2
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1     62 1475.2                                  
## 2     60  909.1  2    566.15 18.683 4.925e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

3. Plot the estimated regression functions for the two populations and describe the nature of the differences between them.

ggplot(av, aes(x = X1, y = Y, group = factor(X2), colour = factor(X2), shape = factor(X2))) +
  geom_point() +
  geom_smooth(aes(linetype = factor(X2)), method = "lm", formula = y ~ x, se = FALSE, size = 0.5) +
  labs(x = "Assessed valuation (X1)", y = "Selling prince (Y)") +
  scale_colour_discrete(name = "Lot location (X2)",
                        labels = c("non-corner", "corner")) +
  scale_shape_discrete(name = "Lot location (X2)",
                       labels = c("non-corner", "corner")) +
  scale_linetype_discrete(name = "Lot location (X2)",
                          labels = c("non-corner", "corner")) +
  theme_bw()

綜合 1、2 題的結果以及上圖的觀察,可得知不同種 lot location 的迴歸模型是不相同,但兩者都可以觀察到 selling price 和 assessed value 呈線性關係,當 assessed value 越大 selling price 也跟者越大。
其中 lot location 與 assessed value 對於 selling price 來說有交互作用影響,當 lot location 在 non-corner 要比在 corner 時,assessed value 對 selling price 的迴歸線中的斜率和截距有不同的大小,每增加 1 單位的 assessed value 對於 selling price 增加的量在 non-corner location 較多。換句話說,在 mean response of selling price 在 non-corner location 和在 corner location 的差異,會隨著 assessed value 變大而跟著增大。