應用線性統計模式　作業一：Chapter 1 & 2

一、Chapter 1 觀念題

1. 試整合圖1之迴歸模型之圖示法，以解釋何謂「…“revert” or “regress” to the mean of the group, [which represents the] … tendency to be a regression to “mediocrity.”」（課本，p.5）。

data(father.son, package = "UsingR")
.rect <- data.frame(.xmin = c(63.5, 71.5), .xmax = c(64.5, 72.5), .ymin = -Inf, .ymax = Inf, .group = factor(c(64, 72)))
ggplot(father.son, aes(fheight, sheight)) +
  geom_point(size = 0.5) +
  geom_smooth(method = "lm", se = FALSE, size = 0.5) +
  geom_abline(slope = 1, intercept = 1, linetype = "dashed", color = "blue") +
  geom_rect(data = .rect, aes(xmin = .xmin, xmax = .xmax, ymin = .ymin, ymax = .ymax, fill = .group), alpha = 0.3, inherit.aes = FALSE) +
  coord_cartesian(xlim = c(min(father.son$fheight), max(father.son$fheight))) +
  theme_bw()

Sir Francis Galton (1822-1911) 當時在研究親代的身高會不會因爲遺傳的因素而影響子代的身高，他發現高於平均身高的親代，他們的子代會傾向變矮些；而低於平均身高的親代，他們的子代則會傾向變高於他們的父母，反而不是說越高的父母會生出更高的小孩，因此Galton將這種現象稱為 “regression towards medicority” 趨於平凡。
在此我想用上圖來作為解釋。上圖中的資料點為 Pearson 紀錄1078對父親身高(X)和兒子身高(Y)(單位：in.)，其中\(\bar{X} = 67.7\)、\(\bar{Y} = 68.7\)，相關約為0.50。藍色虛線為 \(\tilde{Y} = 1 + X_i\)，來表示我們預期兒子身高應該會多於父親身高1單位；藍色實線為迴歸線 \(\hat{Y_i} = 33.87＋0.51X_i\)的結果。

Y_72mean <- father.son %>% filter(fheight < 72.5 & fheight >= 71.5) %>% summarise(mean(sheight))
Y_72mean

##   mean(sheight)
## 1      70.67719

fs_lm <- lm(sheight ~ fheight, father.son)
Y_72predict <- predict(fs_lm, data.frame(fheight = 72))
Y_72predict

##       1 
## 70.9013

我們以X = 72為例，圖中右方的淺藍垂直帶表示兒子身高的分配。依照我們先前的觀察，既然兒子平均身高高於父親1單位，那麼應該會\(\tilde{Y} = 73\)。然而實際上在父親身高為72 in.下，該兒子身高的平均約為70.67，反而與迴歸線預測的\(\hat{Y} = 70.9\)非常接近，這表示著高父親的兒子身高分配之平均變矮了。同樣的，在較矮的父親X = 64，兒子應預期為\(\tilde{Y} = 65\)，但實際上兒子身高分配(圖中左方的淺紅垂直帶)的平均為66.70，也是接近我們迴歸線所預測的66.79，也表示著矮父親的兒子身高分配之平均變高了。總體來說兒子的身高似乎都趨向於整體的平均點。

2. Consider the normal error regression model (1.24). Suppose that the parameter values are \(\beta_0 = 200\), \(\beta_1 = 5.0\), and \(\sigma = 4\).

(1) Plot this normal error regression model in the fashion of Figure 1.6. Show the distributions of Y for X =10, 20, and 40.

beta0 <- 200
beta1 <- 5
sigma <- 4
.X <- c(10, 20, 40)
.Y_hat <- beta0+beta1*.X 
.data <- data.frame(X = rep(.X, each = 10000),
                    Y_hat = rep(.Y_hat, each = 10000),
                    Y_pseudo = c(sapply(.Y_hat, function(.){rnorm(10000, ., sigma)})))

ggplot(.data, aes(X, Y_pseudo)) +
  geom_flat_violin(aes(group = Y_hat), fill = alpha(0.5)) +
  stat_summary(fun.y=mean, geom="point", colour="red") +
  geom_smooth(method="lm", se=FALSE) +
  theme_bw()

(2) Explain the meaning of the parameters \(\beta_0\) and \(\beta_1\). Assume that the scope of the model includes X=0

\(\beta_1\)為迴歸線的斜率，表示著當每增加X一個單位時，會增加多少的單位的Y(<-Y分配之平均值)
\(\beta_0\)為迴歸線的在Y軸上的截距，表示著當\(X=0\)時，Y分配之平均值會是多少。

3. In a simulation exercise, regression model (1.1) applies with \(\beta_0 = 100\), \(\beta_1 = 20\), and \(\sigma^2 = 25\). An observation on Y will be made for X=5.

(1) Can you state the exact probability that Y will fall between 195 and 205? Explain.

迴歸模型(1.1)為 \(Y_i = \beta_0 + \beta_1X_i + \epsilon_i\) 說明了X和Y之間的線性統計關係，其中的 \(\epsilon_i\) 為一個 \(E\{\epsilon_i\}=0\) 且 \(\sigma^2\{\epsilon_i\}=\sigma^2\) 的隨機變數，而 \(\beta_0\)、\(\beta_1\)和\(X_i\) 皆為定值，因此 \(Y_i\) 在上述的模型關係中會是一個隨機變數，其平均值和變異數分別為 \(E\{Y_i\}=\beta_0 + \beta_1X_i\) 和 \(\sigma^2\{Y_i\}=\sigma^2\) 。然而我們並不知道 \(Y_i\) 確切的分配模型是什麼 (在模型1.1也為假設 \(\epsilon_i\) 的分配為何)，因此我們沒辦法確定 \(Y_i\) 的真實座落範圍為何。

(2) If the normal error regression model (1.24) is applicable, can you now state the exact probability that Y will fall between 195 and 205? If so, state it.

迴歸模型(1.24)為 \(Y_i = \beta_0 + \beta_1X_i + \epsilon_i\)，但與上提不同的假設在於我們的隨機變數\(\epsilon_i\)是服從於 \(N(0, \sigma^2)\)，因此隨機變數 \(Y_i\) 也服從於 \(N(\beta_0 + \beta_1X_i, \sigma^2)\)。當 \(X = 5\) 時，該 Y 的平均值和變異數分別為 200 和 25 ，而座落於 [195, 205] 之間的實際機率為\[P(195<Y<205) = P(\frac{195-200}{5}<\frac{Y-\mu}{\sigma}<\frac{205-200}{5}) \approx 0.68\]

pnorm(205, mean = 200, sd = 5) - pnorm(195, mean = 200, sd = 5)

## [1] 0.6826895

4. Evaluate the following statement: “For the least squares method to be fully valid, it is required that the distribution of Y be normal.”

不一定。在使用 LSE 時並不需要假設 \(Y\) 或是 \(\epsilon\) 有任何的分配假設，我們還是可以依照給定的資料做出迴歸線。可是當我們想做某些參數的假設檢定時，就必須要他們有常態分佈的假設。

5. According to (1.17), \(\sum{\epsilon_i} = 0\) when regression model (1.1) is fitted to a set of n cases by the method of least squares. Is it also true\(\sum{\epsilon_i} = 0\)? Comment.

不一定會。隨機變數 \(\epsilon_i = Y_i-(\beta_0+\beta_1X)\)，我們在預設中 \(E\{\epsilon_i\} = 0\)、 \(\sigma^2\{\epsilon_i\} = \sigma^2\)，我們只能知道 \(E\{\sum(\epsilon_i\} = 0\) 但不能確定 \(\sum{\epsilon_i}\) 會不會也等於0。

6. Computer programmers employed by a software developer were asked to participate in a month-long training seminar. During the seminar, each employee was asked to record the number of hours spent in class preparation each week. After completing the seminar, the productivity level of each participant was measured. A positive linear statistical relationship between participants’ productivity levels and time spent in class preparation was found. The seminar leader concluded that increases in employee productivity are caused by increased class preparation time.

(1) Were the data used by the seminar leader observational or experimental data?

此情境中所使用的資料算是觀察資料 (observational data)，由於我們所觀察的預測變項並非是由 leader 所操弄控制的，而是經由觀察得到 preparation time 和 productivity 的值。

(2) Comment on the validity of the conclusion reached by the seminar leader.

Seminar leader 最終推論 preparation time 增加造成 productivity 增加之間有因果關係是有問題的。我們只能知道者兩者間有統計上顯著的正相關或線性關係，但還是不能進行因果推論，在這之中很有可能有其他沒觀察到的混淆變項共同響著 preparation time 和 productivity。

(3) Identify two or three alternative variables that might cause both the employee productivity scores and the employee class participation times to increase (decrease) simultaneously.

其他可能的變項如：員工對整體任務的動機、他們是否有其他多重工作在身、對軟體使用的熟悉度…等。

(4) How might the study be changed so that a valid conclusion about causal relationship between class preparation time and employee productivity can be reached?

如果進行實驗控制的話就較能說明 preparation time 和 productivity 之間的因果關聯。實驗可以將 preparation time 的長度設計成獨變項中不同組別，但在不同組別中除了操弄變項 (課程準備時間) 不同外，其他的人口變項或是任何因素盡可能的組間都相同，以及採隨機分配員工到不同的組別來降低系統性的誤差，最後才來觀察我們的依變項 productivity 在不同組別間的差距。如果最終發現當 preparation time 增加時， productivity 也跟著增加的話，同時也排除掉其他可能的原因影響，這時我們才能說 preparation time 和 productivity 之間有因果關係。

二、Chapter 2 觀念題

1. A person asks if there is a difference between the “mean response at \(X = X_h\)” and the “mean of m new observations at \(X = X_h\).” Reply.

Mean response at \(X = X_h\): 指的是在 normal error regression model 的假設下得到的迴歸線，在既有的 X 觀察值中，當預測變數 \(X = X_h\)，用來估計\(E[Y_h]\)可能真實的位置在哪。其估計的信賴區間 (C.I.) 為 \(\hat{Y}_h \pm t(1-\alpha/2; n-2)s\{\hat{Y}_h\}\)。
Mean of m new observations at \(X = X_h\): 則是指當有新個 m 組觀察值都在 \(X = X_{h(new)}\) 下，所預測值的平均 \(\hat{Y}_{h(new)}\) 用來預測真實的 \(E[Y_{h(new)}]\) 會座落在何處。其估計得預測區間 (P.I.) 為 \(\hat{Y}_{h(new)} \pm t(1-\alpha/2; n-2)s^2\{predmean\}\)，其中 \(s^2\{predmean\} = s\{\frac{\sum{Y_{h(new)}}}{m}-\hat{Y}_{h(new)}\} = \frac{MSE}{m } + s\{\hat{Y}_{h(new)}\}\)，這比前者的差別在於該變異數估計包含了 Y 本身在 \(X = X_h\) 的平均變異，再加上我們迴歸線估計出的 \(\hat{Y}_{h(new)}\) 抽樣分配的變異。然而在前者的 C.I.中，就只有迴歸線估計出的 \(\hat{Y}_{h}\) 抽樣分配的變異而已。因此 P.I. 通常會寬於 C.I.。

2. For conducting statistical tests concerning the parameter \(\beta_1\), why is the t test more versatile than the F test?

t-test 比較多用的原因在於：

t-test 可以使用 test one-sided hypothesis 或是 test two-sided hypothesis，而 F-test 則只能 test one-sided hypothesis;
除此之外 t-test 可以獲得參數的估計值和標準誤，也好去檢測不同種假設檢定;
在單迴歸下 t-test 能分別對 \(\beta_0\) 和 \(\beta_1\) 進行統計檢地，而 ANOVA approach 下的 F-test 來檢測迴歸的顯著性時，剛好會是檢測 \(\beta_1\) 而已。但是如果遇到多元迴歸時， t-test 較能方便地分別檢測各個參數的顯著性。

3. When testing whether or not \(\beta_1 = 0\), why is the F test a one-sided test even though \(H_a\) includes both \(\beta_1 < 0\) and \(\beta_1 > 0\)? [Hint: Refer to (2.57).]

\[\begin{align} F(1-\alpha; 1, n-2) &= \frac{SSR/1}{SSE/n-2} \\ &= \frac{b_1^2\sum{(X_i-\bar{X})^2}}{MSE} \\ &= \frac{b_1^2}{MSE/\sum{(X_i-\bar{X})^2}} \\ &= \frac{b_1^2}{s^2\{b_1\}} \\ &= \left(\frac{b_1-0}{s\{b_1\}}\right)^2 = t^2(1-\frac{\alpha}{2};n-2) \end{align}\] 由上方推導可以知道 F-ratio 其實是 t-ratio 的平方，所以在 t-test 雙尾檢定中 (\(\beta_1 > 0\) 和 \(\beta_1 < 0\)) 會等同於 F-test 的單尾檢定 (\(\beta_1^2 > 0\))。

三、資料分析

1. 初步分析

(1) Obtain the estimated regression function. Plot the estimated regression function and the data. Does a linear regression function appear to give a good fit here?

ampules <- data.frame(X = c(1, 0, 2, 0, 3, 1, 0, 1, 2, 0),
                      Y = c(16, 9, 17, 12, 22, 13, 8, 15, 19, 11))
ampules_lm <- lm(Y ~ X, ampules)
summary(ampules_lm)

## 
## Call:
## lm(formula = Y ~ X, data = ampules)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##   -2.2   -1.2    0.3    0.8    1.8 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  10.2000     0.6633  15.377 3.18e-07 ***
## X             4.0000     0.4690   8.528 2.75e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.483 on 8 degrees of freedom
## Multiple R-squared:  0.9009, Adjusted R-squared:  0.8885 
## F-statistic: 72.73 on 1 and 8 DF,  p-value: 2.749e-05

ggplot(ampules, aes(X, Y)) +
  geom_point() +
  geom_smooth(method = "lm", formula = y ~ x, se = FALSE) +
  geom_point(aes(mean(X), mean(Y)), col = "red", shape = "cross", size = 3) + 
  theme_bw()

迴歸線為：\(\hat{Y} = 10.2 + 4.0X\)，且 \(R^2 = 0.90\) 非常高，此外，上圖中的資料點的趨勢和迴歸線，也看起來似乎是蠻吻合的，

(2) Verify that your fitted regression line goes through the point (\(\bar{X}\), \(\bar{Y}\)).

x_bar = mean(ampules$X) # x_bar == 1
y_bar = mean(ampules$Y) # y_bar == 14.2
predict(ampules_lm, data.frame(X = x_bar)) # == 14.2

##    1 
## 14.2

估計出來的回歸線確實會通過 X, Y 分別的平均值 (1, 14.2)。此外，其實在上圖中的紅色叉叉及為 (\(\bar{X}\), \(\bar{Y}\))，也確實通過迴歸線。

2. 殘差

(1) Obtain the residual for the first case. What is its relation to \(\epsilon_{1}\)?

e <- residuals(ampules_lm)
e

##    1    2    3    4    5    6    7    8    9   10 
##  1.8 -1.2 -1.2  1.8 -0.2 -1.2 -2.2  0.8  0.8  0.8

ampules 各別的殘差值如上所示。而第一項的殘差 e[1] == 1.8，是我們第一項的實際觀察值減去我們用迴歸線所預測的 \(\hat{Y_1}-(b_0+b_1X_1)\)；而誤差 \(\epsilon_{1}\) 則是只迴歸模型中實際觀測值和觀測值之期望值之間的差距 \(Y_1-E[Y-1] = Y_1-(\beta_0+\beta_1X_1)\)，是無法得知的。

(2) Compute \(\sum{e_i^2}\) and MSE. What is estimated by MSE?

SSE <- sum(e^2)
SSE # == 17.6

## [1] 17.6

MSE <- SSE / (10-2)
MSE # == 2.2

## [1] 2.2

MSE 是 \(\sigma^2\{\epsilon\}\) 的不偏估計值。

3. 估計與檢定

(1) \(\beta_1\)之估計與檢定

I. Estimate \(\beta_1\) with a 95 percent confidence interval. Interpret your interval estimate.

summary(ampules_lm)

## 
## Call:
## lm(formula = Y ~ X, data = ampules)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##   -2.2   -1.2    0.3    0.8    1.8 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  10.2000     0.6633  15.377 3.18e-07 ***
## X             4.0000     0.4690   8.528 2.75e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.483 on 8 degrees of freedom
## Multiple R-squared:  0.9009, Adjusted R-squared:  0.8885 
## F-statistic: 72.73 on 1 and 8 DF,  p-value: 2.749e-05

confint(ampules_lm)

##                2.5 %    97.5 %
## (Intercept) 8.670370 11.729630
## X           2.918388  5.081612

\(b_1\) 為 \(\beta_1\) 的點估計值，為 4.00。在 95% 的信心水準下我們知道他的信賴區間為 \[b_1 \pm t_{1-a/2;n-2}s\{b_1\} = 4.00 \pm t_{0.975,8}\times0.469\] 即為 [2.92, 5.08]。

II. Conduct a t test to decide whether or not there is a linear association between number of times a carton is transferred (X) and number of broken ampules (Y). Use a level of significance of .05. State the alternatives, decision rule, and conclusion. What is the P-value of the test?

本題的null hypothesis \(H_0: \beta_1 = 0\) vs. alternative hypothesis \(H_1: \beta_1 \neq 0\)。來表示 X 和 Y 是否有呈現線性關係。
Decision rule: \(if\,|t^*| \leq t(1-\alpha/2; n-2),\, conclude\,H_0 \\ if\,|t^*| > t(1-\alpha/2; n-2),\, conclude\,H_1\)
在上題的summary(ampules_lm)中就顯示了 t-test 的結果。 \(b_1\) 的 t_value 為 8.53 (> t_criteria = qt((1-0.05/2), 8) = 2.31)，而 \(b_1\) 的p_value為 2.75e-05 (< 0.001)，拒絕 \(H_0\) 而接受 \(H_1: \beta_1 \neq 0\) ，表示 X 和 Y 呈現線性關係在統計上是極為顯著的。也就是說如果每多一次轉運 (X)，會造成 4 個 ampules 破損。

(2) \(\beta_0\)之估計與檢定

I. \(\beta_0\) presents here the mean number of ampules broken when no transfers of the shipment are made-i.e., when X=0. Obtain a 95 percent confidence interval for \(\beta_ㄢ\) and interpret it.

本題的 null hypothesis \(H_0: \beta_0 = 0\) vs. alternative hypothesis \(H_1: \beta_0 \neq 0\) 。來表示當 X=0 時 (也就是沒有轉運) ， Y₀ 會不會等於 0 (也就是有沒有造成破損)。
Decision rule: \(if\,|t^*| \leq t(1-\alpha/2; n-2),\, conclude\,H_0 \\ if\,|t^*| > t(1-\alpha/2; n-2),\, conclude\,H_1\)
在上上題的 summary(ampules_lm) 中也顯示了 t-test 的結果。 \(b_0\) 為 \(\beta_0\) 的點估計值，為 10.20。其中 \(b_0\) 的 t_value 為 15.38 (> t_criteria = qt((1-0.05/2), 8) = 2.31)，而 \(b_0\) 的 p_value 為 3.18e-07 (< 0.001)，拒絕 \(H_0\) 而接受 \(H_1: \beta_1 \neq 0\) ，表示著該迴歸線得在Y軸上的截距不等於 0 在統計上是非常顯著的，也表示說當 X=0 (也就是沒有轉運) 的情況下，該 Y 值很可能不等於 0 (也就是 ampules 本來就有破損)。
而在95%的信心水準下我們知道 \(\beta_0\) 的信賴區間為 \[b_0 \pm t_{1-a/2;n-2}s\{b_0\} = 10.20 \pm t_{0.975,8}\times0.66\] 即為 [8.67, 11.72]。也就是我們有 95% 的信心水準 ampules 可能本來就有破損的個數約 8~12 之間。

II. A consultant has suggested, on the basis of previous experience, that the mean number of broken ampules should not exceed 9.0 when no transfers are made. Conduct an appropriate test, using α=.025. State the alternatives, decision rule, and conclusion. What is the P-value of the test?

本題的 null hypothesis \(H_0: \beta_0 \leq 9.0\) vs. alternative hypothesis \(H_1: \beta_0 > 9.0\)。
Decision rule: \(if\,t^* \leq t(1-\alpha; n-2),\, conclude\,H_0 \\ if\,t^* > t(1-\alpha; n-2),\, conclude\,H_1\)
我們知道\(\beta_0\)的抽樣分配會遵守t分配，而我們要檢驗單尾的t-test，因此

alpha <- 0.025
beta0 <- 9
beta0_hat <- summary(ampules_lm)$coefficients[1, 1] # == 10.2
beta0_hat_se <- summary(ampules_lm)$coefficients[1, 2] # == 0.663325
df <- ampules_lm$df.residual # == 8
t_value <- (beta0_hat - beta0) / beta0_hat_se # == 1.809068
t_criteria <- qt(1-alpha, df) # == 2.306004
print(ifelse((t_value > t_criteria), "reject H0", "accept H0"))

## [1] "accept H0"

p_value <- 1-pt(t_value, df)
p_value

## [1] 0.05402227

透過 t-test 我們得知接受 \(H_0: \beta_0 \leq 9.0\)，在沒有傳送的情況下沒有打破超過 9 個 ampules 在統計上雖是顯著的，然而他的 p-value 約大於 0.05 一些些，因此還是很有可能實際上超過 9 個以上。

(3) Obtain the power of your test in part (1) if actually \(\beta_1 = 2.0\). Assume \(\sigma\{b_1\} = .50\). Also obtain the power of your test in part (2) if actually \(\beta_0 = 11\). Assume \(\sigma\{b_0\} = .75\). (不確定這邊助教有沒有打錯數值，我私自改成跟課本習題一樣的了)

根據 part (1) 中的 null hypothesis \(H_0: \beta_1 = 0\) vs. alternative hypothesis \(H_1: \beta_1 \neq 0\)。在 \(\beta_1 = 2.0\) 和 \(\sigma\{b_1\} = .50\) 下其 power 為 \[\begin{align} Power &= Pr(Accept\,H_1|H_1\,is\,true) = Pr(Accept\,H_1|\beta_1 = 2) \\ &= Pr\left(\left| \frac{\hat{\beta}_1-0}{0.5} \right|> t_{1-0.05/2,8}|\beta_1 = 2\right) \\ &= Pr\left(\frac{\hat{\beta}_1-2}{0.5}>2.31-\frac{2}{0.5}\right) + Pr\left(\frac{\hat{\beta}_1-2}{0.5}<-2.31-\frac{2}{0.5}\right) \\ &= 0.9352 + 0.0001151 \approx 0.94 \end{align}\]

根據 part (2) 中的 null hypothesis \(H_0: \beta_0 \leq 9\) vs. alternative hypothesis \(H_1: \beta_0 > 0\)。在 \(\beta_0 = 11\)和\(\sigma\{b_1\} = .75\) 下其 power 為 \[\begin{align} Power &= Pr(Accept\,H_1|H_1\,is\,true) = Pr(Accept\,H_1|\beta_0 = 11) \\ &= Pr\left(\frac{\hat{\beta}_0-0}{0.75} > t_{1-0.05,8}|\beta_0 = 11\right) \\ &= Pr\left(\frac{\hat{\beta}_0-11}{0.75} > 1.86-\frac{(11-9)}{0.75}\right) \\ &= 1-Pr\left(z < 1.86-\frac{(2)}{0.75}\right) \\ &\approx 0.78 \end{align}\]

4. Prediction

(1) Because of changes in airline routes, shipments may have to be transferred more frequently than in the past. Estimate the mean breakage for the following numbers of transfers: X=2,4. Use separate 99 percent confidence intervals. Interpret your results.

predict(ampules_lm, data.frame(X = c(2, 4)), interval = "confidence", level = 0.99)

##    fit      lwr      upr
## 1 18.2 15.97429 20.42571
## 2 26.2 21.22316 31.17684

此題中我們是要估計在 X 重複測量下， \(\hat{Y}\) 的期望值與信賴區間會是多少。
在 X = 2 時， \(E[\hat{Y}] = 18.2\)，且 99% 信心水準下的 C.I.: \(15.97 < E[\hat{Y}] < 20.43\);
在 X = 4 時， \(E[\hat{Y}] = 26.2\)，且 99% 信心水準下的 C.I.: \(21.22 < E[\hat{Y}] < 31.18\)

(2) The next shipment will entail two transfers. Obtain a 99 percent prediction interval for the number of broken ampules for this shipment. Interpret your prediction interval.

predict(ampules_lm, data.frame(X = 2), interval = "prediction", level = 0.99)

##    fit      lwr      upr
## 1 18.2 12.74814 23.65186

與上題不同的是，我們這次是要來預估當新的 X 觀察值出現時，新的 \(\hat{Y}_{new}\) 的期望值和預測區間 (prediction interval, P.I.) 會是多少。
在 X = 2 時， \(E[\hat{Y}_{new}] = 18.2\)，但就 99% 信心水準下的PI則是: \(12.75 < E[\hat{Y}] < 23.65\)，比起上提預估的 C.I. 還要來得寬，主要是因為我們對 \(\hat{Y}_{new}\) 的變異數估計為 \[\begin{align} s^2\{pred\} &= s^2\{Y_{h(new)}\}+s^2\{\hat{Y}_h\}\\ &= MSE\left[1+\frac{1}{n}+\frac{(X_h-\bar{X})^2}{\sum{(X_i-\bar{X})^2}}\right] \end{align}\] 比起原本的 C.I.中只包含了 \(s^2\{\hat{Y}_h\}\) 的變異， P.I. 中還包含了自己本身 \(s^2\{Y_{h(new)}\}\) 的變異，因此 P.I. 的範圍自然要大些。

(3) In the next several days, three independent shipments will be made, each entailing two transfers. Obtain a 99 percent prediction interval for the mean number of ampules broken in the three shipments. Convert this interval into a 99 percent prediction interval for the total number of ampules broken in the three shipments.

我們對多個新觀察值 (在同 X level 下) 的平均下的預測，其變異數為 \[\begin{align} s^2\{predmean\} &= \sum{s^2\{\frac{Y_{h(new)}}{m}\}}+s^2\{\hat{Y}_h\}\\ &= MSE\left[\frac{1}{m}+\frac{1}{n}+\frac{(X_h-\bar{X})^2}{\sum{(X_i-\bar{X})^2}}\right] \end{align}\]

X_new <- data.frame(X = 2)
Y_hat_new <- predict(ampules_lm, X_new)# == 18.2
alpha <- 0.01
m <- 3
n <- nrow(ampules)
s_predmean = (MSE*(1/m+1/n+(X_new-mean(ampules$X))^2/sum((ampules$X-mean(ampules$X))^2))) %>% sqrt() # == 1.083
t_value_new <- qt(1-alpha/2, n-2) 
PI <-  data.frame(Y_hat_new,
                  Y_hat_new-t_value_new*s_predmean, 
                  Y_hat_new+t_value_new*s_predmean)
names(PI) <- c("fit", "lwr", "upr")
PI

##    fit      lwr      upr
## 1 18.2 14.56543 21.83457

PI_total <-  PI*3
PI_total

##    fit      lwr      upr
## 1 54.6 43.69628 65.50372

在三次皆為轉運兩次下其平均打破的 ampules，其 \(E[\hat{Y}_{new}] = 18.2\)，且 99% 信心水準下的 P.I.: \(14.57 < E[\hat{Y}_{new}] < 21.83\)，比上題中單一一次預測的 P.I. 的範圍來的窄些，由於我們三次平均下對平均的預測可能較穩；而若算回三次總共打破的 ampules，預期 \(E[\hat{Y}_{total}] = 54.6\)，且 99% 信心水準下的 P.I.: \(43.70 < E[\hat{Y}_{total}] < 65.50\)，是前者的三倍範圍。

(4) Determine the boundary values of the 99 percent confidence band for the regression line when \(X_h = 2\) and when \(X_h = 4\). Is your confidence band wider at these two points than the corresponding confidence intervals in part (1)? Should it be?

我們在計算 confidence band (C.B.) 採用的是 Working-Hotelling 的方法：\(\hat{Y_h} \pm Ws\{\hat{Y_h}\}\)，其中 \(W^2 = 2F(1-\alpha; 2, n-2)\)

X_h <-  data.frame(X = c(2, 4))
Y_hat_h <-  predict(ampules_lm, X_h) # c(18.2, 26.2)
alpha <- 0.01
n <- nrow(ampules)
W <- (2*qf(1-alpha, 2, n-2)) %>% sqrt()
s_Y_hat_h <- (MSE*(1/n+(X_h-mean(ampules$X))^2/sum((ampules$X-mean(ampules$X))^2))) %>% sqrt()
Y_hat_CB <- data.frame(Y_hat_h,
                       Y_hat_h-W*s_Y_hat_h, 
                       Y_hat_h+W*s_Y_hat_h, 
                       row.names = c("X_h=2", "X_h=4"))
names(Y_hat_CB) <- c("fit", "lwr", "upr")
Y_hat_CB

##        fit      lwr      upr
## X_h=2 18.2 15.44116 20.95884
## X_h=4 26.2 20.03104 32.36896

在 \(X_h = 2\) 下， \(E[\hat{Y_h}] = 18.2\)，且 99% 信心水準下的 C.B.: \(15.44 < \beta_0+\beta_1X_h < 22.96\)，確實比 (1) 中的 C.I. 還要來得寬；
在 \(X_h = 4\) 下， \(E[\hat{Y_h}] = 26.2\)，且 99% 信心水準下的 C.B.: \(20.03 < \beta_0+\beta_1X_h < 32.37\)，同樣也是比 (1) 中的 C.I. 還要來得寬。
確實是如此，因為最主要的原因在於算信賴區間中 C.I. 是以 t multiple，而這邊則是以 W multiple 來算 C.B.，而且 W 一定比 t 來得大，推導如下 \[\begin{align} t(1-\alpha; n-2) &= \frac{z}{\sqrt{\chi_{n-2}^2/n-2}} \\ &= \frac{\sqrt{\chi_1^2/1}}{\sqrt{\chi_{n-2}^2/n-2}} \\ &\leq \frac{\sqrt{\chi_2^2/1}}{\sqrt{\chi_{n-2}^2/n-2}} \\ &= \sqrt{2}\frac{\sqrt{\chi_2^2/2}}{\sqrt{\chi_{n-2}^2/n-2}} \\ &= \sqrt{2}F(1-\alpha; 2, n-2) = W \end{align}\] 所以在同樣的信心水準下， C.B. 一定比 C.I. 來得寬。

5. ANOVA table

(1) Set up the ANOVA table.

anova(ampules_lm)

## Analysis of Variance Table
## 
## Response: Y
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## X          1  160.0   160.0  72.727 2.749e-05 ***
## Residuals  8   17.6     2.2                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

如上表所示。

(2) Conduct an F test to decide whether or not there is a linear association between the number of times a carton is transferred and the number of broken ampules; control the α risk at .05. State the alternatives, decision rule, and conclusion.

在本題中的 null hypothesis \(H_0: \beta_1 = 0\) vs. alternative hypothesis \(H_1: \beta_1 \neq 0\)。我們又曉得 \(MSE/MSR \sim F(1, n-2)\)，可以用來檢驗 X 和 Y 是否為線性關係。
Decision rule: \(if\,F^* \leq F(1-\alpha; 1, n-2),\, conclude\,H_0 \\ if\,F^* > F(1-\alpha; 1, n-2),\, conclude\,H_1\)
由上表可知結果是拒絕 \(H_0\) 的，且 p-value < 0.001，表示著 \(\beta_1\) 極可能不等於 0，也就是說 X 和 Y 的線性關係在統計上是非常顯著。

(3) Calculate \(R^2\) and r. What proportion of the variation in Y is accounted for by introducing X into the regression model?

r_squared <- summary(ampules_lm)$r.squared 
r_squared

## [1] 0.9009009

r <- sqrt(r_squared)
r

## [1] 0.949158

決定係數 \(R^2 = \frac{SSR}{SST}= 1 - \frac{SSE}{SSTO}\) 表示著 Y 的總變異量有多少的比例能被我們建立的迴歸線給解釋，而我們得到的 \(R^2\) 約為 0.9，顯示著迴歸線 fit 的非常不錯。
而相關係數 \(r = +\sqrt{R^2}\) 約為 0.95 (由於斜率是正的所以取正值)，顯示者 X, Y 這兩個變項的關連非常密切，當 X 越大， Y 也會跟著越大。

6. Suppose that the test in Airfreight breakage Problems is to be carried out by means of a general linear test.

(1) State the full and reduced models.

Full model: \(Y_i = \beta_0 + \beta_1X_i + \epsilon_i\) vs. reduced model: \(Y_i = \beta_0 + \epsilon_i\)。若寫成假設檢定的方式為 null hypothesis \(H_0: \beta_1 = 0\) vs. alternative hypothesis \(H_1: \beta_1 \neq 0\)。

(2) Obtain (a) SSE(F), (b) SSE(R), (c) df_F, (d) df_R, (e) test statistic F* for the general linear test, (f) decision rule.

# (a)
SSE_F <- anova(ampules_lm)[2, 2] 
SSE_F

## [1] 17.6

# (c)
df_F <- anova(ampules_lm)[2, 1]
df_F

## [1] 8

# (b)
SSE_R <- anova(ampules_lm)[2] %>% sum() # == SSTO
SSE_R

## [1] 177.6

# (d)
df_R <- anova(ampules_lm)[1] %>% sum() # == df_total
df_R

## [1] 9

# (e)
F_value <- ((SSE_R-SSE_F)/(df_R-df_F)) / (SSE_F/df_F)
F_value

## [1] 72.72727

F_critical <- qf(1-0.05, (df_R-df_F), df_F)
F_critical

## [1] 5.317655

print(ifelse((F_value > F_critical), "reject H0", "accept H0"))

## [1] "reject H0"

SSE(F) = 17.6;
SSE(R) = 177.6;
df_F = 8;
df_R = 9;
F* = 72.73;
\(if\,F^* \leq F(1-\alpha; df_R-df_F, df_F),\, conclude\,H_0 \\ if\,F^* > F(1-\alpha; df_R-df_F, df_F),\, conclude\,H_1\)
=> F* = 1.14 > 5.31 = F_critical => accept H₁

(3) Are the test statistic F* and the decision rule for the general linear test numerically equivalent to those in Problem 5-(2)?

在單回歸模型中是一模一樣的。
兩者的 \(F^* = 72.23\)， \(F_{critical} = F(1-0.05, 1, 8) ＝ 5.31\)。
而兩者的 decision rule 也皆相同 \(if\,F^* \leq F_{critical},\, conclude\,H_0 \\ if\,F^* > F_{critical},\, conclude\,H_1\)。
結果也是相同的，皆是拒絕 H₀、接受 H₁。

應用線性統計模式 作業一：Chapter 1 & 2

B03605059 心理五 林子堯

2019-04-01