應用線性統計模式　作業二：Chapter 3, 4, 5

一、觀念題與推導

1. When joint confidence intervals for $\beta_0$ and $\beta_1$ are developed by the Bonferroni method with a family confidence coefficient of 90 percent, does this imply that 10 percent of the time the confidence interval for $\beta_0$ will be incorrect? That 5 percent of the time the confidence interval for $\beta_0$ will be incorrect and 5 percent of the time that for $\beta_1$ will be incorrect? Discuss.

兩者的敘述皆不正確。90% family confidence coefficient 指的是對 $\beta_0$ 和 $\beta_1$ 估計的信賴區間，兩者皆有包含到真值 $\beta_0$ 和 $\beta_1$ 的機率至少大於 90%。
換句話說，發生錯誤 (不管是沒有包含到 $\beta_0$ 或是沒有包含到 $\beta_1$，還是兩者參數同時都沒包含到) 的機率總和會小於等於 10%。此外，並沒有說 $\beta_0$ 和 $\beta_1$ 各自錯誤的機率一定都會是 5%。

2. Show that $\hat{Y}$ as defined in (4.15) for Iinear regression through the origin is an unbiased estimator of E{Y}.

通過原點 (0, 0) 的回歸模型為 $E\{Y\} = \beta_1X$。
根據 (4.14) 的結果，我們得知估計值 $b_1 = \frac{\sum{X_iY_i}}{\sum{X_i^2}}$，其期望值為 \[\begin{align} E\{b_1\} &= E\{\frac{\sum{X_iY_i}}{\sum{X_i^2}}\} \\ &= \frac{1}{\sum{X_i^2}}\sum{X_iE\{Y_i\}} \\ &= \frac{1}{\sum{X_i^2}}\sum{X_i(\beta_1X_1)} \\ &= \frac{1}{\sum{X_i^2}}\beta_1\sum{X_i^2} &= \beta_1 \end{align}\] 會是 $\beta_1$ 的不偏估計。
而根據 (4.14) 的結果，估計值 $\hat{Y} = b_1X_1$，該期望值則為 \[\begin{align} E\{\hat{Y}\} &= E\{b_1X_1\} \\ &= E\{b_1\}X_1 \\ &= \beta_1X_1 &= E\{Y\} \end{align}\] 會是 $E\{Y\}$ 的不偏估計，得證。

3. Consider the least squares estimator b given in (5.60). Using matrix methods show that b is an unbiased estimator.

已矩陣的方式描述迴歸模型為 $\mathbf{Y = \boldsymbol\beta X}$。
根據 (5.60) 的結果，我們得知 $\beta$ 的估計值 $\mathbf{b = (X'X)^{-1}X'Y}$，其期望值為 \[\begin{align} \mathbf{E\{b\}} &= \mathbf{E\{(X'X)^{-1}X'Y}\} \\ &= \mathbf{(X'X)^{-1}X'E\{Y\}} \\ &= \mathbf{(X'X)^{-1}X'(\boldsymbol\beta X)} \\ &= \mathbf{(X'X)^{-1}(X'X)\boldsymbol\beta} = \boldsymbol\beta \end{align}\] 為 $\boldsymbol\beta$ 得不偏估計，得證。

二、資料分析 1

Diagnostics and Remedial Measures

X <- c(0, 1, 2, 3, 4, 5, 6, 7, 8, 9) # year
Y <- c(98, 135, 162, 178, 221, 232, 283, 300, 374, 395) # sales in thousands of units
sales <- data.frame(X, Y)
sales

##    X   Y
## 1  0  98
## 2  1 135
## 3  2 162
## 4  3 178
## 5  4 221
## 6  5 232
## 7  6 283
## 8  7 300
## 9  8 374
## 10 9 395

(a) Prepare a scatter plot of the data. Does a linear relation appear adequate here?

library(ggplot2)
library(dplyr)
ggplot(sales, aes(X, Y)) +
  geom_point() +
  geom_smooth(method = "lm", se = TRUE, alpha = 0.5) +
  scale_x_continuous(breaks = X) +
  labs(x = "Year", y = "Sales (in thousand)") +
  theme_bw()

由上圖中資料點看似成線性關係，但又有一點點彎曲曲線的感覺，同時某些點卻無法落入迴歸線的信賴區間內，因此該迴歸線可能還不夠完好，可能真的不那麼線性。

(b) Use the Box-Cox procedure and standardization (3.36) to find an appropriate power transformation of Y. Evaluate SSE for λ = .3, .4, .5, .6, .7. What transformation of Y is suggested?

boxcox.sse <- function(lambda, model) {
  X <- model.frame(model)$X
  Y <- model.frame(model)$Y
  n <- nrow(model.frame(model))
  K2 <- prod(Y)^(1/n)            # (3.36a)
  K1 <- 1/(lambda*K2^(lambda-1)) # (3.36b)
  
  if (lambda == 0) {
    W <- K2*log(Y)        # (3.36)
  } else{
    W <- K1*(Y^lambda-1)  # (3.36)
  }
  # Deviance = Residual Sum of Squares, SSE
  return(deviance(lm(W ~ X))) 
}

lambda <- seq(-1, 1, 0.1)
SSE <- sapply(lambda, boxcox.sse, lm(Y ~ X, sales))
.data <- data.frame(lambda, SSE)
.data

##    lambda        SSE
## 1    -1.0 13670.3269
## 2    -0.9 11629.0334
## 3    -0.8  9842.1378
## 4    -0.7  8281.9520
## 5    -0.6  6924.2528
## 6    -0.5  5747.8831
## 7    -0.4  4734.4047
## 8    -0.3  3867.7951
## 9    -0.2  3134.1829
## 10   -0.1  2521.6190
## 11    0.0  2019.8767
## 12    0.1  1620.2804
## 13    0.2  1315.5569
## 14    0.3  1099.7093
## 15    0.4   967.9088
## 16    0.5   916.4048
## 17    0.6   942.4498
## 18    0.7  1044.2384
## 19    0.8  1220.8598
## 20    0.9  1472.2614
## 21    1.0  1799.2242

ggplot(.data, aes(lambda, SSE)) +
  geom_line() +
  theme_bw()

透過 Box-Cox 轉換得知，在 $\lambda$ = .3, .4, .5, .6, .7，其 W 對 X 的線性模型之 SSE 分別為 1099.71, 969.91, 916.40, 942.45, 1044.24，如上表所示。從上表和上圖中可發現當 $\lambda = .5$ 時，有最小值的 SSE 出現，且相較於附近的其它值， .5 較容易解釋其意義性，因此根據 Box-Cox 轉換結果暗示，新的預測變項可作 $Y' = \sqrt{Y}$ 此轉換，較能符合線性模型的基本預設。

(c) Use the transformation $Y' = \sqrt{Y}$ and obtain the estimated linear regression function for the transformed data.

sales$Y_t <- sqrt(sales$Y)
sales_lm <- lm(Y_t ~ X, sales)
summary(sales_lm)

## 
## Call:
## lm(formula = Y_t ~ X, data = sales)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.47447 -0.30811  0.01549  0.29541  0.46781 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 10.26093    0.21290   48.20 3.80e-11 ***
## X            1.07629    0.03988   26.99 3.83e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3622 on 8 degrees of freedom
## Multiple R-squared:  0.9891, Adjusted R-squared:  0.9878 
## F-statistic: 728.4 on 1 and 8 DF,  p-value: 3.826e-09

由統計報表得知，經轉換後的迴歸模型的結果為 $\hat{Y'} = 10.26 + 1.08 X$。其中，截距 $b_0$ 和斜率 $b_1$ 在統計上都顯著不等於 0 (p-value < 0.001)， $R^2 \approx 0.99$ 表示該回歸線幾乎快 99% $Y'$ 的變異了，也表示此模型較有解釋能力。

(d) Plot the estimated regression line and the transformed data. Does the regression line appear to be a good fit to the transformed data?

ggplot(sales, aes(X, Y_t)) +
  geom_point() +
  geom_smooth(method = "lm", se = TRUE, alpha = 0.5) +
  scale_x_continuous(breaks = X) +
  labs(x = "Year", y = "Square root of Sales (in thousand)") +
  theme_bw()

轉換後的資料點，看起來比起 part (a) 線性關係更為明顯了，而且該迴歸線似乎 fitted 得更好了，且資料點也大致多都落入在信賴區間內。

(e) Obtain the residuals and plot them against the fitted values. Also prepare a normal probability plot. What do your plots show?

sales$Y_fit <- fitted.values(sales_lm)
sales$residuals <- residuals(sales_lm)
sales %>% select(X, Y_t, Y_fit, residuals)

##    X       Y_t    Y_fit   residuals
## 1  0  9.899495 10.26093 -0.36143656
## 2  1 11.618950 11.33722  0.28172678
## 3  2 12.727922 12.41352  0.31440703
## 4  3 13.341664 13.48981 -0.14814273
## 5  4 14.866069 14.56610  0.29997018
## 6  5 15.231546 15.64239 -0.41084412
## 7  6 16.822604 16.71868  0.10392174
## 8  7 17.320508 17.79497 -0.47446579
## 9  8 19.339080 18.87127  0.46781397
## 10 9 19.874607 19.94756 -0.07295049

上方資料表中 X 為 years，Y_t 為轉換過開根號後的 sales (in thousands)，而 Y_fit 和 residuals 分別代表迴歸線 $\hat{Y'} = 10.26 + 1.08 X$ 的估計值 $\hat{Y'}$ 和殘差 $e = Y' - \hat{Y'}$。

ggplot(sales, aes(Y_fit, residuals)) +
  geom_point() +
  geom_abline(intercept = 0, slope = 0, color = "red", linetype = "dashed") +
  labs(x = "Fitted values", y = "Residuals") +
  theme_bw()

而上圖為 residuals plot，可以觀察到 residuals 的分佈在不同的 X 值下是蠻均勻上下分布的 (對稱於 0)，這也暗示著 residuals 很可能具有變異數同質 (homoscedasticity)，而符合線性模型一開始對 error 的假設。

ggplot(sales, aes(sample = residuals)) +
  geom_qq() +
  geom_qq_line(color = "blue", line.p = c(0.25, 0.75), linetype = "dashed") +
  labs(x = "Theoretical quantiles", y = "residuals quantiles", title = "Normal Q-Q plot") +
  theme_bw()

上圖為 Q-Q plot 和 Q-Q line (藍色虛線)，顯示著 residuals 的 quantiles 分佈以及相對應理論上 (常態分佈下) 的 quantiles 之間的散佈圖。可觀察到大致上觀察值與理論值似乎呈現線性 (與藍色虛線的 Q-Q line 相對應)，暗示著 residuals 很有可能與常態分配相似，也表示著 error 是出自於常態分配的而符合迴歸模型一開始的預設。雖然在極端的兩側左方上翹、右方下彎，很有可能實際上該分配比起常態兩側的尾部要 thinner。

(f) Express the estimated regression function in the original units.

我們在前幾題的迴歸線中的 responsed variable 是經過 $Y' = \sqrt{Y}$ 轉換而來的，因此換回原先資料點，該迴歸線會為 $\hat{Y} = \hat{Y'}^2 = (10.26 + 1.08 X)^2$。

三、資料分析 2

ampules <- data.frame(X = c(1, 0, 2, 0, 3, 1, 0, 1, 2, 0),
                      Y = c(16, 9, 17, 12, 22, 13, 8, 15, 19, 11))
ampules

ampules.lm <- lm(Y ~ X, ampules)

1. Diagnostics and Remedial Measures

(a) Plot the residuals $e_i$ against $X_i$ to ascertain whether any departures from regression model (2.1) are evident. What is your conclusion?

ampules$residuals <- residuals(ampules.lm)
ggplot(ampules, aes(X, residuals)) +
  geom_point() +
  geom_abline(slope = 0, intercept = 0, color = "red", linetype = "dashed") + 
  labs(x = "Shipment route", y = "Residuals") +
  theme_bw()

可發現 residuals plot 中，residuals 在 shipment route (X) 較小的時候偏差較大，而在 shipment route 較大時則偏差較小，整體呈現 megaphone 的形式，暗示著變異數很有可能不同質，這會導致不符合迴歸模型一開始的假設。

(b) Prepare a normal probability plot of the residuals. Also obtain the coefficient of correlation between the ordered residuals and their expected values under normality to ascertain whether the normality assumption is reasonable here. Use Table B.6 and α = .01. What do you conclude?

# Q-Q plot
ggplot(ampules, aes(sample = residuals)) +
  geom_qq() +
  geom_qq_line(color = "blue", line.p = c(0.25, 0.75), linetype = "dashed") +
  labs(x = "Theoretical quantiles", y = "residuals quantiles", title = "Normal Q-Q plot") +
  theme_bw()

上圖為 Q-Q plot 和 Q-Q line (藍色虛線)，顯示著 residuals 的 quantiles 分佈以及相對應理論上 (常態分佈下) 的 quantiles 之間的散佈圖。可觀察到大致上觀察值與理論值似乎呈現線性 (與藍色虛線的 Q-Q line 相對應)，暗示著 residuals 很有可能與常態分配相似，但有些地方還是有蠻大的偏移。

# correlation
ampules_qq <- qqnorm(ampules$residuals, plot.it = FALSE)
cor(ampules_qq$x, ampules_qq$y)

## [1] 0.9609751

Correlation test for normality 可以來檢定 residuals 是否為常態分配。null hypothesis $H_0: normal$ vs. alternative hypothesis $H_1: not\,normal$。計算其 residuals 和 expected values 分別 quantiles 之相關 $r \approx 0.96$，而查 Table B.6 在 α = .01 & n = 10 下，其 $criterial\,r^* \approx .879$，$r = 0.96 > 0.879 = r^*$ 表示接受 $H_0$，因此在統計上我們沒有拒絕 residual 是從常態分佈而來的。
總結來說，由上述的 Q-Q plot 和 correlation test for normality 結果可得知，residual 似乎是有遵從是常態分佈的假設。

(c) Prepare a time plot of the residuals. What information is provided by your plot?

ggplot(ampules, aes(1:length(residuals), residuals)) + 
  geom_point() + 
  geom_line(linetype = "longdash") +
  geom_abline(slope = 0, intercept = 0, color = "red", linetype = "dashed") +
  scale_x_continuous(breaks = 1:length(Y)) +
  labs(x = "Run", y = "Residuals") +
  theme_bw()

上圖顯示著 time plot of the residuals，似乎 residuals 對時間上沒有特別的規律存在。

(d) Assume that (3.10) is applicable and conduct the Breusch-Pagan test to determine whether or not the error variance varies with the level of X. Use α = .10. State the alternatives, decision rule, and conclusion. Does your conclusion support your preliminary findings in part (a)?

library(lmtest)
bptest(ampules.lm, student = FALSE)

## 
##  Breusch-Pagan test
## 
## data:  ampules.lm
## BP = 1.0331, df = 1, p-value = 0.3094

qchisq(p = 1-.10, df = 1)

## [1] 2.705543

雖然說 Breusch-Pagan test 假定是大樣本的資料且 error 得要符合獨立和常態的假設下來檢驗 error 使否變異數同質，但在本題假設可以使用 B-P test 情況下來計算 $log_e\sigma_i^2 = \gamma_0 + \gamma_1X_i$ 表 residuals 的變異 ($\sigma_i^2$) 會不會隨著預測變項 $X_i$ 有線性關係。我們的假設檢定為 null hypothesis $H_0: \gamma_0 = 0$ vs. alternative hypothesis $H_1: \gamma_0 \neq 0$。
其結果為 $\chi_{BP} = 1.03$，而統計量 $\chi_{BP}$ 會趨近 Chi-sq 分佈，而 $\chi_{criteria} = 2.71 > 1.03 = \chi_{BP}$, p-value = 0.31 > 0.10 = $\alpha$，因此接受 null hypothesis $H_0$，表示在統計上 error 的變異數很可能是同質的。
但此結果並與 part (a) 中觀察到 residual plot 的結果有些違背，該圖顯示著有 megaphone 的形式且暗示著變異數可能不同質，但 B-P test 的結果卻是不顯著，我們無法說變異數不是不同質。然而這其實是可以理解的，由於此題沒有符合原先 B-P test 原先的假設，因此很有可能得到我們不預期的結果。

(e) Perform the F test to determine whether or not there is lack of fit of a linear regression function; use α = .01. State the alternatives, decision rule, and conclusion.

ampules$fit <- fitted.values(ampules.lm)
.Yj_bar <- ampules %>% 
  group_by(X) %>% 
  summarise(num = length(X), Yj_bar = mean(Y))
ampules <- ampules %>% 
  left_join(.Yj_bar, by = "X")
ampules

##    X  Y residuals  fit num   Yj_bar
## 1  1 16       1.8 14.2   3 14.66667
## 2  0  9      -1.2 10.2   4 10.00000
## 3  2 17      -1.2 18.2   2 18.00000
## 4  0 12       1.8 10.2   4 10.00000
## 5  3 22      -0.2 22.2   1 22.00000
## 6  1 13      -1.2 14.2   3 14.66667
## 7  0  8      -2.2 10.2   4 10.00000
## 8  1 15       0.8 14.2   3 14.66667
## 9  2 19       0.8 18.2   2 18.00000
## 10 0 11       0.8 10.2   4 10.00000

SSLF<- sum((ampules$Yj_bar-ampules$fit)^2) # 0.93
SSPE <- sum((ampules$Y-ampules$Yj_bar)^2)  # 16.l67
LF_df <- 4 - 2  # == c - p
PE_df <- 10 - 4 # == n - c
MSLF <- SSLF / LF_df # 0.47
MSPE <- SSPE / PE_df # 2.78
F_value <- MSLF / MSPE # 0.168
F_critical <- qf(p = 1-0.01, df1 = LF_df, df2 = PE_df) # 10.92
print(ifelse((F_value > F_critical), "reject H0", "accept H0"))

## [1] "accept H0"

在 Lack of fit 的 F test 中，會檢定此筆資料適不適合使用我們預想的回歸模型，我們的 full model 為 $Y_{ij} = \mu_j + \epsilon_{ij}$ 而 reduced model 為 $Y_{ij} = \beta_0 + \beta_1X_j + \epsilon_{ij}$，其中 i 指的是受試者 (i = 1,…,n)、j 指的是有多少個不同 X 值 (j = 1,…,c)、$\mu_j$ 指的是在分別不同 $X_j$ 值下其對應的觀察值 $Y_{ij}$ 之期望值 (平均數)。若我們寫成假設檢定會是：null hypothesis $H_0: E\{Y_{ij}\} = \beta_0 + \beta_1X_j$ vs. alternative hypothesis $H_1: E\{Y_{ij}\} \neq \beta_0 + \beta_1X_j$。
Decision rule: $if\,F^* \leq F(1-\alpha; c-p, n-c),\, conclude\,H_0 \\ if\,F^* > F(1-\alpha; c-p, n-c),\, conclude\,H_1$
最終結果為 $F^* ＝ 0.17 \leq 10.92 ＝ F_{critical}$，因此接受 $H_0$，表示此迴歸是線性的，在統計上是可以用 $Y_{ij} = \beta_0 + \beta_1X_j + \epsilon_{ij}$ 來估計此比資料的線性迴歸。

(補充：r code for lack of fit & anova table。Reference: https://rpubs.com/bryangoodrich/)

# 補充：General ANOVA Table for Testing Lack of Fit of Simple Linear Regression Function and ANOVA Table
fit.aov <- anova(update(ampules.lm, . ~ . + factor(X)))
fit.aov

## Analysis of Variance Table
## 
## Response: Y
##           Df  Sum Sq Mean Sq F value    Pr(>F)    
## X          1 160.000 160.000  57.600 0.0002722 ***
## factor(X)  2   0.933   0.467   0.168 0.8491966    
## Residuals  6  16.667   2.778                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

as.table(cbind(
  'SS' = c('SSR'   =     fit.aov[1,   2], 
           'SSE'   = sum(fit.aov[2:3, 2]),
           'SSLF'  =     fit.aov[2,   2],
           'SSPE'  =     fit.aov[3,   2],
           'Total' = sum(fit.aov[1:3, 2])), 

  'Df' = c(              fit.aov[1,   1],
                     sum(fit.aov[2:3, 1]),
                         fit.aov[2,   1],
                         fit.aov[3,   1],
                     sum(fit.aov[1:3, 1])),  

  'MS' = c(              fit.aov[1,   3],
                     sum(fit.aov[2:3, 2]) / sum(fit.aov[2:3, 1]),
                         fit.aov[2,   3],
                         fit.aov[3,   3],
                         NA)  
))

##                SS          Df          MS
## SSR   160.0000000   1.0000000 160.0000000
## SSE    17.6000000   8.0000000   2.2000000
## SSLF    0.9333333   2.0000000   0.4666667
## SSPE   16.6666667   6.0000000   2.7777778
## Total 177.6000000   9.0000000

2. Simultaneous Inferences and Other Topics in Regression Analysis

(a) Obtain Bonferroni joint confidence intervals for $\beta_0$ and $\beta_1$, using a 99 percent family confidence coefficient. Interpret your confidence intervals.

# confci(ampules.lm, level = 0.995)
.alpha <-  1 - 0.99
.df <-  10 - 2
B <-  qt(1-.alpha/2/2, .df) # == 3.83
beta <- summary(ampules.lm)$coefficients[,1]
s_beta <- summary(ampules.lm)$coefficients[,2]
conf <- data.frame(coef = beta,
                   sd = s_beta,
                   lower = beta-s_beta*B, 
                   upper = beta+s_beta*B)
conf

##             coef        sd    lower     upper
## (Intercept) 10.2 0.6633250 7.657795 12.742205
## X            4.0 0.4690416 2.202389  5.797611

我們要對 $\beta_0$ 和 $\beta_1$ 做同時推論時，family confidence coefficient = 99% ($\alpha = 0.01$) 下，依照 Bonferroni joint confidence intervals 的計算，$\beta_0$ 和 $\beta_1$ 各自的 confidence coefficient 得是 $1 - \alpha/2 = 0.995$ 能滿組此條件。因此，$\beta_0$ 和 $\beta_1$ 分別的 confidence intervals 為：\[ b_0 \pm t_{1-a/4;n-2}s\{b_0\} = 10.20 \pm 3.83\times0.66 = [7.66,\,12.74]\]\[b_1 \pm t_{1-a/4;n-2}s\{b_1\} = 4.0 \pm 3.83\times0.47 = [2.20,\,5.80]\] 表示我們對 $\beta_0$ 和 $\beta_1$ 同時估計出的點估計 $b_0$ 和 $b_1$ 及其兩者的信賴區間，我們有 99% 的信心水準下會「同時」包含到真值 $\beta_0$ 和 $\beta_1$。簡單來說，如果我們對 $\beta_0$ 和 $\beta_1$ 重複抽樣且用同樣的方式估計出其信賴區間， 100 次中只少不會小於 99 次，這兩個信賴區間同時會分別含括到自己要估計的參數 $\beta_0$ 和 $\beta_1$ 的真實值。

(b) It is desired to obtain interval estimates of the mean number of broken ampules when there are 0, I, and 2 transfers for a shipment, using a 95 percent family confidence coefficient. Obtain the desired confidence intervals, using the Working-Hotelling procedure.

ci.sim <- function(model, data, isNew = FALSE, method = c("B", "W", "S"), alpha = 0.05){
  df <- nrow(model.frame(model)) - length(coef(model))
  g <- length(unique(data$X))
  method <- match.arg(method)
  if (isNew) {
    .interval <- "prediction"
    multiple <- switch(method,
                       "B" = qt(1-alpha/2/g, df),
                       "S" = sqrt(g*qf(1-alpha, g, df)))
  } else {
    .interval <- "confidence"
    multiple <- switch(method,
                       "B" = qt(1-alpha/2/g, df),
                       "W" = sqrt(2*qf(1-alpha, 2, df)))
  }
  CI <- predict(model, data, se.fit = TRUE, interval = .interval)
  if (isNew) {
    sprd <- sqrt(CI$residual.scale^2 + CI$se.fit^2)
  } else {
    sprd <- CI$se.fit 
  }
  estimate <- list()
  estimate$CI <- data.frame(X = data, 
                         fit = CI$fit[,1],
                         lower = CI$fit[,1]-multiple*sprd,
                         upper = CI$fit[,1]+multiple*sprd)
  estimate$multiple <- data.frame(multiple = multiple, row.names = method)
  return(estimate)
}

X_h <- data.frame(X = c(0, 1, 2))
Y_h_hat <- ci.sim(ampules.lm, X_h, isNew = FALSE, method = "W", alpha = 0.05)
Y_h_hat

## $CI
##   X  fit     lower    upper
## 1 0 10.2  8.219118 12.18088
## 2 1 14.2 12.799305 15.60070
## 3 2 18.2 16.219118 20.18088
## 
## $multiple
##   multiple
## W 2.986292

本題中是要估計在不同的 $X_h = 0, 1,2$ 且在 95% family confidence coefficient ($\alpha = 0.05$) 下，其對應的期望值 $E\{\hat{Y}_h\}$ 之信賴區間 (confident interval, C.I.) 為多少。在這邊採用 Working-Hotelling procedure，因此我們算 C.I. 的計算如下：\[\hat{Y}_h \pm Ws\{\hat{Y}_h\}\] 其中的 multiple W 為：\[ W = \sqrt{2F(1-\alpha; 2, n-2)}\]
最終計算的結果我們知道 multiple W = 2.99，而各個 $X_h = 0, 1, 2$ 下，$E\{Y_h\}$ 分別的 C.I. 為 [8.22, 12.18]、[12.80, 15.60]、[16.22, 20.18]。

(c) Are the confidence intervals obtained in part (b) more efficient than Bonferroni intervals here? Explain.

Y_h_hat <- ci.sim(ampules.lm, X_h, isNew = FALSE, method = "B", alpha = 0.05)
Y_h_hat

## $CI
##   X  fit    lower    upper
## 1 0 10.2  8.19957 12.20043
## 2 1 14.2 12.78548 15.61452
## 3 2 18.2 16.19957 20.20043
## 
## $multiple
##   multiple
## B 3.015762

如果我們採用 Bonferroni procedure 計算的話，其 C.I. 的計算則為：\[\hat{Y}_h \pm Bs\{\hat{Y}_h\}\] 與上題不同的是 multiple B 為：\[ B = t(1-\alpha/2g; n-2)\] 其中的 g 指的是有多少個不同的 $X_h$ 值。
由上方的計算結果我們得知 multiple B = 3.02，要比 multiple W 要來得大，導致 $E\{Y_h\}$ 分別的 C.I. 也會比 Working-Hotelling procedure 還要來的大。如果要做假設檢定的情況下會較難拒絕，因此在 part (b) 所使用的 Working-Hotelling procedure 會較有效能的。

(d) The next three shipments will make 0, 1, and 2 transfers, respectively. Obtain prediction intervals for the number of broken ampules for each of these three shipments, using the Scheffé procedure and a 95 percent family confidence coefficient.

X_pred <- data.frame(X = c(0, 1, 2))
Y_pred_hat <- ci.sim(ampules.lm, X_pred, isNew = TRUE, method = "S", alpha = 0.05)
Y_pred_hat

## $CI
##   X  fit     lower    upper
## 1 0 10.2  4.525130 15.87487
## 2 1 14.2  8.766726 19.63327
## 3 2 18.2 12.525130 23.87487
## 
## $multiple
##   multiple
## S 3.492641

與上方 part (b), (c) 不同的是，在這邊是要預估新的不同的預測變項 $X_{pred} = 0, 1,2$，且在 95% family confidence coefficient ($\alpha = 0.05$) 下，其預測的反應變項 $\hat{Y}_{h(new)}$ 之預測區間 (prediction, interval, P.I.) 為多少，。在這邊採用 Scheffé procedure，因此我們算 P.I. 的計算如下：\[\hat{Y}_h \pm Ss\{pred\}\] 其中的 multiple S 為：\[ S = \sqrt{gF(1-\alpha; g, n-2)}\] 同樣的，g 指的是有多少個不同的 $X_{pred}$ 值。
最終計算的結果我們知道 multiple S = 3.49，而各個 $X_{pred} = 0, 1, 2$ 下，$\hat{Y}_{h(new)}$ 分別的 P.I. 為 [4.52, 15.88]、[8.76, 19.63]、[12.52, 23.88]。

(e) Would the Bonferroni procedure have been more efficient in developing the prediction intervals in part (d)? Explain.

Y_pred_hat <- ci.sim(ampules.lm, X_pred, isNew = TRUE, method = "B", alpha = 0.05)
Y_pred_hat

## $CI
##   X  fit     lower    upper
## 1 0 10.2  5.299967 15.10003
## 2 1 14.2  9.508576 18.89142
## 3 2 18.2 13.299967 23.10003
## 
## $multiple
##   multiple
## B 3.015762

如果我們採用 Bonferroni procedure 計算的話，其 P.I. 的計算則為：\[\hat{Y}_{h(new)} \pm Bs\{pred\}\] 與上題不同的是 multiple B 為：\[ B = t(1-\alpha/2g; n-2)\] 其中的 g 指的是有多少個不同的 $X_h$ 值。
計算結果得知 multiple B = 3.02，要比 multiple S 要來得小，導致 P.I. 也要比 Scheffé procedure 還要來的窄。如果要做假設檢定的情況下會較容易拒絕，因此比起 part (d)，part (e) 所使用的 Bonferroni procedure 會較有效能。

3. Matrix Approach to Simple Linear Regression Analysis

(a) Using matrix methods, obtain the following: (1) $\mathbf{(X'X)^{-1}}$, (2) $\mathbf{b}$, (3) $\mathbf{e}$, (4) $\mathbf{H}$, (5) SSE, (6) $\mathbf{s^2\{b\}}$, (7) $\hat{Y_h}$ when $X_h = 2$2, (8) $s^2\{\hat{Y_h}\}$ when $X_h = 2$.

以知 \[\mathbf{X} = \begin{bmatrix} 1 & 1\\ 1 & 0\\ 1 & 2\\ 1 & 0\\ 1 & 3\\ 1 & 1\\ 1 & 0\\ 1 & 1\\ 1 & 2\\ 1 & 0 \end{bmatrix}, \mathbf{Y} = \begin{bmatrix} 16\\ 9\\ 17\\ 12\\ 22\\ 13\\ 8\\ 15\\ 19 \\ 11 \end{bmatrix}, \mathbf{X'X} = \begin{bmatrix} \dots & 1 & \dots\\ \dots & x_i & \dots \end{bmatrix} \begin{bmatrix} \vdots & \vdots\\ 1 & x_i \\ \vdots & \vdots \end{bmatrix} = \begin{bmatrix} n & \sum{X_i}\\ \sum{X_i} & \sum{X_i^2} \end{bmatrix} = \begin{bmatrix} 10 & 10\\ 10 & 20 \end{bmatrix}\]

$\mathbf{(X'X)^{-1}}$
首先，先求出 $X'X$ 的 determinate: \[D = \mathbf{|X'X|} = |10*20-10*10| = 100\] 接續，計算 $X'X$ 的 inverse: \[\mathbf{(X'X)^{-1}} = \frac{1}{D} \begin{bmatrix} 20 & -10\\ -10 & 10 \end{bmatrix} = \begin{bmatrix} 0.2 & -0.1\\ -0.1 & 0.1 \end{bmatrix} \]
$\mathbf{b}$
\[\begin{align} \mathbf{b} &= \mathbf{(X'X)^{-1}X'Y} \\ &= \begin{bmatrix} 0.2 & -0.1\\ -0.1 & 0.1\end{bmatrix} \begin{bmatrix} \dots & 1 & \dots\\ \dots & x_i & \dots\end{bmatrix} \begin{bmatrix} \vdots\\ y_i\\ \vdots \end{bmatrix} \\ &= \begin{bmatrix} 0.2 & -0.1\\ -0.1 & 0.1\end{bmatrix} \begin{bmatrix} \sum{y_i}\\ \sum{x_iy_i}\end{bmatrix} \\ &= \begin{bmatrix} 0.2 & -0.1\\ -0.1 & 0.1 \end{bmatrix} \begin{bmatrix} 142\\ 182 \end{bmatrix} \\ &= \begin{bmatrix} 10.2\\ 4.0 \end{bmatrix} \end{align}\]
$\mathbf{e}$
\[\begin{align} \mathbf{e} &= \mathbf{Y-\hat{Y}} = \mathbf{Y-Xb} \\ &= \begin{bmatrix} \vdots\\ y_i\\ \vdots \end{bmatrix} - \begin{bmatrix} \vdots & \vdots\\ 1 & x_i\\ \vdots & \vdots \end{bmatrix} \begin{bmatrix} 10.2\\ 4.0 \end{bmatrix} \\ &= \begin{bmatrix} 1.2\\ -1.2\\ -1.2\\ 1.8\\ -0.2\\ -1.2\\ -2.2\\ 0.8\\ 0.8\\ 0.8\end{bmatrix} \end{align}\]
$\mathbf{H}$
\[\begin{align} \mathbf{H} &= \mathbf{X(X'X)^{-1}X'} \\ &= \begin{bmatrix} \vdots & \vdots\\ 1 & x_i\\ \vdots & \vdots \end{bmatrix} \begin{bmatrix} 0.2 & -0.1\\ -0.1 & 0.1 \end{bmatrix} \begin{bmatrix} \dots & 1 & \dots\\ \dots & x_i & \dots \end{bmatrix}\\ &= \begin{bmatrix} 0.1 & 0.1 & 0.1 & 0.1 & 0.1 & 0.1 & 0.1 & 0.1 & 0.1 & 0.1\\ 0.1 & 0.2 & 0.0 & 0.2 & -0.1 & 0.1 & 0.2 & 0.1 & 0.0 & 0.2\\ 0.1 & 0.0 & 0.2 & 0.0 & 0.3 & 0.1 & 0.0 & 0.1 & 0.2 & 0.0\\ 0.1 & 0.2 & 0.0 & 0.2 & -0.1 & 0.1 & 0.2 & 0.1 & 0.0 & 0.2\\ 0.1 & -0.1 & 0.3 & -0.1 & 0.5 & 0.1 & -0.1 & 0.1 & 0.3 & -0.1\\ 0.1 & 0.1 & 0.1 & 0.1 & 0.1 & 0.1 & 0.1 & 0.1 & 0.1 & 0.1\\ 0.1 & 0.2 & 0.0 & 0.2 & -0.1 & 0.1 & 0.2 & 0.1 & 0.0 & 0.2\\ 0.1 & 0.1 & 0.1 & 0.1 & 0.1 & 0.1 & 0.1 & 0.1 & 0.1 & 0.1\\ 0.1 & 0.0 & 0.2 & 0.0 & 0.3 & 0.1 & 0.0 & 0.1 & 0.2 & 0.0\\ 0.1 & 0.2 & 0.0 & 0.2 & -0.1 & 0.1 & 0.2 & 0.1 & 0.0 & 0.2\\ \end{bmatrix} \end{align}\]
SSE
\[\begin{align} SSE &= \mathbf{e'e} \\ &= \begin{bmatrix} \dots & e_i & \dots \end{bmatrix} \begin{bmatrix} \vdots\\ e_i\\ \vdots \end{bmatrix} \\ &= \sum{e^2} \\ &= 17.6 \end{align}\]
$\mathbf{s^2\{b\}}$
\[\begin{align} \mathbf{s^2\{b\}} &= MSE(X'X)^{-1} \\ &= \frac{SSE}{n-p}(X'X)^{-1} \\ &= \frac{17.6}{10-2}\begin{bmatrix} 0.2 & -0.1\\ -0.1 & 0.1\end{bmatrix} \\ &= \begin{bmatrix} 0.44 & -0.22\\ -0.22 & 0.22\end{bmatrix} \end{align}\]
$\hat{Y_h}$ when $X_h = 2$
\[\begin{align} \mathbf{\hat{Y_h}} &= \mathbf{X_hb} \\ &= \begin{bmatrix} 1 & x_h \end{bmatrix} \begin{bmatrix} b_0\\ b_1 \end{bmatrix} \\ &= \begin{bmatrix} 1 & 2 \end{bmatrix} \begin{bmatrix} 10.2\\ 4.0 \end{bmatrix} \\ &= 18.2 \end{align}\]
$s^2\{\hat{Y_h}\}$ when $X_h = 2$
\[\begin{align} s^2\{\hat{Y_h}\} &= \mathbf{MSE(X_h(X'X)^{-1}X_h)} \\ &= \frac{SSE}{n-p} \begin{bmatrix} 1 & x_h \end{bmatrix} \begin{bmatrix} 0.44 & -0.22\\ -0.22 & 0.22 \end{bmatrix} \begin{bmatrix} 1\\ x_h \end{bmatrix} \\ &= \frac{17.6}{10-2} \begin{bmatrix} 1 & 2 \end{bmatrix} \begin{bmatrix} 0.44 & -0.22\\ -0.22 & 0.22 \end{bmatrix} \begin{bmatrix} 1\\ 2 \end{bmatrix} \\ &= 0.44 \end{align}\]

(b) From part (a6), obtain the following: (1) $s^2\{b_0\}$; (2) $s\{b_0, b_1\}$; (3) $s\{b_1\}$.

由 (a6) 我們知道 \[\begin{align} \begin{bmatrix} 0.44 & -0.22\\ -0.22 & 0.22\end{bmatrix} = \mathbf{s^2\{b\}} = \begin{bmatrix} s^2\{b_0\} & s\{b_0, b_1\}\\ s\{b_0, b_1\} & s^2\{b_1\}\end{bmatrix} \end{align}\] 因此，

$s^2\{b_0\} = 0.44$
$s\{b_0, b_1\} = -0.22$
$s\{b_1\} = \sqrt{s^2\{b_1\}} = \sqrt{0.22} \approx 0.47$

(c) Find the matrix of the quadratic form for SSR

第一步，求出 SSTO 為 \[\begin{align} SSTO &= \sum(Y_i-\bar{Y})^2 \\ &= (\mathbf{Y}-\frac{1}{n}\mathbf{1'Y1})'(\mathbf{Y}-\frac{1}{n}\mathbf{1'Y1}) \end{align}\] 上述為 SSTO 寫成矩陣的形式，其中 $\mathbf{\bar{Y}} = \mathbf{\frac{1}{n}1'Y1}$。此外 $\mathbf{1'Y}$ 為 1x1 的純量，其特性可以在矩陣乘法中可以任意換位，且他的轉置也會是他自己 ($\mathbf{(1'Y)} = \mathbf{(1'Y)'} = \mathbf{(Y'1)}$)。因此， \[\begin{align} SSTO &= (\mathbf{Y}-\frac{1}{n}\mathbf{(1'Y)1})'(\mathbf{Y}-\frac{1}{n}\mathbf{(1'Y)1}) \\ &= (\mathbf{Y'}-\frac{1}{n}\mathbf{1'(Y'1)})(\mathbf{Y}-\frac{1}{n}\mathbf{(1'Y)1}) \\ &= (\mathbf{Y'}-\frac{1}{n}\mathbf{(Y'1)1'})(\mathbf{Y}-\frac{1}{n}\mathbf{(1'Y)1}) \\ &= \mathbf{Y'Y}-\frac{1}{n}\mathbf{(1'Y)(Y'1)}-\frac{1}{n}\mathbf{(Y'1)(1'Y)}+\frac{1}{n^2}\mathbf{(Y'1)(1'Y)(1'1)} \\ &= \mathbf{Y'Y}-\frac{1}{n}\mathbf{(Y'1)(1'Y)}-\frac{1}{n}\mathbf{(Y'1)(1'Y)}+\frac{1}{n^2}\mathbf{(Y'1)(1'Y)(n)} \\ &= \mathbf{Y'Y}-\frac{1}{n}\mathbf{(Y'1)(1'Y)} \\ &= \mathbf{Y'Y}-\mathbf{Y}\frac{1}{n}\mathbf{(11')Y} \\ &= \mathbf{Y'(I}-\frac{1}{n}\mathbf{J)Y} \end{align}\]

第二步，求出 SSE 為 \[\begin{align} SSE &= \mathbf{e'e} \\ &= \mathbf{(Y-Xb)'(Y-Xb)} \\ \end{align}\] 由於 $\mathbf{Xb = X(X'X)^{-1}XY = HY}$，其中 $\mathbf{H = X(X'X)^{-1}X}$，所以上式可繼續寫成 \[\begin{align} SSE &= \mathbf{(Y-HY)'(Y-HY)} \\ &= \mathbf{((I-H)Y)'(I-H)Y} \\ &= \mathbf{Y'(I-H)'(I-H)Y} \end{align}\] 由於 $\mathbf{(I-H)}$ 為 nxn 的矩陣，具有 symmetric ($\mathbf{(I-H)' = (I-H)}$)和 idempotent ($\mathbf{(I-H)(I-H) = (I-H)}$) 的特性，因此 \[\begin{align} SSE &= \mathbf{Y'(I-H)(I-H)Y} \\ &= \mathbf{Y'(I-H)Y} \\ \end{align}\]

最後，SSE 為 \[\begin{align} SSE &= SSTO-SST \\ &= \mathbf{Y'(I}-\frac{1}{n}\mathbf{J)Y} - \mathbf{Y'(I-H)Y} \\ &= \mathbf{Y'(H}-\frac{1}{n}\mathbf{J)Y} \end{align}\]

應用線性統計模式　作業二：Chapter 3, 4, 5

B03605059 心理五林子堯

2019-04-29

一、觀念題與推導

2. Show that \(\hat{Y}\) as defined in (4.15) for Iinear regression through the origin is an unbiased estimator of E{Y}.

3. Consider the least squares estimator b given in (5.60). Using matrix methods show that b is an unbiased estimator.

二、資料分析 1

Diagnostics and Remedial Measures

(a) Prepare a scatter plot of the data. Does a linear relation appear adequate here?

(b) Use the Box-Cox procedure and standardization (3.36) to find an appropriate power transformation of Y. Evaluate SSE for λ = .3, .4, .5, .6, .7. What transformation of Y is suggested?

(c) Use the transformation \(Y' = \sqrt{Y}\) and obtain the estimated linear regression function for the transformed data.

(d) Plot the estimated regression line and the transformed data. Does the regression line appear to be a good fit to the transformed data?

(e) Obtain the residuals and plot them against the fitted values. Also prepare a normal probability plot. What do your plots show?

(f) Express the estimated regression function in the original units.

三、資料分析 2

1. Diagnostics and Remedial Measures

(a) Plot the residuals \(e_i\) against \(X_i\) to ascertain whether any departures from regression model (2.1) are evident. What is your conclusion?

(b) Prepare a normal probability plot of the residuals. Also obtain the coefficient of correlation between the ordered residuals and their expected values under normality to ascertain whether the normality assumption is reasonable here. Use Table B.6 and α = .01. What do you conclude?

(c) Prepare a time plot of the residuals. What information is provided by your plot?

(d) Assume that (3.10) is applicable and conduct the Breusch-Pagan test to determine whether or not the error variance varies with the level of X. Use α = .10. State the alternatives, decision rule, and conclusion. Does your conclusion support your preliminary findings in part (a)?

(e) Perform the F test to determine whether or not there is lack of fit of a linear regression function; use α = .01. State the alternatives, decision rule, and conclusion.

2. Simultaneous Inferences and Other Topics in Regression Analysis

(a) Obtain Bonferroni joint confidence intervals for \(\beta_0\) and \(\beta_1\), using a 99 percent family confidence coefficient. Interpret your confidence intervals.

(b) It is desired to obtain interval estimates of the mean number of broken ampules when there are 0, I, and 2 transfers for a shipment, using a 95 percent family confidence coefficient. Obtain the desired confidence intervals, using the Working-Hotelling procedure.

(c) Are the confidence intervals obtained in part (b) more efficient than Bonferroni intervals here? Explain.

(d) The next three shipments will make 0, 1, and 2 transfers, respectively. Obtain prediction intervals for the number of broken ampules for each of these three shipments, using the Scheffé procedure and a 95 percent family confidence coefficient.

(e) Would the Bonferroni procedure have been more efficient in developing the prediction intervals in part (d)? Explain.

3. Matrix Approach to Simple Linear Regression Analysis

(a) Using matrix methods, obtain the following: (1) \(\mathbf{(X'X)^{-1}}\), (2) \(\mathbf{b}\), (3) \(\mathbf{e}\), (4) \(\mathbf{H}\), (5) SSE, (6) \(\mathbf{s^2\{b\}}\), (7) \(\hat{Y_h}\) when $X_h = 2$2, (8) \(s^2\{\hat{Y_h}\}\) when \(X_h = 2\).

(b) From part (a6), obtain the following: (1) \(s^2\{b_0\}\); (2) \(s\{b_0, b_1\}\); (3) \(s\{b_1\}\).

(c) Find the matrix of the quadratic form for SSR

應用線性統計模式 作業二：Chapter 3, 4, 5

B03605059 心理五 林子堯

2019-04-29

一、觀念題與推導

2. Show that \(\hat{Y}\) as defined in (4.15) for Iinear regression through the origin is an unbiased estimator of E{Y}.

3. Consider the least squares estimator b given in (5.60). Using matrix methods show that b is an unbiased estimator.

二、資料分析 1

Diagnostics and Remedial Measures

(a) Prepare a scatter plot of the data. Does a linear relation appear adequate here?

(b) Use the Box-Cox procedure and standardization (3.36) to find an appropriate power transformation of Y. Evaluate SSE for λ = .3, .4, .5, .6, .7. What transformation of Y is suggested?

(c) Use the transformation \(Y' = \sqrt{Y}\) and obtain the estimated linear regression function for the transformed data.

(d) Plot the estimated regression line and the transformed data. Does the regression line appear to be a good fit to the transformed data?

(e) Obtain the residuals and plot them against the fitted values. Also prepare a normal probability plot. What do your plots show?

(f) Express the estimated regression function in the original units.

三、資料分析 2

1. Diagnostics and Remedial Measures

(a) Plot the residuals \(e_i\) against \(X_i\) to ascertain whether any departures from regression model (2.1) are evident. What is your conclusion?

(b) Prepare a normal probability plot of the residuals. Also obtain the coefficient of correlation between the ordered residuals and their expected values under normality to ascertain whether the normality assumption is reasonable here. Use Table B.6 and α = .01. What do you conclude?

(c) Prepare a time plot of the residuals. What information is provided by your plot?

(d) Assume that (3.10) is applicable and conduct the Breusch-Pagan test to determine whether or not the error variance varies with the level of X. Use α = .10. State the alternatives, decision rule, and conclusion. Does your conclusion support your preliminary findings in part (a)?

(e) Perform the F test to determine whether or not there is lack of fit of a linear regression function; use α = .01. State the alternatives, decision rule, and conclusion.

2. Simultaneous Inferences and Other Topics in Regression Analysis

(a) Obtain Bonferroni joint confidence intervals for \(\beta_0\) and \(\beta_1\), using a 99 percent family confidence coefficient. Interpret your confidence intervals.

(b) It is desired to obtain interval estimates of the mean number of broken ampules when there are 0, I, and 2 transfers for a shipment, using a 95 percent family confidence coefficient. Obtain the desired confidence intervals, using the Working-Hotelling procedure.

(c) Are the confidence intervals obtained in part (b) more efficient than Bonferroni intervals here? Explain.

(d) The next three shipments will make 0, 1, and 2 transfers, respectively. Obtain prediction intervals for the number of broken ampules for each of these three shipments, using the Scheffé procedure and a 95 percent family confidence coefficient.

(e) Would the Bonferroni procedure have been more efficient in developing the prediction intervals in part (d)? Explain.

3. Matrix Approach to Simple Linear Regression Analysis

(a) Using matrix methods, obtain the following: (1) \(\mathbf{(X'X)^{-1}}\), (2) \(\mathbf{b}\), (3) \(\mathbf{e}\), (4) \(\mathbf{H}\), (5) SSE, (6) \(\mathbf{s^2\{b\}}\), (7) \(\hat{Y_h}\) when $X_h = 2$2, (8) \(s^2\{\hat{Y_h}\}\) when \(X_h = 2\).

(b) From part (a6), obtain the following: (1) \(s^2\{b_0\}\); (2) \(s\{b_0, b_1\}\); (3) \(s\{b_1\}\).

(c) Find the matrix of the quadratic form for SSR

應用線性統計模式　作業二：Chapter 3, 4, 5

B03605059 心理五林子堯