應用線性統計模式　作業四：Chapter 9, 10, & 11

一、觀念題

Chapter 9觀念題

1. 試著談談「縮減模型之解釋變項數量（reduction of explanation variables）」為何在「exploratory observational studies」中相當重要，而「controlled experiment、controlled experiment with covariates、confirmatory observational studies」不太重要。[提示：參考課本pp. 347-348]

在 exploratory observational studies 中，通常一開始選擇大量的變項就是為了要解釋我們觀察的現象，但在迴歸模型中多個變量會產生許多問題像是：1. 難以去對每個變項去做解釋； 2. 如果變項間有共線性的話，會使得整體的參數估計上變異大且不容易顯著，整體的反應變項的估計和解釋能力也都下降； 3. 越多變項增加，會造成收集資料的困難、以及時間和金錢的成本; 4. 可能產生 overfitted 的問題或是有效的預測範圍過窄。因此我們必須得對解釋變項數量進行縮減，除了比避免上述的問題外，在模型建立和解釋上會更容易。 Controlled experiment 對於縮減模型之解釋變項數量不太重要的原因在於，那些解釋變項的選擇都是透過實驗操弄而產生，為了來解釋預測我們的反應變項，並沒有什麼理由該刪減他。同樣的 controlled experiment with covariates 中，比前者多加的 covariates 是為了降低在 controlled experiment 原先模型中的 error variance，而且通常也僅僅數個，因此在沒有甚麼特別原因情況下，也並不會隨意丟棄。最後在 confirmatory observational studies 中，變項的選擇一是基於實驗假設的 primary variables、二是從先驗知識來的 control variable 為了比較先前研究，因此這些變項的存留是有相當性的。

2. An engineer has stated: “Reduction of the number of the explanatory variables should always be done using the objective forward stepwise regression procedure.” Discuss. [提示：參考課本 pp. 361-364]

Forward stepwise regression procedure 的好處在於能鑑定出「單個」「最好的」模型，特別是在當預測變項過多時，我們無法使用 best subsets algorithm 來檢驗每項指標的表現性。然而 forward stepwise regression procedure 並不一定會找到實際上真正最佳的模型，有時也會錯誤地選擇了一個「局部」最好的模型。此外，不同種的 stepwise regression procedure (forward / backward / both direction) 以及不同的增加或剔除變數的準則指標，都有可能找出不同的最佳模型，因此在不相同的 stepwise regression 設立下，就有可能產生不同種的解，此外某些選出的迴歸模型只適合某些目的下才精準。因此這些問題還有賴於研究者的「主觀判斷」，找出數個可能的「最佳模型」後，依照自己本身的研究目的選出心中的最佳解，而非純粹依賴電腦計算所給予的。

3. An attendee at a regression modeling short course stated: “I rarely see validation of regression models mentioned in published papers, so it must rarely not be an important component of model building.” Comment.

在過去發表的研究中很少對模型去做重複驗證是值得去關心的議題，但少見並不代表不重要。Model validation 通常在整個模型建立中的最後一步，來檢驗數個可能的候選模型的有效性，從中選出最有預測力、穩定性等指標中表現最好的模型。通常在 controlled experiment 和 confirmatory observational studies 會透過重複實驗來測量模型的有效性，但是在 exploratory observational studies 中就很難再次收集新的資料，因此可能地做法會是比較舊有的文獻結果，或是將現行有的資料拆分成 training data 和 testing data，前者用來建立迴歸模型而後者則是檢驗模型的有效性。Model validation 的貢獻在於除了提供我們選擇最終模型的依據，也提供我們一些資訊此模型的解釋能力到什麼程度，因此也是不可或缺的一環。

4. 試問 Data splitting 如何協助評估迴歸模型的效度? 此方式有什麼限制? [提示：參考課本 p. 372]

Data splitting 透過將資料切分成 training data 和 testing data，前者用來建立迴歸模型而後者則是檢驗模型的有效性，像是比對這兩者 data set 所得到的參數估計是否像似、正負號是否相同，模型表顯相關指標 \(MSE\)、\(R^2_{a, p}\)、\(PRESS_p\)、\(C_p\)、\(MSPR\) 使否變化過大，都可以來判斷從 training data 所建立的 model 使否具備一定程度的效度，不會應為資料改變而差異很大。
然而 data splitting 來檢驗 model validation 的缺點在於：

在收集到的資料點很少時便不適合使用;
training data 建立出的模型得注意會不會對 testing data 產生 extrapolation 的問題；
因為將資料對半的關係，training data 建立出的模型參數之變異可能會比用全部資料來 fit 的參數變異還要來大，因此在 model validation 過後要做最後的推論時，最好重新使用所有資料來估計最終的模型。

Chapter 10觀念題

5. A student asked: “Why is it necessary to perform diagnostic checks of the fit when \(R^2\) is large?” Comment.

\(R^2\) 並不適合當作模型表現的唯一依據，雖然說當 \(R^2\) 越大，模型能解釋資料的變異也越多，然而它缺點在於當模型的變項越多時，\(R^2\) 卻只會增加不會減少。若依據精簡原則會不知道那個模型才是合適的，因為變項最多的 \(R^2\) 才是最大的。除此之外，當變數太多時我們也難以一一對每個變項做解釋。因此，在合適的迴歸模型選擇上，除了該模型的 \(R^2\) 大以外，我們得考慮其它會考慮變項數目的判別標準，如 \(R^2_{a,p}\)、\(C_p\)、\(AIC_p\)、\(SBC_p\)、\(PRESS_p\) 等，把參數數量做為懲罰，再從中找出我們主觀認定好的模型。

6. Describe several informal methods that can be helpful in identifying multicollinearity among the X variables in a multiple regression model.

課本提供了 5 種非正式的判斷方法，如下：

當預測變項的加入/剔除，或是改變/增減觀察值時，使得迴歸參數的估計變化很大
一些重要的預測變項之迴歸參數達不到顯著
迴歸參數的正負號與理論上或先前經驗相異
變項間的相關係數非常大
一些重要的預測變項的迴歸參數之信賴區間特別寬

值得注意的是，這些對共線性非正式的判斷有些限制，它們無法提供共線性量化的程度，以及分析時發現有上述的現象，但並不代表會有共線性的發生。

Chapter 11觀念題

7. One student remarked to another: “Your residuals show that nonconstancy of error variance is clearly present. Therefore, your regression results are completely invalid.” Comment.

不一定是這樣。從做完迴歸模型後得到的 residuals plot 觀察到變異數不同值，很有可能資料的性質本來就是如此。如課本 pp.429 的 Comments 2 提到：反應變項 Y 本身就從遵從 Poisson distribution，Y 的 mean 和 variance 都會隨預測變項 X 增加而增加，所以在這種情況做 linear regression 會許也不會不好，還是有可能很精準的。
當然我們可以嘗試用別種模型，像是用 hierarchical linear modeling 把 poisson distribution 給納入。或者，將 Y 值先行做些 Box-Cox 轉換，甚至是做 weighted least squares 減少變異數不同值的影響。

8. An analyst suggested: “One nice thing about robust regression is that you need not worry about outliers and influential observations.” Comment.

Robust regression 並不是一個必要處理 outliers 和 influential observations 的方式，而是當研究者很清楚自己的目的，在沒由其他種可能發生 outliers 的原因或處理方式時，同時又不想輕易丟棄觀察值下，想將迴歸線能 fit 最主要的資料點能更加 smooth，所採取最後的手段。Robust regression 透過不同權重的方法將那些 noise 的資料點 (outliers, influential observations) 影響模型的建立降低。如前所述，在採取 robust regression 研究者得先考量以下幾種情境，包括：

是否資料點一開始就有人為登入錯誤。
測量儀器是否故障。
Outliers 提供些一些變異的訊息，提醒研究者當前的模型是否適當，是否少了一些潛在的重要變項、交互作用項、二次項⋯⋯等。
當模型改變或是新的資料點加入時，可能 outliers 和 influential observations 的問題就不見了。
有時在某些變項下看某些資料點是 outliers 但在其他變項中並不一定會是。
有些 outliers 不一定對迴歸模型具有很強的影響力。

9. Regression trees become difficult to utilize when there are many predictors and the sample size is small. Discuss the nature of this problem.

Regression trees 透過將各個 X 變項的數值範圍切段，形成數個區域後，使用該區域的平均值來估計迴歸式，非常適合用在大量資料或是類別變項使用。但當預測變項過多以及樣本點過少，變數數目多容易產生切割區域的數量與資料點接近、甚至是超過，而造成 overfitted 產生，因此，當有新資料加入或模型檢驗時就易表現不好。另一方面，在建立 regression trees 時，預測變項多會造成切分點的選擇倍數成長，增加運算效率上的負擔。

10. Describe how bootstrapping might be used to obtain confidence intervals for regression coefficients when ridge regression is employed.

在 ridge regression 的參數估計中並不能提供參數確切的分佈，因此可能需要靠 bootstrapping 來獲得。而 bootstrapping 的做法上，會將原始的資料做為接下來的抽樣樣本，每此從中抽取 n 筆資料 (取後放回、可重複)，並利用該次資料重新進行迴歸係數的估計，得到 \(b_{k,i}^\star\) (其中 k 指的是第幾個參數、 i 指的是第幾次抽樣)。
將上述步驟重複做 m 次 (i = 1,…, m)，因此我們就可以獲得估計參數 \(b_{k}^\star\) 的機率分配圖，再從中計算該變異量或是百分位數，進而作為 \(b_k\) 的變異估計量以及信賴區間。

二、資料分析

[Patient satisfaction]

A hospital administrator wished to study the relation between patient satisfaction (Y) and patient’s age (X1, in years), severity of illness (X2, an index), and anxiety level (X3 , an index). The administrator randomly selected 46 patients and collected the data presented below, where larger values of Y, X2, and X3 are, respectively, associated with more satisfaction, increased severity of illness, and more anxiety.
The hospital administrator wishes to determine the best subset of predictor variables for predicting patient satisfaction.

patient <- read.table("HW4_Patient satisfaction.txt", header = TRUE, sep = "\t")
DT::datatable(patient)

[Regression Model]

1. Fit the regression model for the three predictor variables to the data and state the estimated regression function.

patient_lm <- lm(Y ~ ., patient)
summary(patient_lm)

## 
## Call:
## lm(formula = Y ~ ., data = patient)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18.3524  -6.4230   0.5196   8.3715  17.1601 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 158.4913    18.1259   8.744 5.26e-11 ***
## X1           -1.1416     0.2148  -5.315 3.81e-06 ***
## X2           -0.4420     0.4920  -0.898   0.3741    
## X3          -13.4702     7.0997  -1.897   0.0647 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.06 on 42 degrees of freedom
## Multiple R-squared:  0.6822, Adjusted R-squared:  0.6595 
## F-statistic: 30.05 on 3 and 42 DF,  p-value: 1.542e-10

根據上述的 first order linear regression 的結果得知此模型的迴歸為： \[\hat{Y} = 158.49 - 1.14X_1 - 0.44X_2 - 13.47X_3\]

[Model Selection]

2. Indicate which subset of predictor variables you would recommend as best for predicting patient satisfaction according to each of the following criteria: (1) \(R^2_p\), (2) R^2_{a,p}, (3) \(C_p\), (4) \(AIC_p\), (5) \(SBC_p\), (6) \(PRESS_p\). Support your recommendations with appropriate graphs.

library(leaps)
library(MPV)

best <- function(model, nbest = 3, nvmax = 4, ...) {
  subsets <- regsubsets(formula(model), model.frame(model), nbest = nbest, nvmax = nvmax, ...)
  subsets <- with(summary(subsets),
                  cbind(p = as.numeric(rownames(which))+1, which, rss,
                        rsq, adjr2, cp, bic))
  
  subsets <- as.data.frame(subsets)

  subsets$aic <- subsets$bic - log(nrow(model.frame(model)))*subsets$p + 2*subsets$p

  rownames(subsets) <- NULL
  
  # PRESS
#  parNames <- names(patient[, -1])
#  n <- length(parNames)
#  .combination <- unlist(lapply(1:n, function(x){combn(1:n, x, simplify = FALSE)}), recursive = F) 
  .combination <- lapply(1:nrow(subsets), function(x){
    colnames(subsets)[which(subsets[x, 3:(ncol(subsets)-6)] == 1)+2]
  })
  combination <- lapply(.combination, function(x){paste("Y~", paste(x, collapse = "+"))}) 
  lm_combination <- lapply(combination, function(x){lm(as.formula(x), data = patient)}) 
  press <- sapply(lm_combination, function(x){PRESS(x)})
  
  subsets$PRESS <- press
  
  return(subsets)
} 

# output table
subsetTable <- best(patient_lm, nbest = 3, nvmax = 4)
round(subsetTable, 3)

##   p (Intercept) X1 X2 X3      rss   rsq adjr2     cp     bic     aic
## 1 2           1  1  0  0 5093.915 0.619 0.610  8.354 -36.729 -40.386
## 2 2           1  0  0  1 7814.391 0.415 0.402 35.246 -17.044 -20.702
## 3 2           1  0  1  0 8509.044 0.364 0.349 42.112 -13.127 -16.784
## 4 3           1  1  0  1 4330.500 0.676 0.661  2.807 -40.369 -45.855
## 5 3           1  1  1  0 4613.000 0.655 0.639  5.600 -37.462 -42.948
## 6 3           1  0  1  1 7106.394 0.468 0.444 30.247 -17.585 -23.070
## 7 4           1  1  1  1 4248.841 0.682 0.659  4.000 -37.416 -44.730
##      PRESS
## 1 5569.562
## 2 8451.432
## 3 9254.489
## 4 4902.751
## 5 5235.192
## 6 8115.912
## 7 5057.886

library(dplyr)
library(tidyr)
library(ggplot2)
library(gridExtra)

data <- subsetTable %>% 
  gather(key = index, value = value, rsq:PRESS)

criterion <- data.frame(
  index = c("rsq", "adjr2", "cp", "aic", "bic", "PRESS"),
  critera.fun = c("max", "max", "min", "min", "min", "min")
)

g <- list()

for(i in 1:nrow(criterion)){
  .index <- criterion[i, "index"]
  .critera.fun <- criterion[i, "critera.fun"] %>% as.character()
  
  g[[i]] <- data %>% 
    filter(index == .index) %>% 
    ggplot(aes(x = p, y = value)) +
    geom_point() +
    scale_x_continuous(breaks = unique(data$p)) +
    stat_summary(fun.y = .critera.fun, colour = "orange", geom = "line") +
    labs(title = .index) +
    theme_bw()
}

grid.arrange(grobs = g, nrow = 2)

從上表和圖示中可得知：
(1) 以 \(R^2_p\) 為判定標準，則是找出最大值的模型，因此 \(\hat{Y} = \beta_0 + \beta_1X_1 + \beta_2X_2 + \beta_3X_3\) 為最佳模型。
(2) 以 \(R^2_{a,p}\) 為判定標準，則是找出最大值的模型，因此 \(\hat{Y} = \beta_0 + \beta_1X_1 + \beta_3X_3\) 為最佳模型。
(3) 以 \(C_p\) 為判定標準，則是找出最小值和最接近 p (# of par) 的模型，因此 \(\hat{Y} = \beta_0 + \beta_1X_1 + \beta_2X_2 + \beta_3X_3\) 為最佳模型。然而模型 \(\hat{Y} = \beta_0 + \beta_1X_1 + \beta_3X_3\) 雖然 \(C_p\) 為最小值 2.8，但卻小於自己本身的參數個數 3，可能有取樣偏差產生，因此有待考慮。
(4) 以 \(AIC_p\) 為判定標準，則是找出最小值的模型，因此 \(\hat{Y} = \beta_0 + \beta_1X_1 + \beta_3X_3\) 為最佳模型。
(5) 以 \(SBC_p\) 為判定標準，則是找出最小值的模型，因此 \(\hat{Y} = \beta_0 + \beta_1X_1 + \beta_3X_3\) 為最佳模型。
(6) 以 \(PRESS_p\) 為判定標準，則是找出最小值的模型，因此 \(\hat{Y} = \beta_0 + \beta_1X_1 + \beta_3X_3\) 為最佳模型。

3. Do the four six criteria in part (1) (2) identify the same best subset? Does this always happen?

以上六種判別標準不竟然完全相同，像是在 \(R^2_p\) 和 \(C_p\) 指標中可能 \(\hat{Y} = \beta_0 + \beta_1X_1 + \beta_2X_2 + \beta_3X_3\) 會是最好的模型；而在 \(R^2_{a,p}\)、\(AIC_p\)、\(SBC_p\) 和 \(PRESS_p\) (及 \(C_p\) 也可能納入考量) 指標則可能認為 \(\hat{Y} = \beta_0 + \beta_1X_1 + \beta_3X_3\) 才是最好的模型。不同指標可能選出的最佳模型並不保證會是相同的，在前述例子就是一個案例。

4. Would forward stepwise regression have any advantages here as a screening procedure over the all-possible-regressions procedure?

Forward stepwise regression 透過 t 檢定和 p-value 來判定模型中預測變項是否加入或剔除模型中，由於是自動化的程序，能大大減輕我們對每個模型的組合進行篩檢，像是每個都得去計算前題中六種指標值，而 forward stepwise regression 最終能提供「一個」「最佳」的迴歸模型。但要記得的是，模型的選擇不能只單單透過這個流程而決定，forward stepwise regression 還是有其缺點的，有賴研究者主觀判斷。

[Model Diagnostics]

Model Adequacy for a Predictor Variable

5. Prepare an added-variable plot for each of the predictor variables.

add_plot <- car::avPlots(patient_lm)

上圖顯示在模型 \(\hat{Y} = \beta_0 + \beta_1X_1 + \beta_2X_2 + \beta_3X_3\) 中，在其他參數加入模型中下，該參數 (X1, X2 or X3) 與 Y 分別的 residuals 之散佈圖。

6. Do your plots in part (5) suggest that the regression relationships in the fitted regression function in part (1) are inappropriate for any of the predictor variables? Explain.

partialRsq <- sapply(add_plot, function(X){cor(X[,1], X[,2])})
names(partialRsq) <- c("Rsq_Y1|23", "Rsq_Y2|13", "Rsq_Y3|12")
partialRsq

##  Rsq_Y1|23  Rsq_Y2|13  Rsq_Y3|12 
## -0.6341216 -0.1373198 -0.2809662

從 added-variable plot 可觀察到 X1 或 X3 在其他變項 (X2、X3 或 X1、X2) 加入下，其與 Y 的 partial residuals 皆呈現線性關係，表示有加入 X1 或 X3 的必要性。然而在對 X2 在 X1、X3 模型下的 added-variable plot 則線性關係則似乎不那麼明顯，因此加入是否加入 X2 可多加考慮。

Identifying Outlying Y & X Observations

7. Obtain the studentized deleted residuals and identify any outlying Y observations. Use the Bonferroni outlier test procedure with \(\alpha = .10\). State the decision rule and conclusion.

outlierDF <- data.frame(
  "e_i" = round(resid(patient_lm), 3),
  "h_ii" = round(hatvalues(patient_lm), 3),
  "t_i" = round(rstudent(patient_lm), 3))

alpha <- 0.1
n <- nrow(patient)
p <- 4
critical_by_Y <- qt((1-alpha/(2*n)), n-p-1) # == 3.27
outlierDF$outlier_by_Y <- outlierDF$t_i > critical_by_Y

critical_by_X <- 2*p/n # == 0.17
outlierDF$outlier_by_X <- outlierDF$h_ii > critical_by_X

outlierDF

##        e_i  h_ii    t_i outlier_by_Y outlier_by_X
## 1    0.113 0.078  0.012        FALSE        FALSE
## 2   -9.080 0.067 -0.933        FALSE        FALSE
## 3    4.024 0.037  0.404        FALSE        FALSE
## 4    2.009 0.154  0.215        FALSE        FALSE
## 5    5.726 0.097  0.594        FALSE        FALSE
## 6   -3.621 0.129 -0.382        FALSE        FALSE
## 7  -12.809 0.034 -1.307        FALSE        FALSE
## 8    0.426 0.075  0.044        FALSE        FALSE
## 9   -6.660 0.184 -0.729        FALSE         TRUE
## 10   2.003 0.058  0.203        FALSE        FALSE
## 11  17.160 0.088  1.836        FALSE        FALSE
## 12  13.353 0.031  1.362        FALSE        FALSE
## 13 -14.165 0.090 -1.498        FALSE        FALSE
## 14 -15.153 0.033 -1.558        FALSE        FALSE
## 15  12.517 0.143  1.358        FALSE        FALSE
## 16  -2.795 0.047 -0.282        FALSE        FALSE
## 17  16.610 0.120  1.807        FALSE        FALSE
## 18   8.541 0.062  0.875        FALSE        FALSE
## 19 -10.873 0.034 -1.102        FALSE        FALSE
## 20   8.168 0.129  0.868        FALSE        FALSE
## 21   5.581 0.078  0.573        FALSE        FALSE
## 22   8.439 0.137  0.901        FALSE        FALSE
## 23   3.680 0.033  0.368        FALSE        FALSE
## 24  -3.866 0.136 -0.409        FALSE        FALSE
## 25  -4.734 0.043 -0.477        FALSE        FALSE
## 26  -4.159 0.103 -0.432        FALSE        FALSE
## 27 -18.352 0.087 -1.974        FALSE        FALSE
## 28   5.395 0.186  0.590        FALSE         TRUE
## 29  -9.647 0.059 -0.989        FALSE        FALSE
## 30   3.368 0.090  0.347        FALSE        FALSE
## 31 -16.314 0.117 -1.769        FALSE        FALSE
## 32  11.511 0.110  1.220        FALSE        FALSE
## 33   0.613 0.045  0.062        FALSE        FALSE
## 34 -14.976 0.037 -1.542        FALSE        FALSE
## 35   0.925 0.103  0.096        FALSE        FALSE
## 36  11.616 0.027  1.176        FALSE        FALSE
## 37  11.507 0.121  1.228        FALSE        FALSE
## 38  -5.372 0.071 -0.549        FALSE        FALSE
## 39  -8.987 0.181 -0.987        FALSE         TRUE
## 40  -5.713 0.087 -0.590        FALSE        FALSE
## 41  11.006 0.038  1.119        FALSE        FALSE
## 42  -0.893 0.154 -0.095        FALSE        FALSE
## 43 -13.696 0.061 -1.422        FALSE        FALSE
## 44  13.058 0.051  1.345        FALSE        FALSE
## 45  -5.538 0.073 -0.567        FALSE        FALSE
## 46  10.052 0.083  1.045        FALSE        FALSE

由上表得知，以 studentized deleted residual 透過 Bonferroni outlier test，在 \(\alpha = .10\) 的信心水準下，從 Y 的角度看是沒有資料點被判定為 outlier (其中判定的 critical 為 \(t(1-\alpha/2n; n-p-1) = 3.27\))。

8. Obtain the diagonal elements of the hat matrix. Identify any outlying X observations.

同樣也是從 part (7) 的表中得知，以 diagonal elements of the hat matrix 從 X 的角度看是沒有資料點是 outlier，可知 cases 9、28 和 39 都超過 critical (\(2p/n = 0.17\)) 而被視為 outliers。

9. Hospital management wishes to estimate mean patient satisfaction for patients who are XI = 30 years old, whose index of illness severity is X2 = 58, and whose index of anxiety level is X3 = 2.0. Use (10.29, see textbook p. 400) to determine whether this estimate will involve a hidden extrapolation.

X_new <- c(1, 30, 58, 2)
h_new <- t(X_new)

X <- cbind(1, patient[,-1]) %>% as.matrix()
h_new <- t(X_new) %*% solve(t(X) %*% X) %*% X_new # == 0.33

. <- ifelse((h_new > max(outlierDF$h_ii)) | (h_new < min(outlierDF$h_ii)),
       "extrapolation", 
       "non-extrapolation") %>% as.character()
print(.)

## [1] "extrapolation"

經過檢驗得知，新病人的 \(h_{nwe, new} = 0.33 > 0.186 = max\{h_{ii}\}\)，超過了整體的 leverages 的範圍，引此屬於 extrapolation。

Identifying Influential Case

10. The three largest absolute studentized deleted residuals are for cases II, 17, and 27. Obtain the DFFITS, DFBETAS, and Cook’s distance values for this case to assess its influence. What do you conclude?

influentialDF <- data.frame(
  "DFFITS"    = dffits(patient_lm),
  "DFBETA_beta0" = dfbetas(patient_lm)[,1],
  "DFBETA_beta1" = dfbetas(patient_lm)[,2],
  "DFBETA_beta2" = dfbetas(patient_lm)[,3],
  "DFBETA_beta3" = dfbetas(patient_lm)[,4],
  "Cooks_D"   = cooks.distance(patient_lm)
)

influentialDF[c(11, 17, 27),] %>% round(digits = 3)

##    DFFITS DFBETA_beta0 DFBETA_beta1 DFBETA_beta2 DFBETA_beta3 Cooks_D
## 11  0.569        0.099       -0.363       -0.190        0.390   0.077
## 17  0.666       -0.449       -0.471        0.443        0.089   0.105
## 27 -0.609       -0.017        0.417       -0.250        0.161   0.087

以 DFFITS 角度來說，由於此筆資料算在小到中樣本數，因此判定是否為 influential point 則看 DFFITS 有無超過 1。由上表得知 case 11, 17, 27 都未超過 1，所以還不能算是 influential point。
以 DFBETA 角度來說，由於此筆資料算在小到中樣本數，因此判定是否為 influential point 則看各個 DFBETAS 之絕對值有無超過 1。由上表得知 case 11, 17, 27 不管是在 \(\beta_0\)、\(\beta_1\)、\(\beta_2\) 和 \(\beta_3\) 的 DFBETAS 絕對值皆小於 1，所以也都不能算是 influential point。

print(paste0("Case ", c(11, 17, 27), ": ",
             round(pf(influentialDF$Cooks_D[c(11, 17, 27)], p, n-p) * 100, digits = 2), 
             "%", collapse = "; "),
      quote = FALSE)

## [1] Case 11: 1.1%; Case 17: 1.99%; Case 27: 1.39%

以 Cook’s distance 角度來說，即便我們檢查 case 11, 17, 27 之 Cook’s distance 的 percentile 都小於 2%，遠遠小於 critical value 50%，因此也都沒有充分的證據算是 influential point。

11. Calculate Cook’s distance Di for each case and prepare an index plot. Are any cases influential according to this measure?

.data <- data.frame("res" = residuals(patient_lm), 
                    "fit" = predict(patient_lm),
                    "Cooks_D" = influentialDF$Cooks_D,
                    "index" = 1:nrow(patient)
)
g_proportional <- ggplot(.data, aes(x = res, y = fit)) +
  geom_point(aes(size = Cooks_D)) +
  theme_bw()
g_index <- ggplot(.data, aes(x = index, y = Cooks_D)) +
  geom_line() +
  geom_point(color = "grey") +
  theme_bw()
gridExtra::grid.arrange(g_proportional, g_index, nrow = 1)

Index plot 如上圖所示。其中 case 17 有最大的 \(Cook's distance = 0.105\)，但從上題得知 case 17 並無充分證據屬於 influential point，因此其他 case 也應都不屬於 influential point。

Multicollinearity Diagnostics

12. What do the scatter plot matrix and the correlation matrix show about pairwise linear associations among the predictor variables?

source("my_custom_function.R") # reference: https://github.com/ggobi/ggally/issues/139 
library(GGally)
ggpairs(patient, 
        upper = list(continuous = my_custom_cor_color),
        lower = list(continuous = my_custom_smooth))

從上圖 scatter and correlation plot 中得知，預測變項 (X1, X2, X3) 兩兩之間都有中度到高度的正相關，當一方值較高時，另一方的值也會較高。

13. Obtain the three variance inflation factors. What do these results suggest about the effects of multicollinearity here? Are these results more revealing than those in part (12)?

patient_lm_scale <- lm(scale(Y) ~ scale(X1) + scale(X2) + scale(X3), patient)
VIF <- data.frame(
  "beta"       = as.vector(coef(patient_lm)[-1]),
  "beta_scale" = as.vector(coef(patient_lm_scale)[-1]),
  "VIF"        = car::vif(lm(patient_lm))
)
VIF %>% round(digits = 3)

##       beta beta_scale   VIF
## X1  -1.142     -0.591 1.632
## X2  -0.442     -0.111 2.003
## X3 -13.470     -0.234 2.009

儘管在上題 part (12) 中顯示預測變相間有中、高度的正相關，但從 variance inflation factor (VIF) 指標得知，這三者 VIF 最大值也才到 2 而已，與理想的無相關 (VIF = 1) 相去不遠，也遠小於 critical value = 10，因此可知這三者沒有共線性的問題。
使用 VIF 指標也比單純看 scatter and correlation plot 更能知曉有無共線性的問題，由於後者只能看兩個變相間的相關，但共線性有時會發生在數個變項間的線性組合，因此 scatter and correlation plot 只提供部分的訊息而已。雖然在此比資料中並沒有這個問題產生。

[Remedial Measures]

Multicollinearity Remedial Measures

14. Obtain the estimated ridge standardized regression coefficients, variance inflation factors, and \(R^2\) for the following biasing constants: c = .000, .005, .01, .02, .03, .04, .05.

c <- c(.000, .005, .01, .02, .03, .04, .05)

Y_corTrans <- scale(patient[,1]) / sqrt(nrow(patient)-1)
X_corTrans <- scale(patient[,-1]) / sqrt(nrow(patient)-1)
rxx <- t(X_corTrans) %*% X_corTrans
rxy <- t(X_corTrans) %*% Y_corTrans

solveBeta <- function(c, rxx, rxy) {solve(rxx + c * diag(ncol(rxx)), rxy)} # (11.34)
getVIF <- function(c, rxx){ # (11.36)
  bias <- solve(rxx + c*diag(ncol(rxx)))
  diag(bias %*% rxx %*% bias)                                          
}
getRsq <- function(c, X_corTrans, Y_corTrans, rxx, rxy){
  Y_corTrans_hat <- X_corTrans %*% solve(rxx + c * diag(ncol(rxx)), rxy)
  SSE_R <- sum((Y_corTrans_hat - Y_corTrans)^2)
  Rsq_R <- 1 - SSE_R
}

ridgeDF <- data.frame(c = c, 
                      b = t(sapply(c, solveBeta, rxx, rxy)),
                      VIF = t(sapply(c, getVIF, rxx)),
                      Rsq = sapply(c, getRsq, X_corTrans, Y_corTrans, rxx, rxy)
)
ridgeDF

##       c        b.1        b.2        b.3   VIF.X1   VIF.X2   VIF.X3
## 1 0.000 -0.5906664 -0.1106149 -0.2339312 1.632296 2.003235 2.009062
## 2 0.005 -0.5868414 -0.1123112 -0.2338038 1.599989 1.950614 1.956122
## 3 0.010 -0.5830859 -0.1139526 -0.2336748 1.568684 1.900185 1.905392
## 4 0.020 -0.5757757 -0.1170790 -0.2334115 1.508916 1.805423 1.810081
## 5 0.030 -0.5687197 -0.1200098 -0.2331400 1.452684 1.718067 1.722236
## 6 0.040 -0.5619036 -0.1227592 -0.2328593 1.399712 1.637351 1.641084
## 7 0.050 -0.5553137 -0.1253400 -0.2325687 1.349749 1.562605 1.565949
##         Rsq
## 1 0.6821943
## 2 0.6821839
## 3 0.6821533
## 4 0.6820356
## 5 0.6818482
## 6 0.6815975
## 7 0.6812892

結果如上表所示。

15. Make a ridge trace plot for the given c values. Do the ridge regression coefficients exhibit substantial changes near c = 0?

c_all <- seq(0, 1, 0.05)
ridgeDF_all <- data.frame(c = c_all, 
                          b = t(sapply(c_all, solveBeta, rxx, rxy)),
                          VIF = t(sapply(c_all, getVIF, rxx)),
                          Rsq = sapply(c_all, getRsq, X_corTrans, Y_corTrans, rxx, rxy)
)

g <- lapply(list(ridgeDF, ridgeDF_all), function(data){
  data %>% 
  gather(key = beta, value = value, b.1:b.3) %>% 
  ggplot(aes(x = c, y = value)) +
    geom_line(aes(color = beta, linetype = beta)) +
    labs(y = expression(b[k]^R)) +
    scale_x_continuous(breaks = c(.000, .05, 0.1, 0.25, 0.5, 0.75, 1)) + 
    
    theme_bw() +
    theme(axis.text.x = element_text(angle = 45, hjust = 1),
          legend.position = "top") 
})

gridExtra::grid.arrange(grobs = g, nrow = 1)

若以題目 part (15) 給定的 c = .000, .005, .01, .02, .03, .04, .05 尺度下來說，\(b_k^R\) 在 c 靠近 0 附近確實感覺變化不大 (左圖)。若將 c 從 0 到 1 範圍來看的話 (右圖)，c 靠近 0 附近 \(b_k^R\) 有較大的變化，但我想這只是做圖上尺度上的觀感，實際上來說 \(b_1^R\) 最大的變化也才多了 0.3，因此整體上來說透過 ridge regression 係數的變化上並不大而且也有趨於穩定的跡象。

16. Suggest a reasonable value for the biasing constant c based on the ridge trace, the VIF values, and \(R^2\).

ridgeDF_all

##       c        b.1        b.2        b.3    VIF.X1    VIF.X2    VIF.X3
## 1  0.00 -0.5906664 -0.1106149 -0.2339312 1.6322963 2.0032351 2.0090618
## 2  0.05 -0.5553137 -0.1253400 -0.2325687 1.3497487 1.5626054 1.5659489
## 3  0.10 -0.5253542 -0.1360930 -0.2309584 1.1378291 1.2601975 1.2621197
## 4  0.15 -0.4995220 -0.1440235 -0.2290893 0.9745800 1.0429498 1.0440238
## 5  0.20 -0.4769284 -0.1498870 -0.2269934 0.8459796 0.8811514 0.8817039
## 6  0.25 -0.4569302 -0.1542009 -0.2247136 0.7427311 0.7570782 0.7573035
## 7  0.30 -0.4390498 -0.1573309 -0.2222918 0.6584714 0.6596071 0.6596249
## 8  0.35 -0.4229244 -0.1595430 -0.2197647 0.5887253 0.5814619 0.5813478
## 9  0.40 -0.4082735 -0.1610346 -0.2171632 0.5302699 0.5177145 0.5175173
## 10 0.45 -0.3948758 -0.1619550 -0.2145128 0.4807360 0.4649281 0.4646798
## 11 0.50 -0.3825549 -0.1624190 -0.2118343 0.4383496 0.4206437 0.4203656
## 12 0.55 -0.3711676 -0.1625160 -0.2091442 0.4017606 0.3830636 0.3827699
## 13 0.60 -0.3605966 -0.1623163 -0.2064559 0.3699266 0.3508471 0.3505474
## 14 0.65 -0.3507449 -0.1618762 -0.2037799 0.3420319 0.3229778 0.3226785
## 15 0.70 -0.3415309 -0.1612407 -0.2011247 0.3174306 0.2986728 0.2983781
## 16 0.75 -0.3328862 -0.1604463 -0.1984968 0.2956059 0.2773208 0.2770335
## 17 0.80 -0.3247522 -0.1595229 -0.1959013 0.2761398 0.2584387 0.2581606
## 18 0.85 -0.3170787 -0.1584947 -0.1933422 0.2586916 0.2416403 0.2413724
## 19 0.90 -0.3098226 -0.1573820 -0.1908226 0.2429806 0.2266137 0.2263566
## 20 0.95 -0.3029463 -0.1562015 -0.1883445 0.2287746 0.2131045 0.2128583
## 21 1.00 -0.2964167 -0.1549670 -0.1859096 0.2158793 0.2009036 0.2006683
##          Rsq
## 1  0.6821943
## 2  0.6812892
## 3  0.6790412
## 4  0.6759204
## 5  0.6722077
## 6  0.6680797
## 7  0.6636529
## 8  0.6590071
## 9  0.6541990
## 10 0.6492707
## 11 0.6442538
## 12 0.6391733
## 13 0.6340486
## 14 0.6288960
## 15 0.6237284
## 16 0.6185570
## 17 0.6133908
## 18 0.6082378
## 19 0.6031045
## 20 0.5979967
## 21 0.5929192

由於此筆資料共線性本來就不高，即便尚未做 ridge regression 前，各個變相的 VIF 做多也才到 2 而已，因此不一定要使用 ridge regression 來做估計。但若要從 ridge trace 中選定一個 biasing constant c 的話，我認為 c = 0.15 是的不錯的選擇，其一 \(R_R^2\) 跟原始相差不多，VIF 在不同變相間也都接近 1，而 \(b_1^R\)、\(b_2^R\) 和 \(b_3^R\) 之後的變化也都不大了。

應用線性統計模式 作業四：Chapter 9, 10, & 11

B03605059 心理五 林子堯

2019-06-06