suppressPackageStartupMessages(library("dplyr"))
suppressPackageStartupMessages(library("ggplot2"))
suppressPackageStartupMessages(library("MASS"))

Problem 1

prob1 <- read.csv("prob1.csv", header=TRUE)

Part A

Fit a regression model to all 30 observations using only three predictors: x1 (height), x2 (gender), and x5 (age).

model1 <- lm(y ~ x1 + x2 + x5, prob1)
summary(model1)
## 
## Call:
## lm(formula = y ~ x1 + x2 + x5, data = prob1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -27.639 -14.208  -4.454  13.186  42.236 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -8.7471    63.1481  -0.139 0.890899    
## x1            1.3483     0.9478   1.423 0.166735    
## x2           44.7311    10.5345   4.246 0.000245 ***
## x5            2.1473     1.3897   1.545 0.134399    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.26 on 26 degrees of freedom
## Multiple R-squared:  0.6409, Adjusted R-squared:  0.5995 
## F-statistic: 15.47 on 3 and 26 DF,  p-value: 5.64e-06

How significant a factor would you say x1 is?

Based on the P-value, it appears that x1 is not terribly significant, which violates intuition.

Part B

Now leave x1 out of the model and fit a regression model using only x2 and x5 as the predictors.

model2 <- lm(y ~ x2 + x5, prob1)
summary(model2)
## 
## Call:
## lm(formula = y ~ x2 + x5, data = prob1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -30.945 -13.562  -4.806   8.734  41.951 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   71.428     29.022   2.461   0.0205 *  
## x2            52.396      9.223   5.681 4.92e-06 ***
## x5             2.487      1.395   1.783   0.0858 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.62 on 27 degrees of freedom
## Multiple R-squared:  0.613,  Adjusted R-squared:  0.5843 
## F-statistic: 21.38 on 2 and 27 DF,  p-value: 2.722e-06

Plot the standardized residuals versus x1, using one symbol for males and another symbol for females.

qplot(x=prob1$x1, y=rstandard(model2), color=factor(prob1$x2), shape=factor(prob1$x2)) + geom_point()

Comment on what you see and whether this would cause you to reconsider the conclusion from part (a) that height has very little relationship with weight after you factor out gender.

Within the female subgroup, there is clearly a linear trend between the residuals and x1. A trend is less evident within the male subgroup. However, if we remove a couple of the male outliers (the two + symbols on the far left), then it might appear as though males have a linear trend too).

The conclusion is that maybe weight actually is somewhat tied to height, which seems quite reasonable. We may just need a larger sample to minimize the effects of a few outliers.

Part C

Now consider again the model from part (a) with all three predictors. Plot the standardized residuals versus x5.

qplot(x=prob1$x5, y=rstandard(model1), color=factor(prob1$x2), shape=factor(prob1$x2)) + geom_point()

Provide an interpretation, commenting on anything unusual.

Students with higher age seem to have either a large positive residual (quite a bit heavier than the model predicts) or a large negative residual (quite a bit lighter than the model predicts). Perhaps we tend to go to extremes more as we get older.

In order to really tell if weight is related to age (with all other predictors factored out), we need a larger sample size.

Part D

What is the difference between the residuals and the standardized residuals?

Standardized residuals are divided by each residual’s estimated standard deviation.

For which observations will you have the largest difference?

Residuals for observations with the largest leverage have the smallest standard deviation. Therefore, standardized residuals for the high leverage observations are inflated the most.

Problem 2

Calculate the standardized regression coefficients \(\hat{\beta}_1^*\), \(\hat{\beta}_2^*\), and \(\hat{\beta}_3^*\) (i.e., the coefficients you get by fitting a model to the standardized predictors and response) using two different methods

prob2 <- read.csv("prob2.csv", header=TRUE)
model3 <- lm(PIQ ~ MRI + Height + Weight, prob2)
summary(model3)
## 
## Call:
## lm(formula = PIQ ~ MRI + Height + Weight, data = prob2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -32.74 -12.09  -3.84  14.17  51.69 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.114e+02  6.297e+01   1.768 0.085979 .  
## MRI          2.060e+00  5.634e-01   3.657 0.000856 ***
## Height      -2.732e+00  1.229e+00  -2.222 0.033034 *  
## Weight       5.599e-04  1.971e-01   0.003 0.997750    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.79 on 34 degrees of freedom
## Multiple R-squared:  0.2949, Adjusted R-squared:  0.2327 
## F-statistic: 4.741 on 3 and 34 DF,  p-value: 0.007215

(i) From the unstandardized coefficients B1, B2 , and β3, and the sample standard deviations of y, x1, x2, and x3: sy = 22.60, sx1 = 7.25, sx2 = 3.99, and sx3 = 23.48

sy <- 22.60
sx <- c(7.25, 3.99, 23.48)
c(model3$coefficients[["MRI"]], model3$coefficients[["Height"]], model3$coefficients[["Weight"]]) * sx/sy
## [1]  0.66095838 -0.48231847  0.00058174

(ii) From directly fitting a regression model to the standardized predictors and standardized response. Check that you get the same answers.

standardized_prob2 <- prob2 %>% mutate(
  PIQ = scale(PIQ),
  MRI = scale(MRI),
  Height = scale(Height),
  Weight = scale(Weight)
)

model4 <- lm(PIQ ~ MRI + Height + Weight, standardized_prob2)
c(model4$coefficients[["MRI"]], model4$coefficients[["Height"]], model4$coefficients[["Weight"]])
## [1]  0.661593272 -0.482822138  0.000581758

The coefficients are the same in both methods.

How do the relative effects of brain size, height, and weight compare for predicting IQ?

The largest effect on IQ is brain size. The second largest effect on brain size is height. An increase in brain size and weight both correlate to an increase in IQ, with brain size having the larger effect. An increase in height shows a decrease in IQ.

Problem 3

The

cormatrix <- cor(data.frame(prob2$MRI, prob2$Height, prob2$Weight))
cormatrix
##              prob2.MRI prob2.Height prob2.Weight
## prob2.MRI    1.0000000    0.5883668     0.513487
## prob2.Height 0.5883668    1.0000000     0.699614
## prob2.Weight 0.5134870    0.6996140     1.000000

Height and weight have the highest correlation.

x1 onto x2 and x3

summary(lm(MRI~Height+Weight, prob2))$r.squared
## [1] 0.366497

x2 onto x1 and x3

summary(lm(Height~MRI+Weight, prob2))$r.squared
## [1] 0.5607563

x3 onto x1 and x2

summary(lm(Weight~MRI+Height, prob2))$r.squared
## [1] 0.5053278

\(VIF_1 = \frac{1}{1-0.366} = 1.578\) \(VIF_2 = \frac{1}{1-0.561} = 2.277\) \(VIF_3 = \frac{1}{1-0.505} = 2.022\)

solve(cormatrix)
##               prob2.MRI prob2.Height prob2.Weight
## prob2.MRI     1.5785244   -0.7084223   -0.3149296
## prob2.Height -0.7084223    2.2766406   -1.2290041
## prob2.Weight -0.3149296   -1.2290041    2.0215407

The VIFs are the same in both methods. Since the VIFs are less than 10, there is not high multicollinearity.

Problem 4

prob4 <- read.csv("prob4.csv", header=TRUE)
standardized_prob4 <- prob4 %>% mutate(
  X100m = scale(X100m),
  X200m = scale(X200m),
  X400m = scale(X400m),
  X800m = scale(X800m),
  X1500m = scale(X1500m),
  X5000m = scale(X5000m),
  X10000m = scale(X10000m),
  marathon = scale(marathon)
)

Part A

model5 <- lm.ridge(
  marathon ~ X100m + X200m + X400m + X800m + X1500m + X5000m + X10000m, 
  standardized_prob4, 
  lambda=5
)
model5
##                       X100m         X200m         X400m         X800m 
##  2.669399e-16 -9.540014e-02 -5.005037e-02  2.499942e-02  9.500216e-02 
##        X1500m        X5000m       X10000m 
##  9.439399e-02  3.608137e-01  4.548958e-01
model6 <- lm(
  marathon ~ X100m + X200m + X400m + X800m + X1500m + X5000m + X10000m, 
  standardized_prob4
)
summary(model6)
## 
## Call:
## lm(formula = marathon ~ X100m + X200m + X400m + X800m + X1500m + 
##     X5000m + X10000m, data = standardized_prob4)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.86331 -0.19269 -0.00361  0.19144  0.82039 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.966e-17  4.461e-02   0.000 1.000000    
## X100m       -1.316e-01  1.271e-01  -1.036 0.305551    
## X200m       -1.913e-02  1.345e-01  -0.142 0.887506    
## X400m        1.109e-02  1.143e-01   0.097 0.923113    
## X800m        9.906e-02  1.312e-01   0.755 0.454080    
## X1500m      -1.241e-01  1.658e-01  -0.748 0.458007    
## X5000m       2.795e-01  2.083e-01   1.342 0.186145    
## X10000m      7.886e-01  2.185e-01   3.609 0.000744 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3309 on 47 degrees of freedom
## Multiple R-squared:  0.9047, Adjusted R-squared:  0.8905 
## F-statistic: 63.75 on 7 and 47 DF,  p-value: < 2.2e-16

The ridge regression model seems to make more sense. The coefficients increase from negative values to positive values as the distance of the race increases. This makes sense as the prediction should get harder as the race gets longer.

Part B, C

lambda <- vector(mode="numeric", length=10)
lambda[1] = 0
for (i in 10:2) {
  if (i==10) {
    lambda[i] = 64
  }
  else {
    lambda[i] = lambda[i+1] / 2
  }
}
lambda
##  [1]  0.00  0.25  0.50  1.00  2.00  4.00  8.00 16.00 32.00 64.00
outridge <- lm.ridge(
  marathon ~ X100m + X200m + X400m + X800m + X1500m + X5000m + X10000m, 
  standardized_prob4, 
  lambda=lambda
)

outridge
##                            X100m         X200m      X400m      X800m
##  0.00 2.443221e-17 -1.316401e-01 -0.0191333962 0.01109401 0.09905998
##  0.25 5.592475e-17 -1.250040e-01 -0.0275392071 0.01192336 0.09520559
##  0.50 8.187493e-17 -1.201898e-01 -0.0335844269 0.01272075 0.09251374
##  1.00 1.228005e-16 -1.136797e-01 -0.0414393173 0.01425258 0.08942909
##  2.00 1.792364e-16 -1.063738e-01 -0.0486810364 0.01715818 0.08818450
##  4.00 2.453715e-16 -9.848150e-02 -0.0510486403 0.02252341 0.09226052
##  8.00 3.099652e-16 -8.678071e-02 -0.0444267202 0.03163762 0.10242556
## 16.00 3.620795e-16 -6.561450e-02 -0.0269099161 0.04465782 0.11348980
## 32.00 3.984468e-16 -3.386091e-02 -0.0009720094 0.05891800 0.11824913
## 64.00 4.256163e-16  6.767748e-05  0.0249941889 0.06901211 0.11311644
##            X1500m    X5000m   X10000m
##  0.00 -0.12410447 0.2794730 0.7885656
##  0.25 -0.09504711 0.3097242 0.7329444
##  0.50 -0.07083623 0.3295515 0.6909180
##  1.00 -0.03249987 0.3521918 0.6306617
##  2.00  0.01963097 0.3678847 0.5570671
##  4.00  0.07712541 0.3659854 0.4794569
##  8.00  0.12542225 0.3430665 0.4046418
## 16.00  0.15269715 0.3030761 0.3342812
## 32.00  0.15640159 0.2526305 0.2678666
## 64.00  0.14193163 0.1989605 0.2063733
plot(outridge)

select(outridge)
## modified HKB estimator is 0.750373 
## modified L-W estimator is 0.6162232 
## smallest value of GCV  at 2

A good choice for stabilizing the estimates would be a lambda between 10 and 20. These lambdas are large enough to have stable but non-bias predictions. From the following graph, a reasonable choice might be 10 < lambda < 20. The generalized cross-validation (GCV) shows that a lambda of 2 is best.

Part D

Use multiple (at least 20) replicates of 10-fold cross-validation to choose the best λ.

k_fold_cross_validation <- function(data, k) {
  id <- sample(rep(seq.int(k), length.out=nrow(data)))
  matrices <- lapply(seq.int(k), function(x) list(
    train_matrix = data[x!=id, ],
    test_matrix  = data[x==id, ]
  ))
  return(matrices)
}

sse_cross_validation <- function(data, matrices) {

  SSE_J <- lapply(matrices, function(x) {

    models <- lm.ridge(
      marathon ~ X100m + X200m + X400m + X800m + X1500m + X5000m + X10000m, 
      x$train_matrix, 
      lambda=lambda
    )
    
    X <- as.matrix(x$test_matrix[c("X100m", "X200m", "X400m", "X800m", "X1500m", "X5000m", "X10000m")])
    
    SSE = lapply(seq(1:length(lambda)), function(y) { 
      pred = X %*% as.matrix(models$coef[,y]) 
      sum((x$test_matrix$marathon - pred)^2)
    })

  })
  
  SUM_SSE <- vector(mode="numeric", length=length(lambda))
  
  for(i in 1:length(lambda)) {
    for(j in 1:length(matrices)) {
      SUM_SSE[i] = SUM_SSE[i] + SSE_J[[j]][[i]]
    }
  }
  
  return(SUM_SSE)
}

seeds <- c(23423, 834, 2342, 456, 3467, 5778, 983, 22, 13, 6757, 6356, 2323, 8578, 8872, 3223, 43432, 3444, 45575, 2213, 9989)

SSE_CV_20 <- as.data.frame(do.call(rbind, lapply(1:20, function(i) {
  set.seed(seeds[i])
  matrices <- k_fold_cross_validation(prob4, 10)
  sse_cross_validation(prob4, matrices)  
})))

SSE_CV_20
##          V1       V2       V3       V4       V5       V6       V7       V8
## 1  514195.2 400000.8 322686.3 225112.7 128177.2 55632.86 16921.78 4420.541
## 2  576540.8 444835.4 356422.5 246240.6 138880.4 60389.93 19330.03 5988.268
## 3  461971.0 353492.7 281403.9 192313.7 106291.2 43917.53 11766.24 2256.701
## 4  495432.6 384849.2 309604.4 215002.0 122051.5 53454.93 16918.28 4304.885
## 5  485303.9 373349.2 298070.5 204034.7 112461.8 46400.78 13194.97 3689.132
## 6  507835.8 388306.8 308910.7 211065.6 116980.8 48974.35 13696.29 2412.920
## 7  469859.0 365245.0 293351.6 202437.5 113125.8 47953.35 14234.23 3679.067
## 8  468666.7 360710.0 287887.1 197106.1 109277.0 46182.94 14023.80 4073.818
## 9  485863.3 370747.5 294797.8 201624.8 112341.8 47723.36 13812.94 2657.696
## 10 541541.0 412108.7 326573.4 221755.1 122019.7 51238.06 15330.07 3984.826
## 11 540357.6 410855.3 325221.1 220220.3 120200.9 49119.99 13125.00 1927.258
## 12 510717.4 389504.9 309388.8 210965.7 116721.5 49064.92 14298.54 3282.495
## 13 472019.1 369672.6 298578.3 207674.3 117002.1 49757.47 14561.20 3399.960
## 14 559718.8 428772.3 341656.6 234228.1 131019.1 56611.59 17865.33 4812.291
## 15 501733.3 388330.7 310903.8 213651.1 118735.8 49831.31 14270.98 2888.576
## 16 520704.9 401300.1 320852.4 220553.7 123020.4 52022.55 15048.04 3063.928
## 17 484344.8 377747.6 304572.8 211921.5 120210.5 52168.29 16031.97 3970.374
## 18 534130.2 406982.9 322912.0 219725.3 121174.8 50663.75 14437.91 2871.895
## 19 525692.7 411024.5 331987.1 231399.9 131536.8 57891.09 19347.76 6578.058
## 20 481841.3 368875.8 293224.3 199541.2 109538.3 45186.56 12396.59 2188.303
##          V9      V10
## 1  6718.364 23895.14
## 2  7743.597 24333.91
## 3  5630.245 23174.98
## 4  5181.042 20726.48
## 5  7008.095 24382.34
## 6  4444.063 20927.30
## 7  6210.488 23367.05
## 8  6449.448 22938.11
## 9  4495.578 20707.62
## 10 5919.078 22222.69
## 11 4100.693 20734.78
## 12 5328.550 21701.74
## 13 5567.316 22147.14
## 14 5815.606 21573.51
## 15 4701.146 20880.12
## 16 4920.113 21452.05
## 17 5458.861 21499.48
## 18 4895.708 21491.70
## 19 7935.213 24231.36
## 20 4519.405 21128.11
colMeans(SSE_CV_20)
##        V1        V2        V3        V4        V5        V6        V7 
## 506923.47 390335.60 311950.27 214328.70 119538.37  50709.28  15030.60 
##        V8        V9       V10 
##   3622.55   5652.13  22175.78

The best lambda is 16 since the sum of SSE for 10 fold CV averaged across 20 replicates shows this value to have the lowest SSE.