suppressPackageStartupMessages(library("dplyr"))
suppressPackageStartupMessages(library("ggplot2"))
suppressPackageStartupMessages(library("MASS"))
prob1 <- read.csv("prob1.csv", header=TRUE)
Fit a regression model to all 30 observations using only three predictors: x1 (height), x2 (gender), and x5 (age).
model1 <- lm(y ~ x1 + x2 + x5, prob1)
summary(model1)
##
## Call:
## lm(formula = y ~ x1 + x2 + x5, data = prob1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.639 -14.208 -4.454 13.186 42.236
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8.7471 63.1481 -0.139 0.890899
## x1 1.3483 0.9478 1.423 0.166735
## x2 44.7311 10.5345 4.246 0.000245 ***
## x5 2.1473 1.3897 1.545 0.134399
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.26 on 26 degrees of freedom
## Multiple R-squared: 0.6409, Adjusted R-squared: 0.5995
## F-statistic: 15.47 on 3 and 26 DF, p-value: 5.64e-06
How significant a factor would you say x1 is?
Based on the P-value, it appears that x1 is not terribly significant, which violates intuition.
Now leave x1 out of the model and fit a regression model using only x2 and x5 as the predictors.
model2 <- lm(y ~ x2 + x5, prob1)
summary(model2)
##
## Call:
## lm(formula = y ~ x2 + x5, data = prob1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -30.945 -13.562 -4.806 8.734 41.951
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 71.428 29.022 2.461 0.0205 *
## x2 52.396 9.223 5.681 4.92e-06 ***
## x5 2.487 1.395 1.783 0.0858 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.62 on 27 degrees of freedom
## Multiple R-squared: 0.613, Adjusted R-squared: 0.5843
## F-statistic: 21.38 on 2 and 27 DF, p-value: 2.722e-06
Plot the standardized residuals versus x1, using one symbol for males and another symbol for females.
qplot(x=prob1$x1, y=rstandard(model2), color=factor(prob1$x2), shape=factor(prob1$x2)) + geom_point()
Comment on what you see and whether this would cause you to reconsider the conclusion from part (a) that height has very little relationship with weight after you factor out gender.
Within the female subgroup, there is clearly a linear trend between the residuals and x1. A trend is less evident within the male subgroup. However, if we remove a couple of the male outliers (the two + symbols on the far left), then it might appear as though males have a linear trend too).
The conclusion is that maybe weight actually is somewhat tied to height, which seems quite reasonable. We may just need a larger sample to minimize the effects of a few outliers.
Now consider again the model from part (a) with all three predictors. Plot the standardized residuals versus x5.
qplot(x=prob1$x5, y=rstandard(model1), color=factor(prob1$x2), shape=factor(prob1$x2)) + geom_point()
Provide an interpretation, commenting on anything unusual.
Students with higher age seem to have either a large positive residual (quite a bit heavier than the model predicts) or a large negative residual (quite a bit lighter than the model predicts). Perhaps we tend to go to extremes more as we get older.
In order to really tell if weight is related to age (with all other predictors factored out), we need a larger sample size.
What is the difference between the residuals and the standardized residuals?
Standardized residuals are divided by each residual’s estimated standard deviation.
For which observations will you have the largest difference?
Residuals for observations with the largest leverage have the smallest standard deviation. Therefore, standardized residuals for the high leverage observations are inflated the most.
Calculate the standardized regression coefficients \(\hat{\beta}_1^*\), \(\hat{\beta}_2^*\), and \(\hat{\beta}_3^*\) (i.e., the coefficients you get by fitting a model to the standardized predictors and response) using two different methods
prob2 <- read.csv("prob2.csv", header=TRUE)
model3 <- lm(PIQ ~ MRI + Height + Weight, prob2)
summary(model3)
##
## Call:
## lm(formula = PIQ ~ MRI + Height + Weight, data = prob2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -32.74 -12.09 -3.84 14.17 51.69
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.114e+02 6.297e+01 1.768 0.085979 .
## MRI 2.060e+00 5.634e-01 3.657 0.000856 ***
## Height -2.732e+00 1.229e+00 -2.222 0.033034 *
## Weight 5.599e-04 1.971e-01 0.003 0.997750
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.79 on 34 degrees of freedom
## Multiple R-squared: 0.2949, Adjusted R-squared: 0.2327
## F-statistic: 4.741 on 3 and 34 DF, p-value: 0.007215
(i) From the unstandardized coefficients B1, B2 , and β3, and the sample standard deviations of y, x1, x2, and x3: sy = 22.60, sx1 = 7.25, sx2 = 3.99, and sx3 = 23.48
sy <- 22.60
sx <- c(7.25, 3.99, 23.48)
c(model3$coefficients[["MRI"]], model3$coefficients[["Height"]], model3$coefficients[["Weight"]]) * sx/sy
## [1] 0.66095838 -0.48231847 0.00058174
(ii) From directly fitting a regression model to the standardized predictors and standardized response. Check that you get the same answers.
standardized_prob2 <- prob2 %>% mutate(
PIQ = scale(PIQ),
MRI = scale(MRI),
Height = scale(Height),
Weight = scale(Weight)
)
model4 <- lm(PIQ ~ MRI + Height + Weight, standardized_prob2)
c(model4$coefficients[["MRI"]], model4$coefficients[["Height"]], model4$coefficients[["Weight"]])
## [1] 0.661593272 -0.482822138 0.000581758
The coefficients are the same in both methods.
How do the relative effects of brain size, height, and weight compare for predicting IQ?
The largest effect on IQ is brain size. The second largest effect on brain size is height. An increase in brain size and weight both correlate to an increase in IQ, with brain size having the larger effect. An increase in height shows a decrease in IQ.
The
cormatrix <- cor(data.frame(prob2$MRI, prob2$Height, prob2$Weight))
cormatrix
## prob2.MRI prob2.Height prob2.Weight
## prob2.MRI 1.0000000 0.5883668 0.513487
## prob2.Height 0.5883668 1.0000000 0.699614
## prob2.Weight 0.5134870 0.6996140 1.000000
Height and weight have the highest correlation.
x1 onto x2 and x3
summary(lm(MRI~Height+Weight, prob2))$r.squared
## [1] 0.366497
x2 onto x1 and x3
summary(lm(Height~MRI+Weight, prob2))$r.squared
## [1] 0.5607563
x3 onto x1 and x2
summary(lm(Weight~MRI+Height, prob2))$r.squared
## [1] 0.5053278
\(VIF_1 = \frac{1}{1-0.366} = 1.578\) \(VIF_2 = \frac{1}{1-0.561} = 2.277\) \(VIF_3 = \frac{1}{1-0.505} = 2.022\)
solve(cormatrix)
## prob2.MRI prob2.Height prob2.Weight
## prob2.MRI 1.5785244 -0.7084223 -0.3149296
## prob2.Height -0.7084223 2.2766406 -1.2290041
## prob2.Weight -0.3149296 -1.2290041 2.0215407
The VIFs are the same in both methods. Since the VIFs are less than 10, there is not high multicollinearity.
prob4 <- read.csv("prob4.csv", header=TRUE)
standardized_prob4 <- prob4 %>% mutate(
X100m = scale(X100m),
X200m = scale(X200m),
X400m = scale(X400m),
X800m = scale(X800m),
X1500m = scale(X1500m),
X5000m = scale(X5000m),
X10000m = scale(X10000m),
marathon = scale(marathon)
)
model5 <- lm.ridge(
marathon ~ X100m + X200m + X400m + X800m + X1500m + X5000m + X10000m,
standardized_prob4,
lambda=5
)
model5
## X100m X200m X400m X800m
## 2.669399e-16 -9.540014e-02 -5.005037e-02 2.499942e-02 9.500216e-02
## X1500m X5000m X10000m
## 9.439399e-02 3.608137e-01 4.548958e-01
model6 <- lm(
marathon ~ X100m + X200m + X400m + X800m + X1500m + X5000m + X10000m,
standardized_prob4
)
summary(model6)
##
## Call:
## lm(formula = marathon ~ X100m + X200m + X400m + X800m + X1500m +
## X5000m + X10000m, data = standardized_prob4)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.86331 -0.19269 -0.00361 0.19144 0.82039
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.966e-17 4.461e-02 0.000 1.000000
## X100m -1.316e-01 1.271e-01 -1.036 0.305551
## X200m -1.913e-02 1.345e-01 -0.142 0.887506
## X400m 1.109e-02 1.143e-01 0.097 0.923113
## X800m 9.906e-02 1.312e-01 0.755 0.454080
## X1500m -1.241e-01 1.658e-01 -0.748 0.458007
## X5000m 2.795e-01 2.083e-01 1.342 0.186145
## X10000m 7.886e-01 2.185e-01 3.609 0.000744 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3309 on 47 degrees of freedom
## Multiple R-squared: 0.9047, Adjusted R-squared: 0.8905
## F-statistic: 63.75 on 7 and 47 DF, p-value: < 2.2e-16
The ridge regression model seems to make more sense. The coefficients increase from negative values to positive values as the distance of the race increases. This makes sense as the prediction should get harder as the race gets longer.
lambda <- vector(mode="numeric", length=10)
lambda[1] = 0
for (i in 10:2) {
if (i==10) {
lambda[i] = 64
}
else {
lambda[i] = lambda[i+1] / 2
}
}
lambda
## [1] 0.00 0.25 0.50 1.00 2.00 4.00 8.00 16.00 32.00 64.00
outridge <- lm.ridge(
marathon ~ X100m + X200m + X400m + X800m + X1500m + X5000m + X10000m,
standardized_prob4,
lambda=lambda
)
outridge
## X100m X200m X400m X800m
## 0.00 2.443221e-17 -1.316401e-01 -0.0191333962 0.01109401 0.09905998
## 0.25 5.592475e-17 -1.250040e-01 -0.0275392071 0.01192336 0.09520559
## 0.50 8.187493e-17 -1.201898e-01 -0.0335844269 0.01272075 0.09251374
## 1.00 1.228005e-16 -1.136797e-01 -0.0414393173 0.01425258 0.08942909
## 2.00 1.792364e-16 -1.063738e-01 -0.0486810364 0.01715818 0.08818450
## 4.00 2.453715e-16 -9.848150e-02 -0.0510486403 0.02252341 0.09226052
## 8.00 3.099652e-16 -8.678071e-02 -0.0444267202 0.03163762 0.10242556
## 16.00 3.620795e-16 -6.561450e-02 -0.0269099161 0.04465782 0.11348980
## 32.00 3.984468e-16 -3.386091e-02 -0.0009720094 0.05891800 0.11824913
## 64.00 4.256163e-16 6.767748e-05 0.0249941889 0.06901211 0.11311644
## X1500m X5000m X10000m
## 0.00 -0.12410447 0.2794730 0.7885656
## 0.25 -0.09504711 0.3097242 0.7329444
## 0.50 -0.07083623 0.3295515 0.6909180
## 1.00 -0.03249987 0.3521918 0.6306617
## 2.00 0.01963097 0.3678847 0.5570671
## 4.00 0.07712541 0.3659854 0.4794569
## 8.00 0.12542225 0.3430665 0.4046418
## 16.00 0.15269715 0.3030761 0.3342812
## 32.00 0.15640159 0.2526305 0.2678666
## 64.00 0.14193163 0.1989605 0.2063733
plot(outridge)
select(outridge)
## modified HKB estimator is 0.750373
## modified L-W estimator is 0.6162232
## smallest value of GCV at 2
A good choice for stabilizing the estimates would be a lambda between 10 and 20. These lambdas are large enough to have stable but non-bias predictions. From the following graph, a reasonable choice might be 10 < lambda < 20. The generalized cross-validation (GCV) shows that a lambda of 2 is best.
Use multiple (at least 20) replicates of 10-fold cross-validation to choose the best λ.
k_fold_cross_validation <- function(data, k) {
id <- sample(rep(seq.int(k), length.out=nrow(data)))
matrices <- lapply(seq.int(k), function(x) list(
train_matrix = data[x!=id, ],
test_matrix = data[x==id, ]
))
return(matrices)
}
sse_cross_validation <- function(data, matrices) {
SSE_J <- lapply(matrices, function(x) {
models <- lm.ridge(
marathon ~ X100m + X200m + X400m + X800m + X1500m + X5000m + X10000m,
x$train_matrix,
lambda=lambda
)
X <- as.matrix(x$test_matrix[c("X100m", "X200m", "X400m", "X800m", "X1500m", "X5000m", "X10000m")])
SSE = lapply(seq(1:length(lambda)), function(y) {
pred = X %*% as.matrix(models$coef[,y])
sum((x$test_matrix$marathon - pred)^2)
})
})
SUM_SSE <- vector(mode="numeric", length=length(lambda))
for(i in 1:length(lambda)) {
for(j in 1:length(matrices)) {
SUM_SSE[i] = SUM_SSE[i] + SSE_J[[j]][[i]]
}
}
return(SUM_SSE)
}
seeds <- c(23423, 834, 2342, 456, 3467, 5778, 983, 22, 13, 6757, 6356, 2323, 8578, 8872, 3223, 43432, 3444, 45575, 2213, 9989)
SSE_CV_20 <- as.data.frame(do.call(rbind, lapply(1:20, function(i) {
set.seed(seeds[i])
matrices <- k_fold_cross_validation(prob4, 10)
sse_cross_validation(prob4, matrices)
})))
SSE_CV_20
## V1 V2 V3 V4 V5 V6 V7 V8
## 1 514195.2 400000.8 322686.3 225112.7 128177.2 55632.86 16921.78 4420.541
## 2 576540.8 444835.4 356422.5 246240.6 138880.4 60389.93 19330.03 5988.268
## 3 461971.0 353492.7 281403.9 192313.7 106291.2 43917.53 11766.24 2256.701
## 4 495432.6 384849.2 309604.4 215002.0 122051.5 53454.93 16918.28 4304.885
## 5 485303.9 373349.2 298070.5 204034.7 112461.8 46400.78 13194.97 3689.132
## 6 507835.8 388306.8 308910.7 211065.6 116980.8 48974.35 13696.29 2412.920
## 7 469859.0 365245.0 293351.6 202437.5 113125.8 47953.35 14234.23 3679.067
## 8 468666.7 360710.0 287887.1 197106.1 109277.0 46182.94 14023.80 4073.818
## 9 485863.3 370747.5 294797.8 201624.8 112341.8 47723.36 13812.94 2657.696
## 10 541541.0 412108.7 326573.4 221755.1 122019.7 51238.06 15330.07 3984.826
## 11 540357.6 410855.3 325221.1 220220.3 120200.9 49119.99 13125.00 1927.258
## 12 510717.4 389504.9 309388.8 210965.7 116721.5 49064.92 14298.54 3282.495
## 13 472019.1 369672.6 298578.3 207674.3 117002.1 49757.47 14561.20 3399.960
## 14 559718.8 428772.3 341656.6 234228.1 131019.1 56611.59 17865.33 4812.291
## 15 501733.3 388330.7 310903.8 213651.1 118735.8 49831.31 14270.98 2888.576
## 16 520704.9 401300.1 320852.4 220553.7 123020.4 52022.55 15048.04 3063.928
## 17 484344.8 377747.6 304572.8 211921.5 120210.5 52168.29 16031.97 3970.374
## 18 534130.2 406982.9 322912.0 219725.3 121174.8 50663.75 14437.91 2871.895
## 19 525692.7 411024.5 331987.1 231399.9 131536.8 57891.09 19347.76 6578.058
## 20 481841.3 368875.8 293224.3 199541.2 109538.3 45186.56 12396.59 2188.303
## V9 V10
## 1 6718.364 23895.14
## 2 7743.597 24333.91
## 3 5630.245 23174.98
## 4 5181.042 20726.48
## 5 7008.095 24382.34
## 6 4444.063 20927.30
## 7 6210.488 23367.05
## 8 6449.448 22938.11
## 9 4495.578 20707.62
## 10 5919.078 22222.69
## 11 4100.693 20734.78
## 12 5328.550 21701.74
## 13 5567.316 22147.14
## 14 5815.606 21573.51
## 15 4701.146 20880.12
## 16 4920.113 21452.05
## 17 5458.861 21499.48
## 18 4895.708 21491.70
## 19 7935.213 24231.36
## 20 4519.405 21128.11
colMeans(SSE_CV_20)
## V1 V2 V3 V4 V5 V6 V7
## 506923.47 390335.60 311950.27 214328.70 119538.37 50709.28 15030.60
## V8 V9 V10
## 3622.55 5652.13 22175.78
The best lambda is 16 since the sum of SSE for 10 fold CV averaged across 20 replicates shows this value to have the lowest SSE.