In this exercise, you will further analyze the Wage data set considered throughout this chapter.
(a) Perform polynomial regression to predict wage using age. Use cross-validation to select the optimal degree d for the polynomial. What degree was chosen, and how does this compare to the results of hypothesis testing using ANOVA? Make a plot of the resulting polynomial fit to the data.
First obtain a summary of the dataset, and set up a 10 fold cross-validation method.
set.seed(509)
library(ISLR)
summary(Wage)
## year age maritl race
## Min. :2003 Min. :18.00 1. Never Married: 648 1. White:2480
## 1st Qu.:2004 1st Qu.:33.75 2. Married :2074 2. Black: 293
## Median :2006 Median :42.00 3. Widowed : 19 3. Asian: 190
## Mean :2006 Mean :42.41 4. Divorced : 204 4. Other: 37
## 3rd Qu.:2008 3rd Qu.:51.00 5. Separated : 55
## Max. :2009 Max. :80.00
##
## education region jobclass
## 1. < HS Grad :268 2. Middle Atlantic :3000 1. Industrial :1544
## 2. HS Grad :971 1. New England : 0 2. Information:1456
## 3. Some College :650 3. East North Central: 0
## 4. College Grad :685 4. West North Central: 0
## 5. Advanced Degree:426 5. South Atlantic : 0
## 6. East South Central: 0
## (Other) : 0
## health health_ins logwage wage
## 1. <=Good : 858 1. Yes:2083 Min. :3.000 Min. : 20.09
## 2. >=Very Good:2142 2. No : 917 1st Qu.:4.447 1st Qu.: 85.38
## Median :4.653 Median :104.92
## Mean :4.654 Mean :111.70
## 3rd Qu.:4.857 3rd Qu.:128.68
## Max. :5.763 Max. :318.34
##
dim(Wage)
## [1] 3000 11
train_control <- trainControl(method = "cv", number = 10)
After cross validation, the optimal degree d is 3 (cubic function). The degrees above three all had p values above 0.05 (although \(age^{4}\) was barely above the threshold).
lm.fit <- train(wage ~ poly(age, 5), data = Wage, trControl = train_control,
method = 'lm')
print(lm.fit)
## Linear Regression
##
## 3000 samples
## 1 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 2701, 2700, 2699, 2699, 2699, 2701, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 39.86224 0.08740126 27.75652
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
summary(lm.fit)
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -99.049 -24.386 -5.028 15.344 202.886
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 111.7036 0.7288 153.278 < 2e-16 ***
## `poly(age, 5)1` 447.0679 39.9161 11.200 < 2e-16 ***
## `poly(age, 5)2` -478.3158 39.9161 -11.983 < 2e-16 ***
## `poly(age, 5)3` 125.5217 39.9161 3.145 0.00168 **
## `poly(age, 5)4` -77.9112 39.9161 -1.952 0.05105 .
## `poly(age, 5)5` -35.8129 39.9161 -0.897 0.36968
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 39.92 on 2994 degrees of freedom
## Multiple R-squared: 0.08651, Adjusted R-squared: 0.08498
## F-statistic: 56.71 on 5 and 2994 DF, p-value: < 2.2e-16
Create hypothesis tests using ANOVA to compare the selection. The results of the hypothesis tests agreed with the results of the t-tests for the coefficients.
fit.1 <- lm(wage ~ age, data = Wage)
fit.2 <- lm(wage ~ poly(age, 2), data = Wage)
fit.3 <- lm(wage ~ poly(age, 3), data = Wage)
fit.4 <- lm(wage ~ poly(age, 4), data = Wage)
fit.5 <- lm(wage ~ poly(age, 5), data = Wage)
anova(fit.1, fit.2, fit.3, fit.4, fit.5)
## Analysis of Variance Table
##
## Model 1: wage ~ age
## Model 2: wage ~ poly(age, 2)
## Model 3: wage ~ poly(age, 3)
## Model 4: wage ~ poly(age, 4)
## Model 5: wage ~ poly(age, 5)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 2998 5022216
## 2 2997 4793430 1 228786 143.5931 < 2.2e-16 ***
## 3 2996 4777674 1 15756 9.8888 0.001679 **
## 4 2995 4771604 1 6070 3.8098 0.051046 .
## 5 2994 4770322 1 1283 0.8050 0.369682
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Fit the selected model to the data for use in prediction.
fit <- lm(wage ~ poly(age, 3), data = Wage)
coef(summary(fit))
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 111.7036 0.7290826 153.211181 0.000000e+00
## poly(age, 3)1 447.0679 39.9334995 11.195309 1.570802e-28
## poly(age, 3)2 -478.3158 39.9334995 -11.977808 2.511784e-32
## poly(age, 3)3 125.5217 39.9334995 3.143268 1.687063e-03
Create a grid of values for age and use the predict() function. Set up standard error bands for displaying a confidence interval band for the final plot.
agelims <- range(Wage$age)
age.grid <- seq(from = agelims[1], to = agelims[2])
preds <- predict(fit, newdata = list(age = age.grid), se = TRUE)
se.bands <- cbind(preds$fit + 2*preds$se.fit, preds$fit - 2*preds$se.fit)
par(mfrow = c(1,1), mar = c(4.5, 4.5, 1, 1), oma = c(0,0,2,0))
plot(wage ~ age, data = Wage, xlim = agelims, cex = 0.5, col = "darkgrey")
title("Degree-3 Polynomial", outer = TRUE)
lines(age.grid, preds$fit, lwd = 2, col = "darkblue")
matlines(age.grid, se.bands, lwd =1, col = "blue", lty = 2)
(b) Fit a step function to predict wage using age, and perform cross-validation to choose the optimal number of cuts. Make a plot of the fit obtained.
The plot shows that the cross-validated error is minimized at 8 cuts.
all.cvs = rep(NA, 10)
for (i in 2:10) {
Wage$age.cut = cut(Wage$age, i)
lm.fit = glm(wage ~ age.cut, data = Wage)
all.cvs[i] = cv.glm(Wage, lm.fit, K = 10)$delta[2]
}
plot(2:10, all.cvs[-1], xlab = "Number of Cuts", ylab = "CV Error", type = "l",
pch = 20, lwd = 2)
The grid of values was previously set up for age.grid, which is used here.
lmfit <- glm(wage ~ cut(age, 8), data = Wage)
lmpred <- predict(lmfit, newdata = list(age = age.grid))
plot(wage ~ age, data = Wage, col = "darkgrey")
lines(age.grid, lmpred, col = "red", lwd = 2)
title("Step Function with 8 Cuts")
This question relates to the College data set.
(a) Split the data into a training set and a test set. Using out-of-state tuition as the response and the other variables as the predictors, perform forward step-wise selection on the training set in order to identify a satisfactory model that uses just a subset of the predictors.
Review a summary of the dataset and split into training and test sets.
library(leaps)
attach(College)
set.seed(1)
summary(College)
## Private Apps Accept Enroll Top10perc
## No :212 Min. : 81 Min. : 72 Min. : 35 Min. : 1.00
## Yes:565 1st Qu.: 776 1st Qu.: 604 1st Qu.: 242 1st Qu.:15.00
## Median : 1558 Median : 1110 Median : 434 Median :23.00
## Mean : 3002 Mean : 2019 Mean : 780 Mean :27.56
## 3rd Qu.: 3624 3rd Qu.: 2424 3rd Qu.: 902 3rd Qu.:35.00
## Max. :48094 Max. :26330 Max. :6392 Max. :96.00
## Top25perc F.Undergrad P.Undergrad Outstate
## Min. : 9.0 Min. : 139 Min. : 1.0 Min. : 2340
## 1st Qu.: 41.0 1st Qu.: 992 1st Qu.: 95.0 1st Qu.: 7320
## Median : 54.0 Median : 1707 Median : 353.0 Median : 9990
## Mean : 55.8 Mean : 3700 Mean : 855.3 Mean :10441
## 3rd Qu.: 69.0 3rd Qu.: 4005 3rd Qu.: 967.0 3rd Qu.:12925
## Max. :100.0 Max. :31643 Max. :21836.0 Max. :21700
## Room.Board Books Personal PhD
## Min. :1780 Min. : 96.0 Min. : 250 Min. : 8.00
## 1st Qu.:3597 1st Qu.: 470.0 1st Qu.: 850 1st Qu.: 62.00
## Median :4200 Median : 500.0 Median :1200 Median : 75.00
## Mean :4358 Mean : 549.4 Mean :1341 Mean : 72.66
## 3rd Qu.:5050 3rd Qu.: 600.0 3rd Qu.:1700 3rd Qu.: 85.00
## Max. :8124 Max. :2340.0 Max. :6800 Max. :103.00
## Terminal S.F.Ratio perc.alumni Expend
## Min. : 24.0 Min. : 2.50 Min. : 0.00 Min. : 3186
## 1st Qu.: 71.0 1st Qu.:11.50 1st Qu.:13.00 1st Qu.: 6751
## Median : 82.0 Median :13.60 Median :21.00 Median : 8377
## Mean : 79.7 Mean :14.09 Mean :22.74 Mean : 9660
## 3rd Qu.: 92.0 3rd Qu.:16.50 3rd Qu.:31.00 3rd Qu.:10830
## Max. :100.0 Max. :39.80 Max. :64.00 Max. :56233
## Grad.Rate
## Min. : 10.00
## 1st Qu.: 53.00
## Median : 65.00
## Mean : 65.46
## 3rd Qu.: 78.00
## Max. :118.00
dim(College)
## [1] 777 18
sum(is.na(College))
## [1] 0
train <- sample(length(Outstate), length(Outstate)/2)
test <- -train
college_train <- College[train, ]
college_test <- College[test, ]
At 6 variables, the BIC is minimized while Cp is low enough to fall within standard deviation band and Adjusted \(R^{2}\) is high enough to fall within the standard deviation band.
reg_fit <- regsubsets(Outstate ~ ., data = college_train, nvmax = 17,
method = "forward")
reg_summary <- summary(reg_fit)
par(mfrow = c(1, 3))
plot(reg_summary$cp, xlab = "Number of Variables", ylab = "Cp", type = "l")
min_cp <- min(reg_summary$cp)
std_cp <- sd(reg_summary$cp)
abline(h = min_cp + 0.2*std_cp, col = "red", lty = 2)
abline(h = min_cp - 0.2*std_cp, col = "red", lty = 2)
plot(reg_summary$bic, xlab = "Number of Variables", ylab = "BIC", type = "l")
min_bic <- min(reg_summary$bic)
std_bic <- sd(reg_summary$bic)
abline(h = min_bic + 0.2*std_cp, col = "red", lty = 2)
abline(h = min_bic - 0.2*std_cp, col = "red", lty = 2)
plot(reg_summary$adjr2, xlab = "Number of Variables", ylab = "Adjusted R2",
type = "l", ylim = c(0.4, 0.84))
max_adjr2 <- max(reg_summary$adjr2)
std_adjr2 <- sd(reg_summary$adjr2)
abline(h = max_adjr2 + 0.2*std_adjr2, col = "red", lty = 2)
abline(h = max_adjr2 - 0.2*std_adjr2, col = "red", lty = 2)
The best 6 explanatory variables are Private, Room.Board, PhD, perc.alumni, Expend, and Grad.Rate. The model was used with the test data to obtain a MSE of 3,620,826.
coefi <- coef(reg_fit, id = 6)
names(coefi)
## [1] "(Intercept)" "PrivateYes" "Room.Board" "Terminal" "perc.alumni"
## [6] "Expend" "Grad.Rate"
lm_test <- lm(Outstate ~ Private + Room.Board + Terminal + perc.alumni + Expend +
Expend + Grad.Rate, data = college_test)
lm_pred <- predict(lm_test, college_test)
lm_err <- mean((college_test$Outstate - lm_pred)^2)
lm_err
## [1] 3620826
(b) Fit a GAM on the training data, using out-of-state tuition as the response and the features selected in the previous step as the predictors. Plot the results, and explain your findings.
library(gam)
## Loading required package: splines
## Loading required package: foreach
##
## Attaching package: 'foreach'
## The following objects are masked from 'package:purrr':
##
## accumulate, when
## Loaded gam 1.16.1
gam_fit <- gam(Outstate ~ Private + s(Room.Board, df = 2) + s(PhD, df = 2) +
s(perc.alumni, df = 2) + s(Expend, df = 2) + s(Grad.Rate, df = 2),
data = college_train)
## Warning in model.matrix.default(mt, mf, contrasts): non-list contrasts argument
## ignored
par(mfrow = c(2, 3))
plot(gam_fit, se = T, col = "blue")
(c) Evaluate the model obtained on the test set, and explain the results obtained.
The MSE for the GAM model was 3,438,192 compared to 3,620,826 for the linear model
gam_pred <- predict(gam_fit, college_test)
gam_err <- mean((college_test$Outstate - gam_pred)^2)
gam_err
## [1] 3438192
(d) For which variables, if any, is there evidence of a non-linear relationship with the response?
The ANOVA results for nonparametric effects lends evidence of a non-linear relationship between Expend and Outstate. The p value for the null hypothesis of a linear relationship is nearly 0. This would result in rejection of the null hypothesis.
summary(gam_fit)
##
## Call: gam(formula = Outstate ~ Private + s(Room.Board, df = 2) + s(PhD,
## df = 2) + s(perc.alumni, df = 2) + s(Expend, df = 2) + s(Grad.Rate,
## df = 2), data = college_train)
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -6500.0 -1344.2 -104.2 1350.9 8792.9
##
## (Dispersion Parameter for gaussian family taken to be 3946537)
##
## Null Deviance: 6989966760 on 387 degrees of freedom
## Residual Deviance: 1483898477 on 376.0001 degrees of freedom
## AIC: 7007.986
##
## Number of Local Scoring Iterations: 2
##
## Anova for Parametric Effects
## Df Sum Sq Mean Sq F value Pr(>F)
## Private 1 1843025465 1843025465 466.998 < 2.2e-16 ***
## s(Room.Board, df = 2) 1 1692725515 1692725515 428.914 < 2.2e-16 ***
## s(PhD, df = 2) 1 397159048 397159048 100.635 < 2.2e-16 ***
## s(perc.alumni, df = 2) 1 351049729 351049729 88.951 < 2.2e-16 ***
## s(Expend, df = 2) 1 419096845 419096845 106.194 < 2.2e-16 ***
## s(Grad.Rate, df = 2) 1 81869675 81869675 20.745 7.101e-06 ***
## Residuals 376 1483898477 3946537
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Anova for Nonparametric Effects
## Npar Df Npar F Pr(F)
## (Intercept)
## Private
## s(Room.Board, df = 2) 1 2.005 0.15763
## s(PhD, df = 2) 1 3.781 0.05258 .
## s(perc.alumni, df = 2) 1 0.314 0.57572
## s(Expend, df = 2) 1 47.156 2.725e-11 ***
## s(Grad.Rate, df = 2) 1 1.057 0.30446
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1