Exercise 6

In this exercise, you will further analyze the Wage data set considered throughout this chapter.

(a) Perform polynomial regression to predict wage using age. Use cross-validation to select the optimal degree d for the polynomial. What degree was chosen, and how does this compare to the results of hypothesis testing using ANOVA? Make a plot of the resulting polynomial fit to the data.

First obtain a summary of the dataset, and set up a 10 fold cross-validation method.

set.seed(509)
library(ISLR)
summary(Wage)
##       year           age                     maritl           race     
##  Min.   :2003   Min.   :18.00   1. Never Married: 648   1. White:2480  
##  1st Qu.:2004   1st Qu.:33.75   2. Married      :2074   2. Black: 293  
##  Median :2006   Median :42.00   3. Widowed      :  19   3. Asian: 190  
##  Mean   :2006   Mean   :42.41   4. Divorced     : 204   4. Other:  37  
##  3rd Qu.:2008   3rd Qu.:51.00   5. Separated    :  55                  
##  Max.   :2009   Max.   :80.00                                          
##                                                                        
##               education                     region               jobclass   
##  1. < HS Grad      :268   2. Middle Atlantic   :3000   1. Industrial :1544  
##  2. HS Grad        :971   1. New England       :   0   2. Information:1456  
##  3. Some College   :650   3. East North Central:   0                        
##  4. College Grad   :685   4. West North Central:   0                        
##  5. Advanced Degree:426   5. South Atlantic    :   0                        
##                           6. East South Central:   0                        
##                           (Other)              :   0                        
##             health      health_ins      logwage           wage       
##  1. <=Good     : 858   1. Yes:2083   Min.   :3.000   Min.   : 20.09  
##  2. >=Very Good:2142   2. No : 917   1st Qu.:4.447   1st Qu.: 85.38  
##                                      Median :4.653   Median :104.92  
##                                      Mean   :4.654   Mean   :111.70  
##                                      3rd Qu.:4.857   3rd Qu.:128.68  
##                                      Max.   :5.763   Max.   :318.34  
## 
dim(Wage)
## [1] 3000   11
train_control <- trainControl(method = "cv", number = 10)

After cross validation, the optimal degree d is 3 (cubic function). The degrees above three all had p values above 0.05 (although \(age^{4}\) was barely above the threshold).

lm.fit <- train(wage ~ poly(age, 5), data = Wage, trControl = train_control, 
                method = 'lm')
print(lm.fit)
## Linear Regression 
## 
## 3000 samples
##    1 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 2701, 2700, 2699, 2699, 2699, 2701, ... 
## Resampling results:
## 
##   RMSE      Rsquared    MAE     
##   39.86224  0.08740126  27.75652
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE
summary(lm.fit)
## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -99.049 -24.386  -5.028  15.344 202.886 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      111.7036     0.7288 153.278  < 2e-16 ***
## `poly(age, 5)1`  447.0679    39.9161  11.200  < 2e-16 ***
## `poly(age, 5)2` -478.3158    39.9161 -11.983  < 2e-16 ***
## `poly(age, 5)3`  125.5217    39.9161   3.145  0.00168 ** 
## `poly(age, 5)4`  -77.9112    39.9161  -1.952  0.05105 .  
## `poly(age, 5)5`  -35.8129    39.9161  -0.897  0.36968    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 39.92 on 2994 degrees of freedom
## Multiple R-squared:  0.08651,    Adjusted R-squared:  0.08498 
## F-statistic: 56.71 on 5 and 2994 DF,  p-value: < 2.2e-16

Create hypothesis tests using ANOVA to compare the selection. The results of the hypothesis tests agreed with the results of the t-tests for the coefficients.

fit.1 <- lm(wage ~ age, data = Wage)
fit.2 <- lm(wage ~ poly(age, 2), data = Wage)
fit.3 <- lm(wage ~ poly(age, 3), data = Wage)
fit.4 <- lm(wage ~ poly(age, 4), data = Wage)
fit.5 <- lm(wage ~ poly(age, 5), data = Wage)
anova(fit.1, fit.2, fit.3, fit.4, fit.5)
## Analysis of Variance Table
## 
## Model 1: wage ~ age
## Model 2: wage ~ poly(age, 2)
## Model 3: wage ~ poly(age, 3)
## Model 4: wage ~ poly(age, 4)
## Model 5: wage ~ poly(age, 5)
##   Res.Df     RSS Df Sum of Sq        F    Pr(>F)    
## 1   2998 5022216                                    
## 2   2997 4793430  1    228786 143.5931 < 2.2e-16 ***
## 3   2996 4777674  1     15756   9.8888  0.001679 ** 
## 4   2995 4771604  1      6070   3.8098  0.051046 .  
## 5   2994 4770322  1      1283   0.8050  0.369682    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Fit the selected model to the data for use in prediction.

fit <- lm(wage ~ poly(age, 3), data = Wage)
coef(summary(fit))
##                Estimate Std. Error    t value     Pr(>|t|)
## (Intercept)    111.7036  0.7290826 153.211181 0.000000e+00
## poly(age, 3)1  447.0679 39.9334995  11.195309 1.570802e-28
## poly(age, 3)2 -478.3158 39.9334995 -11.977808 2.511784e-32
## poly(age, 3)3  125.5217 39.9334995   3.143268 1.687063e-03

Create a grid of values for age and use the predict() function. Set up standard error bands for displaying a confidence interval band for the final plot.

agelims <- range(Wage$age)
age.grid <- seq(from = agelims[1], to = agelims[2])
preds <- predict(fit, newdata = list(age = age.grid), se = TRUE)
se.bands <- cbind(preds$fit + 2*preds$se.fit, preds$fit - 2*preds$se.fit)
par(mfrow = c(1,1), mar = c(4.5, 4.5, 1, 1), oma = c(0,0,2,0))
plot(wage ~ age, data = Wage, xlim = agelims, cex = 0.5, col = "darkgrey")
title("Degree-3 Polynomial", outer = TRUE)
lines(age.grid, preds$fit, lwd = 2, col = "darkblue")
matlines(age.grid, se.bands, lwd =1, col = "blue", lty = 2)

(b) Fit a step function to predict wage using age, and perform cross-validation to choose the optimal number of cuts. Make a plot of the fit obtained.

The plot shows that the cross-validated error is minimized at 8 cuts.

all.cvs = rep(NA, 10)
for (i in 2:10) {
  Wage$age.cut = cut(Wage$age, i)
  lm.fit = glm(wage ~ age.cut, data = Wage)
  all.cvs[i] = cv.glm(Wage, lm.fit, K = 10)$delta[2]
}
plot(2:10, all.cvs[-1], xlab = "Number of Cuts", ylab = "CV Error", type = "l", 
     pch = 20, lwd = 2)

The grid of values was previously set up for age.grid, which is used here.

lmfit <- glm(wage ~ cut(age, 8), data = Wage)
lmpred <- predict(lmfit, newdata = list(age = age.grid))
plot(wage ~ age, data = Wage, col = "darkgrey")
lines(age.grid, lmpred, col = "red", lwd = 2)
title("Step Function with 8 Cuts")

Exercise 10

This question relates to the College data set.

(a) Split the data into a training set and a test set. Using out-of-state tuition as the response and the other variables as the predictors, perform forward step-wise selection on the training set in order to identify a satisfactory model that uses just a subset of the predictors.

Review a summary of the dataset and split into training and test sets.

library(leaps)
attach(College)
set.seed(1)
summary(College)
##  Private        Apps           Accept          Enroll       Top10perc    
##  No :212   Min.   :   81   Min.   :   72   Min.   :  35   Min.   : 1.00  
##  Yes:565   1st Qu.:  776   1st Qu.:  604   1st Qu.: 242   1st Qu.:15.00  
##            Median : 1558   Median : 1110   Median : 434   Median :23.00  
##            Mean   : 3002   Mean   : 2019   Mean   : 780   Mean   :27.56  
##            3rd Qu.: 3624   3rd Qu.: 2424   3rd Qu.: 902   3rd Qu.:35.00  
##            Max.   :48094   Max.   :26330   Max.   :6392   Max.   :96.00  
##    Top25perc      F.Undergrad     P.Undergrad         Outstate    
##  Min.   :  9.0   Min.   :  139   Min.   :    1.0   Min.   : 2340  
##  1st Qu.: 41.0   1st Qu.:  992   1st Qu.:   95.0   1st Qu.: 7320  
##  Median : 54.0   Median : 1707   Median :  353.0   Median : 9990  
##  Mean   : 55.8   Mean   : 3700   Mean   :  855.3   Mean   :10441  
##  3rd Qu.: 69.0   3rd Qu.: 4005   3rd Qu.:  967.0   3rd Qu.:12925  
##  Max.   :100.0   Max.   :31643   Max.   :21836.0   Max.   :21700  
##    Room.Board       Books           Personal         PhD        
##  Min.   :1780   Min.   :  96.0   Min.   : 250   Min.   :  8.00  
##  1st Qu.:3597   1st Qu.: 470.0   1st Qu.: 850   1st Qu.: 62.00  
##  Median :4200   Median : 500.0   Median :1200   Median : 75.00  
##  Mean   :4358   Mean   : 549.4   Mean   :1341   Mean   : 72.66  
##  3rd Qu.:5050   3rd Qu.: 600.0   3rd Qu.:1700   3rd Qu.: 85.00  
##  Max.   :8124   Max.   :2340.0   Max.   :6800   Max.   :103.00  
##     Terminal       S.F.Ratio      perc.alumni        Expend     
##  Min.   : 24.0   Min.   : 2.50   Min.   : 0.00   Min.   : 3186  
##  1st Qu.: 71.0   1st Qu.:11.50   1st Qu.:13.00   1st Qu.: 6751  
##  Median : 82.0   Median :13.60   Median :21.00   Median : 8377  
##  Mean   : 79.7   Mean   :14.09   Mean   :22.74   Mean   : 9660  
##  3rd Qu.: 92.0   3rd Qu.:16.50   3rd Qu.:31.00   3rd Qu.:10830  
##  Max.   :100.0   Max.   :39.80   Max.   :64.00   Max.   :56233  
##    Grad.Rate     
##  Min.   : 10.00  
##  1st Qu.: 53.00  
##  Median : 65.00  
##  Mean   : 65.46  
##  3rd Qu.: 78.00  
##  Max.   :118.00
dim(College)
## [1] 777  18
sum(is.na(College))
## [1] 0
train <- sample(length(Outstate), length(Outstate)/2)
test <- -train
college_train <- College[train, ]
college_test <- College[test, ]

At 6 variables, the BIC is minimized while Cp is low enough to fall within standard deviation band and Adjusted \(R^{2}\) is high enough to fall within the standard deviation band.

reg_fit <- regsubsets(Outstate ~ ., data = college_train, nvmax = 17, 
                      method = "forward")
reg_summary <- summary(reg_fit)
par(mfrow = c(1, 3))
plot(reg_summary$cp, xlab = "Number of Variables", ylab = "Cp", type = "l")
min_cp <- min(reg_summary$cp)
std_cp <- sd(reg_summary$cp)
abline(h = min_cp + 0.2*std_cp, col = "red", lty = 2)
abline(h = min_cp - 0.2*std_cp, col = "red", lty = 2)
plot(reg_summary$bic, xlab = "Number of Variables", ylab = "BIC", type = "l")
min_bic <- min(reg_summary$bic)
std_bic <- sd(reg_summary$bic)
abline(h = min_bic + 0.2*std_cp, col = "red", lty = 2)
abline(h = min_bic - 0.2*std_cp, col = "red", lty = 2)
plot(reg_summary$adjr2, xlab = "Number of Variables", ylab = "Adjusted R2", 
     type = "l", ylim = c(0.4, 0.84))
max_adjr2 <- max(reg_summary$adjr2)
std_adjr2 <- sd(reg_summary$adjr2)
abline(h = max_adjr2 + 0.2*std_adjr2, col = "red", lty = 2)
abline(h = max_adjr2 - 0.2*std_adjr2, col = "red", lty = 2)

The best 6 explanatory variables are Private, Room.Board, PhD, perc.alumni, Expend, and Grad.Rate. The model was used with the test data to obtain a MSE of 3,620,826.

coefi <- coef(reg_fit, id = 6)
names(coefi)
## [1] "(Intercept)" "PrivateYes"  "Room.Board"  "Terminal"    "perc.alumni"
## [6] "Expend"      "Grad.Rate"
lm_test <- lm(Outstate ~ Private + Room.Board + Terminal + perc.alumni + Expend + 
                Expend + Grad.Rate, data = college_test)
lm_pred <- predict(lm_test, college_test)
lm_err <- mean((college_test$Outstate - lm_pred)^2)
lm_err
## [1] 3620826

(b) Fit a GAM on the training data, using out-of-state tuition as the response and the features selected in the previous step as the predictors. Plot the results, and explain your findings.

library(gam)
## Loading required package: splines
## Loading required package: foreach
## 
## Attaching package: 'foreach'
## The following objects are masked from 'package:purrr':
## 
##     accumulate, when
## Loaded gam 1.16.1
gam_fit <- gam(Outstate ~ Private + s(Room.Board, df = 2) + s(PhD, df = 2) + 
                 s(perc.alumni, df = 2) + s(Expend, df = 2) + s(Grad.Rate, df = 2),  
               data = college_train)
## Warning in model.matrix.default(mt, mf, contrasts): non-list contrasts argument
## ignored
par(mfrow = c(2, 3))
plot(gam_fit, se = T, col = "blue")

(c) Evaluate the model obtained on the test set, and explain the results obtained.

The MSE for the GAM model was 3,438,192 compared to 3,620,826 for the linear model

gam_pred <- predict(gam_fit, college_test)
gam_err <- mean((college_test$Outstate - gam_pred)^2)
gam_err
## [1] 3438192

(d) For which variables, if any, is there evidence of a non-linear relationship with the response?

The ANOVA results for nonparametric effects lends evidence of a non-linear relationship between Expend and Outstate. The p value for the null hypothesis of a linear relationship is nearly 0. This would result in rejection of the null hypothesis.

summary(gam_fit)
## 
## Call: gam(formula = Outstate ~ Private + s(Room.Board, df = 2) + s(PhD, 
##     df = 2) + s(perc.alumni, df = 2) + s(Expend, df = 2) + s(Grad.Rate, 
##     df = 2), data = college_train)
## Deviance Residuals:
##     Min      1Q  Median      3Q     Max 
## -6500.0 -1344.2  -104.2  1350.9  8792.9 
## 
## (Dispersion Parameter for gaussian family taken to be 3946537)
## 
##     Null Deviance: 6989966760 on 387 degrees of freedom
## Residual Deviance: 1483898477 on 376.0001 degrees of freedom
## AIC: 7007.986 
## 
## Number of Local Scoring Iterations: 2 
## 
## Anova for Parametric Effects
##                         Df     Sum Sq    Mean Sq F value    Pr(>F)    
## Private                  1 1843025465 1843025465 466.998 < 2.2e-16 ***
## s(Room.Board, df = 2)    1 1692725515 1692725515 428.914 < 2.2e-16 ***
## s(PhD, df = 2)           1  397159048  397159048 100.635 < 2.2e-16 ***
## s(perc.alumni, df = 2)   1  351049729  351049729  88.951 < 2.2e-16 ***
## s(Expend, df = 2)        1  419096845  419096845 106.194 < 2.2e-16 ***
## s(Grad.Rate, df = 2)     1   81869675   81869675  20.745 7.101e-06 ***
## Residuals              376 1483898477    3946537                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Anova for Nonparametric Effects
##                        Npar Df Npar F     Pr(F)    
## (Intercept)                                        
## Private                                            
## s(Room.Board, df = 2)        1  2.005   0.15763    
## s(PhD, df = 2)               1  3.781   0.05258 .  
## s(perc.alumni, df = 2)       1  0.314   0.57572    
## s(Expend, df = 2)            1 47.156 2.725e-11 ***
## s(Grad.Rate, df = 2)         1  1.057   0.30446    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1