ISLR Chapter 7: Moving Beyond Linearity

Problem 6.

In this exercise, you will further analyze the Wage data set considered throughout this chapter.

library(ISLR)
attach(Wage)
str(Wage)
## 'data.frame':    3000 obs. of  11 variables:
##  $ year      : int  2006 2004 2003 2003 2005 2008 2009 2008 2006 2004 ...
##  $ age       : int  18 24 45 43 50 54 44 30 41 52 ...
##  $ maritl    : Factor w/ 5 levels "1. Never Married",..: 1 1 2 2 4 2 2 1 1 2 ...
##  $ race      : Factor w/ 4 levels "1. White","2. Black",..: 1 1 1 3 1 1 4 3 2 1 ...
##  $ education : Factor w/ 5 levels "1. < HS Grad",..: 1 4 3 4 2 4 3 3 3 2 ...
##  $ region    : Factor w/ 9 levels "1. New England",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ jobclass  : Factor w/ 2 levels "1. Industrial",..: 1 2 1 2 2 2 1 2 2 2 ...
##  $ health    : Factor w/ 2 levels "1. <=Good","2. >=Very Good": 1 2 1 2 1 2 2 1 2 2 ...
##  $ health_ins: Factor w/ 2 levels "1. Yes","2. No": 2 2 1 1 1 1 1 1 1 1 ...
##  $ logwage   : num  4.32 4.26 4.88 5.04 4.32 ...
##  $ wage      : num  75 70.5 131 154.7 75 ...

(a) Perform polynomial regression to predict wage using age. Use cross-validation to select the optimal degree d for the polynomial. What degree was chosen, and how does this compare to the results of hypothesis testing using ANOVA? Make a plot of the resulting polynomial fit to the data.

library(boot)
set.seed(1)
cv.error <- rep(NA,10)
for (i in 1:10) {
  glm.fit <- glm(wage~poly(age, i), data = Wage)
  cv.error[i] <- cv.glm(Wage, glm.fit, K=10)$delta[1]
}
cv.error
##  [1] 1676.826 1600.763 1598.399 1595.651 1594.977 1596.061 1594.298 1598.134
##  [9] 1593.913 1595.950
plot(1:10, cv.error, xlab = 'Degree', ylab = 'CV Error', type = 'l')
d.min <-which.min(cv.error)
points(d.min, cv.error[d.min], col ="dark blue", cex = 2, pch = 20)

Based on the polynomial regression plot, D=9 is the optimal degree for the polynomial with a cross validation error of 1593.913.

I will now conduct ANOVA hypothesis testing.

fit.1 <- lm(wage~poly(age, 1), data=Wage)
fit.2 <- lm(wage~poly(age, 2), data=Wage)
fit.3 <- lm(wage~poly(age, 3), data=Wage)
fit.4 <- lm(wage~poly(age, 4), data=Wage)
fit.5 <- lm(wage~poly(age, 5), data=Wage)
fit.6 <- lm(wage~poly(age, 6), data=Wage)
fit.7 <- lm(wage~poly(age, 7), data=Wage)
fit.8 <- lm(wage~poly(age, 8), data=Wage)
fit.9 <- lm(wage~poly(age, 9), data=Wage)
fit.10 <- lm(wage~poly(age, 10), data=Wage)
anova(fit.1, fit.2, fit.3, fit.4, fit.5, fit.6, fit.7, fit.8, fit.9, fit.10)
## Analysis of Variance Table
## 
## Model  1: wage ~ poly(age, 1)
## Model  2: wage ~ poly(age, 2)
## Model  3: wage ~ poly(age, 3)
## Model  4: wage ~ poly(age, 4)
## Model  5: wage ~ poly(age, 5)
## Model  6: wage ~ poly(age, 6)
## Model  7: wage ~ poly(age, 7)
## Model  8: wage ~ poly(age, 8)
## Model  9: wage ~ poly(age, 9)
## Model 10: wage ~ poly(age, 10)
##    Res.Df     RSS Df Sum of Sq        F    Pr(>F)    
## 1    2998 5022216                                    
## 2    2997 4793430  1    228786 143.7638 < 2.2e-16 ***
## 3    2996 4777674  1     15756   9.9005  0.001669 ** 
## 4    2995 4771604  1      6070   3.8143  0.050909 .  
## 5    2994 4770322  1      1283   0.8059  0.369398    
## 6    2993 4766389  1      3932   2.4709  0.116074    
## 7    2992 4763834  1      2555   1.6057  0.205199    
## 8    2991 4763707  1       127   0.0796  0.777865    
## 9    2990 4756703  1      7004   4.4014  0.035994 *  
## 10   2989 4756701  1         3   0.0017  0.967529    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Based on the ANOVA table results, all of the polynomials above degree 3 are not significant with the exception of D=9 at the .05 significance level. However, D=2 and D=3 are significant at higher significance levels, indicating that a more complex model is required such as cubic model.

plot(wage~age, data=Wage, col="darkgrey", main ="Polynomial Degree 9 and 3")
agelims <- range(Wage$age)
age.grid <- seq(from=agelims[1], to=agelims[2])
lm.fit <- lm(wage~poly(age, 3), data=Wage)
lm.pred <- predict(lm.fit, data.frame(age=age.grid))
lines(age.grid, lm.pred, col="blue", lwd=2)
lm.fit2 <- lm(wage~poly(age, 9), data=Wage)
lm.pred2 <- predict(lm.fit2, data.frame(age=age.grid))
lines(age.grid, lm.pred2, col="red", lwd=2)

Based on the plot of the two polynomial predictions, I would have more confidence in the D=3 due to the fit of the line and the evidence in the ANOVA results.

(b) Fit a step function to predict wage using age, and perform cross-validation to choose the optimal number of cuts. Make a plot of the fit obtained.

cvstep <- rep(NA, 10)
for (i in 2:10) {
  Wage$age.cut <- cut(Wage$age, i)
  lm.fit3 <- glm(wage~age.cut, data = Wage)
  cvstep[i] <- cv.glm(Wage, lm.fit3, K=10)$delta[2]
}
cvstep
##  [1]       NA 1735.017 1682.147 1636.701 1631.334 1625.009 1613.236 1599.865
##  [9] 1611.936 1606.397
plot(2:10, cvstep[-1], xlab = "Cuts", ylab ="CV Error", type ="l")
d.min2 <- which.min(cvstep)
points(d.min2, cv.error[d.min2], col ="dark blue", cex = 2, pch = 20)

Based on the cross validation results and plot, the optimal number of cuts is K=8 for a cross validation error rate of 1600.350.

lm.fit4 <- glm(wage~cut(age, 8), data=Wage)
agelims2 = range(Wage$age)
age.grid2 = seq(from=agelims2[1], to=agelims2[2])
lm.pred3 = predict(lm.fit4, data.frame(age=age.grid))
plot(wage~age, data=Wage, col="darkgrey")
lines(age.grid, lm.pred3, col="blue", lwd=2)

detach(Wage)

Problem 10.

This question relates to the College data set.

library(ISLR)
library(leaps)
attach(College)
summary(College)
##  Private        Apps           Accept          Enroll       Top10perc    
##  No :212   Min.   :   81   Min.   :   72   Min.   :  35   Min.   : 1.00  
##  Yes:565   1st Qu.:  776   1st Qu.:  604   1st Qu.: 242   1st Qu.:15.00  
##            Median : 1558   Median : 1110   Median : 434   Median :23.00  
##            Mean   : 3002   Mean   : 2019   Mean   : 780   Mean   :27.56  
##            3rd Qu.: 3624   3rd Qu.: 2424   3rd Qu.: 902   3rd Qu.:35.00  
##            Max.   :48094   Max.   :26330   Max.   :6392   Max.   :96.00  
##    Top25perc      F.Undergrad     P.Undergrad         Outstate    
##  Min.   :  9.0   Min.   :  139   Min.   :    1.0   Min.   : 2340  
##  1st Qu.: 41.0   1st Qu.:  992   1st Qu.:   95.0   1st Qu.: 7320  
##  Median : 54.0   Median : 1707   Median :  353.0   Median : 9990  
##  Mean   : 55.8   Mean   : 3700   Mean   :  855.3   Mean   :10441  
##  3rd Qu.: 69.0   3rd Qu.: 4005   3rd Qu.:  967.0   3rd Qu.:12925  
##  Max.   :100.0   Max.   :31643   Max.   :21836.0   Max.   :21700  
##    Room.Board       Books           Personal         PhD        
##  Min.   :1780   Min.   :  96.0   Min.   : 250   Min.   :  8.00  
##  1st Qu.:3597   1st Qu.: 470.0   1st Qu.: 850   1st Qu.: 62.00  
##  Median :4200   Median : 500.0   Median :1200   Median : 75.00  
##  Mean   :4358   Mean   : 549.4   Mean   :1341   Mean   : 72.66  
##  3rd Qu.:5050   3rd Qu.: 600.0   3rd Qu.:1700   3rd Qu.: 85.00  
##  Max.   :8124   Max.   :2340.0   Max.   :6800   Max.   :103.00  
##     Terminal       S.F.Ratio      perc.alumni        Expend     
##  Min.   : 24.0   Min.   : 2.50   Min.   : 0.00   Min.   : 3186  
##  1st Qu.: 71.0   1st Qu.:11.50   1st Qu.:13.00   1st Qu.: 6751  
##  Median : 82.0   Median :13.60   Median :21.00   Median : 8377  
##  Mean   : 79.7   Mean   :14.09   Mean   :22.74   Mean   : 9660  
##  3rd Qu.: 92.0   3rd Qu.:16.50   3rd Qu.:31.00   3rd Qu.:10830  
##  Max.   :100.0   Max.   :39.80   Max.   :64.00   Max.   :56233  
##    Grad.Rate     
##  Min.   : 10.00  
##  1st Qu.: 53.00  
##  Median : 65.00  
##  Mean   : 65.46  
##  3rd Qu.: 78.00  
##  Max.   :118.00

(a) Split the data into a training set and a test set. Using out-of-state tuition as the response and the other variables as the predictors, perform forward stepwise selection on the training set in order to identify a satisfactory model that uses just a subset of the predictors.

set.seed(1)
train <- sample(length(Outstate), length(Outstate)/2)
test <- -train
College.train <- College[train, ]
College.test <- College[test, ]
reg.fit <- regsubsets(Outstate ~ ., data = College.train, nvmax = 17, method = "forward")
reg.summary = summary(reg.fit)
reg.summary
## Subset selection object
## Call: regsubsets.formula(Outstate ~ ., data = College.train, nvmax = 17, 
##     method = "forward")
## 17 Variables  (and intercept)
##             Forced in Forced out
## PrivateYes      FALSE      FALSE
## Apps            FALSE      FALSE
## Accept          FALSE      FALSE
## Enroll          FALSE      FALSE
## Top10perc       FALSE      FALSE
## Top25perc       FALSE      FALSE
## F.Undergrad     FALSE      FALSE
## P.Undergrad     FALSE      FALSE
## Room.Board      FALSE      FALSE
## Books           FALSE      FALSE
## Personal        FALSE      FALSE
## PhD             FALSE      FALSE
## Terminal        FALSE      FALSE
## S.F.Ratio       FALSE      FALSE
## perc.alumni     FALSE      FALSE
## Expend          FALSE      FALSE
## Grad.Rate       FALSE      FALSE
## 1 subsets of each size up to 17
## Selection Algorithm: forward
##           PrivateYes Apps Accept Enroll Top10perc Top25perc F.Undergrad
## 1  ( 1 )  " "        " "  " "    " "    " "       " "       " "        
## 2  ( 1 )  " "        " "  " "    " "    " "       " "       " "        
## 3  ( 1 )  " "        " "  " "    " "    " "       " "       " "        
## 4  ( 1 )  "*"        " "  " "    " "    " "       " "       " "        
## 5  ( 1 )  "*"        " "  " "    " "    " "       " "       " "        
## 6  ( 1 )  "*"        " "  " "    " "    " "       " "       " "        
## 7  ( 1 )  "*"        " "  " "    " "    " "       " "       " "        
## 8  ( 1 )  "*"        " "  " "    " "    "*"       " "       " "        
## 9  ( 1 )  "*"        " "  " "    " "    "*"       " "       " "        
## 10  ( 1 ) "*"        " "  " "    " "    "*"       " "       " "        
## 11  ( 1 ) "*"        " "  "*"    " "    "*"       " "       " "        
## 12  ( 1 ) "*"        "*"  "*"    " "    "*"       " "       " "        
## 13  ( 1 ) "*"        "*"  "*"    "*"    "*"       " "       " "        
## 14  ( 1 ) "*"        "*"  "*"    "*"    "*"       "*"       " "        
## 15  ( 1 ) "*"        "*"  "*"    "*"    "*"       "*"       " "        
## 16  ( 1 ) "*"        "*"  "*"    "*"    "*"       "*"       "*"        
## 17  ( 1 ) "*"        "*"  "*"    "*"    "*"       "*"       "*"        
##           P.Undergrad Room.Board Books Personal PhD Terminal S.F.Ratio
## 1  ( 1 )  " "         "*"        " "   " "      " " " "      " "      
## 2  ( 1 )  " "         "*"        " "   " "      " " " "      " "      
## 3  ( 1 )  " "         "*"        " "   " "      " " " "      " "      
## 4  ( 1 )  " "         "*"        " "   " "      " " " "      " "      
## 5  ( 1 )  " "         "*"        " "   " "      " " " "      " "      
## 6  ( 1 )  " "         "*"        " "   " "      " " "*"      " "      
## 7  ( 1 )  " "         "*"        " "   "*"      " " "*"      " "      
## 8  ( 1 )  " "         "*"        " "   "*"      " " "*"      " "      
## 9  ( 1 )  " "         "*"        " "   "*"      " " "*"      "*"      
## 10  ( 1 ) " "         "*"        "*"   "*"      " " "*"      "*"      
## 11  ( 1 ) " "         "*"        "*"   "*"      " " "*"      "*"      
## 12  ( 1 ) " "         "*"        "*"   "*"      " " "*"      "*"      
## 13  ( 1 ) " "         "*"        "*"   "*"      " " "*"      "*"      
## 14  ( 1 ) " "         "*"        "*"   "*"      " " "*"      "*"      
## 15  ( 1 ) "*"         "*"        "*"   "*"      " " "*"      "*"      
## 16  ( 1 ) "*"         "*"        "*"   "*"      " " "*"      "*"      
## 17  ( 1 ) "*"         "*"        "*"   "*"      "*" "*"      "*"      
##           perc.alumni Expend Grad.Rate
## 1  ( 1 )  " "         " "    " "      
## 2  ( 1 )  "*"         " "    " "      
## 3  ( 1 )  "*"         "*"    " "      
## 4  ( 1 )  "*"         "*"    " "      
## 5  ( 1 )  "*"         "*"    "*"      
## 6  ( 1 )  "*"         "*"    "*"      
## 7  ( 1 )  "*"         "*"    "*"      
## 8  ( 1 )  "*"         "*"    "*"      
## 9  ( 1 )  "*"         "*"    "*"      
## 10  ( 1 ) "*"         "*"    "*"      
## 11  ( 1 ) "*"         "*"    "*"      
## 12  ( 1 ) "*"         "*"    "*"      
## 13  ( 1 ) "*"         "*"    "*"      
## 14  ( 1 ) "*"         "*"    "*"      
## 15  ( 1 ) "*"         "*"    "*"      
## 16  ( 1 ) "*"         "*"    "*"      
## 17  ( 1 ) "*"         "*"    "*"
reg.summary$adjr2
##  [1] 0.4486533 0.6157583 0.6806427 0.7262750 0.7418534 0.7516133 0.7538978
##  [8] 0.7559309 0.7577210 0.7587138 0.7587900 0.7646744 0.7690583 0.7706887
## [15] 0.7703905 0.7698240 0.7692238
which.max(reg.summary$adjr2)
## [1] 14
reg.summary$cp
##  [1] 538.19147 259.02390 151.39447  76.27854  51.30582  36.07408  33.23602
##  [8]  30.83073  28.84098  28.16937  28.99957  20.39253  14.26824  12.63239
## [15]  14.11928  16.03507  18.00000
which.min(reg.summary$cp)
## [1] 14
reg.summary$bic
##  [1] -220.0937 -355.2429 -422.0560 -476.9311 -494.7198 -504.7297 -503.3736
##  [8] -501.6536 -499.5740 -496.2339 -491.4261 -496.0811 -498.4523 -496.2790
## [15] -490.8555 -484.9828 -479.0586
which.min(reg.summary$bic)
## [1] 6
reg.summary$rss
##  [1] 3843936983 2671956475 2214992160 1893552628 1781123214 1709296809
##  [7] 1689130494 1670768074 1654137740 1643001768 1638125736 1593912800
## [13] 1560048632 1544893426 1542754678 1542403678 1542257478
which.min(reg.summary$rss)
## [1] 17
par(mfrow = c(1, 3))
plot(reg.summary$adjr2, xlab = "Number of Variables", ylab = "Adjusted R2", type = "l")
points(14, reg.summary$adjr2[14], col ='red', cex = 2, pch = 20)
plot(reg.summary$cp, xlab = "Number of Variables", ylab = "CP", type = "l")
points(14, reg.summary$cp[14], col ='green', cex = 2, pch = 20)
plot(reg.summary$bic, xlab = "Number of Variables", ylab = "BIC", type = "l")
points(6, reg.summary$bic[6], col ='blue', cex = 2, pch = 20)

coef(reg.fit, 14)
##   (Intercept)    PrivateYes          Apps        Accept        Enroll 
## -1105.5419407  2218.1453812    -0.3925751     0.9165783    -0.9704322 
##     Top10perc     Top25perc    Room.Board         Books      Personal 
##    61.3795033   -23.7045316     1.0714537    -0.9555993    -0.3123567 
##      Terminal     S.F.Ratio   perc.alumni        Expend     Grad.Rate 
##    33.3259647   -68.9039073    47.8594759     0.1471632    26.3490845

Based on the plots of forward stepwise selection on the training set, the model would fit the best predictors of Out of State with 14 variables (PrivateYes, Apps, Accept, Enroll, Top10perc, Top25perc, Room.Board, Books, Personal, Terminal, S.F.Ratio, perc.alumni, Expend, Grad.Rate).

(b) Fit a GAM on the training data, using out-of-state tuition as the response and the features selected in the previous step as the predictors. Plot the results, and explain your findings.

library(gam)
gam.fit <- gam(Outstate ~ Private + s(Apps, df = 2) + s(Accept, df =2) + s(Enroll, df =2) + s(Top10perc, df = 2) + s(Top25perc, df =2) + s(Room.Board, df =2) + s(Books, df=2) + s(Personal, df =2) + s(Terminal, df=2) + s(S.F.Ratio, df=2) + s(perc.alumni, df=2) + s(Expend, df=2) + s(Grad.Rate, df=2), data = College.train)

par(mfrow = c(2,3))
plot(gam.fit, se= T, col = 'blue')

The GAM plots suggest a potential non-linear relationship between Books, S.F.Ratio and Personal.

(c) Evaluate the model obtained on the test set, and explain the results obtained.

gam.pred <- predict(gam.fit, College.test)
gam.err <- mean((College.test$Outstate - gam.pred)^2)
gam.err
## [1] 3327954
gam.tss = mean((College.test$Outstate - mean(College.test$Outstate))^2)
test.rss = 1 - gam.err/gam.tss
test.rss
## [1] 0.7674922

Using GAM on the test set, we obtain a test r-squared of .77 and mean squared of 3327954.

(d) For which variables, if any, is there evidence of a non-linear relationship with the response?

summary(gam.fit)
## 
## Call: gam(formula = Outstate ~ Private + s(Apps, df = 2) + s(Accept, 
##     df = 2) + s(Enroll, df = 2) + s(Top10perc, df = 2) + s(Top25perc, 
##     df = 2) + s(Room.Board, df = 2) + s(Books, df = 2) + s(Personal, 
##     df = 2) + s(Terminal, df = 2) + s(S.F.Ratio, df = 2) + s(perc.alumni, 
##     df = 2) + s(Expend, df = 2) + s(Grad.Rate, df = 2), data = College.train)
## Deviance Residuals:
##      Min       1Q   Median       3Q      Max 
## -6098.91 -1241.11   -10.78  1222.05  8680.80 
## 
## (Dispersion Parameter for gaussian family taken to be 3613005)
## 
##     Null Deviance: 6989966760 on 387 degrees of freedom
## Residual Deviance: 1300682882 on 360.0003 degrees of freedom
## AIC: 6988.854 
## 
## Number of Local Scoring Iterations: 2 
## 
## Anova for Parametric Effects
##                         Df     Sum Sq    Mean Sq  F value    Pr(>F)    
## Private                  1 1929816146 1929816146 534.1305 < 2.2e-16 ***
## s(Apps, df = 2)          1  691457239  691457239 191.3801 < 2.2e-16 ***
## s(Accept, df = 2)        1   77177822   77177822  21.3611 5.308e-06 ***
## s(Enroll, df = 2)        1   10541100   10541100   2.9175 0.0884832 .  
## s(Top10perc, df = 2)     1 1157444142 1157444142 320.3550 < 2.2e-16 ***
## s(Top25perc, df = 2)     1   25267529   25267529   6.9935 0.0085388 ** 
## s(Room.Board, df = 2)    1  573575250  573575250 158.7530 < 2.2e-16 ***
## s(Books, df = 2)         1   40871174   40871174  11.3122 0.0008528 ***
## s(Personal, df = 2)      1   42455695   42455695  11.7508 0.0006786 ***
## s(Terminal, df = 2)      1   76318619   76318619  21.1233 5.968e-06 ***
## s(S.F.Ratio, df = 2)     1   75223534   75223534  20.8202 6.930e-06 ***
## s(perc.alumni, df = 2)   1  115818908  115818908  32.0561 3.065e-08 ***
## s(Expend, df = 2)        1  214622358  214622358  59.4027 1.264e-13 ***
## s(Grad.Rate, df = 2)     1   47800735   47800735  13.2302 0.0003157 ***
## Residuals              360 1300682882    3613005                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Anova for Nonparametric Effects
##                        Npar Df Npar F     Pr(F)    
## (Intercept)                                        
## Private                                            
## s(Apps, df = 2)              1  3.339   0.06849 .  
## s(Accept, df = 2)            1  3.526   0.06122 .  
## s(Enroll, df = 2)            1  0.355   0.55144    
## s(Top10perc, df = 2)         1  0.624   0.42993    
## s(Top25perc, df = 2)         1  1.569   0.21119    
## s(Room.Board, df = 2)        1  1.627   0.20289    
## s(Books, df = 2)             1  2.693   0.10168    
## s(Personal, df = 2)          1  2.152   0.14326    
## s(Terminal, df = 2)          1  0.741   0.38976    
## s(S.F.Ratio, df = 2)         1  4.864   0.02806 *  
## s(perc.alumni, df = 2)       1  0.504   0.47817    
## s(Expend, df = 2)            1 39.333 1.025e-09 ***
## s(Grad.Rate, df = 2)         1  0.253   0.61526    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Based on the ANOVA for Non-parametric effects, there is evidence of a non-linear relationship between the response variable Outstate with S.F. Ratio and Expend.

detach(College)