Chapter 07 (page 297): 6, 10

Problem 6

In this exercise, you will further analyze the Wage data set considered throughout this chapter.

library(ISLR)
attach(Wage)
# View summary for the Wage dataset 
summary(Wage)
##       year           age                     maritl           race     
##  Min.   :2003   Min.   :18.00   1. Never Married: 648   1. White:2480  
##  1st Qu.:2004   1st Qu.:33.75   2. Married      :2074   2. Black: 293  
##  Median :2006   Median :42.00   3. Widowed      :  19   3. Asian: 190  
##  Mean   :2006   Mean   :42.41   4. Divorced     : 204   4. Other:  37  
##  3rd Qu.:2008   3rd Qu.:51.00   5. Separated    :  55                  
##  Max.   :2009   Max.   :80.00                                          
##                                                                        
##               education                     region               jobclass   
##  1. < HS Grad      :268   2. Middle Atlantic   :3000   1. Industrial :1544  
##  2. HS Grad        :971   1. New England       :   0   2. Information:1456  
##  3. Some College   :650   3. East North Central:   0                        
##  4. College Grad   :685   4. West North Central:   0                        
##  5. Advanced Degree:426   5. South Atlantic    :   0                        
##                           6. East South Central:   0                        
##                           (Other)              :   0                        
##             health      health_ins      logwage           wage       
##  1. <=Good     : 858   1. Yes:2083   Min.   :3.000   Min.   : 20.09  
##  2. >=Very Good:2142   2. No : 917   1st Qu.:4.447   1st Qu.: 85.38  
##                                      Median :4.653   Median :104.92  
##                                      Mean   :4.654   Mean   :111.70  
##                                      3rd Qu.:4.857   3rd Qu.:128.68  
##                                      Max.   :5.763   Max.   :318.34  
## 
str(Wage)
## 'data.frame':    3000 obs. of  11 variables:
##  $ year      : int  2006 2004 2003 2003 2005 2008 2009 2008 2006 2004 ...
##  $ age       : int  18 24 45 43 50 54 44 30 41 52 ...
##  $ maritl    : Factor w/ 5 levels "1. Never Married",..: 1 1 2 2 4 2 2 1 1 2 ...
##  $ race      : Factor w/ 4 levels "1. White","2. Black",..: 1 1 1 3 1 1 4 3 2 1 ...
##  $ education : Factor w/ 5 levels "1. < HS Grad",..: 1 4 3 4 2 4 3 3 3 2 ...
##  $ region    : Factor w/ 9 levels "1. New England",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ jobclass  : Factor w/ 2 levels "1. Industrial",..: 1 2 1 2 2 2 1 2 2 2 ...
##  $ health    : Factor w/ 2 levels "1. <=Good","2. >=Very Good": 1 2 1 2 1 2 2 1 2 2 ...
##  $ health_ins: Factor w/ 2 levels "1. Yes","2. No": 2 2 1 1 1 1 1 1 1 1 ...
##  $ logwage   : num  4.32 4.26 4.88 5.04 4.32 ...
##  $ wage      : num  75 70.5 131 154.7 75 ...

(a) Perform polynomial regression to predict wage using age. Use cross-validation to select the optimal degree d for the polynomial. What degree was chosen, and how does this compare to the results of hypothesis testing using ANOVA? Make a plot of the resulting polynomial fit to the data.

k-Fold Cross-Validation

#initialize the random number generator to aide reproducibility
library(boot)
set.seed(17)
cv.error.10=rep(0,10) #initiate an empty vector to store computed cross-validation errors
for (i in 1:10){ #set up for loop to increment from 1 to five
 glm.fit <- glm(wage ~ poly(age, i), data = Wage) #fit the linear regression model
 cv.error.10[i]=cv.glm(Wage, glm.fit,K=10)$delta[1] #calculate the cross validation error and store the result in our cv.error vector
}
cv.error.10 #print out the resulting cross validation error
##  [1] 1675.109 1601.423 1595.850 1594.073 1593.960 1597.031 1594.023 1594.544
##  [9] 1592.573 1593.557

Plot

# illustrate results with a line plot connecting the cv.error dots
plot( x = 1:10, y = cv.error.10, xlab = "power of age", ylab = "CV error", 
      type = "b", pch = 19, lwd = 2, bty = "n", 
      ylim = c( min(cv.error.10) - sd(cv.error.10), max(cv.error.10) + sd(cv.error.10) ) )

# horizontal line for 1se to less complexity
abline(h = min(cv.error.10) + sd(cv.error.10) , lty = "dotted")

# where is the minimum
points( x = which.min(cv.error.10), y = min(cv.error.10), col = "red", pch = "X", cex = 1.5 )

anova

fit.1 <- lm(wage ~ age,data=Wage)
fit.2 <- lm(wage ~ poly(age,2),data = Wage)
fit.3 <- lm(wage ~ poly(age,3),data = Wage)
fit.4 <- lm(wage~poly(age,4),data = Wage)
fit.5 <- lm(wage~poly(age,5),data = Wage)
fit.6 <- lm(wage~poly(age,6), data = Wage)
fit.7 <- lm(wage~poly(age,7), data = Wage)
fit.8 <- lm(wage~poly(age,8), data = Wage)
fit.9 <- lm(wage~poly(age,9), data = Wage)
fit.10 <- lm(wage~poly(age,10), data = Wage)
anova(fit.1, fit.2, fit.3, fit.4, fit.5, fit.6, fit.7, fit.8, fit.9, fit.10)
## Analysis of Variance Table
## 
## Model  1: wage ~ age
## Model  2: wage ~ poly(age, 2)
## Model  3: wage ~ poly(age, 3)
## Model  4: wage ~ poly(age, 4)
## Model  5: wage ~ poly(age, 5)
## Model  6: wage ~ poly(age, 6)
## Model  7: wage ~ poly(age, 7)
## Model  8: wage ~ poly(age, 8)
## Model  9: wage ~ poly(age, 9)
## Model 10: wage ~ poly(age, 10)
##    Res.Df     RSS Df Sum of Sq        F    Pr(>F)    
## 1    2998 5022216                                    
## 2    2997 4793430  1    228786 143.7638 < 2.2e-16 ***
## 3    2996 4777674  1     15756   9.9005  0.001669 ** 
## 4    2995 4771604  1      6070   3.8143  0.050909 .  
## 5    2994 4770322  1      1283   0.8059  0.369398    
## 6    2993 4766389  1      3932   2.4709  0.116074    
## 7    2992 4763834  1      2555   1.6057  0.205199    
## 8    2991 4763707  1       127   0.0796  0.777865    
## 9    2990 4756703  1      7004   4.4014  0.035994 *  
## 10   2989 4756701  1         3   0.0017  0.967529    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The p-value comparing the linear Model 1 to the quadratic Model 2 is essentially zero (\(<{10}^{15}\)) indicating that a linear fit is not sufficient. Similarly the p-value comparing the quadratic Model 2 to the cubic Model 3 is very low (0.001669), so the quadratic fit is also insufficient. The \(p-value\) comparing the cubic and degree-4 polynomials, Model 3 and Model 4, is approximately 5% while the degree-5 polynomial Model 5 seems unnecessary because its p-value is 0.369398. Hence, either a cubic or a quartic polynomial appear to provide a reasonable fit to the data, but lower- or higher-order models are not justified.

Plot of the resulting Polynomial fit

# we now create a grid of values for age at which we want predictions
agelims=range(age)
age.grid <- seq(from=agelims[1],to=agelims[2])
preds <- predict(fit.4,newdata=list(age=age.grid),se=TRUE)
se.bands <- cbind(preds$fit+2*preds$se.fit,preds$fit-2*preds$se.fit)
# plot the data and add the fit from the degree-4 polynomial
par(mfrow=c(1,1),mar=c(4.5,4.5,1,1),oma=c(0,0,2,0))
plot(age,wage,xlim=agelims,cex =.5,col="darkgrey")
title("Degree-4 Polynomial",outer=T)
lines(age.grid,preds$fit,lwd=2,col="darkblue")
matlines(age.grid,se.bands,lwd=1,col="lightblue",lty=3)

(b) Fit a step function to predict wage using age, and perform crossvalidation to choose the optimal number of cuts. Make a plot of the fit obtained.

table(cut(age,4))
## 
## (17.9,33.5]   (33.5,49]   (49,64.5] (64.5,80.1] 
##         750        1399         779          72
fit_cut <- lm(wage ~ cut(age,4),data = Wage)
coef(summary(fit_cut))
##                         Estimate Std. Error   t value     Pr(>|t|)
## (Intercept)            94.158392   1.476069 63.789970 0.000000e+00
## cut(age, 4)(33.5,49]   24.053491   1.829431 13.148074 1.982315e-38
## cut(age, 4)(49,64.5]   23.664559   2.067958 11.443444 1.040750e-29
## cut(age, 4)(64.5,80.1]  7.640592   4.987424  1.531972 1.256350e-01

The age<33.5 category is left out, so the intercept coefficient of $94,160 can be interpreted as the average salary for those under 33.5 years of age, and the other coefficients can be interpreted as the average additional salary for those in the other age groups. We can produce predictions and plots just as we did in the case of the polynomial fit.

#initialize the random number generator to aide reproducibility
set.seed(18)
cv.error.15=rep(0,15) #initiate an empty vector to store computed cross-validation errors
for (i in 2:15) {
  Wage$age.cut <- cut(Wage$age, i)
  lm.fit <-  glm(wage ~ age.cut, data = Wage)
  cv.error.15[i] <- cv.glm(Wage, lm.fit, K = 10)$delta[1]
}


# the first element of cv.error is NA because we started our loop at 2
plot(2:15, cv.error.15[-1], xlab = "Number of cuts", ylab = "CV error", 
     type = "b", pch = 19, lwd = 2, bty ="n")

# horizontal line for 1se to less complexity
abline(h = min(cv.error.15, na.rm = TRUE) + sd(cv.error.15, na.rm = TRUE) , lty = "dotted")

# highlight minimum
points(x = which.min(cv.error.15), y = min(cv.error.15, na.rm = TRUE), col = "red", pch = "X", cex = 1.5 )

lm.fit <-  glm(wage ~ cut(age, 4), data = Wage)
agelims <-  range(Wage$age)
age.grid <-  seq(from = agelims[1], to = agelims[2])
lm.pred <-  predict(lm.fit, data.frame(age = age.grid), se = TRUE)
plot(wage ~ age, data = Wage, col = "darkgrey", bty = "n")
lines(age.grid, lm.pred$fit, col = "red", lwd = 2)
matlines(age.grid, cbind( lm.pred$fit + 2* lm.pred$se.fit,
                          lm.pred$fit - 2* lm.pred$se.fit),
         col = "red", lty ="dashed")

Problem 6

This question relates to the College data set.

library(ISLR)
attach(College)
summary(College)
##  Private        Apps           Accept          Enroll       Top10perc    
##  No :212   Min.   :   81   Min.   :   72   Min.   :  35   Min.   : 1.00  
##  Yes:565   1st Qu.:  776   1st Qu.:  604   1st Qu.: 242   1st Qu.:15.00  
##            Median : 1558   Median : 1110   Median : 434   Median :23.00  
##            Mean   : 3002   Mean   : 2019   Mean   : 780   Mean   :27.56  
##            3rd Qu.: 3624   3rd Qu.: 2424   3rd Qu.: 902   3rd Qu.:35.00  
##            Max.   :48094   Max.   :26330   Max.   :6392   Max.   :96.00  
##    Top25perc      F.Undergrad     P.Undergrad         Outstate    
##  Min.   :  9.0   Min.   :  139   Min.   :    1.0   Min.   : 2340  
##  1st Qu.: 41.0   1st Qu.:  992   1st Qu.:   95.0   1st Qu.: 7320  
##  Median : 54.0   Median : 1707   Median :  353.0   Median : 9990  
##  Mean   : 55.8   Mean   : 3700   Mean   :  855.3   Mean   :10441  
##  3rd Qu.: 69.0   3rd Qu.: 4005   3rd Qu.:  967.0   3rd Qu.:12925  
##  Max.   :100.0   Max.   :31643   Max.   :21836.0   Max.   :21700  
##    Room.Board       Books           Personal         PhD        
##  Min.   :1780   Min.   :  96.0   Min.   : 250   Min.   :  8.00  
##  1st Qu.:3597   1st Qu.: 470.0   1st Qu.: 850   1st Qu.: 62.00  
##  Median :4200   Median : 500.0   Median :1200   Median : 75.00  
##  Mean   :4358   Mean   : 549.4   Mean   :1341   Mean   : 72.66  
##  3rd Qu.:5050   3rd Qu.: 600.0   3rd Qu.:1700   3rd Qu.: 85.00  
##  Max.   :8124   Max.   :2340.0   Max.   :6800   Max.   :103.00  
##     Terminal       S.F.Ratio      perc.alumni        Expend     
##  Min.   : 24.0   Min.   : 2.50   Min.   : 0.00   Min.   : 3186  
##  1st Qu.: 71.0   1st Qu.:11.50   1st Qu.:13.00   1st Qu.: 6751  
##  Median : 82.0   Median :13.60   Median :21.00   Median : 8377  
##  Mean   : 79.7   Mean   :14.09   Mean   :22.74   Mean   : 9660  
##  3rd Qu.: 92.0   3rd Qu.:16.50   3rd Qu.:31.00   3rd Qu.:10830  
##  Max.   :100.0   Max.   :39.80   Max.   :64.00   Max.   :56233  
##    Grad.Rate     
##  Min.   : 10.00  
##  1st Qu.: 53.00  
##  Median : 65.00  
##  Mean   : 65.46  
##  3rd Qu.: 78.00  
##  Max.   :118.00

(a) Split the data into a training set and a test set. Using out-of-state tuition as the response and the other variables as the predictors, perform forward stepwise selection on the training set in order to identify a satisfactory model that uses just a subset of the predictors.

#set the starting point for the random number generator to aide reproducibility
set.seed(123)
#create vector of random 196 random row numbers to help split training and test datasets
indices <- sample(nrow(College), 0.70 * nrow(College))
train <- College[indices, ] # 70% of the data
test <- College[-indices, ] # 30% of the data
# fit a forward stepwise model
library(leaps)
regfit.fwd <- regsubsets(Outstate ~., data = train, nvmax = 17, method = "forward")
summary(regfit.fwd)
## Subset selection object
## Call: regsubsets.formula(Outstate ~ ., data = train, nvmax = 17, method = "forward")
## 17 Variables  (and intercept)
##             Forced in Forced out
## PrivateYes      FALSE      FALSE
## Apps            FALSE      FALSE
## Accept          FALSE      FALSE
## Enroll          FALSE      FALSE
## Top10perc       FALSE      FALSE
## Top25perc       FALSE      FALSE
## F.Undergrad     FALSE      FALSE
## P.Undergrad     FALSE      FALSE
## Room.Board      FALSE      FALSE
## Books           FALSE      FALSE
## Personal        FALSE      FALSE
## PhD             FALSE      FALSE
## Terminal        FALSE      FALSE
## S.F.Ratio       FALSE      FALSE
## perc.alumni     FALSE      FALSE
## Expend          FALSE      FALSE
## Grad.Rate       FALSE      FALSE
## 1 subsets of each size up to 17
## Selection Algorithm: forward
##           PrivateYes Apps Accept Enroll Top10perc Top25perc F.Undergrad
## 1  ( 1 )  " "        " "  " "    " "    " "       " "       " "        
## 2  ( 1 )  "*"        " "  " "    " "    " "       " "       " "        
## 3  ( 1 )  "*"        " "  " "    " "    " "       " "       " "        
## 4  ( 1 )  "*"        " "  " "    " "    " "       " "       " "        
## 5  ( 1 )  "*"        " "  " "    " "    " "       " "       " "        
## 6  ( 1 )  "*"        " "  " "    " "    " "       " "       " "        
## 7  ( 1 )  "*"        " "  "*"    " "    " "       " "       " "        
## 8  ( 1 )  "*"        " "  "*"    " "    " "       " "       "*"        
## 9  ( 1 )  "*"        " "  "*"    " "    " "       "*"       "*"        
## 10  ( 1 ) "*"        "*"  "*"    " "    " "       "*"       "*"        
## 11  ( 1 ) "*"        "*"  "*"    " "    " "       "*"       "*"        
## 12  ( 1 ) "*"        "*"  "*"    " "    " "       "*"       "*"        
## 13  ( 1 ) "*"        "*"  "*"    " "    "*"       "*"       "*"        
## 14  ( 1 ) "*"        "*"  "*"    "*"    "*"       "*"       "*"        
## 15  ( 1 ) "*"        "*"  "*"    "*"    "*"       "*"       "*"        
## 16  ( 1 ) "*"        "*"  "*"    "*"    "*"       "*"       "*"        
## 17  ( 1 ) "*"        "*"  "*"    "*"    "*"       "*"       "*"        
##           P.Undergrad Room.Board Books Personal PhD Terminal S.F.Ratio
## 1  ( 1 )  " "         " "        " "   " "      " " " "      " "      
## 2  ( 1 )  " "         " "        " "   " "      " " " "      " "      
## 3  ( 1 )  " "         "*"        " "   " "      " " " "      " "      
## 4  ( 1 )  " "         "*"        " "   " "      " " " "      " "      
## 5  ( 1 )  " "         "*"        " "   " "      "*" " "      " "      
## 6  ( 1 )  " "         "*"        " "   " "      "*" " "      " "      
## 7  ( 1 )  " "         "*"        " "   " "      "*" " "      " "      
## 8  ( 1 )  " "         "*"        " "   " "      "*" " "      " "      
## 9  ( 1 )  " "         "*"        " "   " "      "*" " "      " "      
## 10  ( 1 ) " "         "*"        " "   " "      "*" " "      " "      
## 11  ( 1 ) " "         "*"        " "   " "      "*" "*"      " "      
## 12  ( 1 ) " "         "*"        " "   "*"      "*" "*"      " "      
## 13  ( 1 ) " "         "*"        " "   "*"      "*" "*"      " "      
## 14  ( 1 ) " "         "*"        " "   "*"      "*" "*"      " "      
## 15  ( 1 ) " "         "*"        " "   "*"      "*" "*"      "*"      
## 16  ( 1 ) " "         "*"        "*"   "*"      "*" "*"      "*"      
## 17  ( 1 ) "*"         "*"        "*"   "*"      "*" "*"      "*"      
##           perc.alumni Expend Grad.Rate
## 1  ( 1 )  " "         "*"    " "      
## 2  ( 1 )  " "         "*"    " "      
## 3  ( 1 )  " "         "*"    " "      
## 4  ( 1 )  "*"         "*"    " "      
## 5  ( 1 )  "*"         "*"    " "      
## 6  ( 1 )  "*"         "*"    "*"      
## 7  ( 1 )  "*"         "*"    "*"      
## 8  ( 1 )  "*"         "*"    "*"      
## 9  ( 1 )  "*"         "*"    "*"      
## 10  ( 1 ) "*"         "*"    "*"      
## 11  ( 1 ) "*"         "*"    "*"      
## 12  ( 1 ) "*"         "*"    "*"      
## 13  ( 1 ) "*"         "*"    "*"      
## 14  ( 1 ) "*"         "*"    "*"      
## 15  ( 1 ) "*"         "*"    "*"      
## 16  ( 1 ) "*"         "*"    "*"      
## 17  ( 1 ) "*"         "*"    "*"
# examine these to try to select the best overall model
reg.summary <- summary(regfit.fwd)
names(reg.summary)
## [1] "which"  "rsq"    "rss"    "adjr2"  "cp"     "bic"    "outmat" "obj"
# R^2
reg.summary$rsq
##  [1] 0.4341109 0.6054366 0.6774467 0.7123154 0.7313616 0.7401159 0.7429579
##  [8] 0.7508792 0.7540080 0.7563797 0.7581694 0.7598391 0.7611480 0.7622390
## [15] 0.7629536 0.7630786 0.7631066
# max for adj RSq?
which.max(reg.summary$adjr2)
## [1] 15
# min ?
which.min(reg.summary$rss)
## [1] 17
which.min(reg.summary$cp)
## [1] 14
which.min(reg.summary$bic)
## [1] 9
# plotting RSS, adj.R^2 .. for all the models will help us decide which model to select
par(mfrow=c(2,2))
plot(reg.summary$rss, xlab = "Number of Variables", ylab = "RSS", type = "l")
points(17,reg.summary$rss[17], col="red", cex=2, pch =20)

plot(reg.summary$adjr2, xlab = "Number of Variables", ylab = "Adjusted RSq", type = "l")
points(11,reg.summary$adjr2[11], col="red", cex=2, pch =20)


plot(reg.summary$cp, xlab = "Number of Variables", ylab = "Cp", type = "l")
points(10,reg.summary$cp[10], col="red", cex=2, pch =20)

plot(reg.summary$bic, xlab = "Number of Variables", ylab = "BIC", type = "l")
points(9,reg.summary$bic[9], col="red", cex=2, pch =20)

par(mfrow=c(1,4), mar=c(0,0,0,0))
plot(regfit.fwd, scale="r2")
plot(regfit.fwd, scale="adjr2")
plot(regfit.fwd, scale="Cp")
plot(regfit.fwd, scale="bic")

I will build a model using the 9 variables selected by the BIC result. The 9 variables selected were, Private, Apps, Accept, F.Undergrad, Room.Board, Terminal, perc.alumni, Expend and Grad.Rate.

(b) Fit a GAM on the training data, using out-of-state tuition as the response and the features selected in the previous step as the predictors. Plot the results, and explain your findings.

# load in library
library(gam)
## Loading required package: splines
## Loading required package: foreach
## Loaded gam 1.16.1
# fit a GAM model using the lm() function
gam1 <- lm(Outstate ~ Private + ns(Apps,2) + ns(Accept,3) + ns(F.Undergrad,4) + ns(Room.Board, 5) + ns(Terminal, 2) + ns(perc.alumni, 3) + ns(Expend, 4) + ns(Grad.Rate, 5), data = train)

# plot the model
par(mfrow=c(1,3))
plot.Gam(gam1, se=TRUE,col ="#1c9099")

# fit a GAM model with the gam() function

gam2 <- gam(Outstate ~ Private + s(Apps,2) + s(Accept,3) + s(F.Undergrad,4) + s(Room.Board, 5) + s(Terminal, 2) + s(perc.alumni, 3) + s(Expend, 4) + s(Grad.Rate, 5), data = train)
## Warning in model.matrix.default(mt, mf, contrasts): non-list contrasts argument
## ignored
# plot the model
par(mfrow=c(1,3))
plot(gam2, se=TRUE,col ="#1c9099")

Based on the shape of the fit curves, Expend and Grad.Rate are strong non-linear with the Outstate.

# fit a model from the information given above
gam3 <- gam(Outstate ~ Private + Apps + Accept + F.Undergrad + s(Terminal, 4) + perc.alumni + s(Expend, 4) + Grad.Rate + Room.Board, data = train)
## Warning in model.matrix.default(mt, mf, contrasts): non-list contrasts argument
## ignored
# plot the model
par(mfrow=c(1,3))
plot(gam3, se=TRUE,col ="#1c9099")

# view summary
summary(gam3)
## 
## Call: gam(formula = Outstate ~ Private + Apps + Accept + F.Undergrad + 
##     s(Terminal, 4) + perc.alumni + s(Expend, 4) + Grad.Rate + 
##     Room.Board, data = train)
## Deviance Residuals:
##     Min      1Q  Median      3Q     Max 
## -6986.4 -1172.0   138.4  1288.3  7859.6 
## 
## (Dispersion Parameter for gaussian family taken to be 3586630)
## 
##     Null Deviance: 9035173364 on 542 degrees of freedom
## Residual Deviance: 1890153766 on 526.9999 degrees of freedom
## AIC: 9754.076 
## 
## Number of Local Scoring Iterations: 2 
## 
## Anova for Parametric Effects
##                 Df     Sum Sq    Mean Sq F value    Pr(>F)    
## Private          1 2598966645 2598966645 724.626 < 2.2e-16 ***
## Apps             1 1034656950 1034656950 288.476 < 2.2e-16 ***
## Accept           1  126879256  126879256  35.376 4.956e-09 ***
## F.Undergrad      1  189270680  189270680  52.771 1.356e-12 ***
## s(Terminal, 4)   1  835129066  835129066 232.845 < 2.2e-16 ***
## perc.alumni      1  314186305  314186305  87.599 < 2.2e-16 ***
## s(Expend, 4)     1  677679602  677679602 188.946 < 2.2e-16 ***
## Grad.Rate        1  123612652  123612652  34.465 7.685e-09 ***
## Room.Board       1  150367042  150367042  41.924 2.176e-10 ***
## Residuals      527 1890153766    3586630                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Anova for Nonparametric Effects
##                Npar Df  Npar F  Pr(F)    
## (Intercept)                              
## Private                                  
## Apps                                     
## Accept                                   
## F.Undergrad                              
## s(Terminal, 4)       3  0.4589 0.7111    
## perc.alumni                              
## s(Expend, 4)         3 30.8764 <2e-16 ***
## Grad.Rate                                
## Room.Board                               
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(c) Evaluate the model obtained on the test set, and explain the results obtained.

# e can perform a series of ANOVA tests in order to determine which of these three models is best
anova(gam1, gam2, gam3, test = "F")
## Analysis of Variance Table
## 
## Model 1: Outstate ~ Private + ns(Apps, 2) + ns(Accept, 3) + ns(F.Undergrad, 
##     4) + ns(Room.Board, 5) + ns(Terminal, 2) + ns(perc.alumni, 
##     3) + ns(Expend, 4) + ns(Grad.Rate, 5)
## Model 2: Outstate ~ Private + s(Apps, 2) + s(Accept, 3) + s(F.Undergrad, 
##     4) + s(Room.Board, 5) + s(Terminal, 2) + s(perc.alumni, 3) + 
##     s(Expend, 4) + s(Grad.Rate, 5)
## Model 3: Outstate ~ Private + Apps + Accept + F.Undergrad + s(Terminal, 
##     4) + perc.alumni + s(Expend, 4) + Grad.Rate + Room.Board
##   Res.Df        RSS         Df  Sum of Sq         F    Pr(>F)    
## 1    513 1774544966                                              
## 2    513 1765461129  1.452e-04    9083837 18178.209 9.627e-06 ***
## 3    527 1890153766 -1.400e+01 -124692637     2.588  0.001275 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
library(caret)
## Loading required package: lattice
## 
## Attaching package: 'lattice'
## The following object is masked from 'package:boot':
## 
##     melanoma
## Loading required package: ggplot2
# We can make predictions from gam objects, just like from lm objects, using the predict() method for the class gam. Here we make predictions on the testing set.
preds <- predict(gam3, newdata=test)

# RMSE for model1
postResample(predict(gam1, test), test$Outstate)
##         RMSE     Rsquared          MAE 
## 1665.9535207    0.8164515 1291.3729441
# RMSE for model2
postResample(predict(gam2, test), test$Outstate)
##         RMSE     Rsquared          MAE 
## 1724.9172087    0.8034085 1326.5103422
# RMSE for model3
postResample(predict(gam3, test), test$Outstate)
##         RMSE     Rsquared          MAE 
## 1749.6211825    0.7978709 1332.9535628

(d) For which variables, if any, is there evidence of a non-linear relationship with the response?

summary(gam2)
## 
## Call: gam(formula = Outstate ~ Private + s(Apps, 2) + s(Accept, 3) + 
##     s(F.Undergrad, 4) + s(Room.Board, 5) + s(Terminal, 2) + s(perc.alumni, 
##     3) + s(Expend, 4) + s(Grad.Rate, 5), data = train)
## Deviance Residuals:
##     Min      1Q  Median      3Q     Max 
## -6949.3 -1139.5   134.8  1217.1  7903.7 
## 
## (Dispersion Parameter for gaussian family taken to be 3441446)
## 
##     Null Deviance: 9035173364 on 542 degrees of freedom
## Residual Deviance: 1765461129 on 512.9999 degrees of freedom
## AIC: 9745.018 
## 
## Number of Local Scoring Iterations: 2 
## 
## Anova for Parametric Effects
##                    Df     Sum Sq    Mean Sq F value    Pr(>F)    
## Private             1 2584551575 2584551575 751.008 < 2.2e-16 ***
## s(Apps, 2)          1 1065415570 1065415570 309.584 < 2.2e-16 ***
## s(Accept, 3)        1  137788700  137788700  40.038 5.433e-10 ***
## s(F.Undergrad, 4)   1  175425141  175425141  50.974 3.218e-12 ***
## s(Room.Board, 5)    1  666776478  666776478 193.749 < 2.2e-16 ***
## s(Terminal, 2)      1  339237348  339237348  98.574 < 2.2e-16 ***
## s(perc.alumni, 3)   1  294136734  294136734  85.469 < 2.2e-16 ***
## s(Expend, 4)        1  588203384  588203384 170.917 < 2.2e-16 ***
## s(Grad.Rate, 5)     1   65174292   65174292  18.938 1.630e-05 ***
## Residuals         513 1765461129    3441446                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Anova for Nonparametric Effects
##                   Npar Df  Npar F     Pr(F)    
## (Intercept)                                    
## Private                                        
## s(Apps, 2)              1  1.1553 0.2829554    
## s(Accept, 3)            2  7.5925 0.0005632 ***
## s(F.Undergrad, 4)       3  2.7963 0.0396751 *  
## s(Room.Board, 5)        4  2.3583 0.0525731 .  
## s(Terminal, 2)          1  0.6492 0.4207470    
## s(perc.alumni, 3)       2  0.5480 0.5784167    
## s(Expend, 4)            3 30.4854 < 2.2e-16 ***
## s(Grad.Rate, 5)         4  2.3586 0.0525568 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

“Anova for Nonparametric Effects” shows Expend has strong non-linear relationshop with the Outstate. Grad.Rate and Accept have moderate non-linear relationship with the Outstate, which coincide with the result of (b).