Question 6

In this exercise, you will further analyze the Wage data set considered throughout this chapter.

  1. Perform polynomial regression to predict wage using age. Use cross-validation to select the optimal degree d for the polynomial. What degree was chosen, and how does this compare to the results of hypothesis testing using ANOVA? Make a plot of the resulting polynomial fit to the data.
set.seed(1)
plot(wage ~ age, data = Wage, col = "blue")
agerange = range(Wage$age)
agegrid = seq(from = agerange[1], to = agerange[2])
fitlm = lm(wage ~ poly(age, 3), data = Wage)
predic = predict(fitlm, newdata = list(age = agegrid))
lines(agegrid, predic, col = "Black", lwd = 2)

(b) Fit a step function to predict wage using age, and perform crossvalidation to choose the optimal number of cuts. Make a plot of the fit obtained.

Question 10

  1. Split the data into a training set and a test set. Using out-of-state tuition as the response and the other variables as the predictors, perform forward stepwise selection on the training set in order to identify a satisfactory model that uses just a subset of the predictors.
train = sample(1: nrow(College), nrow(College)/2)
test = -train
fit = regsubsets(Outstate ~ ., data = College, subset = train, method = 'forward')
fit.summary = summary(fit)
fit.summary
## Subset selection object
## Call: regsubsets.formula(Outstate ~ ., data = College, subset = train, 
##     method = "forward")
## 17 Variables  (and intercept)
##             Forced in Forced out
## PrivateYes      FALSE      FALSE
## Apps            FALSE      FALSE
## Accept          FALSE      FALSE
## Enroll          FALSE      FALSE
## Top10perc       FALSE      FALSE
## Top25perc       FALSE      FALSE
## F.Undergrad     FALSE      FALSE
## P.Undergrad     FALSE      FALSE
## Room.Board      FALSE      FALSE
## Books           FALSE      FALSE
## Personal        FALSE      FALSE
## PhD             FALSE      FALSE
## Terminal        FALSE      FALSE
## S.F.Ratio       FALSE      FALSE
## perc.alumni     FALSE      FALSE
## Expend          FALSE      FALSE
## Grad.Rate       FALSE      FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: forward
##          PrivateYes Apps Accept Enroll Top10perc Top25perc F.Undergrad
## 1  ( 1 ) " "        " "  " "    " "    " "       " "       " "        
## 2  ( 1 ) " "        " "  " "    " "    " "       " "       " "        
## 3  ( 1 ) " "        " "  " "    " "    " "       " "       " "        
## 4  ( 1 ) "*"        " "  " "    " "    " "       " "       " "        
## 5  ( 1 ) "*"        " "  " "    " "    " "       " "       " "        
## 6  ( 1 ) "*"        " "  " "    " "    " "       " "       " "        
## 7  ( 1 ) "*"        " "  " "    " "    " "       " "       " "        
## 8  ( 1 ) "*"        " "  " "    " "    "*"       " "       " "        
##          P.Undergrad Room.Board Books Personal PhD Terminal S.F.Ratio
## 1  ( 1 ) " "         "*"        " "   " "      " " " "      " "      
## 2  ( 1 ) " "         "*"        " "   " "      " " " "      " "      
## 3  ( 1 ) " "         "*"        " "   " "      " " " "      " "      
## 4  ( 1 ) " "         "*"        " "   " "      " " " "      " "      
## 5  ( 1 ) " "         "*"        " "   " "      " " " "      " "      
## 6  ( 1 ) " "         "*"        " "   " "      " " "*"      " "      
## 7  ( 1 ) " "         "*"        " "   "*"      " " "*"      " "      
## 8  ( 1 ) " "         "*"        " "   "*"      " " "*"      " "      
##          perc.alumni Expend Grad.Rate
## 1  ( 1 ) " "         " "    " "      
## 2  ( 1 ) "*"         " "    " "      
## 3  ( 1 ) "*"         "*"    " "      
## 4  ( 1 ) "*"         "*"    " "      
## 5  ( 1 ) "*"         "*"    "*"      
## 6  ( 1 ) "*"         "*"    "*"      
## 7  ( 1 ) "*"         "*"    "*"      
## 8  ( 1 ) "*"         "*"    "*"
coef(fit, id = 6)
##   (Intercept)    PrivateYes    Room.Board      Terminal   perc.alumni 
## -4726.8810613  2717.7019276     1.1032433    36.9990286    59.0863753 
##        Expend     Grad.Rate 
##     0.1930814    33.8303314
  1. Fit a GAM on the training data, using out-of-state tuition as the response and the features selected in the previous step as the predictors. Plot the results, and explain your findings.
gamm = gam(Outstate ~ Private + s(Room.Board, 5) + s(Terminal, 5) + s(perc.alumni, 5) + s(Expend, 5) + s(Grad.Rate, 5), data = College, subset = train)
par(mfrow = c(2,3))
plot(gamm, se = TRUE, col = 'Black')

Based off the graphs, Room/Board costs, alumni percent and out of state tution increase. Expend and Graduation rates are nonlinear with relation to the out of state tution.

  1. Evaluate the model obtained on the test set, and explain the results obtained.
prediction = predict(gamm,newdata = College)
mse = mean((College$Outstate - prediction)^2)
mse
## [1] 3385754
preds = predict(gamm, College[test, ])
RSS = sum((College[test, ]$Outstate - preds)^2) 
TSS = sum((College[test, ]$Outstate - mean(College[test, ]$Outstate)) ^ 2)
1 - (RSS / TSS)
## [1] 0.7652114

The mean squared error is 3412013 and the R-squared value of 0.616 means that this model has a moderate strength based off the data.

  1. For which variables, if any, is there evidence of a non-linear relationship with the response?
summary(gamm)
## 
## Call: gam(formula = Outstate ~ Private + s(Room.Board, 5) + s(Terminal, 
##     5) + s(perc.alumni, 5) + s(Expend, 5) + s(Grad.Rate, 5), 
##     data = College, subset = train)
## Deviance Residuals:
##      Min       1Q   Median       3Q      Max 
## -7270.32 -1115.76   -58.54  1231.45  7013.47 
## 
## (Dispersion Parameter for gaussian family taken to be 3666101)
## 
##     Null Deviance: 6989966760 on 387 degrees of freedom
## Residual Deviance: 1323460787 on 360.9995 degrees of freedom
## AIC: 6993.591 
## 
## Number of Local Scoring Iterations: NA 
## 
## Anova for Parametric Effects
##                    Df     Sum Sq    Mean Sq F value    Pr(>F)    
## Private             1 1780121359 1780121359 485.562 < 2.2e-16 ***
## s(Room.Board, 5)    1 1635038271 1635038271 445.988 < 2.2e-16 ***
## s(Terminal, 5)      1  281113708  281113708  76.679 < 2.2e-16 ***
## s(perc.alumni, 5)   1  351003562  351003562  95.743 < 2.2e-16 ***
## s(Expend, 5)        1  607193060  607193060 165.624 < 2.2e-16 ***
## s(Grad.Rate, 5)     1   89036829   89036829  24.287 1.267e-06 ***
## Residuals         361 1323460787    3666101                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Anova for Nonparametric Effects
##                   Npar Df  Npar F     Pr(F)    
## (Intercept)                                    
## Private                                        
## s(Room.Board, 5)        4  2.0626    0.0852 .  
## s(Terminal, 5)          4  1.5754    0.1803    
## s(perc.alumni, 5)       4  0.4126    0.7996    
## s(Expend, 5)            4 21.2640 8.882e-16 ***
## s(Grad.Rate, 5)         4  0.8399    0.5006    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Out of State tution and Expend has a strong nonlinear relationship. Along with Out of State tution, Room/ Board also has a strong nonlinear relationship with Expend.