Exercise 6 (a) Perform polynomial regression to predict wage using age. Use cross-validation to select the optimal degree d for the polynomial. What degree was chosen, and how does this compare to the results of hypothesis testing using ANOVA? Make a plot of the resulting polynomial fit to the data.

attach(Wage)
str(Wage)
## 'data.frame':    3000 obs. of  11 variables:
##  $ year      : int  2006 2004 2003 2003 2005 2008 2009 2008 2006 2004 ...
##  $ age       : int  18 24 45 43 50 54 44 30 41 52 ...
##  $ maritl    : Factor w/ 5 levels "1. Never Married",..: 1 1 2 2 4 2 2 1 1 2 ...
##  $ race      : Factor w/ 4 levels "1. White","2. Black",..: 1 1 1 3 1 1 4 3 2 1 ...
##  $ education : Factor w/ 5 levels "1. < HS Grad",..: 1 4 3 4 2 4 3 3 3 2 ...
##  $ region    : Factor w/ 9 levels "1. New England",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ jobclass  : Factor w/ 2 levels "1. Industrial",..: 1 2 1 2 2 2 1 2 2 2 ...
##  $ health    : Factor w/ 2 levels "1. <=Good","2. >=Very Good": 1 2 1 2 1 2 2 1 2 2 ...
##  $ health_ins: Factor w/ 2 levels "1. Yes","2. No": 2 2 1 1 1 1 1 1 1 1 ...
##  $ logwage   : num  4.32 4.26 4.88 5.04 4.32 ...
##  $ wage      : num  75 70.5 131 154.7 75 ...
set.seed(1)
delta = rep(NA,10)
for (i in 1:10){
  fit = glm(wage ~ poly(age, i), data = Wage)
  delta[i] = cv.glm(Wage, fit, K=10)$delta[1]
}
plot(1:10, delta, xlab = "Degree", ylab = "Test MSE", type = "l")
d.min = which.min(delta)
points(d.min, delta[d.min], col = "red", cex = 2, pch = 20)

From the above plot, we can observe that polynomial with the degree 9 has the lowest error choosen via cross-validation.

set.seed(1)
fit.1 = lm(wage~poly(age,1), data = Wage)
fit.2 = lm(wage~poly(age,2), data = Wage)
fit.3 = lm(wage~poly(age,3), data = Wage)
fit.4 = lm(wage~poly(age,4), data = Wage)
fit.5 = lm(wage~poly(age,5), data = Wage)
fit.6 = lm(wage~poly(age,6), data = Wage)
fit.7 = lm(wage~poly(age,7), data = Wage)
fit.8 = lm(wage~poly(age,8), data = Wage)
fit.9 = lm(wage~poly(age,9), data = Wage)
fit.10 = lm(wage~poly(age,10), data = Wage)
anova(fit.1,fit.2,fit.3,fit.4,fit.5,fit.6,fit.7,fit.8,fit.9,fit.10)
## Analysis of Variance Table
## 
## Model  1: wage ~ poly(age, 1)
## Model  2: wage ~ poly(age, 2)
## Model  3: wage ~ poly(age, 3)
## Model  4: wage ~ poly(age, 4)
## Model  5: wage ~ poly(age, 5)
## Model  6: wage ~ poly(age, 6)
## Model  7: wage ~ poly(age, 7)
## Model  8: wage ~ poly(age, 8)
## Model  9: wage ~ poly(age, 9)
## Model 10: wage ~ poly(age, 10)
##    Res.Df     RSS Df Sum of Sq        F    Pr(>F)    
## 1    2998 5022216                                    
## 2    2997 4793430  1    228786 143.7638 < 2.2e-16 ***
## 3    2996 4777674  1     15756   9.9005  0.001669 ** 
## 4    2995 4771604  1      6070   3.8143  0.050909 .  
## 5    2994 4770322  1      1283   0.8059  0.369398    
## 6    2993 4766389  1      3932   2.4709  0.116074    
## 7    2992 4763834  1      2555   1.6057  0.205199    
## 8    2991 4763707  1       127   0.0796  0.777865    
## 9    2990 4756703  1      7004   4.4014  0.035994 *  
## 10   2989 4756701  1         3   0.0017  0.967529    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From the ANOVA results, the 2nd, 3rd and 9th order polynomials have significant p-values. Hence they can be a good fit for the data.

Plotting the data, using the 3rd order polynomial, we get the following graph

plot(wage ~ age, data = Wage, col = "darkgrey")
agelims = range(Wage$age)
age.grid  = seq(from = agelims[1], to = agelims[2])
fit = lm(wage ~ poly(age,3), data  = Wage)
preds = predict(fit, newdata = list(age = age.grid))
lines(age.grid, preds, col = "red", lwd = 2)

Similarly plotting 9th order polynomial we get the following graph

plot(wage ~ age, data = Wage, col = "darkgrey")
agelims = range(Wage$age)
age.grid = seq(from = agelims[1], to = agelims[2])
fit = lm(wage ~ poly(age, 9), data = Wage)
preds = predict(fit, newdata = list(age = age.grid))
lines(age.grid, preds, col = "red", lwd = 2)

(b) Fit a step function to predict wage using age, and perform crossvalidation to choose the optimal number of cuts. Make a plot of the fit obtained.

set.seed(1)
delta = rep(0,10)

for (i in 2:10){
  Wage$age.cut = cut(Wage$age, i)
  fit = glm(wage ~ age.cut, data  = Wage)
  delta[i] = cv.glm(Wage, fit, K = 10)$delta[1]
}
plot(2:10, delta[-1], xlab = "cuts", ylab = "CV error", type = "l")
d.min  = which.min(delta)
points(d.min, delta[d.min], col = "red", pch = 20)

The plot shows that CV error is minimun at 8, thus 8 is the optimal number of cuts.

plot(wage ~ age, data = Wage, col = "darkgrey")
agelims = range(Wage$age)
age.grid = seq(from = agelims[1], to = agelims[2])
fit = lm(wage ~ cut(age, 8), data = Wage)
preds = predict(fit, newdata = list(age = age.grid))
lines(age.grid, preds, col = "red", lwd = 2)

Exercise 10

This question relates to the College data set.

(a) Split the data into a training set and a test set. Using out-of-state tuition as the response and the other variables as the predictors, perform forward stepwise selection on the training set in order to identify a satisfactory model that uses just a subset of the predictors.

detach(Wage)
attach(College)
str(College)
## 'data.frame':    777 obs. of  18 variables:
##  $ Private    : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Apps       : num  1660 2186 1428 417 193 ...
##  $ Accept     : num  1232 1924 1097 349 146 ...
##  $ Enroll     : num  721 512 336 137 55 158 103 489 227 172 ...
##  $ Top10perc  : num  23 16 22 60 16 38 17 37 30 21 ...
##  $ Top25perc  : num  52 29 50 89 44 62 45 68 63 44 ...
##  $ F.Undergrad: num  2885 2683 1036 510 249 ...
##  $ P.Undergrad: num  537 1227 99 63 869 ...
##  $ Outstate   : num  7440 12280 11250 12960 7560 ...
##  $ Room.Board : num  3300 6450 3750 5450 4120 ...
##  $ Books      : num  450 750 400 450 800 500 500 450 300 660 ...
##  $ Personal   : num  2200 1500 1165 875 1500 ...
##  $ PhD        : num  70 29 53 92 76 67 90 89 79 40 ...
##  $ Terminal   : num  78 30 66 97 72 73 93 100 84 41 ...
##  $ S.F.Ratio  : num  18.1 12.2 12.9 7.7 11.9 9.4 11.5 13.7 11.3 11.5 ...
##  $ perc.alumni: num  12 16 30 37 2 11 26 37 23 15 ...
##  $ Expend     : num  7041 10527 8735 19016 10922 ...
##  $ Grad.Rate  : num  60 56 54 59 15 55 63 73 80 52 ...
set.seed(1)
train.index = createDataPartition(Outstate, p = 0.8, list = FALSE, times = 1)
College.train = College[train.index,]
College.test = College[-train.index,]
ctrl = trainControl(method = "repeatedcv", 
                     number = 10, 
                     repeats = 1, 
                     selectionFunction = "oneSE")

set.seed(1)

model.forward = train(Outstate ~ .,
                       data = College.train,
                       method = "leapForward",
                       maximize = F,
                       trControl = ctrl,
                       tuneGrid = data.frame(nvmax = 1:17))

model.forward
## Linear Regression with Forward Selection 
## 
## 623 samples
##  17 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 560, 560, 562, 560, 561, 561, ... 
## Resampling results across tuning parameters:
## 
##   nvmax  RMSE      Rsquared   MAE     
##    1     3169.576  0.4229785  2473.480
##    2     2673.444  0.5769138  2027.409
##    3     2315.210  0.6760545  1813.115
##    4     2119.253  0.7250021  1666.489
##    5     2051.198  0.7413953  1615.754
##    6     2011.108  0.7507054  1583.030
##    7     2032.862  0.7457135  1601.087
##    8     2054.518  0.7404374  1611.457
##    9     2054.324  0.7402399  1614.336
##   10     2033.989  0.7455431  1608.504
##   11     2036.540  0.7447230  1608.870
##   12     2033.428  0.7453446  1598.234
##   13     2022.145  0.7492188  1592.980
##   14     1999.300  0.7544045  1572.820
##   15     1996.559  0.7547738  1575.071
##   16     1994.824  0.7550069  1575.145
##   17     1995.677  0.7548594  1575.779
## 
## RMSE was used to select the optimal model using  the one SE rule.
## The final value used for the model was nvmax = 6.

Using forward selection technique to select the bets model, the optimal number of variables chosen are 6. For nvmax = 6, the error is the lowest.

(b) Fit a GAM on the training data, using out-of-state tuition as the response and the features selected in the previous step as the predictors. Plot the results, and explain your findings.

coef(model.forward$finalModel, id = 6)
##   (Intercept)    PrivateYes    Room.Board           PhD   perc.alumni 
## -3748.7268770  2885.1884620     0.9448039    41.0276564    51.9906135 
##        Expend     Grad.Rate 
##     0.1997015    28.4932592

The predictors selected using forward model selection are Private, Room.Board, PhD, perc.alumni, Expend, Grad.Rate.

gam.model = gam(Outstate ~ Private + s(Room.Board) + s(PhD) + s(perc.alumni) + s(Expend) + s(Grad.Rate), data = College.train )

The smoothing splines are fit for all th epredictors except Private as it is qualitative

par(mfrow = c(2,3))
plot(gam.model, se = T, col = 'red')

There is evident non-linearity for Expend, but for Room Board , PhD, perc.alumni linear relationship would be apt. There is non-linaerity in GradRate variable as well.

(c) Evaluate the model obtained on the test set, and explain the results obtained.

Error rate for GAM Models

mean((predict(gam.model, newdata = College.test) - College.test$Outstate)^2)
## [1] 3816401

Error rate for the model chosen using forward stepwise selection

mean((predict(model.forward, newdata = College.test) - College.test$Outstate)^2)
## [1] 4960797

The GAM model has lower test error rate when compared to the model choosen from stepwise forward selection.

(d) For which variables, if any, is there evidence of a non-linear relationship with the response?

summary(gam.model)
## 
## Call: gam(formula = Outstate ~ Private + s(Room.Board) + s(PhD) + s(perc.alumni) + 
##     s(Expend) + s(Grad.Rate), data = College.train)
## Deviance Residuals:
##      Min       1Q   Median       3Q      Max 
## -7238.47 -1185.55    85.31  1280.35  4604.05 
## 
## (Dispersion Parameter for gaussian family taken to be 3411970)
## 
##     Null Deviance: 9778743994 on 622 degrees of freedom
## Residual Deviance: 2050592947 on 600.9997 degrees of freedom
## AIC: 11163.26 
## 
## Number of Local Scoring Iterations: NA 
## 
## Anova for Parametric Effects
##                 Df     Sum Sq    Mean Sq F value    Pr(>F)    
## Private          1 2634772462 2634772462  772.21 < 2.2e-16 ***
## s(Room.Board)    1 1955517765 1955517765  573.13 < 2.2e-16 ***
## s(PhD)           1  756514727  756514727  221.72 < 2.2e-16 ***
## s(perc.alumni)   1  386996490  386996490  113.42 < 2.2e-16 ***
## s(Expend)        1  681419771  681419771  199.71 < 2.2e-16 ***
## s(Grad.Rate)     1  103860089  103860089   30.44 5.123e-08 ***
## Residuals      601 2050592947    3411970                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Anova for Nonparametric Effects
##                Npar Df  Npar F   Pr(F)    
## (Intercept)                               
## Private                                   
## s(Room.Board)        3  1.9582 0.11910    
## s(PhD)               3  1.3924 0.24407    
## s(perc.alumni)       3  1.1768 0.31779    
## s(Expend)            3 27.3661 < 2e-16 ***
## s(Grad.Rate)         3  2.3871 0.06803 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From the above results, we observe that ANOVA for Parametric effects for all predictors are significant, hence all of the predictors exhibit a linear relationship, then in ANOVA for non-parametric results, Expend only has significant p-vale, showing that there is evidence of nonlinear relationship between Expend and Outstate. This behavior is observed from the plots above as well.