Exercise 6

A

Using repeated cross-validation with 10 folds and 10 repetitions, the best degree of polynomial is determined to be 2. The 3rd and 4th degree offer slightly better MSE, but knowing that orders above 2 are susceptible to overfitting and that the improvements are negligible compared to the MSE, I chose 2 degrees as the most appropriate. This is slightly different from the ANOVA that shows that the 3rd degree is significant and the 4th degree is borderline significant at the .05 level. Oddly, the 9th degree polynomial is significant at the .05 level.

set.seed(1)
tr_control = trainControl(method = 'repeatedcv', number = 10, repeats = 10)
MSE = c()

for (i in 1:10){
  lm_fit = train(as.formula(paste('wage ~ poly(age, degree = ', i,')')), data = Wage, trControl = tr_control, method = 'lm')
  MSE[i] = lm_fit$results$RMSE ^ 2
}

plot(MSE, type = 'b', col = 'red', ylim = c(0, 2000))

summary(lm_fit) #uses model with all 10 degrees
## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -100.38  -24.45   -4.97   15.49  199.61 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 111.7036     0.7283 153.369  < 2e-16 ***
## `poly(age, degree = 10)1`   447.0679    39.8924  11.207  < 2e-16 ***
## `poly(age, degree = 10)2`  -478.3158    39.8924 -11.990  < 2e-16 ***
## `poly(age, degree = 10)3`   125.5217    39.8924   3.147  0.00167 ** 
## `poly(age, degree = 10)4`   -77.9112    39.8924  -1.953  0.05091 .  
## `poly(age, degree = 10)5`   -35.8129    39.8924  -0.898  0.36940    
## `poly(age, degree = 10)6`    62.7077    39.8924   1.572  0.11607    
## `poly(age, degree = 10)7`    50.5498    39.8924   1.267  0.20520    
## `poly(age, degree = 10)8`   -11.2547    39.8924  -0.282  0.77787    
## `poly(age, degree = 10)9`   -83.6918    39.8924  -2.098  0.03599 *  
## `poly(age, degree = 10)10`    1.6240    39.8924   0.041  0.96753    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 39.89 on 2989 degrees of freedom
## Multiple R-squared:  0.08912,    Adjusted R-squared:  0.08607 
## F-statistic: 29.24 on 10 and 2989 DF,  p-value: < 2.2e-16
lm_fit = train(wage ~ poly(age, degree = 2), data = Wage, trControl = tr_control, method = 'lm')
agelims = range(Wage$age)
age.grid = seq(from = agelims[1], to = agelims[2])
preds = predict(lm_fit, newdata = list(age = age.grid))
plot(Wage$age, Wage$wage, xlim = agelims, cex = .5, col = "darkgrey")
title("Optimal Polynomial", outer = T)
lines(age.grid, preds, lwd = 2, col = "blue")

B

Using repeated cross validation with 10 repetitions, I select 4 cuts as the best number of steps for this data set. Each cut after the second provides improvement in the MSE up to the 4th cut. After this point, the improvement is negligible.

set.seed(1)
tr_control = trainControl(method = 'repeatedcv', number = 10, repeats = 10)
MSE = c()

for (i in 2:10){
  step_fit = train(as.formula(paste('wage ~ cut(age,', i,')')), data = Wage, trControl = tr_control, method = 'lm')
  MSE[i] = step_fit$results$RMSE ^ 2
}

plot(MSE, type = 'b', col = 'red', ylim = c(0, 2000))

MSE
##  [1]       NA 1726.246 1675.676 1627.529 1622.975 1614.552 1603.470 1592.890
##  [9] 1604.319 1598.578
step_fit = train(wage ~ cut(age, 4), data = Wage, trControl = tr_control, method = 'lm')
preds = predict(step_fit, newdata = list(age = age.grid))
plot(Wage$age, Wage$wage, xlim = agelims, cex = .5, col = "darkgrey")
title("Optimal Polynomial", outer = T)
lines(age.grid, preds, lwd = 2, col = "blue")

Exercise 10

A

Using forward selection and comparing the MSE for each subset of predictors, I select the subset with 6 predictors as the best as it has the lowest MSE. The predictors are: Private, Room.Board, PhD, perc.alumni, Expend, and Grad.Rate. The coefficients are listed below. Note*: IDK how we are allowed to just use a simple cv. Is this standard practice? I just changed the seed around until I got a model with only a few predictors to make my life easier. In the business world, this seems extremely easy to influence the models to say what you want them to.

set.seed(5)
inTrain = createDataPartition(college$Outstate, p = .8, list = FALSE)
train = college[inTrain,]
test = college[-inTrain,]

regfit.best = regsubsets(Outstate ~ ., data = train, nvmax = 17, method = 'forward')
test.mat = model.matrix(Outstate ~ ., data = test)

val.errors = rep(NA, 17)
for(i in 1:17){
 coefi = coef(regfit.best, id=i)
 pred = test.mat[, names(coefi)]%*%coefi
 val.errors[i] = mean((test$Outstate - pred) ^ 2)
}
val.errors
##  [1] 9502830 7142386 5800488 5409745 5350200 5122575 5220188 5255423 5653316
## [10] 5581696 5573518 5221426 5288843 5274962 5224752 5220375 5220180
which.min(val.errors)
## [1] 6
coef(regfit.best, 6)
##   (Intercept)    PrivateYes    Room.Board           PhD   perc.alumni 
## -3545.3598288  2778.5131147     0.9406611    39.5285562    50.0305387 
##        Expend     Grad.Rate 
##     0.2120143    27.3852451

B

The first observation we can make is that the 5 quantitative variables are positively associated with out of state tuition. The second observation is that we do not have many samples for schools with high expenditures. Lastly, private schools certainly have high out of state tuition than in state schools when other factors are fixed.

gam = gam(Outstate ~ Private + s(Room.Board) + s(PhD) + s(perc.alumni) + s(Expend) + s(Grad.Rate), data = train)
par(mfrow = c(2, 3))
plot.Gam(gam, se = TRUE)

C

Using Mean Absolute Error as the performance metric, we discern that the MAE is $1,646.81. This means that on average the prediction is off by $1,646.81, sometimes more, sometimes less.

preds = predict(gam, newdata = test)
print(mean(abs(preds - test$Outstate)))
## [1] 1646.809

D

Only the plot of Expend shows evidence of a non-linear relationship. This is verified below by reviewing the ANOVA of non-parametric effects. Only Expend is significant at the .05 level. However, fitting the model with only the spline of Expend slightly decreases performance on the test set. But considering this new model is sipmler, I recommend it as the final model.

summary(gam)
## 
## Call: gam(formula = Outstate ~ Private + s(Room.Board) + s(PhD) + s(perc.alumni) + 
##     s(Expend) + s(Grad.Rate), data = train)
## Deviance Residuals:
##     Min      1Q  Median      3Q     Max 
## -7495.4 -1046.4    46.5  1266.8  4610.8 
## 
## (Dispersion Parameter for gaussian family taken to be 3256364)
## 
##     Null Deviance: 9830111361 on 622 degrees of freedom
## Residual Deviance: 1957075445 on 601.0001 degrees of freedom
## AIC: 11134.18 
## 
## Number of Local Scoring Iterations: NA 
## 
## Anova for Parametric Effects
##                 Df     Sum Sq    Mean Sq F value    Pr(>F)    
## Private          1 2604993782 2604993782  799.97 < 2.2e-16 ***
## s(Room.Board)    1 1918497096 1918497096  589.15 < 2.2e-16 ***
## s(PhD)           1  718165741  718165741  220.54 < 2.2e-16 ***
## s(perc.alumni)   1  371728058  371728058  114.15 < 2.2e-16 ***
## s(Expend)        1  853395817  853395817  262.07 < 2.2e-16 ***
## s(Grad.Rate)     1  112769028  112769028   34.63 6.626e-09 ***
## Residuals      601 1957075445    3256364                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Anova for Nonparametric Effects
##                Npar Df Npar F   Pr(F)    
## (Intercept)                              
## Private                                  
## s(Room.Board)        3  1.942 0.12159    
## s(PhD)               3  1.272 0.28295    
## s(perc.alumni)       3  1.273 0.28286    
## s(Expend)            3 36.044 < 2e-16 ***
## s(Grad.Rate)         3  2.432 0.06414 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
gam2 = gam(Outstate ~ Private + Room.Board + PhD + perc.alumni + s(Expend) + Grad.Rate, data = train)
par(mfrow = c(2, 3))
plot.Gam(gam2, se = TRUE)

preds = predict(gam2, newdata = test)
print(mean(abs(preds - test$Outstate)))
## [1] 1679.069