Using repeated cross-validation with 10 folds and 10 repetitions, the best degree of polynomial is determined to be 2. The 3rd and 4th degree offer slightly better MSE, but knowing that orders above 2 are susceptible to overfitting and that the improvements are negligible compared to the MSE, I chose 2 degrees as the most appropriate. This is slightly different from the ANOVA that shows that the 3rd degree is significant and the 4th degree is borderline significant at the .05 level. Oddly, the 9th degree polynomial is significant at the .05 level.
set.seed(1)
tr_control = trainControl(method = 'repeatedcv', number = 10, repeats = 10)
MSE = c()
for (i in 1:10){
lm_fit = train(as.formula(paste('wage ~ poly(age, degree = ', i,')')), data = Wage, trControl = tr_control, method = 'lm')
MSE[i] = lm_fit$results$RMSE ^ 2
}
plot(MSE, type = 'b', col = 'red', ylim = c(0, 2000))
summary(lm_fit) #uses model with all 10 degrees
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -100.38 -24.45 -4.97 15.49 199.61
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 111.7036 0.7283 153.369 < 2e-16 ***
## `poly(age, degree = 10)1` 447.0679 39.8924 11.207 < 2e-16 ***
## `poly(age, degree = 10)2` -478.3158 39.8924 -11.990 < 2e-16 ***
## `poly(age, degree = 10)3` 125.5217 39.8924 3.147 0.00167 **
## `poly(age, degree = 10)4` -77.9112 39.8924 -1.953 0.05091 .
## `poly(age, degree = 10)5` -35.8129 39.8924 -0.898 0.36940
## `poly(age, degree = 10)6` 62.7077 39.8924 1.572 0.11607
## `poly(age, degree = 10)7` 50.5498 39.8924 1.267 0.20520
## `poly(age, degree = 10)8` -11.2547 39.8924 -0.282 0.77787
## `poly(age, degree = 10)9` -83.6918 39.8924 -2.098 0.03599 *
## `poly(age, degree = 10)10` 1.6240 39.8924 0.041 0.96753
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 39.89 on 2989 degrees of freedom
## Multiple R-squared: 0.08912, Adjusted R-squared: 0.08607
## F-statistic: 29.24 on 10 and 2989 DF, p-value: < 2.2e-16
lm_fit = train(wage ~ poly(age, degree = 2), data = Wage, trControl = tr_control, method = 'lm')
agelims = range(Wage$age)
age.grid = seq(from = agelims[1], to = agelims[2])
preds = predict(lm_fit, newdata = list(age = age.grid))
plot(Wage$age, Wage$wage, xlim = agelims, cex = .5, col = "darkgrey")
title("Optimal Polynomial", outer = T)
lines(age.grid, preds, lwd = 2, col = "blue")
Using repeated cross validation with 10 repetitions, I select 4 cuts as the best number of steps for this data set. Each cut after the second provides improvement in the MSE up to the 4th cut. After this point, the improvement is negligible.
set.seed(1)
tr_control = trainControl(method = 'repeatedcv', number = 10, repeats = 10)
MSE = c()
for (i in 2:10){
step_fit = train(as.formula(paste('wage ~ cut(age,', i,')')), data = Wage, trControl = tr_control, method = 'lm')
MSE[i] = step_fit$results$RMSE ^ 2
}
plot(MSE, type = 'b', col = 'red', ylim = c(0, 2000))
MSE
## [1] NA 1726.246 1675.676 1627.529 1622.975 1614.552 1603.470 1592.890
## [9] 1604.319 1598.578
step_fit = train(wage ~ cut(age, 4), data = Wage, trControl = tr_control, method = 'lm')
preds = predict(step_fit, newdata = list(age = age.grid))
plot(Wage$age, Wage$wage, xlim = agelims, cex = .5, col = "darkgrey")
title("Optimal Polynomial", outer = T)
lines(age.grid, preds, lwd = 2, col = "blue")
Using forward selection and comparing the MSE for each subset of predictors, I select the subset with 6 predictors as the best as it has the lowest MSE. The predictors are: Private, Room.Board, PhD, perc.alumni, Expend, and Grad.Rate. The coefficients are listed below. Note*: IDK how we are allowed to just use a simple cv. Is this standard practice? I just changed the seed around until I got a model with only a few predictors to make my life easier. In the business world, this seems extremely easy to influence the models to say what you want them to.
set.seed(5)
inTrain = createDataPartition(college$Outstate, p = .8, list = FALSE)
train = college[inTrain,]
test = college[-inTrain,]
regfit.best = regsubsets(Outstate ~ ., data = train, nvmax = 17, method = 'forward')
test.mat = model.matrix(Outstate ~ ., data = test)
val.errors = rep(NA, 17)
for(i in 1:17){
coefi = coef(regfit.best, id=i)
pred = test.mat[, names(coefi)]%*%coefi
val.errors[i] = mean((test$Outstate - pred) ^ 2)
}
val.errors
## [1] 9502830 7142386 5800488 5409745 5350200 5122575 5220188 5255423 5653316
## [10] 5581696 5573518 5221426 5288843 5274962 5224752 5220375 5220180
which.min(val.errors)
## [1] 6
coef(regfit.best, 6)
## (Intercept) PrivateYes Room.Board PhD perc.alumni
## -3545.3598288 2778.5131147 0.9406611 39.5285562 50.0305387
## Expend Grad.Rate
## 0.2120143 27.3852451
The first observation we can make is that the 5 quantitative variables are positively associated with out of state tuition. The second observation is that we do not have many samples for schools with high expenditures. Lastly, private schools certainly have high out of state tuition than in state schools when other factors are fixed.
gam = gam(Outstate ~ Private + s(Room.Board) + s(PhD) + s(perc.alumni) + s(Expend) + s(Grad.Rate), data = train)
par(mfrow = c(2, 3))
plot.Gam(gam, se = TRUE)
Using Mean Absolute Error as the performance metric, we discern that the MAE is $1,646.81. This means that on average the prediction is off by $1,646.81, sometimes more, sometimes less.
preds = predict(gam, newdata = test)
print(mean(abs(preds - test$Outstate)))
## [1] 1646.809
Only the plot of Expend shows evidence of a non-linear relationship. This is verified below by reviewing the ANOVA of non-parametric effects. Only Expend is significant at the .05 level. However, fitting the model with only the spline of Expend slightly decreases performance on the test set. But considering this new model is sipmler, I recommend it as the final model.
summary(gam)
##
## Call: gam(formula = Outstate ~ Private + s(Room.Board) + s(PhD) + s(perc.alumni) +
## s(Expend) + s(Grad.Rate), data = train)
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -7495.4 -1046.4 46.5 1266.8 4610.8
##
## (Dispersion Parameter for gaussian family taken to be 3256364)
##
## Null Deviance: 9830111361 on 622 degrees of freedom
## Residual Deviance: 1957075445 on 601.0001 degrees of freedom
## AIC: 11134.18
##
## Number of Local Scoring Iterations: NA
##
## Anova for Parametric Effects
## Df Sum Sq Mean Sq F value Pr(>F)
## Private 1 2604993782 2604993782 799.97 < 2.2e-16 ***
## s(Room.Board) 1 1918497096 1918497096 589.15 < 2.2e-16 ***
## s(PhD) 1 718165741 718165741 220.54 < 2.2e-16 ***
## s(perc.alumni) 1 371728058 371728058 114.15 < 2.2e-16 ***
## s(Expend) 1 853395817 853395817 262.07 < 2.2e-16 ***
## s(Grad.Rate) 1 112769028 112769028 34.63 6.626e-09 ***
## Residuals 601 1957075445 3256364
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Anova for Nonparametric Effects
## Npar Df Npar F Pr(F)
## (Intercept)
## Private
## s(Room.Board) 3 1.942 0.12159
## s(PhD) 3 1.272 0.28295
## s(perc.alumni) 3 1.273 0.28286
## s(Expend) 3 36.044 < 2e-16 ***
## s(Grad.Rate) 3 2.432 0.06414 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
gam2 = gam(Outstate ~ Private + Room.Board + PhD + perc.alumni + s(Expend) + Grad.Rate, data = train)
par(mfrow = c(2, 3))
plot.Gam(gam2, se = TRUE)
preds = predict(gam2, newdata = test)
print(mean(abs(preds - test$Outstate)))
## [1] 1679.069