In this exercise, you will further analyze the Wage data set considered throughout this chapter.
**(A) Perform polynomial regression to predict wage using age. Use cross-validation to select the optimal degree *d for the polynomial. What degree was chosen, and how does this compare to the results of hypothesis testing using ANOVA? Make a plot of the resulting polynomial fit to the data.**
data("Wage")
set.seed(532)
poly.mse=c()
for(degree in 1:7){
poly.fit=glm(wage~poly(age,degree,raw=TRUE),data=Wage)
mse=cv.glm(poly.fit,data = Wage,K=10)$delta[1]
poly.mse=c(poly.mse,mse)
}
plot(poly.mse,main='Wage Datatset: Polynomial Regression With Cross-Validation',xlab='Degree of Polynomial',ylab='Cross Validation Error',type='b')
x=which.min(poly.mse)
points(x,poly.mse[x],pch=20,cex=2,col='red')
set.seed(532)
anova.fit1=lm(wage~poly(age,1,raw=TRUE),data=Wage)
anova.fit2=lm(wage~poly(age,2,raw=TRUE),data=Wage)
anova.fit3=lm(wage~poly(age,3,raw=TRUE),data=Wage)
anova.fit4=lm(wage~poly(age,4,raw=TRUE),data=Wage)
anova.fit5=lm(wage~poly(age,5,raw=TRUE),data=Wage)
anova.fit6=lm(wage~poly(age,6,raw=TRUE),data=Wage)
anova.fit7=lm(wage~poly(age,7,raw=TRUE),data=Wage)
anova(anova.fit1,anova.fit2,anova.fit3,anova.fit4,anova.fit5,anova.fit6,anova.fit7)
## Analysis of Variance Table
##
## Model 1: wage ~ poly(age, 1, raw = TRUE)
## Model 2: wage ~ poly(age, 2, raw = TRUE)
## Model 3: wage ~ poly(age, 3, raw = TRUE)
## Model 4: wage ~ poly(age, 4, raw = TRUE)
## Model 5: wage ~ poly(age, 5, raw = TRUE)
## Model 6: wage ~ poly(age, 6, raw = TRUE)
## Model 7: wage ~ poly(age, 7, raw = TRUE)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 2998 5022216
## 2 2997 4793430 1 228786 143.6926 < 2.2e-16 ***
## 3 2996 4777674 1 15756 9.8956 0.001673 **
## 4 2995 4771604 1 6070 3.8125 0.050966 .
## 5 2994 4770322 1 1283 0.8055 0.369516
## 6 2993 4766389 1 3932 2.4697 0.116165
## 7 2992 4763834 1 2555 1.6049 0.205311
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The Wage Datatset: Polynomial Regression With Cross-Validation graph identifies that the sixth degree, highlighted in red, is the best polynomial. The selection does concur with the analysis of variance results. These results indicate that either a cubic or a quartic polynomial appears to provide a reasonable fit to the data, but lower-order or higher-order models are not justified.
ggplot(Wage, aes(age,wage)) +
geom_point(color="green") +
stat_smooth(method = "lm", formula = y ~ poly(x, 6), size = 1)
(B) Fit a step function to predict wage using age, and perform cross- validation to choose the optimal number of cuts. Make a plot of the fit obtained.
set.seed(532)
ctrl <- trainControl(method = "cv", number = 10)
CV_RMSE <- c()
for (i in 2:20) {
model_temp <- train(y = Wage$wage,
x = data.frame(cut(Wage$age, i)),
method = "lm",
metric = "RMSE",
trControl = ctrl)
CV_RMSE[i-1] <- model_temp$results$RMSE
}
data.frame(cuts = 2:20, CV_RMSE = CV_RMSE) %>%
mutate(min_CV_RMSE = as.numeric(min(CV_RMSE) == CV_RMSE)) %>%
ggplot(aes(x = cuts, y = CV_RMSE)) +
geom_line(col = "grey55") +
geom_point(size = 2, aes(col = factor(min_CV_RMSE))) +
scale_x_continuous(breaks = seq(2, 20), minor_breaks = NULL) +
scale_color_manual(values = c("deepskyblue3", "green")) +
theme(legend.position = "none") +
labs(title = "Wage Dataset: Step Function",
x = "Intervals",
y = "CV RMSE")
In fitting the step function age was cut up into 20 intervals which was used by the cross-validation function to select best cut for the model. For this model the optimal number of cuts is 8. The model will now be used to predict wage with an eight interval step function of age.
ggplot(Wage, aes(x = age, y = wage)) +
geom_point(alpha = 0.3) +
geom_smooth(method = "lm", formula = "y ~ cut(x, 8)") +
labs(title = "Wage Dataset: Step Function (Eight Intervals)")
This question relates to the College data set.
(A) Split the data into a training set and a test set. Using out-of-state tuition as the response and the other variables as the predictors, perform forward step wise selection on the training set in order to identify a satisfactory model that uses just a subset of the predictors.
data(College)
set.seed(234)
College.split = initial_split(College, strata = "Outstate", prop = .7)
College.train = training(College.split)
College.test = testing (College.split)
ctrl <- trainControl(method = "repeatedcv",
number = 10,
repeats = 1,
selectionFunction = "oneSE")
model.fwd <- train(Outstate ~ .,
data = College.train,
method = "leapForward",
metric = "MSE",
maximize = F,
trControl = ctrl,
tuneGrid = data.frame(nvmax = 1:17))
## Warning in train.default(x, y, weights = w, ...): The metric "MSE" was not in
## the result set. RMSE will be used instead.
model.fwd
## Linear Regression with Forward Selection
##
## 542 samples
## 17 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times)
## Summary of sample sizes: 487, 487, 488, 489, 489, 488, ...
## Resampling results across tuning parameters:
##
## nvmax RMSE Rsquared MAE
## 1 2821.222 0.5421468 2263.313
## 2 2432.681 0.6511000 1868.439
## 3 2226.759 0.7038217 1723.816
## 4 2081.243 0.7405207 1643.016
## 5 2025.848 0.7518705 1595.431
## 6 2015.033 0.7547385 1596.790
## 7 1993.910 0.7601704 1583.245
## 8 2009.634 0.7558769 1598.434
## 9 2029.106 0.7502363 1601.002
## 10 2034.763 0.7506942 1602.750
## 11 2037.931 0.7506845 1602.482
## 12 1995.502 0.7625791 1577.820
## 13 1997.814 0.7624667 1580.192
## 14 2007.636 0.7598428 1578.395
## 15 2007.645 0.7596646 1577.778
## 16 2001.546 0.7608492 1574.528
## 17 1996.867 0.7618954 1570.084
##
## RMSE was used to select the optimal model using the one SE rule.
## The final value used for the model was nvmax = 5.
The forward step wise selection process identified a 5 variable solution to be a satisfactory model.
(B) Fit a GAM on the training data, using out-of-state tuition as the response and the features selected in the previous step as the predictors. Plot the results, and explain your findings.
The variables in the selected solution are:
coef(model.fwd$finalModel, id = 5)
## (Intercept) PrivateYes Room.Board PhD perc.alumni
## -2567.2683333 2642.6569135 0.9689926 37.4598121 64.1658168
## Expend
## 0.2808643
model.gam <- gam(Outstate ~ Private + s(Room.Board) + s(PhD) + s(perc.alumni) + s(Expend), data = College.train)
par(mfrow = c(2, 3))
plot(model.gam, se = T, col = "red")
The variable Private is a qualitative variable indicating is a university is private or public. In this model the data consist largely of data from the public sector. Room and board cost (Rooom.Board) appear to linear in nature, which is somewhat expected considering out-of-state students require room and board weather it is provided through the university or in the private sector. As out-of-state tuition rises so does the percent of faculty with PH.D’s (PhD). The model suggest that schools that cost more have a higher percentage of alumni (perc_alumni) who donate to it. The instructional expenditure per student (Expend) begins to plateau when out-of-state tuition is approximately $16,000 suggesting that out-of-state tuition is not a driver for the a Universities decision on instructional expenditures.
(C) Evaluate the model obtained on the test set, and explain the results obtained.
The GAM MSE and R^2 for College.test is:
gam.pred=predict(model.fwd, College.test)
(gam.mse=mean((College.test$Outstate-gam.pred)^2))
## [1] 5023810
gam.tss = sum((College.test$Outstate- mean(College.test$Outstate))^2)
gam.rss = sum((gam.pred -College.test$Outstate)^2)
(1-gam.rss/gam.tss)
## [1] 0.6930349
The GAM model can explain 69.3% of the variance in the data.
(D) For which variables, if any, is there evidence of a non-linear relationship with the response?
summary(model.gam)
##
## Call: gam(formula = Outstate ~ Private + s(Room.Board) + s(PhD) + s(perc.alumni) +
## s(Expend), data = College.train)
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -5440.1 -1136.9 103.7 1202.5 7010.9
##
## (Dispersion Parameter for gaussian family taken to be 3448194)
##
## Null Deviance: 8712201403 on 541 degrees of freedom
## Residual Deviance: 1806853709 on 524 degrees of freedom
## AIC: 9716.746
##
## Number of Local Scoring Iterations: NA
##
## Anova for Parametric Effects
## Df Sum Sq Mean Sq F value Pr(>F)
## Private 1 2257060436 2257060436 654.56 < 2.2e-16 ***
## s(Room.Board) 1 1880675047 1880675047 545.41 < 2.2e-16 ***
## s(PhD) 1 672093668 672093668 194.91 < 2.2e-16 ***
## s(perc.alumni) 1 409119952 409119952 118.65 < 2.2e-16 ***
## s(Expend) 1 757963449 757963449 219.81 < 2.2e-16 ***
## Residuals 524 1806853709 3448194
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Anova for Nonparametric Effects
## Npar Df Npar F Pr(F)
## (Intercept)
## Private
## s(Room.Board) 3 2.7045 0.04481 *
## s(PhD) 3 2.3431 0.07225 .
## s(perc.alumni) 3 1.7452 0.15675
## s(Expend) 3 26.9990 2.22e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Considering a significance value of p<0.05, the Anova for Nonparametric Effects tests identified that the variable Expend is significant and has a non-linear relationship. In contrast, the Anova for Parametric Effects tests identifies that all variables are significant and have a non-linear relationship.