In this exercise, you will further analyze the Wage data set considered throughout this chapter.
Wage using Age. Use cross-validation to select the optimal degree for the polynomial. What degree was chosen, and how does this compare to the results of hypothesis testing using ANOVA? Make a plot of the resulting polynomial fit to the data.set.seed(1)
RMSE <- c()
for(i in 1:10){
Wage$age.degree <- poly(Wage$age, i)
lm.fit <- train(wage ~ age.degree, data = Wage, method = "lm",
metric = "RMSE", trControl = trainControl(method = "cv", number = 10))
RMSE[i] <- lm.fit$results$RMSE
}
data.frame(degree = 1:10, RMSE = RMSE) %>%
ggplot(aes(x = degree, y = RMSE)) +
geom_line(size = 1, color = "#1c9099") +
geom_point(color = "#1c9099") +
scale_x_continuous(breaks = 1:10, minor_breaks = NULL) +
labs(x = "\n Degree", y = "Cross-Validation RMSE\n")A polynomial degree of 6 gives us the lowest error in the cross-validation plot. However, based on the ANOVA results, the model with a 6th order polynomial term is insignificant, a polynomial degree of 3 is sufficient. Below is a plot of the cubic polynomial fit of Age to predict Wage.
fit.1 <- lm(wage ~ poly(age, 1, raw = TRUE), data = Wage)
fit.2 <- lm(wage ~ poly(age, 2, raw = TRUE), data = Wage)
fit.3 <- lm(wage ~ poly(age, 3, raw = TRUE), data = Wage)
fit.4 <- lm(wage ~ poly(age, 4, raw = TRUE), data = Wage)
fit.5 <- lm(wage ~ poly(age, 5, raw = TRUE), data = Wage)
fit.6 <- lm(wage ~ poly(age, 6, raw = TRUE), data = Wage)
anova(fit.1, fit.2, fit.3, fit.4, fit.5, fit.6)## Analysis of Variance Table
##
## Model 1: wage ~ poly(age, 1, raw = TRUE)
## Model 2: wage ~ poly(age, 2, raw = TRUE)
## Model 3: wage ~ poly(age, 3, raw = TRUE)
## Model 4: wage ~ poly(age, 4, raw = TRUE)
## Model 5: wage ~ poly(age, 5, raw = TRUE)
## Model 6: wage ~ poly(age, 6, raw = TRUE)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 2998 5022216
## 2 2997 4793430 1 228786 143.6636 < 2.2e-16 ***
## 3 2996 4777674 1 15756 9.8936 0.001675 **
## 4 2995 4771604 1 6070 3.8117 0.050989 .
## 5 2994 4770322 1 1283 0.8054 0.369565
## 6 2993 4766389 1 3932 2.4692 0.116201
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ggplot(Wage, aes(age, wage)) +
geom_point(alpha = 0.5) +
stat_smooth(method = lm, formula = y ~ poly(x, 3, raw = TRUE)) +
labs(title = "Predicting Wage with a Cubic Polynomial Function of Age",
x = "Age", y = "Wage")Wage using Age, and perform cross-validation to choose the optimal number of cuts. Make a plot of the fit obtained.RMSE <- c()
for(i in 2:10){
Wage$age.cut <- cut(Wage$age, i)
lm.fit <- train(wage ~ age.cut, data = Wage, method = "lm",
metric = "RMSE", trControl = trainControl(method = "cv", number = 10))
RMSE[i] <- lm.fit$results$RMSE
}
data.frame(degree = 1:10, RMSE = RMSE) %>%
ggplot(aes(x = degree, y = RMSE)) +
geom_line(size = 1, color = "#1c9099") +
geom_point(color = "#1c9099") +
scale_x_continuous(breaks = 1:10, minor_breaks = NULL) +
labs(x = "\n Knots", y = "Cross-Validation RMSE \n")Using 10-fold cross-validation, the optimal number of cuts that gives us the lowest CV error is k = 7. Below is the fit of a 7-Step Function of Age.
ggplot(Wage, aes(age, wage)) +
geom_point(alpha = 0.5) +
stat_smooth(method = lm, formula = y ~ cut(x, 7)) +
labs(title = "Predicting Wage with a 7-Step Function of Age", x = "Age", y = "Wage")This question relates to the College data set.
set.seed(2)
train <- College$Outstate %>%
createDataPartition(p = 0.6, list = FALSE)
college.train <- College[train, ]
college.test <- College[-train, ]
ctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 3)lm.forward <- train(Outstate ~., data = college.train,
method = "leapForward", metric = "RMSE",
tuneGrid = data.frame(nvmax = 1:17),
trControl = ctrl)
plot(lm.forward)A subset of six variables minimizes the cross-validation error. The best six variables include Private, Room.Board, PhD, Perc.alumni, Expend, and Grad.Rate.
## (Intercept) PrivateYes Room.Board PhD perc.alumni
## -3426.1279395 2625.0467512 1.0544391 31.6186696 62.2164054
## Expend Grad.Rate
## 0.1835718 27.5736003
gam.fit <- train(Outstate ~ Private + Room.Board + PhD + perc.alumni + Expend + Grad.Rate,
data = college.train, method = "gamSpline", metric = "RMSE",
trControl = ctrl, tuneGrid = data.frame(df = seq(2, 10, by = 1)))
plot(gam.fit)Based on the cross-validation plot, smoothing splines with three degrees of freedom give us the lowest CV error. The GAM plots show that there is evidence of non-linearity in PhD, Grad.Rate, and Expend. We will use ANOVA to test for nonparametric effects.
Evaluate the model obtained on the test set, and explain the results obtained.
The Generalized Additive Model (GAM) achieved a lower test error and a higher test \(r^2\) than the linear model, indicating that the smoothing splines provide a better fit in predicting out-of-state tuition.
gam.preds <- gam.fit %>%
predict(college.test)
gam.rmse <- RMSE(college.test$Outstate, gam.preds)
gam.r2 <- R2(college.test$Outstate, gam.preds)
lm.preds <- lm.forward %>%
predict(college.test)
lm.rmse <- RMSE(college.test$Outstate, lm.preds)
lm.r2 <- R2(college.test$Outstate, lm.preds)
data.frame(Model = c("GAM", "Linear Model"),
RMSE = c(gam.rmse, lm.rmse),
RSquared = percent(c(gam.r2, lm.r2), accuracy = 0.1)) %>%
arrange(RMSE) %>%
kable(format.args = list(big.mark = ","), align = c("l", "c", "c"), digits = 0) %>%
kable_styling(bootstrap_options = c("striped"))| Model | RMSE | RSquared |
|---|---|---|
| GAM | 1,973 | 75.9% |
| Linear Model | 2,138 | 71.6% |
For which variables, if any, is there evidence of a non-linear relationship with the response?
Based on the nonparametric ANOVA tests, there is strong evidence of a non-linear relationship between Outstate and Expend.
## Anova for Nonparametric Effects
## Npar Df Npar F Pr(F)
## (Intercept)
## PrivateYes
## s(perc.alumni, df = 3) 2 0.4687 0.6261
## s(PhD, df = 3) 2 2.0337 0.1320
## s(Grad.Rate, df = 3) 2 1.5558 0.2121
## s(Room.Board, df = 3) 2 2.0234 0.1334
## s(Expend, df = 3) 2 26.6981 1.1e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1