In this exercise, you will further analyze the Wage data set considered throughout this chapter.
Wage
using Age
. Use cross-validation to select the optimal degree for the polynomial. What degree was chosen, and how does this compare to the results of hypothesis testing using ANOVA? Make a plot of the resulting polynomial fit to the data.set.seed(1)
RMSE <- c()
for(i in 1:10){
Wage$age.degree <- poly(Wage$age, i)
lm.fit <- train(wage ~ age.degree, data = Wage, method = "lm",
metric = "RMSE", trControl = trainControl(method = "cv", number = 10))
RMSE[i] <- lm.fit$results$RMSE
}
data.frame(degree = 1:10, RMSE = RMSE) %>%
ggplot(aes(x = degree, y = RMSE)) +
geom_line(size = 1, color = "#1c9099") +
geom_point(color = "#1c9099") +
scale_x_continuous(breaks = 1:10, minor_breaks = NULL) +
labs(x = "\n Degree", y = "Cross-Validation RMSE\n")
A polynomial degree of 6 gives us the lowest error in the cross-validation plot. However, based on the ANOVA results, the model with a 6th order polynomial term is insignificant, a polynomial degree of 3 is sufficient. Below is a plot of the cubic polynomial fit of Age
to predict Wage
.
fit.1 <- lm(wage ~ poly(age, 1, raw = TRUE), data = Wage)
fit.2 <- lm(wage ~ poly(age, 2, raw = TRUE), data = Wage)
fit.3 <- lm(wage ~ poly(age, 3, raw = TRUE), data = Wage)
fit.4 <- lm(wage ~ poly(age, 4, raw = TRUE), data = Wage)
fit.5 <- lm(wage ~ poly(age, 5, raw = TRUE), data = Wage)
fit.6 <- lm(wage ~ poly(age, 6, raw = TRUE), data = Wage)
anova(fit.1, fit.2, fit.3, fit.4, fit.5, fit.6)
## Analysis of Variance Table
##
## Model 1: wage ~ poly(age, 1, raw = TRUE)
## Model 2: wage ~ poly(age, 2, raw = TRUE)
## Model 3: wage ~ poly(age, 3, raw = TRUE)
## Model 4: wage ~ poly(age, 4, raw = TRUE)
## Model 5: wage ~ poly(age, 5, raw = TRUE)
## Model 6: wage ~ poly(age, 6, raw = TRUE)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 2998 5022216
## 2 2997 4793430 1 228786 143.6636 < 2.2e-16 ***
## 3 2996 4777674 1 15756 9.8936 0.001675 **
## 4 2995 4771604 1 6070 3.8117 0.050989 .
## 5 2994 4770322 1 1283 0.8054 0.369565
## 6 2993 4766389 1 3932 2.4692 0.116201
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ggplot(Wage, aes(age, wage)) +
geom_point(alpha = 0.5) +
stat_smooth(method = lm, formula = y ~ poly(x, 3, raw = TRUE)) +
labs(title = "Predicting Wage with a Cubic Polynomial Function of Age",
x = "Age", y = "Wage")
Wage
using Age
, and perform cross-validation to choose the optimal number of cuts. Make a plot of the fit obtained.RMSE <- c()
for(i in 2:10){
Wage$age.cut <- cut(Wage$age, i)
lm.fit <- train(wage ~ age.cut, data = Wage, method = "lm",
metric = "RMSE", trControl = trainControl(method = "cv", number = 10))
RMSE[i] <- lm.fit$results$RMSE
}
data.frame(degree = 1:10, RMSE = RMSE) %>%
ggplot(aes(x = degree, y = RMSE)) +
geom_line(size = 1, color = "#1c9099") +
geom_point(color = "#1c9099") +
scale_x_continuous(breaks = 1:10, minor_breaks = NULL) +
labs(x = "\n Knots", y = "Cross-Validation RMSE \n")
Using 10-fold cross-validation, the optimal number of cuts that gives us the lowest CV error is k = 7. Below is the fit of a 7-Step Function of Age
.
ggplot(Wage, aes(age, wage)) +
geom_point(alpha = 0.5) +
stat_smooth(method = lm, formula = y ~ cut(x, 7)) +
labs(title = "Predicting Wage with a 7-Step Function of Age", x = "Age", y = "Wage")
This question relates to the College data set.
set.seed(2)
train <- College$Outstate %>%
createDataPartition(p = 0.6, list = FALSE)
college.train <- College[train, ]
college.test <- College[-train, ]
ctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 3)
lm.forward <- train(Outstate ~., data = college.train,
method = "leapForward", metric = "RMSE",
tuneGrid = data.frame(nvmax = 1:17),
trControl = ctrl)
plot(lm.forward)
A subset of six variables minimizes the cross-validation error. The best six variables include Private
, Room.Board
, PhD
, Perc.alumni
, Expend
, and Grad.Rate
.
## (Intercept) PrivateYes Room.Board PhD perc.alumni
## -3426.1279395 2625.0467512 1.0544391 31.6186696 62.2164054
## Expend Grad.Rate
## 0.1835718 27.5736003
gam.fit <- train(Outstate ~ Private + Room.Board + PhD + perc.alumni + Expend + Grad.Rate,
data = college.train, method = "gamSpline", metric = "RMSE",
trControl = ctrl, tuneGrid = data.frame(df = seq(2, 10, by = 1)))
plot(gam.fit)
Based on the cross-validation plot, smoothing splines with three degrees of freedom give us the lowest CV error. The GAM plots show that there is evidence of non-linearity in PhD
, Grad.Rate
, and Expend
. We will use ANOVA to test for nonparametric effects.
Evaluate the model obtained on the test set, and explain the results obtained.
The Generalized Additive Model (GAM) achieved a lower test error and a higher test \(r^2\) than the linear model, indicating that the smoothing splines provide a better fit in predicting out-of-state tuition.
gam.preds <- gam.fit %>%
predict(college.test)
gam.rmse <- RMSE(college.test$Outstate, gam.preds)
gam.r2 <- R2(college.test$Outstate, gam.preds)
lm.preds <- lm.forward %>%
predict(college.test)
lm.rmse <- RMSE(college.test$Outstate, lm.preds)
lm.r2 <- R2(college.test$Outstate, lm.preds)
data.frame(Model = c("GAM", "Linear Model"),
RMSE = c(gam.rmse, lm.rmse),
RSquared = percent(c(gam.r2, lm.r2), accuracy = 0.1)) %>%
arrange(RMSE) %>%
kable(format.args = list(big.mark = ","), align = c("l", "c", "c"), digits = 0) %>%
kable_styling(bootstrap_options = c("striped"))
Model | RMSE | RSquared |
---|---|---|
GAM | 1,973 | 75.9% |
Linear Model | 2,138 | 71.6% |
For which variables, if any, is there evidence of a non-linear relationship with the response?
Based on the nonparametric ANOVA tests, there is strong evidence of a non-linear relationship between Outstate
and Expend
.
## Anova for Nonparametric Effects
## Npar Df Npar F Pr(F)
## (Intercept)
## PrivateYes
## s(perc.alumni, df = 3) 2 0.4687 0.6261
## s(PhD, df = 3) 2 2.0337 0.1320
## s(Grad.Rate, df = 3) 2 1.5558 0.2121
## s(Room.Board, df = 3) 2 2.0234 0.1334
## s(Expend, df = 3) 2 26.6981 1.1e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1