ISLR2 book starting PDF Page 334.
# Load dependencies
pacman::p_load(ISLR2, boot, dplyr, leaps, mgcv)
In this exercise, you will further analyze the Wage data set considered throughout this chapter.
Perform polynomial regression to predict wage using
age. Use cross-validation to select the optimal degree d
for the polynomial. What degree was chosen, and how does this compare to
the results of hypothesis testing using ANOVA? Make a plot of the
resulting polynomial fit to the data.
Using cross-validation, a 4th degree polynomial yielded the lowest prediction error. However, an ANOVA comparison between the cubic (degree-3) and quartic (degree-4) polynomials models resulted in a p-value of approximately 0.05. Hence, either a cubic or a quadratic polynomial appear to provide a reasonable fit to the data.
# Read in the data from ISLR2 package
d1 = Wage
head(d1)
## year age maritl race education region
## 231655 2006 18 1. Never Married 1. White 1. < HS Grad 2. Middle Atlantic
## 86582 2004 24 1. Never Married 1. White 4. College Grad 2. Middle Atlantic
## 161300 2003 45 2. Married 1. White 3. Some College 2. Middle Atlantic
## 155159 2003 43 2. Married 3. Asian 4. College Grad 2. Middle Atlantic
## 11443 2005 50 4. Divorced 1. White 2. HS Grad 2. Middle Atlantic
## 376662 2008 54 2. Married 1. White 4. College Grad 2. Middle Atlantic
## jobclass health health_ins logwage wage
## 231655 1. Industrial 1. <=Good 2. No 4.318063 75.04315
## 86582 2. Information 2. >=Very Good 2. No 4.255273 70.47602
## 161300 1. Industrial 1. <=Good 1. Yes 4.875061 130.98218
## 155159 2. Information 2. >=Very Good 1. Yes 5.041393 154.68529
## 11443 2. Information 1. <=Good 1. Yes 4.318063 75.04315
## 376662 2. Information 2. >=Very Good 1. Yes 4.845098 127.11574
# Cross validation to find the optimal degree
set.seed(42)
cv.error = rep(0, 5)
for (i in 1:5) {
lm_model = glm(wage ~ poly(age, i), data = d1)
cv.error[i] = cv.glm(d1, lm_model, K = 10)$delta[1]
}
# Print the errors
cv.error
## [1] 1676.334 1601.952 1597.313 1594.688 1595.061
# Decide on the degree of the polynomial to use with ANOVA
fit.1 = lm(wage ~ age, data= d1)
fit.2 = lm(wage ~ poly(age, 2), data= d1)
fit.3 = lm(wage ~ poly(age, 3), data= d1)
fit.4 = lm(wage ~ poly(age, 4), data= d1)
fit.5 = lm(wage ~ poly(age, 5), data= d1)
# ANOVA test
anova(fit.1, fit.2, fit.3, fit.4, fit.5)
## Analysis of Variance Table
##
## Model 1: wage ~ age
## Model 2: wage ~ poly(age, 2)
## Model 3: wage ~ poly(age, 3)
## Model 4: wage ~ poly(age, 4)
## Model 5: wage ~ poly(age, 5)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 2998 5022216
## 2 2997 4793430 1 228786 143.5931 < 2.2e-16 ***
## 3 2996 4777674 1 15756 9.8888 0.001679 **
## 4 2995 4771604 1 6070 3.8098 0.051046 .
## 5 2994 4770322 1 1283 0.8050 0.369682
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
age_range = range(d1$age)
age_grid = seq(from= age_range[1], to= age_range[2])
preds = predict(fit.4, list(age= age_grid), se= TRUE)
# Se stands for standard errors
se_bands = cbind(preds$fit + 2*preds$se.fit, preds$fit - 2*preds$se.fit)
# Plot the data and add the fit from the degree-4 polynomial
par(mfrow= c(1, 2), mar= c(4.5, 4.5, 1, 1), oma= c(0, 0, 4, 0))
plot(d1$age, d1$wage, xlim= age_range, cex= .5, col= "darkgrey")
title("Degree-4 Polynomial", outer= T)
lines(age_grid, preds$fit, lwd= 2, col= "blue")
matlines(age_grid, se_bands, lwd= 1, col= "blue", lty= 3)
Fit a step function to predict wage using age, and perform cross validation to choose the optimal number of cuts. Make a plot of the fit obtained.
By cross-validation, 7 is the optimal number of cuts.
# Cross validation to find the optimal number of cuts
set.seed(42)
cv.error = rep(0, 8)
for (i in 2:10) {
d1$age_cut = cut(d1$age, i)
lm_model = glm(wage ~ age_cut, data = d1)
cv.error[i-1] = cv.glm(d1, lm_model, K = 10)$delta[1]
}
# Optimal number of cuts
best_k = which.min(cv.error) + 1
# Print the errors
cv.error
## [1] 1733.565 1684.082 1637.465 1632.337 1623.836 1610.572 1602.163 1612.770
## [9] 1606.141
# Final Model
fit.7 = glm(wage ~ cut(age, 7), data = d1)
age_range = range(d1$age)
age_grid = seq(from= age_range[1], to= age_range[2])
preds = predict(fit.7, list(age= age_grid), se.fit = TRUE)
# Se stands for standard errors
se_bands = cbind(preds$fit + 2*preds$se.fit, preds$fit - 2*preds$se.fit)
# Plot the data and add the fit from the optimal number of cuts
par(mfrow= c(1, 2), mar= c(4.5, 4.5, 1, 1), oma= c(0, 0, 4, 0))
plot(d1$age, d1$wage, xlim= age_range, cex= .5, col= "darkgrey")
title("7 Cut Model", outer= T)
lines(age_grid, preds$fit, lwd= 2, col= "blue")
matlines(age_grid, se_bands, lwd= 1, col= "blue", lty= 3)
This question relates to the College data set.
Split the data into a training set and a test set. Using out-of-state tuition as the response and the other variables as the predictors, perform forward stepwise selection on the training set in order to identify a satisfactory model that uses just a subset of the predictors.
# Read in the data from ISLR2 package
d2 = College
glimpse(d2)
## Rows: 777
## Columns: 18
## $ Private <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes…
## $ Apps <dbl> 1660, 2186, 1428, 417, 193, 587, 353, 1899, 1038, 582, 173…
## $ Accept <dbl> 1232, 1924, 1097, 349, 146, 479, 340, 1720, 839, 498, 1425…
## $ Enroll <dbl> 721, 512, 336, 137, 55, 158, 103, 489, 227, 172, 472, 484,…
## $ Top10perc <dbl> 23, 16, 22, 60, 16, 38, 17, 37, 30, 21, 37, 44, 38, 44, 23…
## $ Top25perc <dbl> 52, 29, 50, 89, 44, 62, 45, 68, 63, 44, 75, 77, 64, 73, 46…
## $ F.Undergrad <dbl> 2885, 2683, 1036, 510, 249, 678, 416, 1594, 973, 799, 1830…
## $ P.Undergrad <dbl> 537, 1227, 99, 63, 869, 41, 230, 32, 306, 78, 110, 44, 638…
## $ Outstate <dbl> 7440, 12280, 11250, 12960, 7560, 13500, 13290, 13868, 1559…
## $ Room.Board <dbl> 3300, 6450, 3750, 5450, 4120, 3335, 5720, 4826, 4400, 3380…
## $ Books <dbl> 450, 750, 400, 450, 800, 500, 500, 450, 300, 660, 500, 400…
## $ Personal <dbl> 2200, 1500, 1165, 875, 1500, 675, 1500, 850, 500, 1800, 60…
## $ PhD <dbl> 70, 29, 53, 92, 76, 67, 90, 89, 79, 40, 82, 73, 60, 79, 36…
## $ Terminal <dbl> 78, 30, 66, 97, 72, 73, 93, 100, 84, 41, 88, 91, 84, 87, 6…
## $ S.F.Ratio <dbl> 18.1, 12.2, 12.9, 7.7, 11.9, 9.4, 11.5, 13.7, 11.3, 11.5, …
## $ perc.alumni <dbl> 12, 16, 30, 37, 2, 11, 26, 37, 23, 15, 31, 41, 21, 32, 26,…
## $ Expend <dbl> 7041, 10527, 8735, 19016, 10922, 9727, 8861, 11487, 11644,…
## $ Grad.Rate <dbl> 60, 56, 54, 59, 15, 55, 63, 73, 80, 52, 73, 76, 74, 68, 55…
set.seed(1)
# Sample data into test and train
train = sample(length(d2$Outstate), length(d2$Outstate)/2)
test = -train
# Split data into test and train
college_train = d2[train, ]
college_test = d2[test,]
# Regsubsets stepwise selection Model
reg_model = regsubsets(Outstate ~ ., data= college_train, nvmax= 17, method= 'forward')
reg_summary = summary(reg_model)
reg_summary
## Subset selection object
## Call: regsubsets.formula(Outstate ~ ., data = college_train, nvmax = 17,
## method = "forward")
## 17 Variables (and intercept)
## Forced in Forced out
## PrivateYes FALSE FALSE
## Apps FALSE FALSE
## Accept FALSE FALSE
## Enroll FALSE FALSE
## Top10perc FALSE FALSE
## Top25perc FALSE FALSE
## F.Undergrad FALSE FALSE
## P.Undergrad FALSE FALSE
## Room.Board FALSE FALSE
## Books FALSE FALSE
## Personal FALSE FALSE
## PhD FALSE FALSE
## Terminal FALSE FALSE
## S.F.Ratio FALSE FALSE
## perc.alumni FALSE FALSE
## Expend FALSE FALSE
## Grad.Rate FALSE FALSE
## 1 subsets of each size up to 17
## Selection Algorithm: forward
## PrivateYes Apps Accept Enroll Top10perc Top25perc F.Undergrad
## 1 ( 1 ) " " " " " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " " " "
## 3 ( 1 ) " " " " " " " " " " " " " "
## 4 ( 1 ) "*" " " " " " " " " " " " "
## 5 ( 1 ) "*" " " " " " " " " " " " "
## 6 ( 1 ) "*" " " " " " " " " " " " "
## 7 ( 1 ) "*" " " " " " " " " " " " "
## 8 ( 1 ) "*" " " " " " " "*" " " " "
## 9 ( 1 ) "*" " " " " " " "*" " " " "
## 10 ( 1 ) "*" " " " " " " "*" " " " "
## 11 ( 1 ) "*" " " "*" " " "*" " " " "
## 12 ( 1 ) "*" "*" "*" " " "*" " " " "
## 13 ( 1 ) "*" "*" "*" "*" "*" " " " "
## 14 ( 1 ) "*" "*" "*" "*" "*" "*" " "
## 15 ( 1 ) "*" "*" "*" "*" "*" "*" " "
## 16 ( 1 ) "*" "*" "*" "*" "*" "*" "*"
## 17 ( 1 ) "*" "*" "*" "*" "*" "*" "*"
## P.Undergrad Room.Board Books Personal PhD Terminal S.F.Ratio
## 1 ( 1 ) " " "*" " " " " " " " " " "
## 2 ( 1 ) " " "*" " " " " " " " " " "
## 3 ( 1 ) " " "*" " " " " " " " " " "
## 4 ( 1 ) " " "*" " " " " " " " " " "
## 5 ( 1 ) " " "*" " " " " " " " " " "
## 6 ( 1 ) " " "*" " " " " " " "*" " "
## 7 ( 1 ) " " "*" " " "*" " " "*" " "
## 8 ( 1 ) " " "*" " " "*" " " "*" " "
## 9 ( 1 ) " " "*" " " "*" " " "*" "*"
## 10 ( 1 ) " " "*" "*" "*" " " "*" "*"
## 11 ( 1 ) " " "*" "*" "*" " " "*" "*"
## 12 ( 1 ) " " "*" "*" "*" " " "*" "*"
## 13 ( 1 ) " " "*" "*" "*" " " "*" "*"
## 14 ( 1 ) " " "*" "*" "*" " " "*" "*"
## 15 ( 1 ) "*" "*" "*" "*" " " "*" "*"
## 16 ( 1 ) "*" "*" "*" "*" " " "*" "*"
## 17 ( 1 ) "*" "*" "*" "*" "*" "*" "*"
## perc.alumni Expend Grad.Rate
## 1 ( 1 ) " " " " " "
## 2 ( 1 ) "*" " " " "
## 3 ( 1 ) "*" "*" " "
## 4 ( 1 ) "*" "*" " "
## 5 ( 1 ) "*" "*" "*"
## 6 ( 1 ) "*" "*" "*"
## 7 ( 1 ) "*" "*" "*"
## 8 ( 1 ) "*" "*" "*"
## 9 ( 1 ) "*" "*" "*"
## 10 ( 1 ) "*" "*" "*"
## 11 ( 1 ) "*" "*" "*"
## 12 ( 1 ) "*" "*" "*"
## 13 ( 1 ) "*" "*" "*"
## 14 ( 1 ) "*" "*" "*"
## 15 ( 1 ) "*" "*" "*"
## 16 ( 1 ) "*" "*" "*"
## 17 ( 1 ) "*" "*" "*"
par(mfrow = (c(1,3)))
plot(reg_summary$cp, xlab= 'Number of Vairables', ylab= "Mallow's Cp", type= 'l')
min_cp = min(reg_summary$cp)
std_cp = sd(reg_summary$cp)
abline(h= min_cp + 0.2*std_cp, col= 'red', lty= 2)
abline(h= min_cp - 0.2*std_cp, col= 'red', lty= 2)
plot(reg_summary$bic, xlab= 'Number of Vairables', ylab= "BIC", type= 'l')
min_bic = min(reg_summary$bic)
std_bic = sd(reg_summary$bic)
abline(h= min_bic + 0.2*std_bic, col= 'red', lty= 2)
abline(h= min_bic - 0.2*std_bic, col= 'red', lty= 2)
plot(reg_summary$adjr2, xlab= 'Number of Vairables', ylab= "Adjut R^2", type= 'l')
min_adjr2 = min(reg_summary$adjr2)
std_adjr2 = sd(reg_summary$adjr2)
abline(h= min_adjr2 + 0.2*std_adjr2, col= 'red', lty= 2)
abline(h= min_adjr2 - 0.2*std_adjr2, col= 'red', lty= 2)
lm = regsubsets(Outstate ~ ., data= d2, method= 'forward')
coeffs = coef(reg_model, id= 6)
names(coeffs)
## [1] "(Intercept)" "PrivateYes" "Room.Board" "Terminal" "perc.alumni"
## [6] "Expend" "Grad.Rate"
Fit a GAM on the training data, using out-of-state tuition as the response and the features selected in the previous step as the predictors. Plot the results, and explain your findings.
I see that Expend shows a strong non-linear effect,
indicating that institutions with higher per-student instructional
expenditures charge significantly higher out-of-state tuition.
From the Room.Board graph I see as room and board costs
increase, out-of-state tuition tends to increase.
The PhD graph shows a relatively flat curve suggesting
little association.
The per.alumni graph shows positive and increasing
pattern which could suggest that schools with more generous alumni tend
to charge higher tuition, possibly due to prestige.
From the Grad.Rate graph I see a positive relationship
indicating higher graduation rates might be associated with higher
tuition, possibly due to better student support or prestige.
# Gam Model
gam_model = gam(Outstate ~ Private + s(Room.Board, k= 3) + s(PhD, k= 3) + s(perc.alumni, k= 3) + s(Expend, k= 6) + s(Grad.Rate, k= 3), data= college_train)
# Plot
par(mfrow = c(2, 3))
plot(gam_model, se= T, col= 'lightblue')
Evaluate the model obtained on the test set, and explain the results obtained.
The linear model resulted in a test RMSE of 1960.8 and the GAM model resulted in a RMSE of 1847.2. Thus, the GAM outperformed the linear model by reducing test error by approximately $114, demonstrating the value of modeling nonlinear effects in predicting out-of-state tuition.
set.seed(1)
lm_fit = lm(Outstate ~ Private + Room.Board + Terminal + perc.alumni + Expend + Grad.Rate, data= college_train)
# Predictions
lm_pred = predict(lm_fit, college_test)
gam_pred = predict(gam_model, college_test)
# Test Error RMSE
sqrt(mean((college_test$Outstate - lm_pred)^2))
## [1] 1960.831
sqrt(mean((college_test$Outstate - gam_pred)^2))
## [1] 1847.212
summary(gam_model)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## Outstate ~ Private + s(Room.Board, k = 3) + s(PhD, k = 3) + s(perc.alumni,
## k = 3) + s(Expend, k = 6) + s(Grad.Rate, k = 3)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8740.0 225.6 38.737 < 2e-16 ***
## PrivateYes 2408.1 280.9 8.571 2.67e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(Room.Board) 1.416 1.658 38.604 < 2e-16 ***
## s(PhD) 1.000 1.000 1.837 0.176
## s(perc.alumni) 1.000 1.000 21.085 6.43e-06 ***
## s(Expend) 4.829 4.985 32.819 < 2e-16 ***
## s(Grad.Rate) 1.000 1.000 24.237 1.95e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.796 Deviance explained = 80.1%
## GCV = 3.7993e+06 Scale est. = 3.6892e+06 n = 388
For which variables, if any, is there evidence of a non-linear relationship with the response?
The Expend variable shows the strongest evidence of a
non-linear effect as the curve resembles an “M” rather than a straight
line. At lower expenditure levels, tuition increases with spending,
suggesting a positive relationship. Mid-range values show a flattening
or slight dip, implying that increased spending does not always
translate into higher tuition. At higher expenditure levels, tuition
rises again, suggesting a return to a positive association.