ISLR2 book starting PDF Page 334.

# Load dependencies
pacman::p_load(ISLR2, boot, dplyr, leaps, mgcv)

Question 6

In this exercise, you will further analyze the Wage data set considered throughout this chapter.

Part a)

Perform polynomial regression to predict wage using age. Use cross-validation to select the optimal degree d for the polynomial. What degree was chosen, and how does this compare to the results of hypothesis testing using ANOVA? Make a plot of the resulting polynomial fit to the data.

Using cross-validation, a 4th degree polynomial yielded the lowest prediction error. However, an ANOVA comparison between the cubic (degree-3) and quartic (degree-4) polynomials models resulted in a p-value of approximately 0.05. Hence, either a cubic or a quadratic polynomial appear to provide a reasonable fit to the data.

# Read in the data from ISLR2 package
d1 = Wage
head(d1)

##        year age           maritl     race       education             region
## 231655 2006  18 1. Never Married 1. White    1. < HS Grad 2. Middle Atlantic
## 86582  2004  24 1. Never Married 1. White 4. College Grad 2. Middle Atlantic
## 161300 2003  45       2. Married 1. White 3. Some College 2. Middle Atlantic
## 155159 2003  43       2. Married 3. Asian 4. College Grad 2. Middle Atlantic
## 11443  2005  50      4. Divorced 1. White      2. HS Grad 2. Middle Atlantic
## 376662 2008  54       2. Married 1. White 4. College Grad 2. Middle Atlantic
##              jobclass         health health_ins  logwage      wage
## 231655  1. Industrial      1. <=Good      2. No 4.318063  75.04315
## 86582  2. Information 2. >=Very Good      2. No 4.255273  70.47602
## 161300  1. Industrial      1. <=Good     1. Yes 4.875061 130.98218
## 155159 2. Information 2. >=Very Good     1. Yes 5.041393 154.68529
## 11443  2. Information      1. <=Good     1. Yes 4.318063  75.04315
## 376662 2. Information 2. >=Very Good     1. Yes 4.845098 127.11574

# Cross validation to find the optimal degree
set.seed(42)
cv.error = rep(0, 5)

for (i in 1:5) {
  lm_model = glm(wage ~ poly(age, i), data = d1)
  cv.error[i] = cv.glm(d1, lm_model, K = 10)$delta[1]
}

# Print the errors
cv.error

## [1] 1676.334 1601.952 1597.313 1594.688 1595.061

# Decide on the degree of the polynomial to use with ANOVA
fit.1 = lm(wage ~ age, data= d1)
fit.2 = lm(wage ~ poly(age, 2), data= d1)
fit.3 = lm(wage ~ poly(age, 3), data= d1)
fit.4 = lm(wage ~ poly(age, 4), data= d1)
fit.5 = lm(wage ~ poly(age, 5), data= d1)

# ANOVA test
anova(fit.1, fit.2, fit.3, fit.4, fit.5)

## Analysis of Variance Table
## 
## Model 1: wage ~ age
## Model 2: wage ~ poly(age, 2)
## Model 3: wage ~ poly(age, 3)
## Model 4: wage ~ poly(age, 4)
## Model 5: wage ~ poly(age, 5)
##   Res.Df     RSS Df Sum of Sq        F    Pr(>F)    
## 1   2998 5022216                                    
## 2   2997 4793430  1    228786 143.5931 < 2.2e-16 ***
## 3   2996 4777674  1     15756   9.8888  0.001679 ** 
## 4   2995 4771604  1      6070   3.8098  0.051046 .  
## 5   2994 4770322  1      1283   0.8050  0.369682    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

age_range = range(d1$age)
age_grid = seq(from= age_range[1], to= age_range[2])
preds = predict(fit.4, list(age= age_grid), se= TRUE)

# Se stands for standard errors
se_bands = cbind(preds$fit + 2*preds$se.fit, preds$fit - 2*preds$se.fit)

# Plot the data and add the fit from the degree-4 polynomial
par(mfrow= c(1, 2), mar= c(4.5, 4.5, 1, 1), oma= c(0, 0, 4, 0))
plot(d1$age, d1$wage, xlim= age_range, cex= .5, col= "darkgrey")
title("Degree-4 Polynomial", outer= T)
lines(age_grid, preds$fit, lwd= 2, col= "blue")
matlines(age_grid, se_bands, lwd= 1, col= "blue", lty= 3)

Part b)

Fit a step function to predict wage using age, and perform cross validation to choose the optimal number of cuts. Make a plot of the fit obtained.

By cross-validation, 7 is the optimal number of cuts.

# Cross validation to find the optimal number of cuts
set.seed(42)
cv.error = rep(0, 8)

for (i in 2:10) {
  d1$age_cut = cut(d1$age, i)
  lm_model = glm(wage ~ age_cut, data = d1)
  cv.error[i-1] = cv.glm(d1, lm_model, K = 10)$delta[1]
}

# Optimal number of cuts
best_k = which.min(cv.error) + 1

# Print the errors
cv.error

## [1] 1733.565 1684.082 1637.465 1632.337 1623.836 1610.572 1602.163 1612.770
## [9] 1606.141

# Final Model
fit.7 = glm(wage ~ cut(age, 7), data = d1)

age_range = range(d1$age)
age_grid = seq(from= age_range[1], to= age_range[2])
preds = predict(fit.7, list(age= age_grid), se.fit = TRUE)

# Se stands for standard errors
se_bands = cbind(preds$fit + 2*preds$se.fit, preds$fit - 2*preds$se.fit)

# Plot the data and add the fit from the optimal number of cuts
par(mfrow= c(1, 2), mar= c(4.5, 4.5, 1, 1), oma= c(0, 0, 4, 0))
plot(d1$age, d1$wage, xlim= age_range, cex= .5, col= "darkgrey")
title("7 Cut Model", outer= T)
lines(age_grid, preds$fit, lwd= 2, col= "blue")
matlines(age_grid, se_bands, lwd= 1, col= "blue", lty= 3)

Question 10

This question relates to the College data set.

Part a)

Split the data into a training set and a test set. Using out-of-state tuition as the response and the other variables as the predictors, perform forward stepwise selection on the training set in order to identify a satisfactory model that uses just a subset of the predictors.

# Read in the data from ISLR2 package
d2 = College
glimpse(d2)

## Rows: 777
## Columns: 18
## $ Private     <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes…
## $ Apps        <dbl> 1660, 2186, 1428, 417, 193, 587, 353, 1899, 1038, 582, 173…
## $ Accept      <dbl> 1232, 1924, 1097, 349, 146, 479, 340, 1720, 839, 498, 1425…
## $ Enroll      <dbl> 721, 512, 336, 137, 55, 158, 103, 489, 227, 172, 472, 484,…
## $ Top10perc   <dbl> 23, 16, 22, 60, 16, 38, 17, 37, 30, 21, 37, 44, 38, 44, 23…
## $ Top25perc   <dbl> 52, 29, 50, 89, 44, 62, 45, 68, 63, 44, 75, 77, 64, 73, 46…
## $ F.Undergrad <dbl> 2885, 2683, 1036, 510, 249, 678, 416, 1594, 973, 799, 1830…
## $ P.Undergrad <dbl> 537, 1227, 99, 63, 869, 41, 230, 32, 306, 78, 110, 44, 638…
## $ Outstate    <dbl> 7440, 12280, 11250, 12960, 7560, 13500, 13290, 13868, 1559…
## $ Room.Board  <dbl> 3300, 6450, 3750, 5450, 4120, 3335, 5720, 4826, 4400, 3380…
## $ Books       <dbl> 450, 750, 400, 450, 800, 500, 500, 450, 300, 660, 500, 400…
## $ Personal    <dbl> 2200, 1500, 1165, 875, 1500, 675, 1500, 850, 500, 1800, 60…
## $ PhD         <dbl> 70, 29, 53, 92, 76, 67, 90, 89, 79, 40, 82, 73, 60, 79, 36…
## $ Terminal    <dbl> 78, 30, 66, 97, 72, 73, 93, 100, 84, 41, 88, 91, 84, 87, 6…
## $ S.F.Ratio   <dbl> 18.1, 12.2, 12.9, 7.7, 11.9, 9.4, 11.5, 13.7, 11.3, 11.5, …
## $ perc.alumni <dbl> 12, 16, 30, 37, 2, 11, 26, 37, 23, 15, 31, 41, 21, 32, 26,…
## $ Expend      <dbl> 7041, 10527, 8735, 19016, 10922, 9727, 8861, 11487, 11644,…
## $ Grad.Rate   <dbl> 60, 56, 54, 59, 15, 55, 63, 73, 80, 52, 73, 76, 74, 68, 55…

set.seed(1)

# Sample data into test and train
train = sample(length(d2$Outstate), length(d2$Outstate)/2)
test = -train

# Split data into test and train
college_train = d2[train, ]
college_test = d2[test,]

# Regsubsets stepwise selection Model
reg_model = regsubsets(Outstate ~ ., data= college_train, nvmax= 17, method= 'forward')
reg_summary = summary(reg_model)
reg_summary

## Subset selection object
## Call: regsubsets.formula(Outstate ~ ., data = college_train, nvmax = 17, 
##     method = "forward")
## 17 Variables  (and intercept)
##             Forced in Forced out
## PrivateYes      FALSE      FALSE
## Apps            FALSE      FALSE
## Accept          FALSE      FALSE
## Enroll          FALSE      FALSE
## Top10perc       FALSE      FALSE
## Top25perc       FALSE      FALSE
## F.Undergrad     FALSE      FALSE
## P.Undergrad     FALSE      FALSE
## Room.Board      FALSE      FALSE
## Books           FALSE      FALSE
## Personal        FALSE      FALSE
## PhD             FALSE      FALSE
## Terminal        FALSE      FALSE
## S.F.Ratio       FALSE      FALSE
## perc.alumni     FALSE      FALSE
## Expend          FALSE      FALSE
## Grad.Rate       FALSE      FALSE
## 1 subsets of each size up to 17
## Selection Algorithm: forward
##           PrivateYes Apps Accept Enroll Top10perc Top25perc F.Undergrad
## 1  ( 1 )  " "        " "  " "    " "    " "       " "       " "        
## 2  ( 1 )  " "        " "  " "    " "    " "       " "       " "        
## 3  ( 1 )  " "        " "  " "    " "    " "       " "       " "        
## 4  ( 1 )  "*"        " "  " "    " "    " "       " "       " "        
## 5  ( 1 )  "*"        " "  " "    " "    " "       " "       " "        
## 6  ( 1 )  "*"        " "  " "    " "    " "       " "       " "        
## 7  ( 1 )  "*"        " "  " "    " "    " "       " "       " "        
## 8  ( 1 )  "*"        " "  " "    " "    "*"       " "       " "        
## 9  ( 1 )  "*"        " "  " "    " "    "*"       " "       " "        
## 10  ( 1 ) "*"        " "  " "    " "    "*"       " "       " "        
## 11  ( 1 ) "*"        " "  "*"    " "    "*"       " "       " "        
## 12  ( 1 ) "*"        "*"  "*"    " "    "*"       " "       " "        
## 13  ( 1 ) "*"        "*"  "*"    "*"    "*"       " "       " "        
## 14  ( 1 ) "*"        "*"  "*"    "*"    "*"       "*"       " "        
## 15  ( 1 ) "*"        "*"  "*"    "*"    "*"       "*"       " "        
## 16  ( 1 ) "*"        "*"  "*"    "*"    "*"       "*"       "*"        
## 17  ( 1 ) "*"        "*"  "*"    "*"    "*"       "*"       "*"        
##           P.Undergrad Room.Board Books Personal PhD Terminal S.F.Ratio
## 1  ( 1 )  " "         "*"        " "   " "      " " " "      " "      
## 2  ( 1 )  " "         "*"        " "   " "      " " " "      " "      
## 3  ( 1 )  " "         "*"        " "   " "      " " " "      " "      
## 4  ( 1 )  " "         "*"        " "   " "      " " " "      " "      
## 5  ( 1 )  " "         "*"        " "   " "      " " " "      " "      
## 6  ( 1 )  " "         "*"        " "   " "      " " "*"      " "      
## 7  ( 1 )  " "         "*"        " "   "*"      " " "*"      " "      
## 8  ( 1 )  " "         "*"        " "   "*"      " " "*"      " "      
## 9  ( 1 )  " "         "*"        " "   "*"      " " "*"      "*"      
## 10  ( 1 ) " "         "*"        "*"   "*"      " " "*"      "*"      
## 11  ( 1 ) " "         "*"        "*"   "*"      " " "*"      "*"      
## 12  ( 1 ) " "         "*"        "*"   "*"      " " "*"      "*"      
## 13  ( 1 ) " "         "*"        "*"   "*"      " " "*"      "*"      
## 14  ( 1 ) " "         "*"        "*"   "*"      " " "*"      "*"      
## 15  ( 1 ) "*"         "*"        "*"   "*"      " " "*"      "*"      
## 16  ( 1 ) "*"         "*"        "*"   "*"      " " "*"      "*"      
## 17  ( 1 ) "*"         "*"        "*"   "*"      "*" "*"      "*"      
##           perc.alumni Expend Grad.Rate
## 1  ( 1 )  " "         " "    " "      
## 2  ( 1 )  "*"         " "    " "      
## 3  ( 1 )  "*"         "*"    " "      
## 4  ( 1 )  "*"         "*"    " "      
## 5  ( 1 )  "*"         "*"    "*"      
## 6  ( 1 )  "*"         "*"    "*"      
## 7  ( 1 )  "*"         "*"    "*"      
## 8  ( 1 )  "*"         "*"    "*"      
## 9  ( 1 )  "*"         "*"    "*"      
## 10  ( 1 ) "*"         "*"    "*"      
## 11  ( 1 ) "*"         "*"    "*"      
## 12  ( 1 ) "*"         "*"    "*"      
## 13  ( 1 ) "*"         "*"    "*"      
## 14  ( 1 ) "*"         "*"    "*"      
## 15  ( 1 ) "*"         "*"    "*"      
## 16  ( 1 ) "*"         "*"    "*"      
## 17  ( 1 ) "*"         "*"    "*"

par(mfrow = (c(1,3)))
plot(reg_summary$cp, xlab= 'Number of Vairables', ylab= "Mallow's Cp", type= 'l')
min_cp = min(reg_summary$cp)
std_cp = sd(reg_summary$cp)
abline(h= min_cp + 0.2*std_cp, col= 'red', lty= 2)
abline(h= min_cp - 0.2*std_cp, col= 'red', lty= 2)

plot(reg_summary$bic, xlab= 'Number of Vairables', ylab= "BIC", type= 'l')
min_bic = min(reg_summary$bic)
std_bic = sd(reg_summary$bic)
abline(h= min_bic + 0.2*std_bic, col= 'red', lty= 2)
abline(h= min_bic - 0.2*std_bic, col= 'red', lty= 2)

plot(reg_summary$adjr2, xlab= 'Number of Vairables', ylab= "Adjut R^2", type= 'l')
min_adjr2 = min(reg_summary$adjr2)
std_adjr2 = sd(reg_summary$adjr2)
abline(h= min_adjr2 + 0.2*std_adjr2, col= 'red', lty= 2)
abline(h= min_adjr2 - 0.2*std_adjr2, col= 'red', lty= 2)

lm = regsubsets(Outstate ~ ., data= d2, method= 'forward')
coeffs = coef(reg_model, id= 6)
names(coeffs)

## [1] "(Intercept)" "PrivateYes"  "Room.Board"  "Terminal"    "perc.alumni"
## [6] "Expend"      "Grad.Rate"

Part b)

Fit a GAM on the training data, using out-of-state tuition as the response and the features selected in the previous step as the predictors. Plot the results, and explain your findings.

I see that Expend shows a strong non-linear effect, indicating that institutions with higher per-student instructional expenditures charge significantly higher out-of-state tuition.

From the Room.Board graph I see as room and board costs increase, out-of-state tuition tends to increase.

The PhD graph shows a relatively flat curve suggesting little association.

The per.alumni graph shows positive and increasing pattern which could suggest that schools with more generous alumni tend to charge higher tuition, possibly due to prestige.

From the Grad.Rate graph I see a positive relationship indicating higher graduation rates might be associated with higher tuition, possibly due to better student support or prestige.

# Gam Model
gam_model = gam(Outstate ~ Private + s(Room.Board, k= 3) + s(PhD, k= 3) + s(perc.alumni, k= 3) + s(Expend, k= 6) + s(Grad.Rate, k= 3), data= college_train)

# Plot
par(mfrow = c(2, 3))
plot(gam_model, se= T, col= 'lightblue')

Part c)

Evaluate the model obtained on the test set, and explain the results obtained.

The linear model resulted in a test RMSE of 1960.8 and the GAM model resulted in a RMSE of 1847.2. Thus, the GAM outperformed the linear model by reducing test error by approximately $114, demonstrating the value of modeling nonlinear effects in predicting out-of-state tuition.

set.seed(1)

lm_fit = lm(Outstate ~ Private + Room.Board + Terminal + perc.alumni + Expend + Grad.Rate, data= college_train)

# Predictions
lm_pred = predict(lm_fit, college_test)
gam_pred = predict(gam_model, college_test)

# Test Error RMSE
sqrt(mean((college_test$Outstate - lm_pred)^2))

## [1] 1960.831

sqrt(mean((college_test$Outstate - gam_pred)^2))

## [1] 1847.212

summary(gam_model)

## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## Outstate ~ Private + s(Room.Board, k = 3) + s(PhD, k = 3) + s(perc.alumni, 
##     k = 3) + s(Expend, k = 6) + s(Grad.Rate, k = 3)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   8740.0      225.6  38.737  < 2e-16 ***
## PrivateYes    2408.1      280.9   8.571 2.67e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                  edf Ref.df      F  p-value    
## s(Room.Board)  1.416  1.658 38.604  < 2e-16 ***
## s(PhD)         1.000  1.000  1.837    0.176    
## s(perc.alumni) 1.000  1.000 21.085 6.43e-06 ***
## s(Expend)      4.829  4.985 32.819  < 2e-16 ***
## s(Grad.Rate)   1.000  1.000 24.237 1.95e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.796   Deviance explained = 80.1%
## GCV = 3.7993e+06  Scale est. = 3.6892e+06  n = 388

Part d)

For which variables, if any, is there evidence of a non-linear relationship with the response?

The Expend variable shows the strongest evidence of a non-linear effect as the curve resembles an “M” rather than a straight line. At lower expenditure levels, tuition increases with spending, suggesting a positive relationship. Mid-range values show a flattening or slight dip, implying that increased spending does not always translate into higher tuition. At higher expenditure levels, tuition rises again, suggesting a return to a positive association.

Assignment 6

Mario Martinez

Question 6

Part a)

Part b)

Question 10

Part a)

Part b)

Part c)

Part d)