Assignment 6

Exercise 6

In this exercise, we will further analyze the Wage data set considered throughout this chapter.

(a) Perform polynomial regression to predict wage using age. Use cross-validation to select the optimal degree \(d\) for the polynomial. What degree was chosen, and how does this compare to the results of hypothesis testing using ANOVA? Make a plot of the resulting polynomial fit to the data.

Wage <- Wage
set.seed(13)

cv.errors <- rep(NA, 5)
for (i in 1:5) {
  fit1 <- glm(wage ~ poly(age, i), data = Wage)
  cv.errors[i] <- cv.glm(Wage, fit1, K = 10)$delta[1]
}
cv.errors

## [1] 1676.237 1601.415 1596.904 1593.841 1593.962

plot(1:5, cv.errors, xlab = 'Degree', ylab = 'Test MSE', type = 'b')
best_d <- which.min(cv.errors)
points(best_d, cv.errors[best_d], col = 'blue', cex = 2, pch = 20)

Using cross-validation to select the optimal polynomial degree, within the range of 1 to 5, suggests that the most suitable degree is 4. This finding is partially consistent with the outcomes of hypothesis testing using ANOVA. As seen below, the third-degree fit represents the last statistically significant model at a 95% confidence level, but it’s noteworthy that the fourth-degree fit is marginally close to meeting the significance threshold with a p-value of 0.051.

poly1 = lm(wage ~ age,data=Wage)
poly2 = lm(wage ~ poly(age, 2), data = Wage)
poly3 = lm(wage ~ poly(age, 3), data = Wage)
poly4 = lm(wage ~ poly(age, 4), data = Wage)
poly5 = lm(wage ~ poly(age, 5), data = Wage)
anova(poly1, poly2, poly3, poly4, poly5)

## Analysis of Variance Table
## 
## Model 1: wage ~ age
## Model 2: wage ~ poly(age, 2)
## Model 3: wage ~ poly(age, 3)
## Model 4: wage ~ poly(age, 4)
## Model 5: wage ~ poly(age, 5)
##   Res.Df     RSS Df Sum of Sq        F    Pr(>F)    
## 1   2998 5022216                                    
## 2   2997 4793430  1    228786 143.5931 < 2.2e-16 ***
## 3   2996 4777674  1     15756   9.8888  0.001679 ** 
## 4   2995 4771604  1      6070   3.8098  0.051046 .  
## 5   2994 4770322  1      1283   0.8050  0.369682    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The following plot illustrates the degree-4 polynomial regression model fitted to the relationship between age and wage. The gray points represent the observed data points, while the solid blue line represents the fitted polynomial curve. Additionally, the dashed lines indicates the 95% confidence interval for the predictions.

plot(wage ~ age, data = Wage, col = "darkgrey")
age.range <- range(Wage$age)
age.grid <- seq(from = age.range[1], to = age.range[2])
fit2 <- lm(wage ~ poly(age, 4), data = Wage)
preds <- predict(fit2, newdata = list(age = age.grid), se = T)
se.bands = cbind(preds$fit + 2*preds$se.fit, preds$fit - 2*preds$se.fit)
title("Degree-4 Polynomial")
lines(age.grid, preds$fit, col = "blue", lwd = 2)
matlines(age.grid, se.bands, lwd = 1, col = "blue", lty = 3)

(b) Fit a step function to predict wage using age, and perform cross-validation to choose the optimal number of cuts. Make a plot of the fit obtained.

set.seed(13)

cv.errors <- rep(NA, 10)
for(i in 2:10){
  Wage$age.cut <- cut(Wage$age, i)
  fit3 <- glm(wage ~ age.cut, data = Wage)
  cv.errors[i] <- cv.glm(Wage, fit3, K = 10)$delta[1]
}
cv.errors

##  [1]       NA 1733.475 1683.584 1637.555 1630.754 1622.164 1611.313 1601.867
##  [9] 1610.493 1606.294

plot(2:10, cv.errors[-1], type = "b", xlab = "Number of Cuts", ylab = "Test MSE")
best_num_cut <- which.min(cv.errors)
points(best_num_cut, cv.errors[best_num_cut], col = "blue", pch = 20, cex = 2)

Seeing that cross-validation determined that 8 is the optimal number of cuts, we fit the step model accordingly. The plot below illustrates the prediction of wage based on age using 8 cuts.

fit4 <- glm(wage ~ cut(age, 8), data = Wage)
preds <- predict(fit4, data.frame(age = age.grid))
plot(wage ~ age, data = Wage, col = "darkgray")
lines(age.grid, preds, col = "blue", lwd = 2)

Exercise 10

This question relates to the College data set.

(a) Split the data into a training set and a test set. Using out-of-state tuition as the response and the other variables as the predictors, perform forward stepwise selection on the training set in order to identify a satisfactory model that uses just a subset of the predictors.

set.seed(13)
College <- College

# Splitting into train and test
index = sample(nrow(College), 0.7*nrow(College), replace = FALSE)
train <- College[index, ]
test <- College[-index, ]

# Fitting full model
fit.full = regsubsets(Outstate ~ ., data = train, nvmax=13)
full.summary = summary(fit.full)
full.summary

## Subset selection object
## Call: regsubsets.formula(Outstate ~ ., data = train, nvmax = 13)
## 17 Variables  (and intercept)
##             Forced in Forced out
## PrivateYes      FALSE      FALSE
## Apps            FALSE      FALSE
## Accept          FALSE      FALSE
## Enroll          FALSE      FALSE
## Top10perc       FALSE      FALSE
## Top25perc       FALSE      FALSE
## F.Undergrad     FALSE      FALSE
## P.Undergrad     FALSE      FALSE
## Room.Board      FALSE      FALSE
## Books           FALSE      FALSE
## Personal        FALSE      FALSE
## PhD             FALSE      FALSE
## Terminal        FALSE      FALSE
## S.F.Ratio       FALSE      FALSE
## perc.alumni     FALSE      FALSE
## Expend          FALSE      FALSE
## Grad.Rate       FALSE      FALSE
## 1 subsets of each size up to 13
## Selection Algorithm: exhaustive
##           PrivateYes Apps Accept Enroll Top10perc Top25perc F.Undergrad
## 1  ( 1 )  " "        " "  " "    " "    " "       " "       " "        
## 2  ( 1 )  "*"        " "  " "    " "    " "       " "       " "        
## 3  ( 1 )  "*"        " "  " "    " "    " "       " "       " "        
## 4  ( 1 )  "*"        " "  " "    " "    " "       " "       " "        
## 5  ( 1 )  "*"        " "  " "    " "    " "       " "       " "        
## 6  ( 1 )  "*"        " "  " "    " "    " "       " "       " "        
## 7  ( 1 )  "*"        " "  " "    " "    " "       " "       " "        
## 8  ( 1 )  "*"        "*"  "*"    " "    " "       " "       " "        
## 9  ( 1 )  "*"        "*"  "*"    " "    "*"       " "       " "        
## 10  ( 1 ) "*"        "*"  "*"    " "    "*"       " "       "*"        
## 11  ( 1 ) "*"        "*"  "*"    "*"    "*"       " "       " "        
## 12  ( 1 ) "*"        "*"  "*"    "*"    "*"       " "       " "        
## 13  ( 1 ) "*"        "*"  "*"    "*"    "*"       " "       "*"        
##           P.Undergrad Room.Board Books Personal PhD Terminal S.F.Ratio
## 1  ( 1 )  " "         " "        " "   " "      " " " "      " "      
## 2  ( 1 )  " "         " "        " "   " "      " " " "      " "      
## 3  ( 1 )  " "         "*"        " "   " "      " " " "      " "      
## 4  ( 1 )  " "         "*"        " "   " "      " " " "      " "      
## 5  ( 1 )  " "         "*"        " "   " "      " " " "      " "      
## 6  ( 1 )  " "         "*"        " "   " "      " " "*"      " "      
## 7  ( 1 )  " "         "*"        " "   "*"      " " "*"      " "      
## 8  ( 1 )  " "         "*"        " "   " "      " " "*"      " "      
## 9  ( 1 )  " "         "*"        " "   " "      " " "*"      " "      
## 10  ( 1 ) " "         "*"        " "   " "      " " "*"      " "      
## 11  ( 1 ) " "         "*"        " "   "*"      " " "*"      " "      
## 12  ( 1 ) " "         "*"        " "   "*"      " " "*"      "*"      
## 13  ( 1 ) " "         "*"        " "   "*"      " " "*"      "*"      
##           perc.alumni Expend Grad.Rate
## 1  ( 1 )  " "         "*"    " "      
## 2  ( 1 )  " "         "*"    " "      
## 3  ( 1 )  " "         "*"    " "      
## 4  ( 1 )  "*"         "*"    " "      
## 5  ( 1 )  "*"         "*"    "*"      
## 6  ( 1 )  "*"         "*"    "*"      
## 7  ( 1 )  "*"         "*"    "*"      
## 8  ( 1 )  "*"         "*"    "*"      
## 9  ( 1 )  "*"         "*"    "*"      
## 10  ( 1 ) "*"         "*"    "*"      
## 11  ( 1 ) "*"         "*"    "*"      
## 12  ( 1 ) "*"         "*"    "*"      
## 13  ( 1 ) "*"         "*"    "*"

# Plotting performance statistics to select best subset
par(mfrow=c(2,2))
plot(full.summary$rss, xlab = "Number of Variables", ylab = "RSS", type = "l")

plot(full.summary$adjr2, xla = "Number of Variables", ylab = "Adjusted RSq", type = "l")
p = which.max(full.summary$adjr2)
points(p, full.summary$adjr2[p], col = "blue", cex = 2, pch = 20)

plot(full.summary$cp, xlab = "Number of Variables", ylab = "Cp", type = 'l')
p = which.min(full.summary$cp)
points(p, full.summary$cp[p], col = "blue", cex = 2, pch = 20)

plot(full.summary$bic, xlab="Number of Variables", ylab = "BIC", type = 'l')
p = which.min(full.summary$bic)
points(p, full.summary$bic[p], col = "blue", cex = 2, pch = 20)

According to the BIC criterion, the best subset contains 10 predictors, while the Cp and the adjusted R-square suggest 12 predictors. However, the plots also reveal that simpler models would be sufficient, seeing that the rate of improvement diminishes significantly after six variables. For this reason, we will consider the six best predictors from the summary above: Private, Room.Board, Terminal, perc.alumni, Expend, and Grad.Rate.

(b) Fit a GAM on the training data, using out-of-state tuition as the response and the features selected in the previous step as the predictors. Plot the results, and explain your findings.

fit6 <- gam(Outstate ~ Private + s(Room.Board, 4) + s(Terminal, 4) + s(perc.alumni, 4) + s(Expend, 4) + s(Grad.Rate, 4), data = train)

par(mfrow = c(2,3))
plot(fit6, se = TRUE, col = "blue")

These plots provide insights into the non-linear relationships between the six predictor variables and out-of-state tuition. The blue lines represent the estimated smooth functions, while the dashed lines indicate the upper and lower boundaries for the 95% confidence intervals. We notice a non-linear relationship between Outstate and Expend as well as great uncertainty in the model’s estimate of out-of-state tuition when expenditure per student (Expend) is high. Similarly, we see a wide confidence interval for Terminal when the percent of faculty with terminal degree is low.

(c) Evaluate the model obtained on the test set, and explain the results obtained.

preds <- predict(fit6, newdata = test)

TSS <- sum((test$Outstate - mean(test$Outstate))^2)
RSS <- sum((test$Outstate - preds)^2)
r_Sq <- 1 - (RSS / TSS)
r_Sq

## [1] 0.7884569

The r-squared statistic indicates that the model can explain 78.85% of the variance in out-of-state tuition based on the six predictors included in the model.

(d) For which variables, if any, is there evidence of a non-linear relationship with the response?

summary(fit6)

## 
## Call: gam(formula = Outstate ~ Private + s(Room.Board, 4) + s(Terminal, 
##     4) + s(perc.alumni, 4) + s(Expend, 4) + s(Grad.Rate, 4), 
##     data = train)
## Deviance Residuals:
##      Min       1Q   Median       3Q      Max 
## -7200.04 -1092.48    33.39  1263.13  7902.03 
## 
## (Dispersion Parameter for gaussian family taken to be 3553109)
## 
##     Null Deviance: 8877747960 on 542 degrees of freedom
## Residual Deviance: 1851168109 on 520.9995 degrees of freedom
## AIC: 9754.76 
## 
## Number of Local Scoring Iterations: NA 
## 
## Anova for Parametric Effects
##                    Df     Sum Sq    Mean Sq F value    Pr(>F)    
## Private             1 2334251519 2334251519 656.960 < 2.2e-16 ***
## s(Room.Board, 4)    1 1959113460 1959113460 551.380 < 2.2e-16 ***
## s(Terminal, 4)      1  487796256  487796256 137.287 < 2.2e-16 ***
## s(perc.alumni, 4)   1  370965339  370965339 104.406 < 2.2e-16 ***
## s(Expend, 4)        1  771425299  771425299 217.113 < 2.2e-16 ***
## s(Grad.Rate, 4)     1  122579247  122579247  34.499 7.607e-09 ***
## Residuals         521 1851168109    3553109                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Anova for Nonparametric Effects
##                   Npar Df  Npar F  Pr(F)    
## (Intercept)                                 
## Private                                     
## s(Room.Board, 4)        3  1.8527 0.1366    
## s(Terminal, 4)          3  1.3560 0.2555    
## s(perc.alumni, 4)       3  0.9154 0.4332    
## s(Expend, 4)            3 29.4063 <2e-16 ***
## s(Grad.Rate, 4)         3  1.1201 0.3404    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Examining the p-values for the predictors under Anova for Nonparametric Effects, only Expend demonstrates statistical significance. This aligns with the visual assessment in part (b), where Expend appeared to be the only predictor in the plot with a non-linear relationship with Outstate.

Assignment 6

Emilie Baek

2024-04-09

Exercise 6

Exercise 10