Chapter 07 (page 297): 6, 10

6. In this exercise, you will further analyze the “Wage” data set considered throughout this chapter.

a. Perform polynomial regression to predict “wage” using “age”. Use cross-validation to select the optimal degree d for the polynomial. What degree was chosen, and how does this compare to the results of hypothesis testing using ANOVA ? Make a plot of the resulting polynomial fit to the data.

We will perform K-fold cross-validation with \(K=10\).

library(ISLR)

## Warning: package 'ISLR' was built under R version 4.0.5

library(boot)
library("gam")

## Warning: package 'gam' was built under R version 4.0.5

## Loading required package: splines

## Loading required package: foreach

## Warning: package 'foreach' was built under R version 4.0.5

## Loaded gam 1.20

set.seed(1)
deltas <- rep(NA, 10)
for (i in 1:10) {
    fit <- glm(wage ~ poly(age, i), data = Wage)
    deltas[i] <- cv.glm(Wage, fit, K = 10)$delta[1]
}
plot(1:10, deltas, xlab = "Degree", ylab = "Test MSE", type = "l")
d.min <- which.min(deltas)
points(which.min(deltas), deltas[which.min(deltas)], col = "red", cex = 2, pch = 20)

We may see that \(d=4\) is the optimal degree for the polynomial. We now use ANOVA to test the null hypothesis that a model \(M1\) is sufficient to explain the data against the alternative hypothesis that a more complex \(M2\) is required.

fit1 <- lm(wage ~ age, data = Wage)
fit2 <- lm(wage ~ poly(age, 2), data = Wage)
fit3 <- lm(wage ~ poly(age, 3), data = Wage)
fit4 <- lm(wage ~ poly(age, 4), data = Wage)
fit5 <- lm(wage ~ poly(age, 5), data = Wage)
anova(fit1, fit2, fit3, fit4, fit5)

## Analysis of Variance Table
## 
## Model 1: wage ~ age
## Model 2: wage ~ poly(age, 2)
## Model 3: wage ~ poly(age, 3)
## Model 4: wage ~ poly(age, 4)
## Model 5: wage ~ poly(age, 5)
##   Res.Df     RSS Df Sum of Sq        F    Pr(>F)    
## 1   2998 5022216                                    
## 2   2997 4793430  1    228786 143.5931 < 2.2e-16 ***
## 3   2996 4777674  1     15756   9.8888  0.001679 ** 
## 4   2995 4771604  1      6070   3.8098  0.051046 .  
## 5   2994 4770322  1      1283   0.8050  0.369682    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We may see, by examining the p-values, that either a cubic or quartic polynomial appear to provide a reasonable fit to the data, but lower or higher order models are not justified. It remains to plot the resulting polynomial fit to the data.

plot(wage ~ age, data = Wage, col = "darkgrey")
agelims <- range(Wage$age)
age.grid <- seq(from = agelims[1], to = agelims[2])
fit <- lm(wage ~ poly(age, 3), data = Wage)
preds <- predict(fit, newdata = list(age = age.grid))
lines(age.grid, preds, col = "red", lwd = 2)

b. Fit a step function to predict “wage” using “age”, and perform cross-validation to choose the optimal number of cuts. Make a plot of the fit obtained.

We will perform K-fold cross-validation with K=10.

cvs <- rep(NA, 10)
for (i in 2:10) {
    Wage$age.cut <- cut(Wage$age, i)
    fit <- glm(wage ~ age.cut, data = Wage)
    cvs[i] <- cv.glm(Wage, fit, K = 10)$delta[1]
}
plot(2:10, cvs[-1], xlab = "Cuts", ylab = "Test MSE", type = "l")
d.min <- which.min(cvs)
points(which.min(cvs), cvs[which.min(cvs)], col = "red", cex = 2, pch = 20)

We may see that the error is minimum for 8 cuts. Now, we fit the entire data with a step function using 8 cuts and plot it.

plot(wage ~ age, data = Wage, col = "darkgrey")
agelims <- range(Wage$age)
age.grid <- seq(from = agelims[1], to = agelims[2])
fit <- glm(wage ~ cut(age, 8), data = Wage)
preds <- predict(fit, data.frame(age = age.grid))
lines(age.grid, preds, col = "red", lwd = 2)

10. This question relates to the “College” data set.

a. Split the data into a training set and a test set. Using out-of-state tuition as the response and the other variables as the predictors, perform forward stepwise selection on the training set in order to identify a satisfactory model that uses just a subset of the predictors.

library(leaps)

## Warning: package 'leaps' was built under R version 4.0.5

set.seed(1)
attach(College)
train <- sample(length(Outstate), length(Outstate) / 2)
test <- -train
College.train <- College[train, ]
College.test <- College[test, ]
fit <- regsubsets(Outstate ~ ., data = College.train, nvmax = 17, method = "forward")
fit.summary <- summary(fit)
par(mfrow = c(1, 3))
plot(fit.summary$cp, xlab = "Number of variables", ylab = "Cp", type = "l")
min.cp <- min(fit.summary$cp)
std.cp <- sd(fit.summary$cp)
abline(h = min.cp + 0.2 * std.cp, col = "red", lty = 2)
abline(h = min.cp - 0.2 * std.cp, col = "red", lty = 2)
plot(fit.summary$bic, xlab = "Number of variables", ylab = "BIC", type='l')
min.bic <- min(fit.summary$bic)
std.bic <- sd(fit.summary$bic)
abline(h = min.bic + 0.2 * std.bic, col = "red", lty = 2)
abline(h = min.bic - 0.2 * std.bic, col = "red", lty = 2)
plot(fit.summary$adjr2, xlab = "Number of variables", ylab = "Adjusted R2", type = "l", ylim = c(0.4, 0.84))
max.adjr2 <- max(fit.summary$adjr2)
std.adjr2 <- sd(fit.summary$adjr2)
abline(h = max.adjr2 + 0.2 * std.adjr2, col = "red", lty = 2)
abline(h = max.adjr2 - 0.2 * std.adjr2, col = "red", lty = 2)

Cp, BIC and adjr2 show that size 6 is the minimum size for the subset for which the scores are within 0.2 standard devitations of optimum.

fit <- regsubsets(Outstate ~ ., data = College, method = "forward")
coeffs <- coef(fit, id = 6)
names(coeffs)

## [1] "(Intercept)" "PrivateYes"  "Room.Board"  "PhD"         "perc.alumni"
## [6] "Expend"      "Grad.Rate"

b. Fit a GAM on the training data, using out-of-state tuition as the response and the features selected in the previous step as the predictors. Plot the results, and explain your findings.

fit <- gam(Outstate ~ Private + s(Room.Board, df = 2) + s(PhD, df = 2) + s(perc.alumni, df = 2) + s(Expend, df = 5) + s(Grad.Rate, df = 2), data=College.train)
par(mfrow = c(2, 3))
plot(fit, se = T, col = "blue")

c. Evaluate the model obtained on the test set, and explain the results obtained.

preds <- predict(fit, College.test)
err <- mean((College.test$Outstate - preds)^2)
err

## [1] 3349290

tss <- mean((College.test$Outstate - mean(College.test$Outstate))^2)
rss <- 1 - err / tss
rss

## [1] 0.7660016

We obtain a test R^2 of 0.77 using GAM with 6 predictors.

d. For which variables, if any, is there evidence of a non-linear relationship with the response ?

summary(fit)

## 
## Call: gam(formula = Outstate ~ Private + s(Room.Board, df = 2) + s(PhD, 
##     df = 2) + s(perc.alumni, df = 2) + s(Expend, df = 5) + s(Grad.Rate, 
##     df = 2), data = College.train)
## Deviance Residuals:
##      Min       1Q   Median       3Q      Max 
## -7402.89 -1114.45   -12.67  1282.69  7470.60 
## 
## (Dispersion Parameter for gaussian family taken to be 3711182)
## 
##     Null Deviance: 6989966760 on 387 degrees of freedom
## Residual Deviance: 1384271126 on 373 degrees of freedom
## AIC: 6987.021 
## 
## Number of Local Scoring Iterations: NA 
## 
## Anova for Parametric Effects
##                         Df     Sum Sq    Mean Sq F value    Pr(>F)    
## Private                  1 1778718277 1778718277 479.286 < 2.2e-16 ***
## s(Room.Board, df = 2)    1 1577115244 1577115244 424.963 < 2.2e-16 ***
## s(PhD, df = 2)           1  322431195  322431195  86.881 < 2.2e-16 ***
## s(perc.alumni, df = 2)   1  336869281  336869281  90.771 < 2.2e-16 ***
## s(Expend, df = 5)        1  530538753  530538753 142.957 < 2.2e-16 ***
## s(Grad.Rate, df = 2)     1   86504998   86504998  23.309 2.016e-06 ***
## Residuals              373 1384271126    3711182                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Anova for Nonparametric Effects
##                        Npar Df  Npar F     Pr(F)    
## (Intercept)                                         
## Private                                             
## s(Room.Board, df = 2)        1  1.9157    0.1672    
## s(PhD, df = 2)               1  0.9699    0.3253    
## s(perc.alumni, df = 2)       1  0.1859    0.6666    
## s(Expend, df = 5)            4 20.5075 2.665e-15 ***
## s(Grad.Rate, df = 2)         1  0.5702    0.4506    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

ANOVA shows a strong evidence of non-linear relationship between “Outstate” and “Expend”“, and a moderately strong non-linear relationship (using p-value of 0.05) between”Outstate" and “Grad.Rate”" or “PhD”.

Chapter 07 (page 297): 6, 10

Dom Rodriguez

7/26/2021

6. In this exercise, you will further analyze the “Wage” data set considered throughout this chapter.

We will perform K-fold cross-validation with \(K=10\).

We may see that \(d=4\) is the optimal degree for the polynomial. We now use ANOVA to test the null hypothesis that a model \(M1\) is sufficient to explain the data against the alternative hypothesis that a more complex \(M2\) is required.

We may see, by examining the p-values, that either a cubic or quartic polynomial appear to provide a reasonable fit to the data, but lower or higher order models are not justified. It remains to plot the resulting polynomial fit to the data.

b. Fit a step function to predict “wage” using “age”, and perform cross-validation to choose the optimal number of cuts. Make a plot of the fit obtained.

We will perform K-fold cross-validation with K=10.

We may see that the error is minimum for 8 cuts. Now, we fit the entire data with a step function using 8 cuts and plot it.

10. This question relates to the “College” data set.

a. Split the data into a training set and a test set. Using out-of-state tuition as the response and the other variables as the predictors, perform forward stepwise selection on the training set in order to identify a satisfactory model that uses just a subset of the predictors.

Cp, BIC and adjr2 show that size 6 is the minimum size for the subset for which the scores are within 0.2 standard devitations of optimum.

b. Fit a GAM on the training data, using out-of-state tuition as the response and the features selected in the previous step as the predictors. Plot the results, and explain your findings.

c. Evaluate the model obtained on the test set, and explain the results obtained.

We obtain a test R^2 of 0.77 using GAM with 6 predictors.

d. For which variables, if any, is there evidence of a non-linear relationship with the response ?

ANOVA shows a strong evidence of non-linear relationship between “Outstate” and “Expend”“, and a moderately strong non-linear relationship (using p-value of 0.05) between”Outstate" and “Grad.Rate”" or “PhD”.