R Markdown

6

  1. Perform polynomial regression to predict wage using age. Use cross-validation to select the optimal degree d for the polynomial. What degree was chosen, and how does this compare to the results of hypothesis testing using ANOVA? Make a plot of the resulting polynomial fit to the data.
library(ISLR)
library(boot)
set.seed(1)
degree <- 10
cv.errs <- rep(NA, degree)
for (i in 1:degree) {
  fit <- glm(wage ~ poly(age, i), data = Wage)
  cv.errs[i] <- cv.glm(Wage, fit)$delta[1]
}
plot(1:degree, cv.errs, xlab = 'Degree', ylab = 'Test MSE', type = 'l')
deg.min <- which.min(cv.errs)
points(deg.min, cv.errs[deg.min], col = 'red', cex = 2, pch = 19)

The minimum of test MSE at the degree 9. But test MSE of degree 4 is small enough. The comparison by ANOVA suggests degree 4 is enough.

plot(wage ~ age, data = Wage, col = "darkgrey")
age.range <- range(Wage$age)
age.grid <- seq(from = age.range[1], to = age.range[2])
fit <- lm(wage ~ poly(age, 3), data = Wage)
preds <- predict(fit, newdata = list(age = age.grid))
lines(age.grid, preds, col = "red", lwd = 2)

  1. Fit a step function to predict wage using age, and perform crossvalidation to choose the optimal number of cuts. Make a plot of the fit obtained.
cv.errs <- rep(NA, degree)
for (i in 2:degree) {
  Wage$age.cut <- cut(Wage$age, i)
  fit <- glm(wage ~ age.cut, data = Wage)
  cv.errs[i] <- cv.glm(Wage, fit)$delta[1]
}
plot(2:degree, cv.errs[-1], xlab = 'Cuts', ylab = 'Test MSE', type = 'l')
deg.min <- which.min(cv.errs)
points(deg.min, cv.errs[deg.min], col = 'red', cex = 2, pch = 19)

8 cuts produce minimum test MSE

plot(wage ~ age, data = Wage, col = "darkgrey")
fit <- glm(wage ~ cut(age, 8), data = Wage)
preds <- predict(fit, list(age = age.grid))
lines(age.grid, preds, col = "red", lwd = 2)

res <- cut(c(1,5,2,3,8), 2)
res
## [1] (0.993,4.5] (4.5,8.01]  (0.993,4.5] (0.993,4.5] (4.5,8.01] 
## Levels: (0.993,4.5] (4.5,8.01]
length(res)
## [1] 5
class(res[1])
## [1] "factor"

cut(x,k) turns continuous quantitative variable into a discrete qualitative variable by dividing the range of x evenly into k intervals. each interval is a level. The output is a vector with the same length of x. each element of output is a level where the corresponding input element falls in.

10

  1. Split the data into a training set and a test set. Using out-of-state tuition as the response and the other variables as the predictors, perform forward stepwise selection on the training set in order to identify a satisfactory model that uses just a subset of the predictors.
library(ISLR)
set.seed(12345)
train <- sample(nrow(College) * 0.7)
train_set <- College[train, ]
test_set <- College[-train, ]
plot_metric <- function(metric, yaxis_label, reverse = FALSE) {
  plot(metric, xlab = "Number of Variables", ylab = yaxis_label, xaxt = "n", type = "l")
  axis(side = 1, at = 1:length(metric))
  
  if (reverse) {
    metric_1se <- max(metric) - (sd(metric) / sqrt(length(metric)))
    min_subset <- which(metric > metric_1se)
  } else {
    metric_1se <- min(metric) + (sd(metric) / sqrt(length(metric)))
    min_subset <- which(metric < metric_1se)
  }
  
  abline(h = metric_1se, col = "red", lty = 2)
  abline(v = min_subset[1], col = "green", lty = 2)
}

par(mfrow=c(1, 3))
  1. Fit a GAM on the training data, using out-of-state tuition as the response and the features selected in the previous step as the predictors. Plot the results, and explain your findings.
calc_mse <- function(y, y_hat) {
  return(mean((y - y_hat)^2))
}

calc_rmse <- function(y, y_hat) {
  return(sqrt(calc_mse(y, y_hat)))
}

calc_r2 <- function(y, y_hat) {
  y_bar <- mean(y)
  rss <- sum((y - y_hat)^2)
  tss <- sum((y - y_bar)^2)
  return(1 - (rss / tss))
}
  1. Evaluate the model obtained on the test set, and explain the results obtained.

test RMSE of 1984.3845506 and R-squared of 0.7614328 using GAM with 6 predictors

  1. For which variables, if any, is there evidence of a non-linear relationship with the response?
summary(College)
##  Private        Apps           Accept          Enroll       Top10perc    
##  No :212   Min.   :   81   Min.   :   72   Min.   :  35   Min.   : 1.00  
##  Yes:565   1st Qu.:  776   1st Qu.:  604   1st Qu.: 242   1st Qu.:15.00  
##            Median : 1558   Median : 1110   Median : 434   Median :23.00  
##            Mean   : 3002   Mean   : 2019   Mean   : 780   Mean   :27.56  
##            3rd Qu.: 3624   3rd Qu.: 2424   3rd Qu.: 902   3rd Qu.:35.00  
##            Max.   :48094   Max.   :26330   Max.   :6392   Max.   :96.00  
##    Top25perc      F.Undergrad     P.Undergrad         Outstate    
##  Min.   :  9.0   Min.   :  139   Min.   :    1.0   Min.   : 2340  
##  1st Qu.: 41.0   1st Qu.:  992   1st Qu.:   95.0   1st Qu.: 7320  
##  Median : 54.0   Median : 1707   Median :  353.0   Median : 9990  
##  Mean   : 55.8   Mean   : 3700   Mean   :  855.3   Mean   :10441  
##  3rd Qu.: 69.0   3rd Qu.: 4005   3rd Qu.:  967.0   3rd Qu.:12925  
##  Max.   :100.0   Max.   :31643   Max.   :21836.0   Max.   :21700  
##    Room.Board       Books           Personal         PhD        
##  Min.   :1780   Min.   :  96.0   Min.   : 250   Min.   :  8.00  
##  1st Qu.:3597   1st Qu.: 470.0   1st Qu.: 850   1st Qu.: 62.00  
##  Median :4200   Median : 500.0   Median :1200   Median : 75.00  
##  Mean   :4358   Mean   : 549.4   Mean   :1341   Mean   : 72.66  
##  3rd Qu.:5050   3rd Qu.: 600.0   3rd Qu.:1700   3rd Qu.: 85.00  
##  Max.   :8124   Max.   :2340.0   Max.   :6800   Max.   :103.00  
##     Terminal       S.F.Ratio      perc.alumni        Expend     
##  Min.   : 24.0   Min.   : 2.50   Min.   : 0.00   Min.   : 3186  
##  1st Qu.: 71.0   1st Qu.:11.50   1st Qu.:13.00   1st Qu.: 6751  
##  Median : 82.0   Median :13.60   Median :21.00   Median : 8377  
##  Mean   : 79.7   Mean   :14.09   Mean   :22.74   Mean   : 9660  
##  3rd Qu.: 92.0   3rd Qu.:16.50   3rd Qu.:31.00   3rd Qu.:10830  
##  Max.   :100.0   Max.   :39.80   Max.   :64.00   Max.   :56233  
##    Grad.Rate     
##  Min.   : 10.00  
##  1st Qu.: 53.00  
##  Median : 65.00  
##  Mean   : 65.46  
##  3rd Qu.: 78.00  
##  Max.   :118.00

Non-parametric Anova test shows a strong evidence of non-linear relationship between response and Expend and PhD, and a moderately strong non-linear relationship between response and Room.Board and Grad.Rate.