Problem 6

  1. Perform polynomial regression to predict wage using age. Use cross-validation to select the optimal degree d for the polynomial. What degree was chosen, and how does this compare to the results of hypothesis testing using ANOVA? Make a plot of the resulting polynomial fit to the data.
ctrl <- trainControl(method = "cv", number = 6)

CV_RMSE <- c()

set.seed(151)

for (i in 1:6) {
  model_temp <- train(y = Wage$wage,
                      x = poly(Wage$age, i, raw = T, simple = T),
                      method = "lm",
                      metric = "RMSE",
                      trControl = ctrl)
  CV_RMSE[i] <- model_temp$results$RMSE
}


data.frame(degree = 1:6, CV_RMSE = CV_RMSE) %>%
  mutate(min_CV_RMSE = as.numeric(min(CV_RMSE) == CV_RMSE)) %>%
  ggplot(aes(x = degree, y = CV_RMSE)) +
  geom_line(col = "blue") +
  geom_point(size = 2, aes(col = factor(min_CV_RMSE))) +
  scale_x_continuous(breaks = seq(1, 6), minor_breaks = NULL) +
  scale_color_manual(values = c("darkgreen", "red")) +
  theme(legend.position = "none") +
  labs(title = "Wage Dataset - Polynomial Regression",
       subtitle = "Selecting the 'age' polynomial degree with cross-validation",
       x = "Degree",
       y = "CV RMSE")

fit1 <- lm(wage ~ age, data = Wage)
fit2 <- lm(wage ~ poly(age, 2), data = Wage)
fit3 <- lm(wage ~ poly(age, 3), data = Wage)
fit4 <- lm(wage ~ poly(age, 4), data = Wage)
fit5 <- lm(wage ~ poly(age, 5), data = Wage)
fit6 <- lm(wage ~ poly(age, 6), data = Wage)
anova(fit1, fit2, fit3, fit4, fit5, fit6)
## Analysis of Variance Table
## 
## Model 1: wage ~ age
## Model 2: wage ~ poly(age, 2)
## Model 3: wage ~ poly(age, 3)
## Model 4: wage ~ poly(age, 4)
## Model 5: wage ~ poly(age, 5)
## Model 6: wage ~ poly(age, 6)
##   Res.Df     RSS Df Sum of Sq        F    Pr(>F)    
## 1   2998 5022216                                    
## 2   2997 4793430  1    228786 143.6636 < 2.2e-16 ***
## 3   2996 4777674  1     15756   9.8936  0.001675 ** 
## 4   2995 4771604  1      6070   3.8117  0.050989 .  
## 5   2994 4770322  1      1283   0.8054  0.369565    
## 6   2993 4766389  1      3932   2.4692  0.116201    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ggplot(Wage, aes(x = age, y = wage)) + 
  geom_point(alpha = 0.3) + 
  geom_smooth(method = "lm", formula = "y ~ poly(x, 3, raw = T)") + 
  labs(title = "Wage Dataset - Polynomial Regression",
       subtitle = "Predicting 'wage' with a cubic polynomial of 'age'")

Polynomial regression with cross-validation shows that a degree three polynomial is best. This is supported as well by the ANOVA table. The ANOVA table says that a fourth degree is close to being statistically significant, however this would make an overly complicated model so I chose three. The graph appears to work well.

  1. Fit a step function to predict wage using age, and perform crossvalidation to choose the optimal number of cuts. Make a plot of the fit obtained.
CV_RMSE <- c()

set.seed(150)

for (i in 2:10) {
  model_temp <- train(y = Wage$wage,
                      x = data.frame(cut(Wage$age, i)),
                      method = "lm",
                      metric = "RMSE",
                      trControl = ctrl)
  CV_RMSE[i-1] <- model_temp$results$RMSE
}


data.frame(cuts = 2:10, CV_RMSE = CV_RMSE) %>%
  mutate(min_CV_RMSE = as.numeric(min(CV_RMSE) == CV_RMSE)) %>%
  ggplot(aes(x = cuts, y = CV_RMSE)) +
  geom_line(col = "blue") +
  geom_point(size = 2, aes(col = factor(min_CV_RMSE))) +
  scale_x_continuous(breaks = seq(2, 10), minor_breaks = NULL) +
  scale_color_manual(values = c("darkgreen", "red")) +
  theme(legend.position = "none") +
  labs(title = "Wage Dataset - Step Function",
       subtitle = "Selecting number of 'age' cut-points with cross-validation",
       x = "Intervals",
       y = "CV RMSE")

ggplot(Wage, aes(x = age, y = wage)) + 
  geom_point(alpha = 0.3) + 
  geom_smooth(method = "lm", formula = "y ~ cut(x, 8)") + 
  labs(title = "Wage Dataset - Step Function",
       subtitle = "Predicting 'wage' with a 8-interval step function of 'age'")

Problem 10

  1. Split the data into a training set and a test set. Using out-of-state tuition as the response and the other variables as the predictors, perform forward stepwise selection on the training set in order to identify a satisfactory model that uses just a subset of the predictors.
set.seed(152)
train_index <- sample(1:nrow(College), round(nrow(College) * 0.7))

train <- College[train_index, ]
nrow(train) / nrow(College)
## [1] 0.7001287
test <- College[-train_index, ]
nrow(test) / nrow(College)
## [1] 0.2998713

70% of the data will be used for training with about 30% left over for testing.

ctrl <- trainControl(method = "repeatedcv", 
                     number = 10, 
                     repeats = 1, 
                     selectionFunction = "oneSE")

set.seed(153)

model_forward <- train(Outstate ~ .,
                       data = train,
                       method = "leapForward",
                       metric = "MSE",
                       maximize = F,
                       trControl = ctrl,
                       tuneGrid = data.frame(nvmax = 1:17))

model_forward
## Linear Regression with Forward Selection 
## 
## 544 samples
##  17 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 490, 490, 491, 489, 489, 491, ... 
## Resampling results across tuning parameters:
## 
##   nvmax  RMSE      Rsquared   MAE     
##    1     2984.019  0.5122385  2361.966
##    2     2548.220  0.6343888  1914.578
##    3     2339.352  0.6870853  1773.240
##    4     2243.736  0.7112909  1716.763
##    5     2136.765  0.7360749  1645.983
##    6     2109.361  0.7399644  1637.262
##    7     2118.858  0.7371518  1647.370
##    8     2146.511  0.7307774  1667.069
##    9     2174.846  0.7227112  1665.939
##   10     2174.785  0.7238102  1660.478
##   11     2165.704  0.7264317  1651.141
##   12     2129.936  0.7354217  1635.032
##   13     2129.434  0.7354281  1635.305
##   14     2117.606  0.7383181  1623.533
##   15     2102.094  0.7420614  1611.258
##   16     2101.862  0.7421832  1611.050
##   17     2106.312  0.7412256  1614.380
## 
## RMSE was used to select the optimal model using  the one SE rule.
## The final value used for the model was nvmax = 5.

The five variable model is best. The model increases the Rsquared value significantly until the fifth variable. Following that the increase in the amount of variation explained is nominal and not worth increasing the complexity of the model.

  1. Fit a GAM on the training data, using out-of-state tuition as the response and the features selected in the previous step as the predictors. Plot the results, and explain your findings.
library(gam)
model_gam_1 <- gam(Outstate ~ Private + s(Room.Board) + s(PhD) + s(perc.alumni) + s(Expend), data = train)

par(mfrow = c(2, 3))
plot(model_gam_1, se = T, col = "blue")

The perc.alumni (Percent of alumni who donate) appears to be the only variable with a linear relationship.

  1. Evaluate the model obtained on the test set, and explain the results obtained.
test_TSS <- sum((test$Outstate - mean(test$Outstate))^2)
test_RSS <- sum((predict(model_gam_1, newdata = test) - test$Outstate)^2)

1 - test_RSS/test_TSS
## [1] 0.7558089

Evaluating the model with the test set we get 75.58% accuracy.

  1. For which variables, if any, is there evidence of a non-linear relationship with the response?
summary(model_gam_1)
## 
## Call: gam(formula = Outstate ~ Private + s(Room.Board) + s(PhD) + s(perc.alumni) + 
##     s(Expend), data = train)
## Deviance Residuals:
##      Min       1Q   Median       3Q      Max 
## -8733.25 -1125.38    83.78  1217.77  6984.26 
## 
## (Dispersion Parameter for gaussian family taken to be 3746260)
## 
##     Null Deviance: 9164917055 on 543 degrees of freedom
## Residual Deviance: 1970530823 on 525.9994 degrees of freedom
## AIC: 9797.631 
## 
## Number of Local Scoring Iterations: NA 
## 
## Anova for Parametric Effects
##                 Df     Sum Sq    Mean Sq F value    Pr(>F)    
## Private          1 2427606078 2427606078  648.01 < 2.2e-16 ***
## s(Room.Board)    1 1630907677 1630907677  435.34 < 2.2e-16 ***
## s(PhD)           1  680367251  680367251  181.61 < 2.2e-16 ***
## s(perc.alumni)   1  445688176  445688176  118.97 < 2.2e-16 ***
## s(Expend)        1  856412415  856412415  228.60 < 2.2e-16 ***
## Residuals      526 1970530823    3746260                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Anova for Nonparametric Effects
##                Npar Df Npar F   Pr(F)    
## (Intercept)                              
## Private                                  
## s(Room.Board)        3  2.721 0.04383 *  
## s(PhD)               3  1.870 0.13357    
## s(perc.alumni)       3  0.587 0.62386    
## s(Expend)            3 32.772 < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Room and Board costs as well as Instructional expenditure per student have non-linear relationships with Out-of-State tuition.