Exercise 6

In this exercise, you will further analyze the Wage data set considered throughout this chapter.

**(A) Perform polynomial regression to predict wage using age. Use cross-validation to select the optimal degree *d for the polynomial. What degree was chosen, and how does this compare to the results of hypothesis testing using ANOVA? Make a plot of the resulting polynomial fit to the data.**

Polynomial Regression

data("Wage")
set.seed(532)
poly.mse=c()

for(degree in 1:7){
  poly.fit=glm(wage~poly(age,degree,raw=TRUE),data=Wage)
  mse=cv.glm(poly.fit,data = Wage,K=10)$delta[1]
  poly.mse=c(poly.mse,mse)
}
plot(poly.mse,main='Wage Datatset: Polynomial Regression With Cross-Validation',xlab='Degree of Polynomial',ylab='Cross Validation Error',type='b')
x=which.min(poly.mse)
points(x,poly.mse[x],pch=20,cex=2,col='red')

set.seed(532)
anova.fit1=lm(wage~poly(age,1,raw=TRUE),data=Wage)
anova.fit2=lm(wage~poly(age,2,raw=TRUE),data=Wage)
anova.fit3=lm(wage~poly(age,3,raw=TRUE),data=Wage)
anova.fit4=lm(wage~poly(age,4,raw=TRUE),data=Wage)
anova.fit5=lm(wage~poly(age,5,raw=TRUE),data=Wage)
anova.fit6=lm(wage~poly(age,6,raw=TRUE),data=Wage)
anova.fit7=lm(wage~poly(age,7,raw=TRUE),data=Wage)
anova(anova.fit1,anova.fit2,anova.fit3,anova.fit4,anova.fit5,anova.fit6,anova.fit7)
## Analysis of Variance Table
## 
## Model 1: wage ~ poly(age, 1, raw = TRUE)
## Model 2: wage ~ poly(age, 2, raw = TRUE)
## Model 3: wage ~ poly(age, 3, raw = TRUE)
## Model 4: wage ~ poly(age, 4, raw = TRUE)
## Model 5: wage ~ poly(age, 5, raw = TRUE)
## Model 6: wage ~ poly(age, 6, raw = TRUE)
## Model 7: wage ~ poly(age, 7, raw = TRUE)
##   Res.Df     RSS Df Sum of Sq        F    Pr(>F)    
## 1   2998 5022216                                    
## 2   2997 4793430  1    228786 143.6926 < 2.2e-16 ***
## 3   2996 4777674  1     15756   9.8956  0.001673 ** 
## 4   2995 4771604  1      6070   3.8125  0.050966 .  
## 5   2994 4770322  1      1283   0.8055  0.369516    
## 6   2993 4766389  1      3932   2.4697  0.116165    
## 7   2992 4763834  1      2555   1.6049  0.205311    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The Wage Datatset: Polynomial Regression With Cross-Validation graph identifies that the sixth degree, highlighted in red, is the best polynomial. The selection does concur with the analysis of variance results. These results indicate that either a cubic or a quartic polynomial appears to provide a reasonable fit to the data, but lower-order or higher-order models are not justified.

ggplot(Wage, aes(age,wage)) + 
  geom_point(color="green") + 
  stat_smooth(method = "lm", formula = y ~ poly(x, 6), size = 1)

(B) Fit a step function to predict wage using age, and perform cross- validation to choose the optimal number of cuts. Make a plot of the fit obtained.

set.seed(532)

ctrl <- trainControl(method = "cv", number = 10)
CV_RMSE <- c()

for (i in 2:20) {
  model_temp <- train(y = Wage$wage,
                      x = data.frame(cut(Wage$age, i)),
                      method = "lm",
                      metric = "RMSE",
                      trControl = ctrl)
  CV_RMSE[i-1] <- model_temp$results$RMSE
}


data.frame(cuts = 2:20, CV_RMSE = CV_RMSE) %>%
  mutate(min_CV_RMSE = as.numeric(min(CV_RMSE) == CV_RMSE)) %>%
  ggplot(aes(x = cuts, y = CV_RMSE)) +
  geom_line(col = "grey55") +
  geom_point(size = 2, aes(col = factor(min_CV_RMSE))) +
  scale_x_continuous(breaks = seq(2, 20), minor_breaks = NULL) +
  scale_color_manual(values = c("deepskyblue3", "green")) +
  theme(legend.position = "none") +
  labs(title = "Wage Dataset: Step Function",
       x = "Intervals",
       y = "CV RMSE")

In fitting the step function age was cut up into 20 intervals which was used by the cross-validation function to select best cut for the model. For this model the optimal number of cuts is 8. The model will now be used to predict wage with an eight interval step function of age.

ggplot(Wage, aes(x = age, y = wage)) + 
  geom_point(alpha = 0.3) + 
  geom_smooth(method = "lm", formula = "y ~ cut(x, 8)") + 
  labs(title = "Wage Dataset: Step Function (Eight Intervals)")

Exercise 10

This question relates to the College data set.

(A) Split the data into a training set and a test set. Using out-of-state tuition as the response and the other variables as the predictors, perform forward step wise selection on the training set in order to identify a satisfactory model that uses just a subset of the predictors.

data(College)
set.seed(234)
College.split = initial_split(College, strata = "Outstate", prop = .7)
College.train = training(College.split)
College.test = testing (College.split)

ctrl <- trainControl(method = "repeatedcv", 
                     number = 10, 
                     repeats = 1, 
                     selectionFunction = "oneSE")

model.fwd <- train(Outstate ~ .,
                       data = College.train,
                       method = "leapForward",
                       metric = "MSE",
                       maximize = F,
                       trControl = ctrl,
                       tuneGrid = data.frame(nvmax = 1:17))
## Warning in train.default(x, y, weights = w, ...): The metric "MSE" was not in
## the result set. RMSE will be used instead.
model.fwd
## Linear Regression with Forward Selection 
## 
## 542 samples
##  17 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 487, 487, 488, 489, 489, 488, ... 
## Resampling results across tuning parameters:
## 
##   nvmax  RMSE      Rsquared   MAE     
##    1     2821.222  0.5421468  2263.313
##    2     2432.681  0.6511000  1868.439
##    3     2226.759  0.7038217  1723.816
##    4     2081.243  0.7405207  1643.016
##    5     2025.848  0.7518705  1595.431
##    6     2015.033  0.7547385  1596.790
##    7     1993.910  0.7601704  1583.245
##    8     2009.634  0.7558769  1598.434
##    9     2029.106  0.7502363  1601.002
##   10     2034.763  0.7506942  1602.750
##   11     2037.931  0.7506845  1602.482
##   12     1995.502  0.7625791  1577.820
##   13     1997.814  0.7624667  1580.192
##   14     2007.636  0.7598428  1578.395
##   15     2007.645  0.7596646  1577.778
##   16     2001.546  0.7608492  1574.528
##   17     1996.867  0.7618954  1570.084
## 
## RMSE was used to select the optimal model using  the one SE rule.
## The final value used for the model was nvmax = 5.

The forward step wise selection process identified a 5 variable solution to be a satisfactory model.

(B) Fit a GAM on the training data, using out-of-state tuition as the response and the features selected in the previous step as the predictors. Plot the results, and explain your findings.

The variables in the selected solution are:

coef(model.fwd$finalModel, id = 5)
##   (Intercept)    PrivateYes    Room.Board           PhD   perc.alumni 
## -2567.2683333  2642.6569135     0.9689926    37.4598121    64.1658168 
##        Expend 
##     0.2808643
model.gam <- gam(Outstate ~ Private + s(Room.Board) + s(PhD) + s(perc.alumni) + s(Expend), data = College.train)

par(mfrow = c(2, 3))
plot(model.gam, se = T, col = "red")

The variable Private is a qualitative variable indicating is a university is private or public. In this model the data consist largely of data from the public sector. Room and board cost (Rooom.Board) appear to linear in nature, which is somewhat expected considering out-of-state students require room and board weather it is provided through the university or in the private sector. As out-of-state tuition rises so does the percent of faculty with PH.D’s (PhD). The model suggest that schools that cost more have a higher percentage of alumni (perc_alumni) who donate to it. The instructional expenditure per student (Expend) begins to plateau when out-of-state tuition is approximately $16,000 suggesting that out-of-state tuition is not a driver for the a Universities decision on instructional expenditures.

(C) Evaluate the model obtained on the test set, and explain the results obtained.

The GAM MSE and R^2 for College.test is:

gam.pred=predict(model.fwd, College.test)
(gam.mse=mean((College.test$Outstate-gam.pred)^2))
## [1] 5023810
gam.tss = sum((College.test$Outstate- mean(College.test$Outstate))^2)
gam.rss = sum((gam.pred -College.test$Outstate)^2)
(1-gam.rss/gam.tss)
## [1] 0.6930349

The GAM model can explain 69.3% of the variance in the data.

(D) For which variables, if any, is there evidence of a non-linear relationship with the response?

summary(model.gam)
## 
## Call: gam(formula = Outstate ~ Private + s(Room.Board) + s(PhD) + s(perc.alumni) + 
##     s(Expend), data = College.train)
## Deviance Residuals:
##     Min      1Q  Median      3Q     Max 
## -5440.1 -1136.9   103.7  1202.5  7010.9 
## 
## (Dispersion Parameter for gaussian family taken to be 3448194)
## 
##     Null Deviance: 8712201403 on 541 degrees of freedom
## Residual Deviance: 1806853709 on 524 degrees of freedom
## AIC: 9716.746 
## 
## Number of Local Scoring Iterations: NA 
## 
## Anova for Parametric Effects
##                 Df     Sum Sq    Mean Sq F value    Pr(>F)    
## Private          1 2257060436 2257060436  654.56 < 2.2e-16 ***
## s(Room.Board)    1 1880675047 1880675047  545.41 < 2.2e-16 ***
## s(PhD)           1  672093668  672093668  194.91 < 2.2e-16 ***
## s(perc.alumni)   1  409119952  409119952  118.65 < 2.2e-16 ***
## s(Expend)        1  757963449  757963449  219.81 < 2.2e-16 ***
## Residuals      524 1806853709    3448194                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Anova for Nonparametric Effects
##                Npar Df  Npar F    Pr(F)    
## (Intercept)                                
## Private                                    
## s(Room.Board)        3  2.7045  0.04481 *  
## s(PhD)               3  2.3431  0.07225 .  
## s(perc.alumni)       3  1.7452  0.15675    
## s(Expend)            3 26.9990 2.22e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Considering a significance value of p<0.05, the Anova for Nonparametric Effects tests identified that the variable Expend is significant and has a non-linear relationship. In contrast, the Anova for Parametric Effects tests identifies that all variables are significant and have a non-linear relationship.