7.9 EXERCISE: 6 (Applied)

In this exercise, you will further analyze the Wage data set considered throughout this chapter

data("Wage")
str(Wage)
## 'data.frame':    3000 obs. of  11 variables:
##  $ year      : int  2006 2004 2003 2003 2005 2008 2009 2008 2006 2004 ...
##  $ age       : int  18 24 45 43 50 54 44 30 41 52 ...
##  $ maritl    : Factor w/ 5 levels "1. Never Married",..: 1 1 2 2 4 2 2 1 1 2 ...
##  $ race      : Factor w/ 4 levels "1. White","2. Black",..: 1 1 1 3 1 1 4 3 2 1 ...
##  $ education : Factor w/ 5 levels "1. < HS Grad",..: 1 4 3 4 2 4 3 3 3 2 ...
##  $ region    : Factor w/ 9 levels "1. New England",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ jobclass  : Factor w/ 2 levels "1. Industrial",..: 1 2 1 2 2 2 1 2 2 2 ...
##  $ health    : Factor w/ 2 levels "1. <=Good","2. >=Very Good": 1 2 1 2 1 2 2 1 2 2 ...
##  $ health_ins: Factor w/ 2 levels "1. Yes","2. No": 2 2 1 1 1 1 1 1 1 1 ...
##  $ logwage   : num  4.32 4.26 4.88 5.04 4.32 ...
##  $ wage      : num  75 70.5 131 154.7 75 ...

(6.a) Perform polynomial regression to predict wage using age. Use cross-validation to select the optimal degree d for the polynomial. What degree was chosen, and how does this compare to the results of hypothesis testing using ANOVA? Make a plot of the resulting polynomial fit to the data.

degree = 10
cv.errs = rep(NA, degree)

for (i in 1:degree) {
  fit = glm(wage ~ poly(age, i), data = Wage) #poly regression to predict wage using age
  cv.errs[i] <- cv.glm(Wage, fit)$delta[1] #cv to select the degree d for the polynomial
}

#Make a plot of the resulting polynomial fit to the data.
plot(1:degree, cv.errs, xlab = 'Degree', ylab = 'Test MSE', type = 'l')
deg.min <- which.min(cv.errs)
points(deg.min, cv.errs[deg.min], cex = 1.5, pch = 4)

#Predict with 4 degree model:
plot(wage ~ age, data = Wage, col = "lightgrey")
age.range <- range(Wage$age)
age.grid <- seq(from = age.range[1], to = age.range[2])
fit <- lm(wage ~ poly(age, 4), data = Wage)
preds <- predict(fit, newdata = list(age = age.grid))
lines(age.grid, preds, lwd = 2)

(6.a) Answer: The min test mse lands on degree 9. But we can see from the graph that choosing degree 4 would be reasonable. The results of hypothesis testing using ANOVA suggested that either a cubic or a quartic polynomial appear to provide a reasonable fit to the data

(6.b) Fit a step function to predict wage using age, and perform cross validation to choose the optimal number of cuts. Make a plot of the fit obtained.

cv.errs <- rep(NA, degree)

for (i in 2:degree) {
  Wage$age.cut <- cut(Wage$age, i)
  fit <- glm(wage ~ age.cut, data = Wage) #Fit a step function to predict wage using age
  cv.errs[i] <- cv.glm(Wage, fit)$delta[1] #perform cv to choose the optimal number of cuts
}

#Make a plot of the fit obtained
plot(2:degree, cv.errs[-1], xlab = 'Cuts', ylab = 'Test MSE', type = 'l')
deg.min <- which.min(cv.errs)
points(deg.min, cv.errs[deg.min], cex = 1.5, pch = 4)

#Predict with optimal cuts step function
plot(wage ~ age, data = Wage, col = "lightgrey")
fit <- glm(wage ~ cut(age, 8), data = Wage)
preds <- predict(fit, list(age = age.grid))
lines(age.grid, preds, lwd = 2)

7.9 EXERCISE: 10 (Applied)

This question relates to the College data set.

data("College")
str(College)
## 'data.frame':    777 obs. of  18 variables:
##  $ Private    : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Apps       : num  1660 2186 1428 417 193 ...
##  $ Accept     : num  1232 1924 1097 349 146 ...
##  $ Enroll     : num  721 512 336 137 55 158 103 489 227 172 ...
##  $ Top10perc  : num  23 16 22 60 16 38 17 37 30 21 ...
##  $ Top25perc  : num  52 29 50 89 44 62 45 68 63 44 ...
##  $ F.Undergrad: num  2885 2683 1036 510 249 ...
##  $ P.Undergrad: num  537 1227 99 63 869 ...
##  $ Outstate   : num  7440 12280 11250 12960 7560 ...
##  $ Room.Board : num  3300 6450 3750 5450 4120 ...
##  $ Books      : num  450 750 400 450 800 500 500 450 300 660 ...
##  $ Personal   : num  2200 1500 1165 875 1500 ...
##  $ PhD        : num  70 29 53 92 76 67 90 89 79 40 ...
##  $ Terminal   : num  78 30 66 97 72 73 93 100 84 41 ...
##  $ S.F.Ratio  : num  18.1 12.2 12.9 7.7 11.9 9.4 11.5 13.7 11.3 11.5 ...
##  $ perc.alumni: num  12 16 30 37 2 11 26 37 23 15 ...
##  $ Expend     : num  7041 10527 8735 19016 10922 ...
##  $ Grad.Rate  : num  60 56 54 59 15 55 63 73 80 52 ...

(10.a) Split the data into a training set and a test set. Using out-of-state tuition as the response and the other variables as the predictors, perform forward stepwise selection on the training set in order to identify a satisfactory model that uses just a subset of the predictors.

set.seed(1)
training_indicator = sample(1: nrow(College), nrow(College)/2)
train = College[training_indicator,]
test = College[-training_indicator,]

lm.model= lm(Outstate ~ ., data = train)
step <-step(lm.model, direction="forward", test="Chisq", trace = F);  step$call
## lm(formula = Outstate ~ Private + Apps + Accept + Enroll + Top10perc + 
##     Top25perc + F.Undergrad + P.Undergrad + Room.Board + Books + 
##     Personal + PhD + Terminal + S.F.Ratio + perc.alumni + Expend + 
##     Grad.Rate, data = train)

(10.b) Fit a GAM on the training data, using out-of-state tuition as the response and the features selected in the previous step as the predictors. Plot the results, and explain your findings.

gam.model <- gam(Outstate ~ Private +
                            s(Apps,5) +
                            s(Accept,5) +
                            s(Enroll,5) +
                            s(Top10perc,5) +
                            s(Top25perc,5) +
                            s(F.Undergrad,5) +
                            s(P.Undergrad,5) +
                            s(Room.Board,5) +
                            s(Books,5) +
                            s(Personal,5) +
                            s(PhD,5) +
                            s(Terminal,5) +
                            s(S.F.Ratio,5) +
                            s(perc.alumni,5) +
                            s(Expend,5) +
                            s(Grad.Rate,5),
                 data = train)

par(mfrow = c(3,3))
plot(gam.model, se = TRUE, col = 'orange')

(10.b) Findings: Based on the shape of the fit curves:
Linear-ish: Apps, Top10perc, P.Undergrad, Room.Board, Books, Personal, perc.alumni, Grad.Rate
Non-Linear-ish: Accept, Enroll, Top25perc, F.Undergrad,PhD, Terminal, S.F.Ratio, Expend

(10.c) Evaluate the model obtained on the test set, and explain the results obtained.

preds <- predict(gam.model, newdata = test)

RSS <- sum((test$Outstate - preds)^2) 
TSS <- sum((test$Outstate - mean(test$Outstate)) ^ 2)
1 - (RSS / TSS) 
## [1] 0.7638056

(10.c) Answer: The R2 statistic on test set is 0.7638056

(10.d) For which variables, if any, is there evidence of a non-linear relationship with the response?

summary(gam.model)
## 
## Call: gam(formula = Outstate ~ Private + s(Apps, 5) + s(Accept, 5) + 
##     s(Enroll, 5) + s(Top10perc, 5) + s(Top25perc, 5) + s(F.Undergrad, 
##     5) + s(P.Undergrad, 5) + s(Room.Board, 5) + s(Books, 5) + 
##     s(Personal, 5) + s(PhD, 5) + s(Terminal, 5) + s(S.F.Ratio, 
##     5) + s(perc.alumni, 5) + s(Expend, 5) + s(Grad.Rate, 5), 
##     data = train)
## Deviance Residuals:
##      Min       1Q   Median       3Q      Max 
## -5413.80  -984.52   -15.03  1004.23  6115.07 
## 
## (Dispersion Parameter for gaussian family taken to be 3267943)
## 
##     Null Deviance: 6989966760 on 387 degrees of freedom
## Residual Deviance: 999988594 on 305.9994 degrees of freedom
## AIC: 6994.85 
## 
## Number of Local Scoring Iterations: NA 
## 
## Anova for Parametric Effects
##                    Df     Sum Sq    Mean Sq  F value    Pr(>F)    
## Private             1 1815313375 1815313375 555.4911 < 2.2e-16 ***
## s(Apps, 5)          1  773984614  773984614 236.8415 < 2.2e-16 ***
## s(Accept, 5)        1  140834746  140834746  43.0958 2.230e-10 ***
## s(Enroll, 5)        1   29415823   29415823   9.0013 0.0029196 ** 
## s(Top10perc, 5)     1 1017988895 1017988895 311.5075 < 2.2e-16 ***
## s(Top25perc, 5)     1   50169179   50169179  15.3519 0.0001101 ***
## s(F.Undergrad, 5)   1   47737829   47737829  14.6079 0.0001603 ***
## s(P.Undergrad, 5)   1     579001     579001   0.1772 0.6741071    
## s(Room.Board, 5)    1  487117806  487117806 149.0595 < 2.2e-16 ***
## s(Books, 5)         1   43039427   43039427  13.1702 0.0003332 ***
## s(Personal, 5)      1   21603228   21603228   6.6107 0.0106094 *  
## s(PhD, 5)           1   24479649   24479649   7.4908 0.0065640 ** 
## s(Terminal, 5)      1   16934864   16934864   5.1821 0.0235110 *  
## s(S.F.Ratio, 5)     1   63577261   63577261  19.4548 1.428e-05 ***
## s(perc.alumni, 5)   1   68238473   68238473  20.8812 7.098e-06 ***
## s(Expend, 5)        1  312774312  312774312  95.7098 < 2.2e-16 ***
## s(Grad.Rate, 5)     1   34194856   34194856  10.4637 0.0013506 ** 
## Residuals         306  999988594    3267943                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Anova for Nonparametric Effects
##                   Npar Df  Npar F     Pr(F)    
## (Intercept)                                    
## Private                                        
## s(Apps, 5)              4  1.1174 0.3482870    
## s(Accept, 5)            4  8.1274 3.067e-06 ***
## s(Enroll, 5)            4  7.9879 3.891e-06 ***
## s(Top10perc, 5)         4  1.2507 0.2895282    
## s(Top25perc, 5)         4  2.6485 0.0335137 *  
## s(F.Undergrad, 5)       4  5.1001 0.0005436 ***
## s(P.Undergrad, 5)       4  0.9063 0.4605175    
## s(Room.Board, 5)        4  1.5956 0.1753438    
## s(Books, 5)             4  0.4345 0.7836476    
## s(Personal, 5)          4  0.7742 0.5426734    
## s(PhD, 5)               4  2.5078 0.0421025 *  
## s(Terminal, 5)          4  2.3243 0.0565528 .  
## s(S.F.Ratio, 5)         4  2.1349 0.0764087 .  
## s(perc.alumni, 5)       4  0.6570 0.6223521    
## s(Expend, 5)            4 13.4877 3.882e-10 ***
## s(Grad.Rate, 5)         4  0.4181 0.7955745    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(10.d) Answer: Evidence of a non-linear relationship with the response can be seen in the summary’s “Anova for Nonparametric Effects” section. All of the terms with significant to close to significant p-values match the non-linear terms determined by the visuals in (10.b). Accept, Enroll, Top25perc, F.Undergrad, PhD, and Expend have a strong non-linear relationship with Outstate. Terminal and S.F.Ratio have moderate non-linear relationship with Outstate.