In this exercise, you will further analyze the Wage data set considered throughout this chapter
data("Wage")
str(Wage)
## 'data.frame': 3000 obs. of 11 variables:
## $ year : int 2006 2004 2003 2003 2005 2008 2009 2008 2006 2004 ...
## $ age : int 18 24 45 43 50 54 44 30 41 52 ...
## $ maritl : Factor w/ 5 levels "1. Never Married",..: 1 1 2 2 4 2 2 1 1 2 ...
## $ race : Factor w/ 4 levels "1. White","2. Black",..: 1 1 1 3 1 1 4 3 2 1 ...
## $ education : Factor w/ 5 levels "1. < HS Grad",..: 1 4 3 4 2 4 3 3 3 2 ...
## $ region : Factor w/ 9 levels "1. New England",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ jobclass : Factor w/ 2 levels "1. Industrial",..: 1 2 1 2 2 2 1 2 2 2 ...
## $ health : Factor w/ 2 levels "1. <=Good","2. >=Very Good": 1 2 1 2 1 2 2 1 2 2 ...
## $ health_ins: Factor w/ 2 levels "1. Yes","2. No": 2 2 1 1 1 1 1 1 1 1 ...
## $ logwage : num 4.32 4.26 4.88 5.04 4.32 ...
## $ wage : num 75 70.5 131 154.7 75 ...
(6.a) Perform polynomial regression to predict wage using age. Use cross-validation to select the optimal degree d for the polynomial. What degree was chosen, and how does this compare to the results of hypothesis testing using ANOVA? Make a plot of the resulting polynomial fit to the data.
degree = 10
cv.errs = rep(NA, degree)
for (i in 1:degree) {
fit = glm(wage ~ poly(age, i), data = Wage) #poly regression to predict wage using age
cv.errs[i] <- cv.glm(Wage, fit)$delta[1] #cv to select the degree d for the polynomial
}
#Make a plot of the resulting polynomial fit to the data.
plot(1:degree, cv.errs, xlab = 'Degree', ylab = 'Test MSE', type = 'l')
deg.min <- which.min(cv.errs)
points(deg.min, cv.errs[deg.min], cex = 1.5, pch = 4)
#Predict with 4 degree model:
plot(wage ~ age, data = Wage, col = "lightgrey")
age.range <- range(Wage$age)
age.grid <- seq(from = age.range[1], to = age.range[2])
fit <- lm(wage ~ poly(age, 4), data = Wage)
preds <- predict(fit, newdata = list(age = age.grid))
lines(age.grid, preds, lwd = 2)
(6.a) Answer: The min test mse lands on degree 9. But we can see from the graph that choosing degree 4 would be reasonable. The results of hypothesis testing using ANOVA suggested that either a cubic or a quartic polynomial appear to provide a reasonable fit to the data
(6.b) Fit a step function to predict wage using age, and perform cross validation to choose the optimal number of cuts. Make a plot of the fit obtained.
cv.errs <- rep(NA, degree)
for (i in 2:degree) {
Wage$age.cut <- cut(Wage$age, i)
fit <- glm(wage ~ age.cut, data = Wage) #Fit a step function to predict wage using age
cv.errs[i] <- cv.glm(Wage, fit)$delta[1] #perform cv to choose the optimal number of cuts
}
#Make a plot of the fit obtained
plot(2:degree, cv.errs[-1], xlab = 'Cuts', ylab = 'Test MSE', type = 'l')
deg.min <- which.min(cv.errs)
points(deg.min, cv.errs[deg.min], cex = 1.5, pch = 4)
#Predict with optimal cuts step function
plot(wage ~ age, data = Wage, col = "lightgrey")
fit <- glm(wage ~ cut(age, 8), data = Wage)
preds <- predict(fit, list(age = age.grid))
lines(age.grid, preds, lwd = 2)
This question relates to the College data set.
data("College")
str(College)
## 'data.frame': 777 obs. of 18 variables:
## $ Private : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ Apps : num 1660 2186 1428 417 193 ...
## $ Accept : num 1232 1924 1097 349 146 ...
## $ Enroll : num 721 512 336 137 55 158 103 489 227 172 ...
## $ Top10perc : num 23 16 22 60 16 38 17 37 30 21 ...
## $ Top25perc : num 52 29 50 89 44 62 45 68 63 44 ...
## $ F.Undergrad: num 2885 2683 1036 510 249 ...
## $ P.Undergrad: num 537 1227 99 63 869 ...
## $ Outstate : num 7440 12280 11250 12960 7560 ...
## $ Room.Board : num 3300 6450 3750 5450 4120 ...
## $ Books : num 450 750 400 450 800 500 500 450 300 660 ...
## $ Personal : num 2200 1500 1165 875 1500 ...
## $ PhD : num 70 29 53 92 76 67 90 89 79 40 ...
## $ Terminal : num 78 30 66 97 72 73 93 100 84 41 ...
## $ S.F.Ratio : num 18.1 12.2 12.9 7.7 11.9 9.4 11.5 13.7 11.3 11.5 ...
## $ perc.alumni: num 12 16 30 37 2 11 26 37 23 15 ...
## $ Expend : num 7041 10527 8735 19016 10922 ...
## $ Grad.Rate : num 60 56 54 59 15 55 63 73 80 52 ...
(10.a) Split the data into a training set and a test set. Using out-of-state tuition as the response and the other variables as the predictors, perform forward stepwise selection on the training set in order to identify a satisfactory model that uses just a subset of the predictors.
set.seed(1)
training_indicator = sample(1: nrow(College), nrow(College)/2)
train = College[training_indicator,]
test = College[-training_indicator,]
lm.model= lm(Outstate ~ ., data = train)
step <-step(lm.model, direction="forward", test="Chisq", trace = F); step$call
## lm(formula = Outstate ~ Private + Apps + Accept + Enroll + Top10perc +
## Top25perc + F.Undergrad + P.Undergrad + Room.Board + Books +
## Personal + PhD + Terminal + S.F.Ratio + perc.alumni + Expend +
## Grad.Rate, data = train)
(10.b) Fit a GAM on the training data, using out-of-state tuition as the response and the features selected in the previous step as the predictors. Plot the results, and explain your findings.
gam.model <- gam(Outstate ~ Private +
s(Apps,5) +
s(Accept,5) +
s(Enroll,5) +
s(Top10perc,5) +
s(Top25perc,5) +
s(F.Undergrad,5) +
s(P.Undergrad,5) +
s(Room.Board,5) +
s(Books,5) +
s(Personal,5) +
s(PhD,5) +
s(Terminal,5) +
s(S.F.Ratio,5) +
s(perc.alumni,5) +
s(Expend,5) +
s(Grad.Rate,5),
data = train)
par(mfrow = c(3,3))
plot(gam.model, se = TRUE, col = 'orange')
(10.b) Findings: Based on the shape of the fit curves:
Linear-ish: Apps, Top10perc, P.Undergrad, Room.Board, Books, Personal, perc.alumni, Grad.Rate
Non-Linear-ish: Accept, Enroll, Top25perc, F.Undergrad,PhD, Terminal, S.F.Ratio, Expend
(10.c) Evaluate the model obtained on the test set, and explain the results obtained.
preds <- predict(gam.model, newdata = test)
RSS <- sum((test$Outstate - preds)^2)
TSS <- sum((test$Outstate - mean(test$Outstate)) ^ 2)
1 - (RSS / TSS)
## [1] 0.7638056
(10.c) Answer: The R2 statistic on test set is 0.7638056
(10.d) For which variables, if any, is there evidence of a non-linear relationship with the response?
summary(gam.model)
##
## Call: gam(formula = Outstate ~ Private + s(Apps, 5) + s(Accept, 5) +
## s(Enroll, 5) + s(Top10perc, 5) + s(Top25perc, 5) + s(F.Undergrad,
## 5) + s(P.Undergrad, 5) + s(Room.Board, 5) + s(Books, 5) +
## s(Personal, 5) + s(PhD, 5) + s(Terminal, 5) + s(S.F.Ratio,
## 5) + s(perc.alumni, 5) + s(Expend, 5) + s(Grad.Rate, 5),
## data = train)
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -5413.80 -984.52 -15.03 1004.23 6115.07
##
## (Dispersion Parameter for gaussian family taken to be 3267943)
##
## Null Deviance: 6989966760 on 387 degrees of freedom
## Residual Deviance: 999988594 on 305.9994 degrees of freedom
## AIC: 6994.85
##
## Number of Local Scoring Iterations: NA
##
## Anova for Parametric Effects
## Df Sum Sq Mean Sq F value Pr(>F)
## Private 1 1815313375 1815313375 555.4911 < 2.2e-16 ***
## s(Apps, 5) 1 773984614 773984614 236.8415 < 2.2e-16 ***
## s(Accept, 5) 1 140834746 140834746 43.0958 2.230e-10 ***
## s(Enroll, 5) 1 29415823 29415823 9.0013 0.0029196 **
## s(Top10perc, 5) 1 1017988895 1017988895 311.5075 < 2.2e-16 ***
## s(Top25perc, 5) 1 50169179 50169179 15.3519 0.0001101 ***
## s(F.Undergrad, 5) 1 47737829 47737829 14.6079 0.0001603 ***
## s(P.Undergrad, 5) 1 579001 579001 0.1772 0.6741071
## s(Room.Board, 5) 1 487117806 487117806 149.0595 < 2.2e-16 ***
## s(Books, 5) 1 43039427 43039427 13.1702 0.0003332 ***
## s(Personal, 5) 1 21603228 21603228 6.6107 0.0106094 *
## s(PhD, 5) 1 24479649 24479649 7.4908 0.0065640 **
## s(Terminal, 5) 1 16934864 16934864 5.1821 0.0235110 *
## s(S.F.Ratio, 5) 1 63577261 63577261 19.4548 1.428e-05 ***
## s(perc.alumni, 5) 1 68238473 68238473 20.8812 7.098e-06 ***
## s(Expend, 5) 1 312774312 312774312 95.7098 < 2.2e-16 ***
## s(Grad.Rate, 5) 1 34194856 34194856 10.4637 0.0013506 **
## Residuals 306 999988594 3267943
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Anova for Nonparametric Effects
## Npar Df Npar F Pr(F)
## (Intercept)
## Private
## s(Apps, 5) 4 1.1174 0.3482870
## s(Accept, 5) 4 8.1274 3.067e-06 ***
## s(Enroll, 5) 4 7.9879 3.891e-06 ***
## s(Top10perc, 5) 4 1.2507 0.2895282
## s(Top25perc, 5) 4 2.6485 0.0335137 *
## s(F.Undergrad, 5) 4 5.1001 0.0005436 ***
## s(P.Undergrad, 5) 4 0.9063 0.4605175
## s(Room.Board, 5) 4 1.5956 0.1753438
## s(Books, 5) 4 0.4345 0.7836476
## s(Personal, 5) 4 0.7742 0.5426734
## s(PhD, 5) 4 2.5078 0.0421025 *
## s(Terminal, 5) 4 2.3243 0.0565528 .
## s(S.F.Ratio, 5) 4 2.1349 0.0764087 .
## s(perc.alumni, 5) 4 0.6570 0.6223521
## s(Expend, 5) 4 13.4877 3.882e-10 ***
## s(Grad.Rate, 5) 4 0.4181 0.7955745
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(10.d) Answer: Evidence of a non-linear relationship with the response can be seen in the summary’s “Anova for Nonparametric Effects” section. All of the terms with significant to close to significant p-values match the non-linear terms determined by the visuals in (10.b). Accept, Enroll, Top25perc, F.Undergrad, PhD, and Expend have a strong non-linear relationship with Outstate. Terminal and S.F.Ratio have moderate non-linear relationship with Outstate.