In this exercise, you will further analyze the Wage data set considered throughout this chapter.
library(ISLR)
## Warning: package 'ISLR' was built under R version 3.6.3
summary(Wage)
## year age maritl race
## Min. :2003 Min. :18.00 1. Never Married: 648 1. White:2480
## 1st Qu.:2004 1st Qu.:33.75 2. Married :2074 2. Black: 293
## Median :2006 Median :42.00 3. Widowed : 19 3. Asian: 190
## Mean :2006 Mean :42.41 4. Divorced : 204 4. Other: 37
## 3rd Qu.:2008 3rd Qu.:51.00 5. Separated : 55
## Max. :2009 Max. :80.00
##
## education region
## 1. < HS Grad :268 2. Middle Atlantic :3000
## 2. HS Grad :971 1. New England : 0
## 3. Some College :650 3. East North Central: 0
## 4. College Grad :685 4. West North Central: 0
## 5. Advanced Degree:426 5. South Atlantic : 0
## 6. East South Central: 0
## (Other) : 0
## jobclass health health_ins logwage
## 1. Industrial :1544 1. <=Good : 858 1. Yes:2083 Min. :3.000
## 2. Information:1456 2. >=Very Good:2142 2. No : 917 1st Qu.:4.447
## Median :4.653
## Mean :4.654
## 3rd Qu.:4.857
## Max. :5.763
##
## wage
## Min. : 20.09
## 1st Qu.: 85.38
## Median :104.92
## Mean :111.70
## 3rd Qu.:128.68
## Max. :318.34
##
str(Wage)
## 'data.frame': 3000 obs. of 11 variables:
## $ year : int 2006 2004 2003 2003 2005 2008 2009 2008 2006 2004 ...
## $ age : int 18 24 45 43 50 54 44 30 41 52 ...
## $ maritl : Factor w/ 5 levels "1. Never Married",..: 1 1 2 2 4 2 2 1 1 2 ...
## $ race : Factor w/ 4 levels "1. White","2. Black",..: 1 1 1 3 1 1 4 3 2 1 ...
## $ education : Factor w/ 5 levels "1. < HS Grad",..: 1 4 3 4 2 4 3 3 3 2 ...
## $ region : Factor w/ 9 levels "1. New England",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ jobclass : Factor w/ 2 levels "1. Industrial",..: 1 2 1 2 2 2 1 2 2 2 ...
## $ health : Factor w/ 2 levels "1. <=Good","2. >=Very Good": 1 2 1 2 1 2 2 1 2 2 ...
## $ health_ins: Factor w/ 2 levels "1. Yes","2. No": 2 2 1 1 1 1 1 1 1 1 ...
## $ logwage : num 4.32 4.26 4.88 5.04 4.32 ...
## $ wage : num 75 70.5 131 154.7 75 ...
(a) Perform polynomial regression to predict wage using age. Use cross-validation to select the optimal degree d for the polynomial. What degree was chosen, and how does this compare to the results of hypothesis testing using ANOVA? Make a plot of the resulting polynomial fit to the data.
Using cross validation to select the optimal degree for the polynomial, a degree of 5 returned the lowest error value as seen below. The ANOVA hypothesis test showed that a dgree is actually sufficent since the p value for a degree of 4 was not less that our threshold prefernce of 0.05. We can see from the plot below where we graphed both lines, there is not a whole lot of different in the two models until we get to the age around 65, and even though there is a difference, it is only slight.
library(boot)
set.seed(1)
cv.error=rep(0, 5)
for (i in 1:5) {
wage.fit=glm(wage~poly(age, i, raw=T), data=Wage)
cv.error[i] = cv.glm(Wage, wage.fit, K=10)$delta[1]
}
cv.error
## [1] 1676.826 1600.763 1598.399 1595.651 1594.977
plot(1:5, cv.error, xlab='Degree of Polynomial', ylab='CV Error', type = 'l')
cv.min=which.min(cv.error)
points(cv.min, cv.error[cv.min], col='red', cex=2, pch=20)
Anova testing
set.seed(1)
fit.1=lm(wage~age, data=Wage)
fit.2=lm(wage~poly(age,2, raw=T), data=Wage)
fit.3=lm(wage~poly(age,3, raw=T), data=Wage)
fit.4=lm(wage~poly(age,4, raw=T), data=Wage)
fit.5=lm(wage~poly(age,5, raw=T), data=Wage)
anova(fit.1, fit.2, fit.3, fit.4, fit.5)
## Analysis of Variance Table
##
## Model 1: wage ~ age
## Model 2: wage ~ poly(age, 2, raw = T)
## Model 3: wage ~ poly(age, 3, raw = T)
## Model 4: wage ~ poly(age, 4, raw = T)
## Model 5: wage ~ poly(age, 5, raw = T)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 2998 5022216
## 2 2997 4793430 1 228786 143.5931 < 2.2e-16 ***
## 3 2996 4777674 1 15756 9.8888 0.001679 **
## 4 2995 4771604 1 6070 3.8098 0.051046 .
## 5 2994 4770322 1 1283 0.8050 0.369682
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(wage~age, data=Wage, col="darkgrey")
title= ("Polynomial Degree of 3 & 5 for Age Against Wage")
agelims=range(Wage$age)
age.grid = seq(from = agelims[1], to = agelims[2])
wage.poly3 = lm(wage~poly(age,3), data = Wage)
preds.3=predict(wage.poly3, newdata=list(age=age.grid), se=TRUE)
lines(age.grid, preds.3$fit, col="blue", lwd=2)
wage.poly5 = lm(wage~poly(age,5), data = Wage)
preds.5 = predict(wage.poly5, newdata=list(age=age.grid), se=TRUE)
lines(age.grid, preds.5$fit, col="red", lwd=2)
legend("topright", legend=c("Degree 3", "Degree 5"),col=c('blue', 'red'), lty=1, lwd=2, cex=.8)
(b) Fit a step function to predict wage using age, and perform crossvalidation to choose the optimal number of cuts. Make a plot of the fit obtained.
From the plot below, I can see that the optimal number of cuts for this step funtion is 8 as it porvides the lowest error rate.
set.seed(1)
cut.cverr = rep(NA, 10)
for (i in 2:10) {
Wage$age.cut = cut(Wage$age, i)
fit.cut = glm(wage~age.cut, data = Wage)
cut.cverr[i] = cv.glm(Wage, fit.cut, K = 10)$delta[1]
}
cut.cverr
## [1] NA 1734.489 1684.271 1635.552 1632.080 1623.415 1614.996
## [8] 1601.318 1613.954 1606.331
plot(2:10, cut.cverr[-1], xlab='Cuts', ylab='CV Error', type = 'l')
title("Number of Cuts for Step Function")
cut.min=which.min(cut.cverr)
points(cut.min, cut.cverr[cut.min], col='red', cex=2, pch=20)
plot(wage~age, data = Wage, col = "gray")
title("Step function - 8 Cuts")
step.fit = glm(wage~cut(age,8), data = Wage)
step.preds = predict(step.fit, newdata=list(age=age.grid), se=TRUE)
lines(age.grid, step.preds$fit, col = "darkgreen", lwd=2)
This question relates to the College data set.
library(ISLR)
summary(College)
## Private Apps Accept Enroll Top10perc
## No :212 Min. : 81 Min. : 72 Min. : 35 Min. : 1.00
## Yes:565 1st Qu.: 776 1st Qu.: 604 1st Qu.: 242 1st Qu.:15.00
## Median : 1558 Median : 1110 Median : 434 Median :23.00
## Mean : 3002 Mean : 2019 Mean : 780 Mean :27.56
## 3rd Qu.: 3624 3rd Qu.: 2424 3rd Qu.: 902 3rd Qu.:35.00
## Max. :48094 Max. :26330 Max. :6392 Max. :96.00
## Top25perc F.Undergrad P.Undergrad Outstate
## Min. : 9.0 Min. : 139 Min. : 1.0 Min. : 2340
## 1st Qu.: 41.0 1st Qu.: 992 1st Qu.: 95.0 1st Qu.: 7320
## Median : 54.0 Median : 1707 Median : 353.0 Median : 9990
## Mean : 55.8 Mean : 3700 Mean : 855.3 Mean :10441
## 3rd Qu.: 69.0 3rd Qu.: 4005 3rd Qu.: 967.0 3rd Qu.:12925
## Max. :100.0 Max. :31643 Max. :21836.0 Max. :21700
## Room.Board Books Personal PhD
## Min. :1780 Min. : 96.0 Min. : 250 Min. : 8.00
## 1st Qu.:3597 1st Qu.: 470.0 1st Qu.: 850 1st Qu.: 62.00
## Median :4200 Median : 500.0 Median :1200 Median : 75.00
## Mean :4358 Mean : 549.4 Mean :1341 Mean : 72.66
## 3rd Qu.:5050 3rd Qu.: 600.0 3rd Qu.:1700 3rd Qu.: 85.00
## Max. :8124 Max. :2340.0 Max. :6800 Max. :103.00
## Terminal S.F.Ratio perc.alumni Expend
## Min. : 24.0 Min. : 2.50 Min. : 0.00 Min. : 3186
## 1st Qu.: 71.0 1st Qu.:11.50 1st Qu.:13.00 1st Qu.: 6751
## Median : 82.0 Median :13.60 Median :21.00 Median : 8377
## Mean : 79.7 Mean :14.09 Mean :22.74 Mean : 9660
## 3rd Qu.: 92.0 3rd Qu.:16.50 3rd Qu.:31.00 3rd Qu.:10830
## Max. :100.0 Max. :39.80 Max. :64.00 Max. :56233
## Grad.Rate
## Min. : 10.00
## 1st Qu.: 53.00
## Median : 65.00
## Mean : 65.46
## 3rd Qu.: 78.00
## Max. :118.00
str(College)
## 'data.frame': 777 obs. of 18 variables:
## $ Private : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ Apps : num 1660 2186 1428 417 193 ...
## $ Accept : num 1232 1924 1097 349 146 ...
## $ Enroll : num 721 512 336 137 55 158 103 489 227 172 ...
## $ Top10perc : num 23 16 22 60 16 38 17 37 30 21 ...
## $ Top25perc : num 52 29 50 89 44 62 45 68 63 44 ...
## $ F.Undergrad: num 2885 2683 1036 510 249 ...
## $ P.Undergrad: num 537 1227 99 63 869 ...
## $ Outstate : num 7440 12280 11250 12960 7560 ...
## $ Room.Board : num 3300 6450 3750 5450 4120 ...
## $ Books : num 450 750 400 450 800 500 500 450 300 660 ...
## $ Personal : num 2200 1500 1165 875 1500 ...
## $ PhD : num 70 29 53 92 76 67 90 89 79 40 ...
## $ Terminal : num 78 30 66 97 72 73 93 100 84 41 ...
## $ S.F.Ratio : num 18.1 12.2 12.9 7.7 11.9 9.4 11.5 13.7 11.3 11.5 ...
## $ perc.alumni: num 12 16 30 37 2 11 26 37 23 15 ...
## $ Expend : num 7041 10527 8735 19016 10922 ...
## $ Grad.Rate : num 60 56 54 59 15 55 63 73 80 52 ...
(a) Split the data into a training set and a test set. Using out-of-state tuition as the response and the other variables as the predictors, perform forward stepwise selection on the training set in order to identify a satisfactory model that uses just a subset of the predictors. By looking at the plots and min/max values of Cp, BIC, AIC, and AdjRsquared, I think the model with 12 variable would be the best.
library(caret)
## Warning: package 'caret' was built under R version 3.6.3
## Loading required package: lattice
##
## Attaching package: 'lattice'
## The following object is masked from 'package:boot':
##
## melanoma
## Loading required package: ggplot2
library(leaps)
## Warning: package 'leaps' was built under R version 3.6.3
set.seed(1)
attach(College)
inTrain = createDataPartition(College$Outstate, p=0.75, list=FALSE)
College.train = College[inTrain,]
College.test = College[-inTrain,]
fwd.fit = regsubsets(Outstate~., data=College.train, nvmax=17, method="forward")
summary(fwd.fit)
## Subset selection object
## Call: regsubsets.formula(Outstate ~ ., data = College.train, nvmax = 17,
## method = "forward")
## 17 Variables (and intercept)
## Forced in Forced out
## PrivateYes FALSE FALSE
## Apps FALSE FALSE
## Accept FALSE FALSE
## Enroll FALSE FALSE
## Top10perc FALSE FALSE
## Top25perc FALSE FALSE
## F.Undergrad FALSE FALSE
## P.Undergrad FALSE FALSE
## Room.Board FALSE FALSE
## Books FALSE FALSE
## Personal FALSE FALSE
## PhD FALSE FALSE
## Terminal FALSE FALSE
## S.F.Ratio FALSE FALSE
## perc.alumni FALSE FALSE
## Expend FALSE FALSE
## Grad.Rate FALSE FALSE
## 1 subsets of each size up to 17
## Selection Algorithm: forward
## PrivateYes Apps Accept Enroll Top10perc Top25perc F.Undergrad
## 1 ( 1 ) " " " " " " " " " " " " " "
## 2 ( 1 ) "*" " " " " " " " " " " " "
## 3 ( 1 ) "*" " " " " " " " " " " " "
## 4 ( 1 ) "*" " " " " " " " " " " " "
## 5 ( 1 ) "*" " " " " " " " " " " " "
## 6 ( 1 ) "*" " " " " " " " " " " " "
## 7 ( 1 ) "*" " " " " " " " " " " " "
## 8 ( 1 ) "*" " " " " " " "*" " " " "
## 9 ( 1 ) "*" " " " " " " "*" " " " "
## 10 ( 1 ) "*" " " "*" " " "*" " " " "
## 11 ( 1 ) "*" "*" "*" " " "*" " " " "
## 12 ( 1 ) "*" "*" "*" " " "*" " " "*"
## 13 ( 1 ) "*" "*" "*" " " "*" " " "*"
## 14 ( 1 ) "*" "*" "*" " " "*" "*" "*"
## 15 ( 1 ) "*" "*" "*" " " "*" "*" "*"
## 16 ( 1 ) "*" "*" "*" "*" "*" "*" "*"
## 17 ( 1 ) "*" "*" "*" "*" "*" "*" "*"
## P.Undergrad Room.Board Books Personal PhD Terminal S.F.Ratio
## 1 ( 1 ) " " " " " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " " " "
## 3 ( 1 ) " " "*" " " " " " " " " " "
## 4 ( 1 ) " " "*" " " " " " " " " " "
## 5 ( 1 ) " " "*" " " " " "*" " " " "
## 6 ( 1 ) " " "*" " " " " "*" " " " "
## 7 ( 1 ) " " "*" " " "*" "*" " " " "
## 8 ( 1 ) " " "*" " " "*" "*" " " " "
## 9 ( 1 ) " " "*" " " "*" "*" "*" " "
## 10 ( 1 ) " " "*" " " "*" "*" "*" " "
## 11 ( 1 ) " " "*" " " "*" "*" "*" " "
## 12 ( 1 ) " " "*" " " "*" "*" "*" " "
## 13 ( 1 ) " " "*" " " "*" "*" "*" "*"
## 14 ( 1 ) " " "*" " " "*" "*" "*" "*"
## 15 ( 1 ) " " "*" "*" "*" "*" "*" "*"
## 16 ( 1 ) " " "*" "*" "*" "*" "*" "*"
## 17 ( 1 ) "*" "*" "*" "*" "*" "*" "*"
## perc.alumni Expend Grad.Rate
## 1 ( 1 ) " " "*" " "
## 2 ( 1 ) " " "*" " "
## 3 ( 1 ) " " "*" " "
## 4 ( 1 ) "*" "*" " "
## 5 ( 1 ) "*" "*" " "
## 6 ( 1 ) "*" "*" "*"
## 7 ( 1 ) "*" "*" "*"
## 8 ( 1 ) "*" "*" "*"
## 9 ( 1 ) "*" "*" "*"
## 10 ( 1 ) "*" "*" "*"
## 11 ( 1 ) "*" "*" "*"
## 12 ( 1 ) "*" "*" "*"
## 13 ( 1 ) "*" "*" "*"
## 14 ( 1 ) "*" "*" "*"
## 15 ( 1 ) "*" "*" "*"
## 16 ( 1 ) "*" "*" "*"
## 17 ( 1 ) "*" "*" "*"
fwd.summ=summary(fwd.fit)
par(mfrow=c(2,2))
plot(fwd.summ$cp, pch=20, xlab="Number of Variables", ylab="Cp")
plot(fwd.summ$bic, pch=20, xlab="Number of Variables", ylab="BIC")
RMSE=sqrt(fwd.summ$rss/nrow(College.train))
plot(RMSE, pch=20, xlab="Number of Variables", ylab ="RMSE")
plot(fwd.summ$adjr2, pch=20, xlab="Number of Variables", ylab="Adjusted R2 Values")
fwd.Cp.min = which.min(fwd.summ$cp)
fwd.BIC.min = which.min(fwd.summ$bic)
fwd.RMSE.min = which.min(RMSE)
fwd.AdjR2.max = which.max(fwd.summ$adjr2)
fwd.Cp.min
## [1] 12
fwd.BIC.min
## [1] 12
fwd.RMSE.min
## [1] 17
fwd.AdjR2.max
## [1] 13
The coefficients for the model with the 12 variables are shown below.
coef(fwd.fit, id=12)
## (Intercept) PrivateYes Apps Accept Top10perc
## -3075.4001461 2518.0610153 -0.3194939 0.7141407 26.2838880
## F.Undergrad Room.Board Personal PhD Terminal
## -0.1361100 0.8850004 -0.2666335 18.5499991 22.2163298
## perc.alumni Expend Grad.Rate
## 39.3407369 0.1988058 23.0359234
(b) Fit a GAM on the training data, using out-of-state tuition as the response and the features selected in the previous step as the predictors. Plot the results, and explain your findings.
library(gam)
## Warning: package 'gam' was built under R version 3.6.3
## Loading required package: splines
## Loading required package: foreach
## Warning: package 'foreach' was built under R version 3.6.3
## Loaded gam 1.16.1
gam.fit=gam(Outstate~Private + s(Apps) +s(Accept) + s(Top10perc) + s(F.Undergrad) +s(Room.Board) + s(Personal) + s(PhD) + s(Terminal) + s(perc.alumni) + s(Expend) + s(Grad.Rate), data=College.train)
## Warning in model.matrix.default(mt, mf, contrasts): non-list contrasts
## argument ignored
par(mfrow=c(3,4))
plot(gam.fit, se=T, col="red")
preds_train=predict(gam.fit, College.train)
train_err=mean((College.train$Outstate-preds_train)^2)
train_err
## [1] 2872298
trainss <- mean((College.train$Outstate - mean(College.train$Outstate))^2)
trainrss <- 1 - train_err / trainss
trainrss
## [1] 0.8185167
(c) Evaluate the model obtained on the test set, and explain the results obtained.
The test error for the test dataset was higher than the train error, adn the Rsquare value for the test set was a little lower that for the training data, but that is to be expected. The model explains 79.6% of the variance of the test data, so I think it is a fairly good model.
preds_test=predict(gam.fit, College.test)
test_err=mean((College.test$Outstate-preds_test)^2)
test_err
## [1] 3501624
tss <- mean((College.test$Outstate - mean(College.test$Outstate))^2)
testrss <- 1 - test_err / tss
testrss
## [1] 0.7962022
(d) For which variables, if any, is there evidence of a non-linear relationship with the response?
By looking at the Anova for Nonparametric Effects portion of the summary, it appears that there is evidence to support that there is a nonlinear realtionship between the variable Outstate and the variables Accept, F.Undergrad, and Expend. This can be detected by the low p values associated with these variables.
summary(gam.fit)
##
## Call: gam(formula = Outstate ~ Private + s(Apps) + s(Accept) + s(Top10perc) +
## s(F.Undergrad) + s(Room.Board) + s(Personal) + s(PhD) + s(Terminal) +
## s(perc.alumni) + s(Expend) + s(Grad.Rate), data = College.train)
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -6168.47 -1068.78 69.24 1184.79 4685.09
##
## (Dispersion Parameter for gaussian family taken to be 3117891)
##
## Null Deviance: 9242844542 on 583 degrees of freedom
## Residual Deviance: 1677424214 on 537.9997 degrees of freedom
## AIC: 10435.77
##
## Number of Local Scoring Iterations: 4
##
## Anova for Parametric Effects
## Df Sum Sq Mean Sq F value Pr(>F)
## Private 1 2816412723 2816412723 903.3071 < 2.2e-16 ***
## s(Apps) 1 880323253 880323253 282.3458 < 2.2e-16 ***
## s(Accept) 1 103318067 103318067 33.1372 1.444e-08 ***
## s(Top10perc) 1 944343476 944343476 302.8790 < 2.2e-16 ***
## s(F.Undergrad) 1 266907115 266907115 85.6050 < 2.2e-16 ***
## s(Room.Board) 1 757315862 757315862 242.8937 < 2.2e-16 ***
## s(Personal) 1 29881031 29881031 9.5837 0.002065 **
## s(PhD) 1 122773140 122773140 39.3770 7.193e-10 ***
## s(Terminal) 1 28828755 28828755 9.2462 0.002475 **
## s(perc.alumni) 1 146022029 146022029 46.8336 2.111e-11 ***
## s(Expend) 1 445638596 445638596 142.9295 < 2.2e-16 ***
## s(Grad.Rate) 1 44993627 44993627 14.4308 0.000162 ***
## Residuals 538 1677424214 3117891
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Anova for Nonparametric Effects
## Npar Df Npar F Pr(F)
## (Intercept)
## Private
## s(Apps) 3 1.9442 0.121409
## s(Accept) 3 8.6403 1.311e-05 ***
## s(Top10perc) 3 1.1073 0.345597
## s(F.Undergrad) 3 4.3119 0.005104 **
## s(Room.Board) 3 1.6830 0.169600
## s(Personal) 3 1.7160 0.162640
## s(PhD) 3 0.4305 0.731248
## s(Terminal) 3 1.7893 0.148139
## s(perc.alumni) 3 1.9508 0.120374
## s(Expend) 3 20.5891 1.212e-12 ***
## s(Grad.Rate) 3 2.0664 0.103652
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1