Analyze Wage data.
data(Wage)
Polynomial Regression to predict wage using age. Cross-validation to select optimal degree (d) for polynomial. What degree was chosen and how does this compare to the results of hypothesis testing using ANOVA? Make a plot to show the fit obtained.
Cross-Validation
set.seed(15)
cv <- rep(0,5)
for ( i in 1:5){
poly.fit = glm(wage ~ poly(age, i), data = Wage)
cv[i] = cv.glm(Wage, poly.fit, K=10)$delta[1]
}
cv
## [1] 1676.192 1599.786 1596.256 1593.755 1595.106
Plot the Cross Validation Fit
plot(cv, type = "b", xlab = "Optimal Degree", ylab = "Test MSE")
points(which.min(cv), cv[4], col = "purple", pch = 16, cex = 2)
Interpretation
The plot shows that 4 degrees is the best option using cross-validation.
ANOVA
fit1 <- lm(wage ~ age, data = Wage)
fit2 <- lm(wage ~ poly(age, 2), data = Wage)
fit3 <- lm(wage ~ poly(age, 3), data = Wage)
fit4 <- lm(wage ~ poly(age, 4), data = Wage)
fit5 <- lm(wage ~ poly(age, 5), data = Wage)
anova(fit1, fit2, fit3, fit4, fit5)
## Analysis of Variance Table
##
## Model 1: wage ~ age
## Model 2: wage ~ poly(age, 2)
## Model 3: wage ~ poly(age, 3)
## Model 4: wage ~ poly(age, 4)
## Model 5: wage ~ poly(age, 5)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 2998 5022216
## 2 2997 4793430 1 228786 143.5931 < 2.2e-16 ***
## 3 2996 4777674 1 15756 9.8888 0.001679 **
## 4 2995 4771604 1 6070 3.8098 0.051046 .
## 5 2994 4770322 1 1283 0.8050 0.369682
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Interpretation
Three of the four p-values are below the typical significant value of 0.05. This means there is some significance to those values.
Plot for ANOVA
agelimits <- range(Wage$age)
agegrid <- seq(from = agelimits[1], to = agelimits[2])
preds <- predict(fit4, newdata = list(age = agegrid), se = TRUE)
bands <- cbind(preds$fit + 2*preds$se.fit, preds$fit - 2*preds$se.fit)
plot(Wage$age, Wage$wage, xlim = agelimits, cex = .5, col = "purple")
lines(agegrid, preds$fit, lwd = 2, col = "green")
matlines(agegrid, bands, lwd = 1, col = "green", lty = 3)
Fit step function to predict wage using age and perform cross-validation to choose the optimal number of cuts. Make a plot.
Cross Validation
set.seed(2)
cv2 <- rep(NA, 10)
for (i in 2:10){
Wage$age.cut = cut(Wage$age, i)
glm = glm(wage ~ age.cut, data = Wage)
cv2[i] = cv.glm(Wage, glm, K=10)$delta[1]
}
cv2
## [1] NA 1734.448 1681.623 1635.869 1632.864 1621.643 1613.148 1601.850
## [9] 1610.852 1603.607
Cross Validation Plot
plot(2:10, cv2[-1], type = "b", xlab = "Optimal Cuts", ylab = "Test MSE")
points(which.min(cv2), cv2[which.min(cv2)], col = "purple", pch = 16, cex = 2)
Plot the Fit
glm2 <- glm(wage ~ cut(age, 8), data = Wage)
preds2 <- predict(glm2, data.frame(age = agegrid))
plot(Wage$age, Wage$wage, col = "purple")
lines(agegrid, preds2, col = "green", lwd = 2)
Use College dataset.
data("College")
summary(College)
## Private Apps Accept Enroll Top10perc
## No :212 Min. : 81 Min. : 72 Min. : 35 Min. : 1.00
## Yes:565 1st Qu.: 776 1st Qu.: 604 1st Qu.: 242 1st Qu.:15.00
## Median : 1558 Median : 1110 Median : 434 Median :23.00
## Mean : 3002 Mean : 2019 Mean : 780 Mean :27.56
## 3rd Qu.: 3624 3rd Qu.: 2424 3rd Qu.: 902 3rd Qu.:35.00
## Max. :48094 Max. :26330 Max. :6392 Max. :96.00
## Top25perc F.Undergrad P.Undergrad Outstate
## Min. : 9.0 Min. : 139 Min. : 1.0 Min. : 2340
## 1st Qu.: 41.0 1st Qu.: 992 1st Qu.: 95.0 1st Qu.: 7320
## Median : 54.0 Median : 1707 Median : 353.0 Median : 9990
## Mean : 55.8 Mean : 3700 Mean : 855.3 Mean :10441
## 3rd Qu.: 69.0 3rd Qu.: 4005 3rd Qu.: 967.0 3rd Qu.:12925
## Max. :100.0 Max. :31643 Max. :21836.0 Max. :21700
## Room.Board Books Personal PhD
## Min. :1780 Min. : 96.0 Min. : 250 Min. : 8.00
## 1st Qu.:3597 1st Qu.: 470.0 1st Qu.: 850 1st Qu.: 62.00
## Median :4200 Median : 500.0 Median :1200 Median : 75.00
## Mean :4358 Mean : 549.4 Mean :1341 Mean : 72.66
## 3rd Qu.:5050 3rd Qu.: 600.0 3rd Qu.:1700 3rd Qu.: 85.00
## Max. :8124 Max. :2340.0 Max. :6800 Max. :103.00
## Terminal S.F.Ratio perc.alumni Expend
## Min. : 24.0 Min. : 2.50 Min. : 0.00 Min. : 3186
## 1st Qu.: 71.0 1st Qu.:11.50 1st Qu.:13.00 1st Qu.: 6751
## Median : 82.0 Median :13.60 Median :21.00 Median : 8377
## Mean : 79.7 Mean :14.09 Mean :22.74 Mean : 9660
## 3rd Qu.: 92.0 3rd Qu.:16.50 3rd Qu.:31.00 3rd Qu.:10830
## Max. :100.0 Max. :39.80 Max. :64.00 Max. :56233
## Grad.Rate
## Min. : 10.00
## 1st Qu.: 53.00
## Median : 65.00
## Mean : 65.46
## 3rd Qu.: 78.00
## Max. :118.00
Split the data into training and testing. Use out-of-state tuition as response and other variables as predictors. Perform foward stepwise selection on training set to identify a model that uses just a subset of the predictors.
train <- sample(1: nrow(College), nrow(College)/2)
test <- -train
forward <- regsubsets(Outstate ~ ., data = College, subset = train, method = 'forward')
summary(forward)
## Subset selection object
## Call: regsubsets.formula(Outstate ~ ., data = College, subset = train,
## method = "forward")
## 17 Variables (and intercept)
## Forced in Forced out
## PrivateYes FALSE FALSE
## Apps FALSE FALSE
## Accept FALSE FALSE
## Enroll FALSE FALSE
## Top10perc FALSE FALSE
## Top25perc FALSE FALSE
## F.Undergrad FALSE FALSE
## P.Undergrad FALSE FALSE
## Room.Board FALSE FALSE
## Books FALSE FALSE
## Personal FALSE FALSE
## PhD FALSE FALSE
## Terminal FALSE FALSE
## S.F.Ratio FALSE FALSE
## perc.alumni FALSE FALSE
## Expend FALSE FALSE
## Grad.Rate FALSE FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: forward
## PrivateYes Apps Accept Enroll Top10perc Top25perc F.Undergrad
## 1 ( 1 ) " " " " " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " " " "
## 3 ( 1 ) " " " " " " " " " " " " " "
## 4 ( 1 ) "*" " " " " " " " " " " " "
## 5 ( 1 ) "*" " " " " " " " " " " " "
## 6 ( 1 ) "*" " " " " " " " " " " " "
## 7 ( 1 ) "*" " " " " " " " " " " " "
## 8 ( 1 ) "*" " " " " " " " " "*" " "
## P.Undergrad Room.Board Books Personal PhD Terminal S.F.Ratio
## 1 ( 1 ) " " "*" " " " " " " " " " "
## 2 ( 1 ) " " "*" " " " " " " " " " "
## 3 ( 1 ) " " "*" " " " " " " " " " "
## 4 ( 1 ) " " "*" " " " " " " " " " "
## 5 ( 1 ) " " "*" " " " " " " " " " "
## 6 ( 1 ) " " "*" " " " " " " "*" " "
## 7 ( 1 ) " " "*" " " "*" " " "*" " "
## 8 ( 1 ) " " "*" " " "*" " " "*" " "
## perc.alumni Expend Grad.Rate
## 1 ( 1 ) " " " " " "
## 2 ( 1 ) "*" " " " "
## 3 ( 1 ) "*" "*" " "
## 4 ( 1 ) "*" "*" " "
## 5 ( 1 ) "*" "*" "*"
## 6 ( 1 ) "*" "*" "*"
## 7 ( 1 ) "*" "*" "*"
## 8 ( 1 ) "*" "*" "*"
coef(forward, id = 6)
## (Intercept) PrivateYes Room.Board Terminal perc.alumni
## -3933.0756018 2391.3013093 1.1163216 32.0693193 49.1741555
## Expend Grad.Rate
## 0.2071043 32.3687749
Interpretation
There are 6 variables that are listed and should be used in the subset moving forward.
Fit a GAM on the train data. Use the same response but subset of predictors from part A. Plot findings.
gam.oos <- gam(Outstate ~ Private + s(Room.Board, 5) + s(Terminal, 5) + s(perc.alumni, 5) + s(Expend, 5) + s(Grad.Rate, 5), data = College, subset = train)
par(mfrow = c(2,3))
plot(gam.oos, se = TRUE, col = "purple")
Evaluate model on the test set.
preds3 <- predict(gam.oos, College[test, ])
eval <- sum((College[test, ]$Outstate - preds3)^2)
TSS <- sum((College[test, ]$Outstate - mean(College[test, ]$Outstate))^2)
1 - (eval / TSS)
## [1] 0.7755414
Test Statistic = 0.7755
For which variables are there evidence of a non-linear relationship with the response?
summary(gam.oos)
##
## Call: gam(formula = Outstate ~ Private + s(Room.Board, 5) + s(Terminal,
## 5) + s(perc.alumni, 5) + s(Expend, 5) + s(Grad.Rate, 5),
## data = College, subset = train)
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -6277.5 -1084.1 123.7 1223.2 7139.7
##
## (Dispersion Parameter for gaussian family taken to be 3370637)
##
## Null Deviance: 6286829154 on 387 degrees of freedom
## Residual Deviance: 1216799198 on 360.9997 degrees of freedom
## AIC: 6960.989
##
## Number of Local Scoring Iterations: NA
##
## Anova for Parametric Effects
## Df Sum Sq Mean Sq F value Pr(>F)
## Private 1 1597519236 1597519236 473.952 < 2.2e-16 ***
## s(Room.Board, 5) 1 1370398372 1370398372 406.570 < 2.2e-16 ***
## s(Terminal, 5) 1 261867894 261867894 77.691 < 2.2e-16 ***
## s(perc.alumni, 5) 1 254976797 254976797 75.647 < 2.2e-16 ***
## s(Expend, 5) 1 542142991 542142991 160.843 < 2.2e-16 ***
## s(Grad.Rate, 5) 1 73167730 73167730 21.707 4.473e-06 ***
## Residuals 361 1216799198 3370637
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Anova for Nonparametric Effects
## Npar Df Npar F Pr(F)
## (Intercept)
## Private
## s(Room.Board, 5) 4 2.1533 0.07381 .
## s(Terminal, 5) 4 1.2345 0.29581
## s(perc.alumni, 5) 4 0.7622 0.55043
## s(Expend, 5) 4 22.5609 < 2e-16 ***
## s(Grad.Rate, 5) 4 1.6358 0.16463
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Interpretation
There are strong relationships between the response and Room and Board as well as Epex. This indicates the possibility of a nonlinear relationship with the response OutState. Grad Rate and PhD also have a moderate probability of having a nonlinear relationship.