Problem 6

Analyze Wage data.

data(Wage)

Part A

Polynomial Regression to predict wage using age. Cross-validation to select optimal degree (d) for polynomial. What degree was chosen and how does this compare to the results of hypothesis testing using ANOVA? Make a plot to show the fit obtained.

Cross-Validation

set.seed(15)
cv <- rep(0,5)
for ( i in 1:5){
  poly.fit = glm(wage ~ poly(age, i), data = Wage)
  cv[i] = cv.glm(Wage, poly.fit, K=10)$delta[1]
}
cv
## [1] 1676.192 1599.786 1596.256 1593.755 1595.106

Plot the Cross Validation Fit

plot(cv, type = "b", xlab = "Optimal Degree", ylab = "Test MSE")
points(which.min(cv), cv[4], col = "purple", pch = 16, cex = 2)

Interpretation

The plot shows that 4 degrees is the best option using cross-validation.

ANOVA

fit1 <- lm(wage ~ age, data = Wage)
fit2 <- lm(wage ~ poly(age, 2), data = Wage)
fit3 <- lm(wage ~ poly(age, 3), data = Wage)
fit4 <- lm(wage ~ poly(age, 4), data = Wage)
fit5 <- lm(wage ~ poly(age, 5), data = Wage)
anova(fit1, fit2, fit3, fit4, fit5)
## Analysis of Variance Table
## 
## Model 1: wage ~ age
## Model 2: wage ~ poly(age, 2)
## Model 3: wage ~ poly(age, 3)
## Model 4: wage ~ poly(age, 4)
## Model 5: wage ~ poly(age, 5)
##   Res.Df     RSS Df Sum of Sq        F    Pr(>F)    
## 1   2998 5022216                                    
## 2   2997 4793430  1    228786 143.5931 < 2.2e-16 ***
## 3   2996 4777674  1     15756   9.8888  0.001679 ** 
## 4   2995 4771604  1      6070   3.8098  0.051046 .  
## 5   2994 4770322  1      1283   0.8050  0.369682    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Interpretation

Three of the four p-values are below the typical significant value of 0.05. This means there is some significance to those values.

Plot for ANOVA

agelimits <- range(Wage$age)
agegrid <- seq(from = agelimits[1], to = agelimits[2])
preds <- predict(fit4, newdata = list(age = agegrid), se = TRUE)
bands <- cbind(preds$fit + 2*preds$se.fit, preds$fit - 2*preds$se.fit)

plot(Wage$age, Wage$wage, xlim = agelimits, cex = .5, col = "purple")
lines(agegrid, preds$fit, lwd = 2, col = "green")
matlines(agegrid, bands, lwd = 1, col = "green", lty = 3)

Part B

Fit step function to predict wage using age and perform cross-validation to choose the optimal number of cuts. Make a plot.

Cross Validation

set.seed(2)
cv2 <- rep(NA, 10)
for (i in 2:10){
  Wage$age.cut = cut(Wage$age, i)
  glm = glm(wage ~ age.cut, data = Wage)
  cv2[i] = cv.glm(Wage, glm, K=10)$delta[1]
}
cv2
##  [1]       NA 1734.448 1681.623 1635.869 1632.864 1621.643 1613.148 1601.850
##  [9] 1610.852 1603.607

Cross Validation Plot

plot(2:10, cv2[-1], type = "b", xlab = "Optimal Cuts", ylab = "Test MSE")
points(which.min(cv2), cv2[which.min(cv2)], col = "purple", pch = 16, cex = 2)

Plot the Fit

glm2 <- glm(wage ~ cut(age, 8), data = Wage)
preds2 <- predict(glm2, data.frame(age = agegrid))
plot(Wage$age, Wage$wage, col = "purple")
lines(agegrid, preds2, col = "green", lwd = 2)

Problem 10

Use College dataset.

data("College")
summary(College)
##  Private        Apps           Accept          Enroll       Top10perc    
##  No :212   Min.   :   81   Min.   :   72   Min.   :  35   Min.   : 1.00  
##  Yes:565   1st Qu.:  776   1st Qu.:  604   1st Qu.: 242   1st Qu.:15.00  
##            Median : 1558   Median : 1110   Median : 434   Median :23.00  
##            Mean   : 3002   Mean   : 2019   Mean   : 780   Mean   :27.56  
##            3rd Qu.: 3624   3rd Qu.: 2424   3rd Qu.: 902   3rd Qu.:35.00  
##            Max.   :48094   Max.   :26330   Max.   :6392   Max.   :96.00  
##    Top25perc      F.Undergrad     P.Undergrad         Outstate    
##  Min.   :  9.0   Min.   :  139   Min.   :    1.0   Min.   : 2340  
##  1st Qu.: 41.0   1st Qu.:  992   1st Qu.:   95.0   1st Qu.: 7320  
##  Median : 54.0   Median : 1707   Median :  353.0   Median : 9990  
##  Mean   : 55.8   Mean   : 3700   Mean   :  855.3   Mean   :10441  
##  3rd Qu.: 69.0   3rd Qu.: 4005   3rd Qu.:  967.0   3rd Qu.:12925  
##  Max.   :100.0   Max.   :31643   Max.   :21836.0   Max.   :21700  
##    Room.Board       Books           Personal         PhD        
##  Min.   :1780   Min.   :  96.0   Min.   : 250   Min.   :  8.00  
##  1st Qu.:3597   1st Qu.: 470.0   1st Qu.: 850   1st Qu.: 62.00  
##  Median :4200   Median : 500.0   Median :1200   Median : 75.00  
##  Mean   :4358   Mean   : 549.4   Mean   :1341   Mean   : 72.66  
##  3rd Qu.:5050   3rd Qu.: 600.0   3rd Qu.:1700   3rd Qu.: 85.00  
##  Max.   :8124   Max.   :2340.0   Max.   :6800   Max.   :103.00  
##     Terminal       S.F.Ratio      perc.alumni        Expend     
##  Min.   : 24.0   Min.   : 2.50   Min.   : 0.00   Min.   : 3186  
##  1st Qu.: 71.0   1st Qu.:11.50   1st Qu.:13.00   1st Qu.: 6751  
##  Median : 82.0   Median :13.60   Median :21.00   Median : 8377  
##  Mean   : 79.7   Mean   :14.09   Mean   :22.74   Mean   : 9660  
##  3rd Qu.: 92.0   3rd Qu.:16.50   3rd Qu.:31.00   3rd Qu.:10830  
##  Max.   :100.0   Max.   :39.80   Max.   :64.00   Max.   :56233  
##    Grad.Rate     
##  Min.   : 10.00  
##  1st Qu.: 53.00  
##  Median : 65.00  
##  Mean   : 65.46  
##  3rd Qu.: 78.00  
##  Max.   :118.00

Part A

Split the data into training and testing. Use out-of-state tuition as response and other variables as predictors. Perform foward stepwise selection on training set to identify a model that uses just a subset of the predictors.

train <- sample(1: nrow(College), nrow(College)/2)
test <- -train
forward <- regsubsets(Outstate ~ ., data = College, subset = train, method = 'forward')
summary(forward)
## Subset selection object
## Call: regsubsets.formula(Outstate ~ ., data = College, subset = train, 
##     method = "forward")
## 17 Variables  (and intercept)
##             Forced in Forced out
## PrivateYes      FALSE      FALSE
## Apps            FALSE      FALSE
## Accept          FALSE      FALSE
## Enroll          FALSE      FALSE
## Top10perc       FALSE      FALSE
## Top25perc       FALSE      FALSE
## F.Undergrad     FALSE      FALSE
## P.Undergrad     FALSE      FALSE
## Room.Board      FALSE      FALSE
## Books           FALSE      FALSE
## Personal        FALSE      FALSE
## PhD             FALSE      FALSE
## Terminal        FALSE      FALSE
## S.F.Ratio       FALSE      FALSE
## perc.alumni     FALSE      FALSE
## Expend          FALSE      FALSE
## Grad.Rate       FALSE      FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: forward
##          PrivateYes Apps Accept Enroll Top10perc Top25perc F.Undergrad
## 1  ( 1 ) " "        " "  " "    " "    " "       " "       " "        
## 2  ( 1 ) " "        " "  " "    " "    " "       " "       " "        
## 3  ( 1 ) " "        " "  " "    " "    " "       " "       " "        
## 4  ( 1 ) "*"        " "  " "    " "    " "       " "       " "        
## 5  ( 1 ) "*"        " "  " "    " "    " "       " "       " "        
## 6  ( 1 ) "*"        " "  " "    " "    " "       " "       " "        
## 7  ( 1 ) "*"        " "  " "    " "    " "       " "       " "        
## 8  ( 1 ) "*"        " "  " "    " "    " "       "*"       " "        
##          P.Undergrad Room.Board Books Personal PhD Terminal S.F.Ratio
## 1  ( 1 ) " "         "*"        " "   " "      " " " "      " "      
## 2  ( 1 ) " "         "*"        " "   " "      " " " "      " "      
## 3  ( 1 ) " "         "*"        " "   " "      " " " "      " "      
## 4  ( 1 ) " "         "*"        " "   " "      " " " "      " "      
## 5  ( 1 ) " "         "*"        " "   " "      " " " "      " "      
## 6  ( 1 ) " "         "*"        " "   " "      " " "*"      " "      
## 7  ( 1 ) " "         "*"        " "   "*"      " " "*"      " "      
## 8  ( 1 ) " "         "*"        " "   "*"      " " "*"      " "      
##          perc.alumni Expend Grad.Rate
## 1  ( 1 ) " "         " "    " "      
## 2  ( 1 ) "*"         " "    " "      
## 3  ( 1 ) "*"         "*"    " "      
## 4  ( 1 ) "*"         "*"    " "      
## 5  ( 1 ) "*"         "*"    "*"      
## 6  ( 1 ) "*"         "*"    "*"      
## 7  ( 1 ) "*"         "*"    "*"      
## 8  ( 1 ) "*"         "*"    "*"
coef(forward, id = 6)
##   (Intercept)    PrivateYes    Room.Board      Terminal   perc.alumni 
## -3933.0756018  2391.3013093     1.1163216    32.0693193    49.1741555 
##        Expend     Grad.Rate 
##     0.2071043    32.3687749

Interpretation

There are 6 variables that are listed and should be used in the subset moving forward.

Part B

Fit a GAM on the train data. Use the same response but subset of predictors from part A. Plot findings.

gam.oos <- gam(Outstate ~ Private + s(Room.Board, 5) + s(Terminal, 5) + s(perc.alumni, 5) + s(Expend, 5) + s(Grad.Rate, 5), data = College, subset = train)
par(mfrow = c(2,3))
plot(gam.oos, se = TRUE, col = "purple")

Part C

Evaluate model on the test set.

preds3 <- predict(gam.oos, College[test, ])
eval <- sum((College[test, ]$Outstate - preds3)^2)
TSS <- sum((College[test, ]$Outstate - mean(College[test, ]$Outstate))^2)
1 - (eval / TSS)
## [1] 0.7755414

Test Statistic = 0.7755

Part D

For which variables are there evidence of a non-linear relationship with the response?

summary(gam.oos)
## 
## Call: gam(formula = Outstate ~ Private + s(Room.Board, 5) + s(Terminal, 
##     5) + s(perc.alumni, 5) + s(Expend, 5) + s(Grad.Rate, 5), 
##     data = College, subset = train)
## Deviance Residuals:
##     Min      1Q  Median      3Q     Max 
## -6277.5 -1084.1   123.7  1223.2  7139.7 
## 
## (Dispersion Parameter for gaussian family taken to be 3370637)
## 
##     Null Deviance: 6286829154 on 387 degrees of freedom
## Residual Deviance: 1216799198 on 360.9997 degrees of freedom
## AIC: 6960.989 
## 
## Number of Local Scoring Iterations: NA 
## 
## Anova for Parametric Effects
##                    Df     Sum Sq    Mean Sq F value    Pr(>F)    
## Private             1 1597519236 1597519236 473.952 < 2.2e-16 ***
## s(Room.Board, 5)    1 1370398372 1370398372 406.570 < 2.2e-16 ***
## s(Terminal, 5)      1  261867894  261867894  77.691 < 2.2e-16 ***
## s(perc.alumni, 5)   1  254976797  254976797  75.647 < 2.2e-16 ***
## s(Expend, 5)        1  542142991  542142991 160.843 < 2.2e-16 ***
## s(Grad.Rate, 5)     1   73167730   73167730  21.707 4.473e-06 ***
## Residuals         361 1216799198    3370637                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Anova for Nonparametric Effects
##                   Npar Df  Npar F   Pr(F)    
## (Intercept)                                  
## Private                                      
## s(Room.Board, 5)        4  2.1533 0.07381 .  
## s(Terminal, 5)          4  1.2345 0.29581    
## s(perc.alumni, 5)       4  0.7622 0.55043    
## s(Expend, 5)            4 22.5609 < 2e-16 ***
## s(Grad.Rate, 5)         4  1.6358 0.16463    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Interpretation

There are strong relationships between the response and Room and Board as well as Epex. This indicates the possibility of a nonlinear relationship with the response OutState. Grad Rate and PhD also have a moderate probability of having a nonlinear relationship.