Problem 6

In this exercise, you will further analyze the Wage data set considered throughout this chapter.

library(ISLR)
## Warning: package 'ISLR' was built under R version 3.6.3
summary(Wage)
##       year           age                     maritl           race     
##  Min.   :2003   Min.   :18.00   1. Never Married: 648   1. White:2480  
##  1st Qu.:2004   1st Qu.:33.75   2. Married      :2074   2. Black: 293  
##  Median :2006   Median :42.00   3. Widowed      :  19   3. Asian: 190  
##  Mean   :2006   Mean   :42.41   4. Divorced     : 204   4. Other:  37  
##  3rd Qu.:2008   3rd Qu.:51.00   5. Separated    :  55                  
##  Max.   :2009   Max.   :80.00                                          
##                                                                        
##               education                     region    
##  1. < HS Grad      :268   2. Middle Atlantic   :3000  
##  2. HS Grad        :971   1. New England       :   0  
##  3. Some College   :650   3. East North Central:   0  
##  4. College Grad   :685   4. West North Central:   0  
##  5. Advanced Degree:426   5. South Atlantic    :   0  
##                           6. East South Central:   0  
##                           (Other)              :   0  
##            jobclass               health      health_ins      logwage     
##  1. Industrial :1544   1. <=Good     : 858   1. Yes:2083   Min.   :3.000  
##  2. Information:1456   2. >=Very Good:2142   2. No : 917   1st Qu.:4.447  
##                                                            Median :4.653  
##                                                            Mean   :4.654  
##                                                            3rd Qu.:4.857  
##                                                            Max.   :5.763  
##                                                                           
##       wage       
##  Min.   : 20.09  
##  1st Qu.: 85.38  
##  Median :104.92  
##  Mean   :111.70  
##  3rd Qu.:128.68  
##  Max.   :318.34  
## 
str(Wage)
## 'data.frame':    3000 obs. of  11 variables:
##  $ year      : int  2006 2004 2003 2003 2005 2008 2009 2008 2006 2004 ...
##  $ age       : int  18 24 45 43 50 54 44 30 41 52 ...
##  $ maritl    : Factor w/ 5 levels "1. Never Married",..: 1 1 2 2 4 2 2 1 1 2 ...
##  $ race      : Factor w/ 4 levels "1. White","2. Black",..: 1 1 1 3 1 1 4 3 2 1 ...
##  $ education : Factor w/ 5 levels "1. < HS Grad",..: 1 4 3 4 2 4 3 3 3 2 ...
##  $ region    : Factor w/ 9 levels "1. New England",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ jobclass  : Factor w/ 2 levels "1. Industrial",..: 1 2 1 2 2 2 1 2 2 2 ...
##  $ health    : Factor w/ 2 levels "1. <=Good","2. >=Very Good": 1 2 1 2 1 2 2 1 2 2 ...
##  $ health_ins: Factor w/ 2 levels "1. Yes","2. No": 2 2 1 1 1 1 1 1 1 1 ...
##  $ logwage   : num  4.32 4.26 4.88 5.04 4.32 ...
##  $ wage      : num  75 70.5 131 154.7 75 ...

(a) Perform polynomial regression to predict wage using age. Use cross-validation to select the optimal degree d for the polynomial. What degree was chosen, and how does this compare to the results of hypothesis testing using ANOVA? Make a plot of the resulting polynomial fit to the data.
Using cross validation to select the optimal degree for the polynomial, a degree of 5 returned the lowest error value as seen below. The ANOVA hypothesis test showed that a dgree is actually sufficent since the p value for a degree of 4 was not less that our threshold prefernce of 0.05. We can see from the plot below where we graphed both lines, there is not a whole lot of different in the two models until we get to the age around 65, and even though there is a difference, it is only slight.

library(boot)
set.seed(1)
cv.error=rep(0, 5)

for (i in 1:5) {
  wage.fit=glm(wage~poly(age, i, raw=T), data=Wage)
  cv.error[i] = cv.glm(Wage, wage.fit, K=10)$delta[1]
}

cv.error
## [1] 1676.826 1600.763 1598.399 1595.651 1594.977
plot(1:5, cv.error, xlab='Degree of Polynomial', ylab='CV Error', type = 'l')
cv.min=which.min(cv.error)
points(cv.min, cv.error[cv.min], col='red', cex=2, pch=20)

Anova testing

set.seed(1)
fit.1=lm(wage~age, data=Wage)
fit.2=lm(wage~poly(age,2, raw=T), data=Wage)
fit.3=lm(wage~poly(age,3, raw=T), data=Wage)
fit.4=lm(wage~poly(age,4, raw=T), data=Wage)
fit.5=lm(wage~poly(age,5, raw=T), data=Wage)

anova(fit.1, fit.2, fit.3, fit.4, fit.5)
## Analysis of Variance Table
## 
## Model 1: wage ~ age
## Model 2: wage ~ poly(age, 2, raw = T)
## Model 3: wage ~ poly(age, 3, raw = T)
## Model 4: wage ~ poly(age, 4, raw = T)
## Model 5: wage ~ poly(age, 5, raw = T)
##   Res.Df     RSS Df Sum of Sq        F    Pr(>F)    
## 1   2998 5022216                                    
## 2   2997 4793430  1    228786 143.5931 < 2.2e-16 ***
## 3   2996 4777674  1     15756   9.8888  0.001679 ** 
## 4   2995 4771604  1      6070   3.8098  0.051046 .  
## 5   2994 4770322  1      1283   0.8050  0.369682    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(wage~age, data=Wage, col="darkgrey") 
title= ("Polynomial Degree of 3 & 5 for Age Against Wage")
agelims=range(Wage$age)
age.grid = seq(from = agelims[1], to = agelims[2])
wage.poly3 = lm(wage~poly(age,3), data = Wage)
preds.3=predict(wage.poly3, newdata=list(age=age.grid), se=TRUE)
lines(age.grid, preds.3$fit, col="blue", lwd=2)
wage.poly5 = lm(wage~poly(age,5), data = Wage)
preds.5 = predict(wage.poly5, newdata=list(age=age.grid), se=TRUE)
lines(age.grid, preds.5$fit, col="red", lwd=2)
legend("topright", legend=c("Degree 3", "Degree 5"),col=c('blue', 'red'), lty=1, lwd=2, cex=.8)

(b) Fit a step function to predict wage using age, and perform crossvalidation to choose the optimal number of cuts. Make a plot of the fit obtained.

From the plot below, I can see that the optimal number of cuts for this step funtion is 8 as it porvides the lowest error rate.

set.seed(1)
cut.cverr = rep(NA, 10)
for (i in 2:10) {
    Wage$age.cut = cut(Wage$age, i)
    fit.cut = glm(wage~age.cut, data = Wage)
    cut.cverr[i] = cv.glm(Wage, fit.cut, K = 10)$delta[1]
}

cut.cverr
##  [1]       NA 1734.489 1684.271 1635.552 1632.080 1623.415 1614.996
##  [8] 1601.318 1613.954 1606.331
plot(2:10, cut.cverr[-1], xlab='Cuts', ylab='CV Error', type = 'l')
title("Number of Cuts for Step Function")
cut.min=which.min(cut.cverr)
points(cut.min, cut.cverr[cut.min], col='red', cex=2, pch=20)

plot(wage~age, data = Wage, col = "gray")
title("Step function - 8 Cuts") 
step.fit = glm(wage~cut(age,8), data = Wage)
step.preds = predict(step.fit, newdata=list(age=age.grid), se=TRUE)
lines(age.grid, step.preds$fit, col = "darkgreen", lwd=2)

Problem 10

This question relates to the College data set.

library(ISLR)
summary(College)
##  Private        Apps           Accept          Enroll       Top10perc    
##  No :212   Min.   :   81   Min.   :   72   Min.   :  35   Min.   : 1.00  
##  Yes:565   1st Qu.:  776   1st Qu.:  604   1st Qu.: 242   1st Qu.:15.00  
##            Median : 1558   Median : 1110   Median : 434   Median :23.00  
##            Mean   : 3002   Mean   : 2019   Mean   : 780   Mean   :27.56  
##            3rd Qu.: 3624   3rd Qu.: 2424   3rd Qu.: 902   3rd Qu.:35.00  
##            Max.   :48094   Max.   :26330   Max.   :6392   Max.   :96.00  
##    Top25perc      F.Undergrad     P.Undergrad         Outstate    
##  Min.   :  9.0   Min.   :  139   Min.   :    1.0   Min.   : 2340  
##  1st Qu.: 41.0   1st Qu.:  992   1st Qu.:   95.0   1st Qu.: 7320  
##  Median : 54.0   Median : 1707   Median :  353.0   Median : 9990  
##  Mean   : 55.8   Mean   : 3700   Mean   :  855.3   Mean   :10441  
##  3rd Qu.: 69.0   3rd Qu.: 4005   3rd Qu.:  967.0   3rd Qu.:12925  
##  Max.   :100.0   Max.   :31643   Max.   :21836.0   Max.   :21700  
##    Room.Board       Books           Personal         PhD        
##  Min.   :1780   Min.   :  96.0   Min.   : 250   Min.   :  8.00  
##  1st Qu.:3597   1st Qu.: 470.0   1st Qu.: 850   1st Qu.: 62.00  
##  Median :4200   Median : 500.0   Median :1200   Median : 75.00  
##  Mean   :4358   Mean   : 549.4   Mean   :1341   Mean   : 72.66  
##  3rd Qu.:5050   3rd Qu.: 600.0   3rd Qu.:1700   3rd Qu.: 85.00  
##  Max.   :8124   Max.   :2340.0   Max.   :6800   Max.   :103.00  
##     Terminal       S.F.Ratio      perc.alumni        Expend     
##  Min.   : 24.0   Min.   : 2.50   Min.   : 0.00   Min.   : 3186  
##  1st Qu.: 71.0   1st Qu.:11.50   1st Qu.:13.00   1st Qu.: 6751  
##  Median : 82.0   Median :13.60   Median :21.00   Median : 8377  
##  Mean   : 79.7   Mean   :14.09   Mean   :22.74   Mean   : 9660  
##  3rd Qu.: 92.0   3rd Qu.:16.50   3rd Qu.:31.00   3rd Qu.:10830  
##  Max.   :100.0   Max.   :39.80   Max.   :64.00   Max.   :56233  
##    Grad.Rate     
##  Min.   : 10.00  
##  1st Qu.: 53.00  
##  Median : 65.00  
##  Mean   : 65.46  
##  3rd Qu.: 78.00  
##  Max.   :118.00
str(College)
## 'data.frame':    777 obs. of  18 variables:
##  $ Private    : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Apps       : num  1660 2186 1428 417 193 ...
##  $ Accept     : num  1232 1924 1097 349 146 ...
##  $ Enroll     : num  721 512 336 137 55 158 103 489 227 172 ...
##  $ Top10perc  : num  23 16 22 60 16 38 17 37 30 21 ...
##  $ Top25perc  : num  52 29 50 89 44 62 45 68 63 44 ...
##  $ F.Undergrad: num  2885 2683 1036 510 249 ...
##  $ P.Undergrad: num  537 1227 99 63 869 ...
##  $ Outstate   : num  7440 12280 11250 12960 7560 ...
##  $ Room.Board : num  3300 6450 3750 5450 4120 ...
##  $ Books      : num  450 750 400 450 800 500 500 450 300 660 ...
##  $ Personal   : num  2200 1500 1165 875 1500 ...
##  $ PhD        : num  70 29 53 92 76 67 90 89 79 40 ...
##  $ Terminal   : num  78 30 66 97 72 73 93 100 84 41 ...
##  $ S.F.Ratio  : num  18.1 12.2 12.9 7.7 11.9 9.4 11.5 13.7 11.3 11.5 ...
##  $ perc.alumni: num  12 16 30 37 2 11 26 37 23 15 ...
##  $ Expend     : num  7041 10527 8735 19016 10922 ...
##  $ Grad.Rate  : num  60 56 54 59 15 55 63 73 80 52 ...

(a) Split the data into a training set and a test set. Using out-of-state tuition as the response and the other variables as the predictors, perform forward stepwise selection on the training set in order to identify a satisfactory model that uses just a subset of the predictors. By looking at the plots and min/max values of Cp, BIC, AIC, and AdjRsquared, I think the model with 12 variable would be the best.

library(caret)
## Warning: package 'caret' was built under R version 3.6.3
## Loading required package: lattice
## 
## Attaching package: 'lattice'
## The following object is masked from 'package:boot':
## 
##     melanoma
## Loading required package: ggplot2
library(leaps)
## Warning: package 'leaps' was built under R version 3.6.3
set.seed(1)

attach(College)

inTrain = createDataPartition(College$Outstate, p=0.75, list=FALSE)
College.train = College[inTrain,]
College.test = College[-inTrain,]
fwd.fit = regsubsets(Outstate~., data=College.train, nvmax=17, method="forward")
summary(fwd.fit)
## Subset selection object
## Call: regsubsets.formula(Outstate ~ ., data = College.train, nvmax = 17, 
##     method = "forward")
## 17 Variables  (and intercept)
##             Forced in Forced out
## PrivateYes      FALSE      FALSE
## Apps            FALSE      FALSE
## Accept          FALSE      FALSE
## Enroll          FALSE      FALSE
## Top10perc       FALSE      FALSE
## Top25perc       FALSE      FALSE
## F.Undergrad     FALSE      FALSE
## P.Undergrad     FALSE      FALSE
## Room.Board      FALSE      FALSE
## Books           FALSE      FALSE
## Personal        FALSE      FALSE
## PhD             FALSE      FALSE
## Terminal        FALSE      FALSE
## S.F.Ratio       FALSE      FALSE
## perc.alumni     FALSE      FALSE
## Expend          FALSE      FALSE
## Grad.Rate       FALSE      FALSE
## 1 subsets of each size up to 17
## Selection Algorithm: forward
##           PrivateYes Apps Accept Enroll Top10perc Top25perc F.Undergrad
## 1  ( 1 )  " "        " "  " "    " "    " "       " "       " "        
## 2  ( 1 )  "*"        " "  " "    " "    " "       " "       " "        
## 3  ( 1 )  "*"        " "  " "    " "    " "       " "       " "        
## 4  ( 1 )  "*"        " "  " "    " "    " "       " "       " "        
## 5  ( 1 )  "*"        " "  " "    " "    " "       " "       " "        
## 6  ( 1 )  "*"        " "  " "    " "    " "       " "       " "        
## 7  ( 1 )  "*"        " "  " "    " "    " "       " "       " "        
## 8  ( 1 )  "*"        " "  " "    " "    "*"       " "       " "        
## 9  ( 1 )  "*"        " "  " "    " "    "*"       " "       " "        
## 10  ( 1 ) "*"        " "  "*"    " "    "*"       " "       " "        
## 11  ( 1 ) "*"        "*"  "*"    " "    "*"       " "       " "        
## 12  ( 1 ) "*"        "*"  "*"    " "    "*"       " "       "*"        
## 13  ( 1 ) "*"        "*"  "*"    " "    "*"       " "       "*"        
## 14  ( 1 ) "*"        "*"  "*"    " "    "*"       "*"       "*"        
## 15  ( 1 ) "*"        "*"  "*"    " "    "*"       "*"       "*"        
## 16  ( 1 ) "*"        "*"  "*"    "*"    "*"       "*"       "*"        
## 17  ( 1 ) "*"        "*"  "*"    "*"    "*"       "*"       "*"        
##           P.Undergrad Room.Board Books Personal PhD Terminal S.F.Ratio
## 1  ( 1 )  " "         " "        " "   " "      " " " "      " "      
## 2  ( 1 )  " "         " "        " "   " "      " " " "      " "      
## 3  ( 1 )  " "         "*"        " "   " "      " " " "      " "      
## 4  ( 1 )  " "         "*"        " "   " "      " " " "      " "      
## 5  ( 1 )  " "         "*"        " "   " "      "*" " "      " "      
## 6  ( 1 )  " "         "*"        " "   " "      "*" " "      " "      
## 7  ( 1 )  " "         "*"        " "   "*"      "*" " "      " "      
## 8  ( 1 )  " "         "*"        " "   "*"      "*" " "      " "      
## 9  ( 1 )  " "         "*"        " "   "*"      "*" "*"      " "      
## 10  ( 1 ) " "         "*"        " "   "*"      "*" "*"      " "      
## 11  ( 1 ) " "         "*"        " "   "*"      "*" "*"      " "      
## 12  ( 1 ) " "         "*"        " "   "*"      "*" "*"      " "      
## 13  ( 1 ) " "         "*"        " "   "*"      "*" "*"      "*"      
## 14  ( 1 ) " "         "*"        " "   "*"      "*" "*"      "*"      
## 15  ( 1 ) " "         "*"        "*"   "*"      "*" "*"      "*"      
## 16  ( 1 ) " "         "*"        "*"   "*"      "*" "*"      "*"      
## 17  ( 1 ) "*"         "*"        "*"   "*"      "*" "*"      "*"      
##           perc.alumni Expend Grad.Rate
## 1  ( 1 )  " "         "*"    " "      
## 2  ( 1 )  " "         "*"    " "      
## 3  ( 1 )  " "         "*"    " "      
## 4  ( 1 )  "*"         "*"    " "      
## 5  ( 1 )  "*"         "*"    " "      
## 6  ( 1 )  "*"         "*"    "*"      
## 7  ( 1 )  "*"         "*"    "*"      
## 8  ( 1 )  "*"         "*"    "*"      
## 9  ( 1 )  "*"         "*"    "*"      
## 10  ( 1 ) "*"         "*"    "*"      
## 11  ( 1 ) "*"         "*"    "*"      
## 12  ( 1 ) "*"         "*"    "*"      
## 13  ( 1 ) "*"         "*"    "*"      
## 14  ( 1 ) "*"         "*"    "*"      
## 15  ( 1 ) "*"         "*"    "*"      
## 16  ( 1 ) "*"         "*"    "*"      
## 17  ( 1 ) "*"         "*"    "*"
fwd.summ=summary(fwd.fit)
par(mfrow=c(2,2))
plot(fwd.summ$cp, pch=20, xlab="Number of Variables", ylab="Cp")
plot(fwd.summ$bic, pch=20, xlab="Number of Variables", ylab="BIC")
RMSE=sqrt(fwd.summ$rss/nrow(College.train))
plot(RMSE, pch=20, xlab="Number of Variables", ylab ="RMSE")
plot(fwd.summ$adjr2, pch=20, xlab="Number of Variables", ylab="Adjusted R2 Values")

fwd.Cp.min = which.min(fwd.summ$cp)
fwd.BIC.min = which.min(fwd.summ$bic)
fwd.RMSE.min = which.min(RMSE)
fwd.AdjR2.max = which.max(fwd.summ$adjr2)
fwd.Cp.min
## [1] 12
fwd.BIC.min
## [1] 12
fwd.RMSE.min
## [1] 17
fwd.AdjR2.max
## [1] 13

The coefficients for the model with the 12 variables are shown below.

coef(fwd.fit, id=12)
##   (Intercept)    PrivateYes          Apps        Accept     Top10perc 
## -3075.4001461  2518.0610153    -0.3194939     0.7141407    26.2838880 
##   F.Undergrad    Room.Board      Personal           PhD      Terminal 
##    -0.1361100     0.8850004    -0.2666335    18.5499991    22.2163298 
##   perc.alumni        Expend     Grad.Rate 
##    39.3407369     0.1988058    23.0359234

(b) Fit a GAM on the training data, using out-of-state tuition as the response and the features selected in the previous step as the predictors. Plot the results, and explain your findings.

library(gam)
## Warning: package 'gam' was built under R version 3.6.3
## Loading required package: splines
## Loading required package: foreach
## Warning: package 'foreach' was built under R version 3.6.3
## Loaded gam 1.16.1
gam.fit=gam(Outstate~Private + s(Apps) +s(Accept) + s(Top10perc) + s(F.Undergrad) +s(Room.Board) + s(Personal) + s(PhD) + s(Terminal) + s(perc.alumni) + s(Expend) + s(Grad.Rate), data=College.train)
## Warning in model.matrix.default(mt, mf, contrasts): non-list contrasts
## argument ignored
par(mfrow=c(3,4))
plot(gam.fit, se=T, col="red")

preds_train=predict(gam.fit, College.train)
train_err=mean((College.train$Outstate-preds_train)^2)
train_err
## [1] 2872298
trainss <- mean((College.train$Outstate - mean(College.train$Outstate))^2)
trainrss <- 1 - train_err / trainss
trainrss
## [1] 0.8185167

(c) Evaluate the model obtained on the test set, and explain the results obtained.
The test error for the test dataset was higher than the train error, adn the Rsquare value for the test set was a little lower that for the training data, but that is to be expected. The model explains 79.6% of the variance of the test data, so I think it is a fairly good model.

preds_test=predict(gam.fit, College.test)
test_err=mean((College.test$Outstate-preds_test)^2)
test_err
## [1] 3501624
tss <- mean((College.test$Outstate - mean(College.test$Outstate))^2)
testrss <- 1 - test_err / tss
testrss
## [1] 0.7962022

(d) For which variables, if any, is there evidence of a non-linear relationship with the response?
By looking at the Anova for Nonparametric Effects portion of the summary, it appears that there is evidence to support that there is a nonlinear realtionship between the variable Outstate and the variables Accept, F.Undergrad, and Expend. This can be detected by the low p values associated with these variables.

summary(gam.fit)
## 
## Call: gam(formula = Outstate ~ Private + s(Apps) + s(Accept) + s(Top10perc) + 
##     s(F.Undergrad) + s(Room.Board) + s(Personal) + s(PhD) + s(Terminal) + 
##     s(perc.alumni) + s(Expend) + s(Grad.Rate), data = College.train)
## Deviance Residuals:
##      Min       1Q   Median       3Q      Max 
## -6168.47 -1068.78    69.24  1184.79  4685.09 
## 
## (Dispersion Parameter for gaussian family taken to be 3117891)
## 
##     Null Deviance: 9242844542 on 583 degrees of freedom
## Residual Deviance: 1677424214 on 537.9997 degrees of freedom
## AIC: 10435.77 
## 
## Number of Local Scoring Iterations: 4 
## 
## Anova for Parametric Effects
##                 Df     Sum Sq    Mean Sq  F value    Pr(>F)    
## Private          1 2816412723 2816412723 903.3071 < 2.2e-16 ***
## s(Apps)          1  880323253  880323253 282.3458 < 2.2e-16 ***
## s(Accept)        1  103318067  103318067  33.1372 1.444e-08 ***
## s(Top10perc)     1  944343476  944343476 302.8790 < 2.2e-16 ***
## s(F.Undergrad)   1  266907115  266907115  85.6050 < 2.2e-16 ***
## s(Room.Board)    1  757315862  757315862 242.8937 < 2.2e-16 ***
## s(Personal)      1   29881031   29881031   9.5837  0.002065 ** 
## s(PhD)           1  122773140  122773140  39.3770 7.193e-10 ***
## s(Terminal)      1   28828755   28828755   9.2462  0.002475 ** 
## s(perc.alumni)   1  146022029  146022029  46.8336 2.111e-11 ***
## s(Expend)        1  445638596  445638596 142.9295 < 2.2e-16 ***
## s(Grad.Rate)     1   44993627   44993627  14.4308  0.000162 ***
## Residuals      538 1677424214    3117891                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Anova for Nonparametric Effects
##                Npar Df  Npar F     Pr(F)    
## (Intercept)                                 
## Private                                     
## s(Apps)              3  1.9442  0.121409    
## s(Accept)            3  8.6403 1.311e-05 ***
## s(Top10perc)         3  1.1073  0.345597    
## s(F.Undergrad)       3  4.3119  0.005104 ** 
## s(Room.Board)        3  1.6830  0.169600    
## s(Personal)          3  1.7160  0.162640    
## s(PhD)               3  0.4305  0.731248    
## s(Terminal)          3  1.7893  0.148139    
## s(perc.alumni)       3  1.9508  0.120374    
## s(Expend)            3 20.5891 1.212e-12 ***
## s(Grad.Rate)         3  2.0664  0.103652    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1