Data Mining HW 5

Problem 6

In this exercise, you will further analyze the Wage data set considered throughout this chapter.

library(ISLR)
attach(Wage)

(a) Perform polynomial regression to predict wage using age. Use cross-validation to select the optimal degree d for the polynomial. What degree was chosen, and how does this compare to the results of hypothesis testing using ANOVA? Make a plot of the resulting polynomial fit to the data.

library(boot)
set.seed(5)
errors <- rep(0,10)
for (i in 1:10){
        glm.fit <- glm(wage ~ poly(age,i), data=Wage)
        errors[i] <- cv.glm(Wage, glm.fit, K=10)$delta[1]
}
errors
##  [1] 1675.881 1600.625 1595.877 1594.165 1594.823 1593.792 1593.204 1597.144
##  [9] 1593.845 1593.971
plot(1:10, errors, xlab = "Degree", ylab = "CV ERROR", type = "l")
err.min <- which.min(errors)
points(which.min(errors), errors[which.min(errors)], col = "red", cex = 2, pch = 20)

The optimal degree chosen is degree-7 as that results in the lowest CV error. However, we should notice that at degree-3, the error values seem to fluctuate with very small variation. Degree-3 would be the better choice because the error hardly decreases, and sometimes increases, after that degree of polynomial. In other words, a degree polynomial above 3 would be unnecessary.

fit1 <- lm(wage ~ age, data = Wage)
fit2 <- lm(wage ~ poly(age, 2), data = Wage)
fit3 <- lm(wage ~ poly(age, 3), data = Wage)
fit4 <- lm(wage ~ poly(age, 4), data = Wage)
fit5 <- lm(wage ~ poly(age, 5), data = Wage)
anova(fit1, fit2, fit3, fit4, fit5)
## Analysis of Variance Table
## 
## Model 1: wage ~ age
## Model 2: wage ~ poly(age, 2)
## Model 3: wage ~ poly(age, 3)
## Model 4: wage ~ poly(age, 4)
## Model 5: wage ~ poly(age, 5)
##   Res.Df     RSS Df Sum of Sq        F    Pr(>F)    
## 1   2998 5022216                                    
## 2   2997 4793430  1    228786 143.5931 < 2.2e-16 ***
## 3   2996 4777674  1     15756   9.8888  0.001679 ** 
## 4   2995 4771604  1      6070   3.8098  0.051046 .  
## 5   2994 4770322  1      1283   0.8050  0.369682    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

As described in chapter 7 of the textbook, the p-value comparing the cubic and degree-4 polynomial is approximately 5% while the degree-5 polynomial seems unnecessary because its p-value is 0.37. Hence, either a cubic or a quartic polynomial appear to provide a reasonable fit to the data. For the sake of simplicity, it’s generally better to choose the lowest degree polynomial that adequately models the data when a higher degree doesn’t result in a much better model. We’ll proceed with the degree-3 polynomial.

plot(wage~age, data = Wage, col = "green")
age.range <- range(Wage$age)
age.grid <- seq(from = age.range[1], to = age.range[2])
fit <- lm(wage ~ poly(age, 3), data = Wage)
preds <- predict(fit, newdata = list(age = age.grid))
lines(age.grid, preds, col = "red", lwd = 2)

(b) Fit a step function to predict wage using age, and perform crossvalidation to choose the optimal number of cuts. Make a plot of the fit obtained.

cvs <- rep(NA, 10)
for (i in 2:10) {
    Wage$age.cut <- cut(Wage$age, i)
    fit <- glm(wage ~ age.cut, data = Wage)
    cvs[i] <- cv.glm(Wage, fit, K = 10)$delta[1]
}
plot(2:10, cvs[-1], xlab = "# of Cuts", ylab = "CV ERROR", type = "l")
err.min <- which.min(cvs)
points(which.min(cvs), cvs[which.min(cvs)], col = "red", cex = 2, pch = 20)

plot(wage ~ age, data = Wage, col = "green")
agelims <- range(Wage$age)
age.grid <- seq(from = agelims[1], to = agelims[2])
fit <- glm(wage ~ cut(age, 8), data = Wage)
preds <- predict(fit, data.frame(age = age.grid))
lines(age.grid, preds, col = "red", lwd = 2)

Problem 10

This question relates to the College data set.

library(ISLR)
attach(College)
str(College)
## 'data.frame':    777 obs. of  18 variables:
##  $ Private    : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Apps       : num  1660 2186 1428 417 193 ...
##  $ Accept     : num  1232 1924 1097 349 146 ...
##  $ Enroll     : num  721 512 336 137 55 158 103 489 227 172 ...
##  $ Top10perc  : num  23 16 22 60 16 38 17 37 30 21 ...
##  $ Top25perc  : num  52 29 50 89 44 62 45 68 63 44 ...
##  $ F.Undergrad: num  2885 2683 1036 510 249 ...
##  $ P.Undergrad: num  537 1227 99 63 869 ...
##  $ Outstate   : num  7440 12280 11250 12960 7560 ...
##  $ Room.Board : num  3300 6450 3750 5450 4120 ...
##  $ Books      : num  450 750 400 450 800 500 500 450 300 660 ...
##  $ Personal   : num  2200 1500 1165 875 1500 ...
##  $ PhD        : num  70 29 53 92 76 67 90 89 79 40 ...
##  $ Terminal   : num  78 30 66 97 72 73 93 100 84 41 ...
##  $ S.F.Ratio  : num  18.1 12.2 12.9 7.7 11.9 9.4 11.5 13.7 11.3 11.5 ...
##  $ perc.alumni: num  12 16 30 37 2 11 26 37 23 15 ...
##  $ Expend     : num  7041 10527 8735 19016 10922 ...
##  $ Grad.Rate  : num  60 56 54 59 15 55 63 73 80 52 ...

(a) Split the data into a training set and a test set. Using out-of-state tuition as the response and the other variables as the predictors, perform forward stepwise selection on the training set in order to identify a satisfactory model that uses just a subset of the predictors.

library(leaps)
set.seed(5)
train <-  sample(1: nrow(College), nrow(College)/2)
test <-  -train
fit <- regsubsets(Outstate ~ ., data = College, subset = train, method = 'forward')
fit.summary <- summary(fit)
fit.summary
## Subset selection object
## Call: regsubsets.formula(Outstate ~ ., data = College, subset = train, 
##     method = "forward")
## 17 Variables  (and intercept)
##             Forced in Forced out
## PrivateYes      FALSE      FALSE
## Apps            FALSE      FALSE
## Accept          FALSE      FALSE
## Enroll          FALSE      FALSE
## Top10perc       FALSE      FALSE
## Top25perc       FALSE      FALSE
## F.Undergrad     FALSE      FALSE
## P.Undergrad     FALSE      FALSE
## Room.Board      FALSE      FALSE
## Books           FALSE      FALSE
## Personal        FALSE      FALSE
## PhD             FALSE      FALSE
## Terminal        FALSE      FALSE
## S.F.Ratio       FALSE      FALSE
## perc.alumni     FALSE      FALSE
## Expend          FALSE      FALSE
## Grad.Rate       FALSE      FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: forward
##          PrivateYes Apps Accept Enroll Top10perc Top25perc F.Undergrad
## 1  ( 1 ) " "        " "  " "    " "    " "       " "       " "        
## 2  ( 1 ) "*"        " "  " "    " "    " "       " "       " "        
## 3  ( 1 ) "*"        " "  " "    " "    " "       " "       " "        
## 4  ( 1 ) "*"        " "  " "    " "    " "       " "       " "        
## 5  ( 1 ) "*"        " "  " "    " "    " "       " "       " "        
## 6  ( 1 ) "*"        " "  " "    " "    " "       " "       " "        
## 7  ( 1 ) "*"        " "  " "    " "    " "       " "       "*"        
## 8  ( 1 ) "*"        " "  "*"    " "    " "       " "       "*"        
##          P.Undergrad Room.Board Books Personal PhD Terminal S.F.Ratio
## 1  ( 1 ) " "         " "        " "   " "      " " " "      " "      
## 2  ( 1 ) " "         " "        " "   " "      " " " "      " "      
## 3  ( 1 ) " "         " "        " "   " "      " " " "      " "      
## 4  ( 1 ) " "         "*"        " "   " "      " " " "      " "      
## 5  ( 1 ) " "         "*"        " "   " "      " " " "      " "      
## 6  ( 1 ) " "         "*"        " "   " "      " " "*"      " "      
## 7  ( 1 ) " "         "*"        " "   " "      " " "*"      " "      
## 8  ( 1 ) " "         "*"        " "   " "      " " "*"      " "      
##          perc.alumni Expend Grad.Rate
## 1  ( 1 ) " "         "*"    " "      
## 2  ( 1 ) " "         "*"    " "      
## 3  ( 1 ) " "         "*"    "*"      
## 4  ( 1 ) " "         "*"    "*"      
## 5  ( 1 ) "*"         "*"    "*"      
## 6  ( 1 ) "*"         "*"    "*"      
## 7  ( 1 ) "*"         "*"    "*"      
## 8  ( 1 ) "*"         "*"    "*"

Forward selection method chose a model with 8 predictors. Below we can see the parameters with corresponding coefficients.

coef(fit, id = 8)
##   (Intercept)    PrivateYes        Accept   F.Undergrad    Room.Board 
## -3538.5470811  2303.1188862     0.1883234    -0.1323393     0.8040597 
##      Terminal   perc.alumni        Expend     Grad.Rate 
##    38.2256905    58.5769407     0.2584365    31.6591662

(b) Fit a GAM on the training data, using out-of-state tuition as the response and the features selected in the previous step as the predictors. Plot the results, and explain your findings.

library(gam)
## Loading required package: splines
## Loading required package: foreach
## Loaded gam 1.16.1
library(splines)
fit <- gam(Outstate ~ Private + s(Accept, df = 2) +s(F.Undergrad, df = 2) + s(Room.Board, df = 2) + s(Terminal, df = 2) + s(perc.alumni, df = 2) + s(Expend, df = 5) + s(Grad.Rate, df = 2), data=College, subset = train)
## Warning in model.matrix.default(mt, mf, contrasts): non-list contrasts argument
## ignored
par(mfrow = c(2, 4))
plot(fit, se = T, col = "blue")

From the plots we can see that ‘Accept’ and ‘Grad.Rate’ have strong non-linear relationships with Outstate. The other quantitative variables have a visually linear relationship with Outstate, or at least close to linear.

(c) Evaluate the model obtained on the test set, and explain the results obtained.

preds <- predict(fit, College[test, ])
RSS <- sum((College[test, ]$Outstate - preds)^2)
TSS <- sum((College[test, ]$Outstate - mean(College[test, ]$Outstate)) ^ 2)
1 - (RSS / TSS) 
## [1] 0.7791276

The \(R^2\) value is nearly 78%. With that value, I would say that the model performed fairly well on the test dataset.

(d) For which variables, if any, is there evidence of a non-linear relationship with the response?

summary(fit)
## 
## Call: gam(formula = Outstate ~ Private + s(Accept, df = 2) + s(F.Undergrad, 
##     df = 2) + s(Room.Board, df = 2) + s(Terminal, df = 2) + s(perc.alumni, 
##     df = 2) + s(Expend, df = 5) + s(Grad.Rate, df = 2), data = College, 
##     subset = train)
## Deviance Residuals:
##     Min      1Q  Median      3Q     Max 
## -6336.8 -1064.3   108.9  1226.3  4599.7 
## 
## (Dispersion Parameter for gaussian family taken to be 3283534)
## 
##     Null Deviance: 6473143184 on 387 degrees of freedom
## Residual Deviance: 1211624712 on 369.0002 degrees of freedom
## AIC: 6943.334 
## 
## Number of Local Scoring Iterations: 2 
## 
## Anova for Parametric Effects
##                         Df     Sum Sq    Mean Sq F value    Pr(>F)    
## Private                  1 1910138560 1910138560 581.733 < 2.2e-16 ***
## s(Accept, df = 2)        1  277440270  277440270  84.494 < 2.2e-16 ***
## s(F.Undergrad, df = 2)   1  101151273  101151273  30.806 5.457e-08 ***
## s(Room.Board, df = 2)    1  973004054  973004054 296.328 < 2.2e-16 ***
## s(Terminal, df = 2)      1  462678890  462678890 140.909 < 2.2e-16 ***
## s(perc.alumni, df = 2)   1  383798356  383798356 116.886 < 2.2e-16 ***
## s(Expend, df = 5)        1  438798060  438798060 133.636 < 2.2e-16 ***
## s(Grad.Rate, df = 2)     1   83202459   83202459  25.339 7.531e-07 ***
## Residuals              369 1211624712    3283534                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Anova for Nonparametric Effects
##                        Npar Df  Npar F     Pr(F)    
## (Intercept)                                         
## Private                                             
## s(Accept, df = 2)            1 11.8163 0.0006541 ***
## s(F.Undergrad, df = 2)       1  0.8129 0.3678461    
## s(Room.Board, df = 2)        1  0.6948 0.4050831    
## s(Terminal, df = 2)          1  1.6636 0.1979245    
## s(perc.alumni, df = 2)       1  0.4052 0.5248057    
## s(Expend, df = 5)            4 14.0485 1.115e-10 ***
## s(Grad.Rate, df = 2)         1  1.1680 0.2805296    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Our observations from earlier are now confirmed. There is strong evidence that Accept and Grad.Rate have a non-linear relationship with the response.