STAT-Chapter7

6. In this exercise, you will further analyze the Wage data set considered throughout this chapter.

library(ISLR)
attach(Wage)

(a) Perform polynomial regression to predict wage using age.

# Polynomial Regression and Step Functions

fit=lm(wage~poly(age,4),data=Wage)
coef(summary(fit))

##                 Estimate Std. Error    t value     Pr(>|t|)
## (Intercept)    111.70361  0.7287409 153.283015 0.000000e+00
## poly(age, 4)1  447.06785 39.9147851  11.200558 1.484604e-28
## poly(age, 4)2 -478.31581 39.9147851 -11.983424 2.355831e-32
## poly(age, 4)3  125.52169 39.9147851   3.144742 1.678622e-03
## poly(age, 4)4  -77.91118 39.9147851  -1.951938 5.103865e-02

Use cross-validation to select the optimal degree d for the polynomial. What degree was chosen, and how does this compare to the results of hypothesis testing using ANOVA? Make a plot of the resulting polynomial fit to the data.

library(ISLR)
library(boot)
set.seed(1)
cv.error=rep(0,10)
for (i in 1:10){
glm.fit=glm(wage~poly(age ,i),data=Wage)
cv.error[i]=cv.glm(Wage ,glm.fit)$delta [1]
}
cv.error

##  [1] 1676.235 1600.529 1595.960 1594.596 1594.879 1594.119 1594.145 1594.975
##  [9] 1593.356 1594.232

The cross-validation output shows that the errors drop from degree 1 to degree4. However, after that there is not much reduction in validation test error.

plot(1:10, cv.error, xlab = 'Degree', ylab = 'Test MSE', type = 'l')
deg.min <- which.min(cv.error)
points(deg.min, cv.error[deg.min], col = 'red', cex = 2, pch = 19)

The plot shows that minimum of test MSE is at degree =9.

fit.1=lm(wage~age ,data=Wage)
fit.2=lm(wage~poly(age ,2),data=Wage)
fit.3=lm(wage~poly(age ,3),data=Wage)
fit.4=lm(wage~poly(age ,4),data=Wage)
fit.5=lm(wage~poly(age ,5),data=Wage)
fit.5=lm(wage~poly(age ,5),data=Wage)
fit.6=lm(wage~poly(age ,6),data=Wage)
fit.7=lm(wage~poly(age ,7),data=Wage)
fit.8=lm(wage~poly(age ,8),data=Wage)
fit.9=lm(wage~poly(age ,9),data=Wage)

anova(fit.1,fit.2,fit.3,fit.4,fit.5,fit.6,fit.7,fit.8,fit.9)

## Analysis of Variance Table
## 
## Model 1: wage ~ age
## Model 2: wage ~ poly(age, 2)
## Model 3: wage ~ poly(age, 3)
## Model 4: wage ~ poly(age, 4)
## Model 5: wage ~ poly(age, 5)
## Model 6: wage ~ poly(age, 6)
## Model 7: wage ~ poly(age, 7)
## Model 8: wage ~ poly(age, 8)
## Model 9: wage ~ poly(age, 9)
##   Res.Df     RSS Df Sum of Sq        F    Pr(>F)    
## 1   2998 5022216                                    
## 2   2997 4793430  1    228786 143.8118 < 2.2e-16 ***
## 3   2996 4777674  1     15756   9.9038  0.001666 ** 
## 4   2995 4771604  1      6070   3.8156  0.050870 .  
## 5   2994 4770322  1      1283   0.8062  0.369318    
## 6   2993 4766389  1      3932   2.4718  0.116014    
## 7   2992 4763834  1      2555   1.6062  0.205123    
## 8   2991 4763707  1       127   0.0796  0.777829    
## 9   2990 4756703  1      7004   4.4028  0.035963 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The p-value comparing model 1 and 2 is less than 0.05, which indicates linear model is not sufficient. The p-value comparing model 2 and 3 is 0.00166, quadratic fit may be insufficient.The comparison between model 3 and degree-4 polynomial is 0.05, while degree=5 may be unnecessary (because p-value is 0.37).Hence cubic or degree-4 polynomial may be a reasonable fit. But lower or higher order may not be justified.

agelims=range(age)
age.grid=seq(from=agelims[1],to=agelims[2])
preds=predict(fit.4,newdata=list(age=age.grid),se=TRUE)
se.bands=cbind(preds$fit+2*preds$se.fit,preds$fit-2*preds$se.fit)
par(mfrow=c(1,1),mar=c(4.5,4.5,1,1),oma=c(0,0,4,0))
plot(age,wage,xlim=agelims,cex=.5,col="darkgrey")
title("Degree-4 Polynomial",outer=T)
lines(age.grid,preds$fit,lwd=2,col="blue")
matlines(age.grid,se.bands,lwd=1,col="red",lty=3)

(b) Fit a step function to predict wage using age, and perform cross-validation to choose the optimal number of cuts. Make a plot of the fit obtained.

library(ISLR)
library(boot)
set.seed(1)
degree=10
cv.error=rep(NA, degree)
for (i in 2:degree){
Wage$agecut <- cut(age,i)
glm.fit=glm(wage~agecut,data=Wage)
cv.error[i]=cv.glm(Wage ,glm.fit)$delta [1]
}
cv.error

##  [1]       NA 1734.064 1682.763 1635.894 1631.450 1623.291 1611.604 1601.006
##  [9] 1610.213 1604.804

plot(2:degree, cv.error[-1], xlab = 'Cuts', ylab = 'Test MSE', type = 'l')
deg.min <- which.min(cv.error)
points(deg.min, cv.error[deg.min], col = 'red', cex = 2, pch = 19)

Optimal number of cuts is 8.

plot(wage ~ age, data = Wage, col = "darkgrey")
fit <- glm(wage ~ cut(age, 8), data = Wage)
preds <- predict(fit, list(age = age.grid))
lines(age.grid, preds, col = "red", lwd = 2)

10. This question relates to the College data set.

library(ISLR)
attach(College)

(a) Split the data into a training set and a test set. Using out-of-state tuition as the response and the other variables as the predictors, perform forward stepwise selection on the training set in order to identify a satisfactory model that uses just a subset of the predictors.

#Splitting data into training and test
set.seed(1) 
trainingind<-sample(nrow(College), 0.75*nrow(College))
head(trainingind)

## [1] 679 129 509 471 299 270

train<-College[trainingind, ]
test<-College[-trainingind, ]
dim(College)

## [1] 777  18

dim(train)

## [1] 582  18

dim(test)

## [1] 195  18

View(College)

#Performing forward stepwise selection
set.seed(1)
library(leaps)

## Warning: package 'leaps' was built under R version 4.1.3

regfit.fwd=regsubsets(Outstate~.,data=train, nvmax=17,method ="forward")
summary (regfit.fwd)

## Subset selection object
## Call: regsubsets.formula(Outstate ~ ., data = train, nvmax = 17, method = "forward")
## 17 Variables  (and intercept)
##             Forced in Forced out
## PrivateYes      FALSE      FALSE
## Apps            FALSE      FALSE
## Accept          FALSE      FALSE
## Enroll          FALSE      FALSE
## Top10perc       FALSE      FALSE
## Top25perc       FALSE      FALSE
## F.Undergrad     FALSE      FALSE
## P.Undergrad     FALSE      FALSE
## Room.Board      FALSE      FALSE
## Books           FALSE      FALSE
## Personal        FALSE      FALSE
## PhD             FALSE      FALSE
## Terminal        FALSE      FALSE
## S.F.Ratio       FALSE      FALSE
## perc.alumni     FALSE      FALSE
## Expend          FALSE      FALSE
## Grad.Rate       FALSE      FALSE
## 1 subsets of each size up to 17
## Selection Algorithm: forward
##           PrivateYes Apps Accept Enroll Top10perc Top25perc F.Undergrad
## 1  ( 1 )  " "        " "  " "    " "    " "       " "       " "        
## 2  ( 1 )  "*"        " "  " "    " "    " "       " "       " "        
## 3  ( 1 )  "*"        " "  " "    " "    " "       " "       " "        
## 4  ( 1 )  "*"        " "  " "    " "    " "       " "       " "        
## 5  ( 1 )  "*"        " "  " "    " "    " "       " "       " "        
## 6  ( 1 )  "*"        " "  " "    " "    " "       " "       " "        
## 7  ( 1 )  "*"        " "  " "    " "    " "       " "       " "        
## 8  ( 1 )  "*"        " "  " "    " "    "*"       " "       " "        
## 9  ( 1 )  "*"        " "  " "    " "    "*"       " "       " "        
## 10  ( 1 ) "*"        " "  "*"    " "    "*"       " "       " "        
## 11  ( 1 ) "*"        "*"  "*"    " "    "*"       " "       " "        
## 12  ( 1 ) "*"        "*"  "*"    " "    "*"       " "       "*"        
## 13  ( 1 ) "*"        "*"  "*"    " "    "*"       " "       "*"        
## 14  ( 1 ) "*"        "*"  "*"    "*"    "*"       " "       "*"        
## 15  ( 1 ) "*"        "*"  "*"    "*"    "*"       "*"       "*"        
## 16  ( 1 ) "*"        "*"  "*"    "*"    "*"       "*"       "*"        
## 17  ( 1 ) "*"        "*"  "*"    "*"    "*"       "*"       "*"        
##           P.Undergrad Room.Board Books Personal PhD Terminal S.F.Ratio
## 1  ( 1 )  " "         " "        " "   " "      " " " "      " "      
## 2  ( 1 )  " "         " "        " "   " "      " " " "      " "      
## 3  ( 1 )  " "         "*"        " "   " "      " " " "      " "      
## 4  ( 1 )  " "         "*"        " "   " "      " " " "      " "      
## 5  ( 1 )  " "         "*"        " "   " "      "*" " "      " "      
## 6  ( 1 )  " "         "*"        " "   " "      "*" " "      " "      
## 7  ( 1 )  " "         "*"        " "   "*"      "*" " "      " "      
## 8  ( 1 )  " "         "*"        " "   "*"      "*" " "      " "      
## 9  ( 1 )  " "         "*"        " "   "*"      "*" "*"      " "      
## 10  ( 1 ) " "         "*"        " "   "*"      "*" "*"      " "      
## 11  ( 1 ) " "         "*"        " "   "*"      "*" "*"      " "      
## 12  ( 1 ) " "         "*"        " "   "*"      "*" "*"      " "      
## 13  ( 1 ) " "         "*"        " "   "*"      "*" "*"      "*"      
## 14  ( 1 ) " "         "*"        " "   "*"      "*" "*"      "*"      
## 15  ( 1 ) " "         "*"        " "   "*"      "*" "*"      "*"      
## 16  ( 1 ) "*"         "*"        " "   "*"      "*" "*"      "*"      
## 17  ( 1 ) "*"         "*"        "*"   "*"      "*" "*"      "*"      
##           perc.alumni Expend Grad.Rate
## 1  ( 1 )  " "         "*"    " "      
## 2  ( 1 )  " "         "*"    " "      
## 3  ( 1 )  " "         "*"    " "      
## 4  ( 1 )  "*"         "*"    " "      
## 5  ( 1 )  "*"         "*"    " "      
## 6  ( 1 )  "*"         "*"    "*"      
## 7  ( 1 )  "*"         "*"    "*"      
## 8  ( 1 )  "*"         "*"    "*"      
## 9  ( 1 )  "*"         "*"    "*"      
## 10  ( 1 ) "*"         "*"    "*"      
## 11  ( 1 ) "*"         "*"    "*"      
## 12  ( 1 ) "*"         "*"    "*"      
## 13  ( 1 ) "*"         "*"    "*"      
## 14  ( 1 ) "*"         "*"    "*"      
## 15  ( 1 ) "*"         "*"    "*"      
## 16  ( 1 ) "*"         "*"    "*"      
## 17  ( 1 ) "*"         "*"    "*"

reg.summary=summary(regfit.fwd)
which.max(reg.summary$adjr2)

## [1] 13

coef(regfit.fwd,13)

##   (Intercept)    PrivateYes          Apps        Accept     Top10perc 
## -1830.2240567  2314.2340816    -0.3031429     0.7206354    30.0607613 
##   F.Undergrad    Room.Board      Personal           PhD      Terminal 
##    -0.1338858     0.9298037    -0.3683801    10.1747305    23.4453401 
##     S.F.Ratio   perc.alumni        Expend     Grad.Rate 
##   -43.0149304    48.5868427     0.1727903    21.3217639

Based on adjusted R-squared value, the best model is with 13 predictors. The predictors are as shown above: ‘PrivateYes’,‘Apps’,‘Accept’,‘Top10perc’,‘F.Undergrad’,‘Room.Board’,‘Personal’,‘PhD’,‘Terminal’,‘S.F.Ratio’,‘perc.alumni’,‘Expend’ and ‘Grad.Rate’.

(b) Fit a GAM on the training data, using out-of-state tuition as the response and the features selected in the previous step as the predictors. Plot the results, and explain your findings.

library(gam)

## Warning: package 'gam' was built under R version 4.1.3

## Loading required package: splines

## Loading required package: foreach

## Warning: package 'foreach' was built under R version 4.1.3

## Loaded gam 1.20.1

gam.mod <- gam(Outstate ~ Private +s(Apps, 5)+s(Accept, 5)+ s(Top10perc, 5)+s(F.Undergrad, 5)+s(Room.Board, 5) +s(Personal, 5)+ s(PhD, 5)+s(Terminal, 5) +s(S.F.Ratio, 5)+ s(perc.alumni, 5) + s(Expend, 5) + s(Grad.Rate, 5), data =train)
par(mfrow = c(3,4))
plot(gam.mod, se = TRUE)

(c) Evaluate the model obtained on the test set, and explain the results obtained.

preds=predict (gam.mod,newdata =test)
RSS <- sum((test$Outstate - preds)^2)
TSS <- sum((test$Outstate - mean(test$Outstate))^2)
Rsquared = 1-(RSS/TSS)
Rsquared

## [1] 0.7783117

The R-squared value is 77.83117. This means that the predictors explain 77.8% of the variability.

(d) For which variables, if any, is there evidence of a non-linear relationship with the response?

summary(gam.mod)

## 
## Call: gam(formula = Outstate ~ Private + s(Apps, 5) + s(Accept, 5) + 
##     s(Top10perc, 5) + s(F.Undergrad, 5) + s(Room.Board, 5) + 
##     s(Personal, 5) + s(PhD, 5) + s(Terminal, 5) + s(S.F.Ratio, 
##     5) + s(perc.alumni, 5) + s(Expend, 5) + s(Grad.Rate, 5), 
##     data = train)
## Deviance Residuals:
##      Min       1Q   Median       3Q      Max 
## -5956.97 -1082.77    75.16  1113.16  6750.38 
## 
## (Dispersion Parameter for gaussian family taken to be 3100868)
## 
##     Null Deviance: 9798277171 on 581 degrees of freedom
## Residual Deviance: 1612449052 on 519.9993 degrees of freedom
## AIC: 10411.35 
## 
## Number of Local Scoring Iterations: NA 
## 
## Anova for Parametric Effects
##                    Df     Sum Sq    Mean Sq  F value    Pr(>F)    
## Private             1 2560016475 2560016475 825.5807 < 2.2e-16 ***
## s(Apps, 5)          1 1069142945 1069142945 344.7883 < 2.2e-16 ***
## s(Accept, 5)        1  166528289  166528289  53.7038 8.962e-13 ***
## s(Top10perc, 5)     1 1107015390 1107015390 357.0018 < 2.2e-16 ***
## s(F.Undergrad, 5)   1  322284617  322284617 103.9337 < 2.2e-16 ***
## s(Room.Board, 5)    1  571444919  571444919 184.2855 < 2.2e-16 ***
## s(Personal, 5)      1   50347895   50347895  16.2367 6.425e-05 ***
## s(PhD, 5)           1   57299250   57299250  18.4785 2.052e-05 ***
## s(Terminal, 5)      1   25174827   25174827   8.1186 0.0045549 ** 
## s(S.F.Ratio, 5)     1   84735247   84735247  27.3263 2.493e-07 ***
## s(perc.alumni, 5)   1  135824401  135824401  43.8021 9.060e-11 ***
## s(Expend, 5)        1  473699785  473699785 152.7636 < 2.2e-16 ***
## s(Grad.Rate, 5)     1   41377937   41377937  13.3440 0.0002855 ***
## Residuals         520 1612449052    3100868                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Anova for Nonparametric Effects
##                   Npar Df  Npar F     Pr(F)    
## (Intercept)                                    
## Private                                        
## s(Apps, 5)              4  1.9824  0.095880 .  
## s(Accept, 5)            4 16.9375 4.583e-13 ***
## s(Top10perc, 5)         4  1.9224  0.105354    
## s(F.Undergrad, 5)       4  4.5613  0.001253 ** 
## s(Room.Board, 5)        4  1.6018  0.172487    
## s(Personal, 5)          4  2.9606  0.019460 *  
## s(PhD, 5)               4  2.2780  0.059855 .  
## s(Terminal, 5)          4  1.6920  0.150444    
## s(S.F.Ratio, 5)         4  3.7650  0.004970 ** 
## s(perc.alumni, 5)       4  2.3135  0.056509 .  
## s(Expend, 5)            4 21.1692 3.331e-16 ***
## s(Grad.Rate, 5)         4  0.9614  0.428241    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The variables,‘Accept’, ‘F.Undergrad’,‘Personal’,‘S.F.Ratio’ and ‘Expend’ have a non-linear relationship with the ‘Outstate’ variable.’Apps’,‘PhD’ and ‘perc.alumni’ have moderate non-linear relationship with the ‘Outstate’ variable.

STAT-Chapter7

Shalini Kurumathur

4/11/2022