In this exercise, you will further analyze the Wage data set considered throughout this chapter.
library(ISLR)
attach(Wage)
# View summary for the Wage dataset
summary(Wage)
## year age maritl race
## Min. :2003 Min. :18.00 1. Never Married: 648 1. White:2480
## 1st Qu.:2004 1st Qu.:33.75 2. Married :2074 2. Black: 293
## Median :2006 Median :42.00 3. Widowed : 19 3. Asian: 190
## Mean :2006 Mean :42.41 4. Divorced : 204 4. Other: 37
## 3rd Qu.:2008 3rd Qu.:51.00 5. Separated : 55
## Max. :2009 Max. :80.00
##
## education region jobclass
## 1. < HS Grad :268 2. Middle Atlantic :3000 1. Industrial :1544
## 2. HS Grad :971 1. New England : 0 2. Information:1456
## 3. Some College :650 3. East North Central: 0
## 4. College Grad :685 4. West North Central: 0
## 5. Advanced Degree:426 5. South Atlantic : 0
## 6. East South Central: 0
## (Other) : 0
## health health_ins logwage wage
## 1. <=Good : 858 1. Yes:2083 Min. :3.000 Min. : 20.09
## 2. >=Very Good:2142 2. No : 917 1st Qu.:4.447 1st Qu.: 85.38
## Median :4.653 Median :104.92
## Mean :4.654 Mean :111.70
## 3rd Qu.:4.857 3rd Qu.:128.68
## Max. :5.763 Max. :318.34
##
str(Wage)
## 'data.frame': 3000 obs. of 11 variables:
## $ year : int 2006 2004 2003 2003 2005 2008 2009 2008 2006 2004 ...
## $ age : int 18 24 45 43 50 54 44 30 41 52 ...
## $ maritl : Factor w/ 5 levels "1. Never Married",..: 1 1 2 2 4 2 2 1 1 2 ...
## $ race : Factor w/ 4 levels "1. White","2. Black",..: 1 1 1 3 1 1 4 3 2 1 ...
## $ education : Factor w/ 5 levels "1. < HS Grad",..: 1 4 3 4 2 4 3 3 3 2 ...
## $ region : Factor w/ 9 levels "1. New England",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ jobclass : Factor w/ 2 levels "1. Industrial",..: 1 2 1 2 2 2 1 2 2 2 ...
## $ health : Factor w/ 2 levels "1. <=Good","2. >=Very Good": 1 2 1 2 1 2 2 1 2 2 ...
## $ health_ins: Factor w/ 2 levels "1. Yes","2. No": 2 2 1 1 1 1 1 1 1 1 ...
## $ logwage : num 4.32 4.26 4.88 5.04 4.32 ...
## $ wage : num 75 70.5 131 154.7 75 ...
(a) Perform polynomial regression to predict wage using age. Use cross-validation to select the optimal degree d for the polynomial. What degree was chosen, and how does this compare to the results of hypothesis testing using ANOVA? Make a plot of the resulting polynomial fit to the data.
k-Fold Cross-Validation
#initialize the random number generator to aide reproducibility
library(boot)
set.seed(17)
cv.error.10=rep(0,10) #initiate an empty vector to store computed cross-validation errors
for (i in 1:10){ #set up for loop to increment from 1 to five
glm.fit <- glm(wage ~ poly(age, i), data = Wage) #fit the linear regression model
cv.error.10[i]=cv.glm(Wage, glm.fit,K=10)$delta[1] #calculate the cross validation error and store the result in our cv.error vector
}
cv.error.10 #print out the resulting cross validation error
## [1] 1675.109 1601.423 1595.850 1594.073 1593.960 1597.031 1594.023 1594.544
## [9] 1592.573 1593.557
Plot
# illustrate results with a line plot connecting the cv.error dots
plot( x = 1:10, y = cv.error.10, xlab = "power of age", ylab = "CV error",
type = "b", pch = 19, lwd = 2, bty = "n",
ylim = c( min(cv.error.10) - sd(cv.error.10), max(cv.error.10) + sd(cv.error.10) ) )
# horizontal line for 1se to less complexity
abline(h = min(cv.error.10) + sd(cv.error.10) , lty = "dotted")
# where is the minimum
points( x = which.min(cv.error.10), y = min(cv.error.10), col = "red", pch = "X", cex = 1.5 )
anova
fit.1 <- lm(wage ~ age,data=Wage)
fit.2 <- lm(wage ~ poly(age,2),data = Wage)
fit.3 <- lm(wage ~ poly(age,3),data = Wage)
fit.4 <- lm(wage~poly(age,4),data = Wage)
fit.5 <- lm(wage~poly(age,5),data = Wage)
fit.6 <- lm(wage~poly(age,6), data = Wage)
fit.7 <- lm(wage~poly(age,7), data = Wage)
fit.8 <- lm(wage~poly(age,8), data = Wage)
fit.9 <- lm(wage~poly(age,9), data = Wage)
fit.10 <- lm(wage~poly(age,10), data = Wage)
anova(fit.1, fit.2, fit.3, fit.4, fit.5, fit.6, fit.7, fit.8, fit.9, fit.10)
## Analysis of Variance Table
##
## Model 1: wage ~ age
## Model 2: wage ~ poly(age, 2)
## Model 3: wage ~ poly(age, 3)
## Model 4: wage ~ poly(age, 4)
## Model 5: wage ~ poly(age, 5)
## Model 6: wage ~ poly(age, 6)
## Model 7: wage ~ poly(age, 7)
## Model 8: wage ~ poly(age, 8)
## Model 9: wage ~ poly(age, 9)
## Model 10: wage ~ poly(age, 10)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 2998 5022216
## 2 2997 4793430 1 228786 143.7638 < 2.2e-16 ***
## 3 2996 4777674 1 15756 9.9005 0.001669 **
## 4 2995 4771604 1 6070 3.8143 0.050909 .
## 5 2994 4770322 1 1283 0.8059 0.369398
## 6 2993 4766389 1 3932 2.4709 0.116074
## 7 2992 4763834 1 2555 1.6057 0.205199
## 8 2991 4763707 1 127 0.0796 0.777865
## 9 2990 4756703 1 7004 4.4014 0.035994 *
## 10 2989 4756701 1 3 0.0017 0.967529
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The p-value comparing the linear Model 1 to the quadratic Model 2 is essentially zero (\(<{10}^{15}\)) indicating that a linear fit is not sufficient. Similarly the p-value comparing the quadratic Model 2 to the cubic Model 3 is very low (0.001669), so the quadratic fit is also insufficient. The \(p-value\) comparing the cubic and degree-4 polynomials, Model 3 and Model 4, is approximately 5% while the degree-5 polynomial Model 5 seems unnecessary because its p-value is 0.369398. Hence, either a cubic or a quartic polynomial appear to provide a reasonable fit to the data, but lower- or higher-order models are not justified.
Plot of the resulting Polynomial fit
# we now create a grid of values for age at which we want predictions
agelims=range(age)
age.grid <- seq(from=agelims[1],to=agelims[2])
preds <- predict(fit.4,newdata=list(age=age.grid),se=TRUE)
se.bands <- cbind(preds$fit+2*preds$se.fit,preds$fit-2*preds$se.fit)
# plot the data and add the fit from the degree-4 polynomial
par(mfrow=c(1,1),mar=c(4.5,4.5,1,1),oma=c(0,0,2,0))
plot(age,wage,xlim=agelims,cex =.5,col="darkgrey")
title("Degree-4 Polynomial",outer=T)
lines(age.grid,preds$fit,lwd=2,col="darkblue")
matlines(age.grid,se.bands,lwd=1,col="lightblue",lty=3)
(b) Fit a step function to predict wage using age, and perform crossvalidation to choose the optimal number of cuts. Make a plot of the fit obtained.
table(cut(age,4))
##
## (17.9,33.5] (33.5,49] (49,64.5] (64.5,80.1]
## 750 1399 779 72
fit_cut <- lm(wage ~ cut(age,4),data = Wage)
coef(summary(fit_cut))
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 94.158392 1.476069 63.789970 0.000000e+00
## cut(age, 4)(33.5,49] 24.053491 1.829431 13.148074 1.982315e-38
## cut(age, 4)(49,64.5] 23.664559 2.067958 11.443444 1.040750e-29
## cut(age, 4)(64.5,80.1] 7.640592 4.987424 1.531972 1.256350e-01
The age<33.5 category is left out, so the intercept coefficient of $94,160 can be interpreted as the average salary for those under 33.5 years of age, and the other coefficients can be interpreted as the average additional salary for those in the other age groups. We can produce predictions and plots just as we did in the case of the polynomial fit.
#initialize the random number generator to aide reproducibility
set.seed(18)
cv.error.15=rep(0,15) #initiate an empty vector to store computed cross-validation errors
for (i in 2:15) {
Wage$age.cut <- cut(Wage$age, i)
lm.fit <- glm(wage ~ age.cut, data = Wage)
cv.error.15[i] <- cv.glm(Wage, lm.fit, K = 10)$delta[1]
}
# the first element of cv.error is NA because we started our loop at 2
plot(2:15, cv.error.15[-1], xlab = "Number of cuts", ylab = "CV error",
type = "b", pch = 19, lwd = 2, bty ="n")
# horizontal line for 1se to less complexity
abline(h = min(cv.error.15, na.rm = TRUE) + sd(cv.error.15, na.rm = TRUE) , lty = "dotted")
# highlight minimum
points(x = which.min(cv.error.15), y = min(cv.error.15, na.rm = TRUE), col = "red", pch = "X", cex = 1.5 )
lm.fit <- glm(wage ~ cut(age, 4), data = Wage)
agelims <- range(Wage$age)
age.grid <- seq(from = agelims[1], to = agelims[2])
lm.pred <- predict(lm.fit, data.frame(age = age.grid), se = TRUE)
plot(wage ~ age, data = Wage, col = "darkgrey", bty = "n")
lines(age.grid, lm.pred$fit, col = "red", lwd = 2)
matlines(age.grid, cbind( lm.pred$fit + 2* lm.pred$se.fit,
lm.pred$fit - 2* lm.pred$se.fit),
col = "red", lty ="dashed")
This question relates to the College data set.
library(ISLR)
attach(College)
summary(College)
## Private Apps Accept Enroll Top10perc
## No :212 Min. : 81 Min. : 72 Min. : 35 Min. : 1.00
## Yes:565 1st Qu.: 776 1st Qu.: 604 1st Qu.: 242 1st Qu.:15.00
## Median : 1558 Median : 1110 Median : 434 Median :23.00
## Mean : 3002 Mean : 2019 Mean : 780 Mean :27.56
## 3rd Qu.: 3624 3rd Qu.: 2424 3rd Qu.: 902 3rd Qu.:35.00
## Max. :48094 Max. :26330 Max. :6392 Max. :96.00
## Top25perc F.Undergrad P.Undergrad Outstate
## Min. : 9.0 Min. : 139 Min. : 1.0 Min. : 2340
## 1st Qu.: 41.0 1st Qu.: 992 1st Qu.: 95.0 1st Qu.: 7320
## Median : 54.0 Median : 1707 Median : 353.0 Median : 9990
## Mean : 55.8 Mean : 3700 Mean : 855.3 Mean :10441
## 3rd Qu.: 69.0 3rd Qu.: 4005 3rd Qu.: 967.0 3rd Qu.:12925
## Max. :100.0 Max. :31643 Max. :21836.0 Max. :21700
## Room.Board Books Personal PhD
## Min. :1780 Min. : 96.0 Min. : 250 Min. : 8.00
## 1st Qu.:3597 1st Qu.: 470.0 1st Qu.: 850 1st Qu.: 62.00
## Median :4200 Median : 500.0 Median :1200 Median : 75.00
## Mean :4358 Mean : 549.4 Mean :1341 Mean : 72.66
## 3rd Qu.:5050 3rd Qu.: 600.0 3rd Qu.:1700 3rd Qu.: 85.00
## Max. :8124 Max. :2340.0 Max. :6800 Max. :103.00
## Terminal S.F.Ratio perc.alumni Expend
## Min. : 24.0 Min. : 2.50 Min. : 0.00 Min. : 3186
## 1st Qu.: 71.0 1st Qu.:11.50 1st Qu.:13.00 1st Qu.: 6751
## Median : 82.0 Median :13.60 Median :21.00 Median : 8377
## Mean : 79.7 Mean :14.09 Mean :22.74 Mean : 9660
## 3rd Qu.: 92.0 3rd Qu.:16.50 3rd Qu.:31.00 3rd Qu.:10830
## Max. :100.0 Max. :39.80 Max. :64.00 Max. :56233
## Grad.Rate
## Min. : 10.00
## 1st Qu.: 53.00
## Median : 65.00
## Mean : 65.46
## 3rd Qu.: 78.00
## Max. :118.00
(a) Split the data into a training set and a test set. Using out-of-state tuition as the response and the other variables as the predictors, perform forward stepwise selection on the training set in order to identify a satisfactory model that uses just a subset of the predictors.
#set the starting point for the random number generator to aide reproducibility
set.seed(123)
#create vector of random 196 random row numbers to help split training and test datasets
indices <- sample(nrow(College), 0.70 * nrow(College))
train <- College[indices, ] # 70% of the data
test <- College[-indices, ] # 30% of the data
# fit a forward stepwise model
library(leaps)
regfit.fwd <- regsubsets(Outstate ~., data = train, nvmax = 17, method = "forward")
summary(regfit.fwd)
## Subset selection object
## Call: regsubsets.formula(Outstate ~ ., data = train, nvmax = 17, method = "forward")
## 17 Variables (and intercept)
## Forced in Forced out
## PrivateYes FALSE FALSE
## Apps FALSE FALSE
## Accept FALSE FALSE
## Enroll FALSE FALSE
## Top10perc FALSE FALSE
## Top25perc FALSE FALSE
## F.Undergrad FALSE FALSE
## P.Undergrad FALSE FALSE
## Room.Board FALSE FALSE
## Books FALSE FALSE
## Personal FALSE FALSE
## PhD FALSE FALSE
## Terminal FALSE FALSE
## S.F.Ratio FALSE FALSE
## perc.alumni FALSE FALSE
## Expend FALSE FALSE
## Grad.Rate FALSE FALSE
## 1 subsets of each size up to 17
## Selection Algorithm: forward
## PrivateYes Apps Accept Enroll Top10perc Top25perc F.Undergrad
## 1 ( 1 ) " " " " " " " " " " " " " "
## 2 ( 1 ) "*" " " " " " " " " " " " "
## 3 ( 1 ) "*" " " " " " " " " " " " "
## 4 ( 1 ) "*" " " " " " " " " " " " "
## 5 ( 1 ) "*" " " " " " " " " " " " "
## 6 ( 1 ) "*" " " " " " " " " " " " "
## 7 ( 1 ) "*" " " "*" " " " " " " " "
## 8 ( 1 ) "*" " " "*" " " " " " " "*"
## 9 ( 1 ) "*" " " "*" " " " " "*" "*"
## 10 ( 1 ) "*" "*" "*" " " " " "*" "*"
## 11 ( 1 ) "*" "*" "*" " " " " "*" "*"
## 12 ( 1 ) "*" "*" "*" " " " " "*" "*"
## 13 ( 1 ) "*" "*" "*" " " "*" "*" "*"
## 14 ( 1 ) "*" "*" "*" "*" "*" "*" "*"
## 15 ( 1 ) "*" "*" "*" "*" "*" "*" "*"
## 16 ( 1 ) "*" "*" "*" "*" "*" "*" "*"
## 17 ( 1 ) "*" "*" "*" "*" "*" "*" "*"
## P.Undergrad Room.Board Books Personal PhD Terminal S.F.Ratio
## 1 ( 1 ) " " " " " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " " " "
## 3 ( 1 ) " " "*" " " " " " " " " " "
## 4 ( 1 ) " " "*" " " " " " " " " " "
## 5 ( 1 ) " " "*" " " " " "*" " " " "
## 6 ( 1 ) " " "*" " " " " "*" " " " "
## 7 ( 1 ) " " "*" " " " " "*" " " " "
## 8 ( 1 ) " " "*" " " " " "*" " " " "
## 9 ( 1 ) " " "*" " " " " "*" " " " "
## 10 ( 1 ) " " "*" " " " " "*" " " " "
## 11 ( 1 ) " " "*" " " " " "*" "*" " "
## 12 ( 1 ) " " "*" " " "*" "*" "*" " "
## 13 ( 1 ) " " "*" " " "*" "*" "*" " "
## 14 ( 1 ) " " "*" " " "*" "*" "*" " "
## 15 ( 1 ) " " "*" " " "*" "*" "*" "*"
## 16 ( 1 ) " " "*" "*" "*" "*" "*" "*"
## 17 ( 1 ) "*" "*" "*" "*" "*" "*" "*"
## perc.alumni Expend Grad.Rate
## 1 ( 1 ) " " "*" " "
## 2 ( 1 ) " " "*" " "
## 3 ( 1 ) " " "*" " "
## 4 ( 1 ) "*" "*" " "
## 5 ( 1 ) "*" "*" " "
## 6 ( 1 ) "*" "*" "*"
## 7 ( 1 ) "*" "*" "*"
## 8 ( 1 ) "*" "*" "*"
## 9 ( 1 ) "*" "*" "*"
## 10 ( 1 ) "*" "*" "*"
## 11 ( 1 ) "*" "*" "*"
## 12 ( 1 ) "*" "*" "*"
## 13 ( 1 ) "*" "*" "*"
## 14 ( 1 ) "*" "*" "*"
## 15 ( 1 ) "*" "*" "*"
## 16 ( 1 ) "*" "*" "*"
## 17 ( 1 ) "*" "*" "*"
# examine these to try to select the best overall model
reg.summary <- summary(regfit.fwd)
names(reg.summary)
## [1] "which" "rsq" "rss" "adjr2" "cp" "bic" "outmat" "obj"
# R^2
reg.summary$rsq
## [1] 0.4341109 0.6054366 0.6774467 0.7123154 0.7313616 0.7401159 0.7429579
## [8] 0.7508792 0.7540080 0.7563797 0.7581694 0.7598391 0.7611480 0.7622390
## [15] 0.7629536 0.7630786 0.7631066
# max for adj RSq?
which.max(reg.summary$adjr2)
## [1] 15
# min ?
which.min(reg.summary$rss)
## [1] 17
which.min(reg.summary$cp)
## [1] 14
which.min(reg.summary$bic)
## [1] 9
# plotting RSS, adj.R^2 .. for all the models will help us decide which model to select
par(mfrow=c(2,2))
plot(reg.summary$rss, xlab = "Number of Variables", ylab = "RSS", type = "l")
points(17,reg.summary$rss[17], col="red", cex=2, pch =20)
plot(reg.summary$adjr2, xlab = "Number of Variables", ylab = "Adjusted RSq", type = "l")
points(11,reg.summary$adjr2[11], col="red", cex=2, pch =20)
plot(reg.summary$cp, xlab = "Number of Variables", ylab = "Cp", type = "l")
points(10,reg.summary$cp[10], col="red", cex=2, pch =20)
plot(reg.summary$bic, xlab = "Number of Variables", ylab = "BIC", type = "l")
points(9,reg.summary$bic[9], col="red", cex=2, pch =20)
par(mfrow=c(1,4), mar=c(0,0,0,0))
plot(regfit.fwd, scale="r2")
plot(regfit.fwd, scale="adjr2")
plot(regfit.fwd, scale="Cp")
plot(regfit.fwd, scale="bic")
I will build a model using the 9 variables selected by the BIC result. The 9 variables selected were, Private, Apps, Accept, F.Undergrad, Room.Board, Terminal, perc.alumni, Expend and Grad.Rate.
(b) Fit a GAM on the training data, using out-of-state tuition as the response and the features selected in the previous step as the predictors. Plot the results, and explain your findings.
# load in library
library(gam)
## Loading required package: splines
## Loading required package: foreach
## Loaded gam 1.16.1
# fit a GAM model using the lm() function
gam1 <- lm(Outstate ~ Private + ns(Apps,2) + ns(Accept,3) + ns(F.Undergrad,4) + ns(Room.Board, 5) + ns(Terminal, 2) + ns(perc.alumni, 3) + ns(Expend, 4) + ns(Grad.Rate, 5), data = train)
# plot the model
par(mfrow=c(1,3))
plot.Gam(gam1, se=TRUE,col ="#1c9099")
# fit a GAM model with the gam() function
gam2 <- gam(Outstate ~ Private + s(Apps,2) + s(Accept,3) + s(F.Undergrad,4) + s(Room.Board, 5) + s(Terminal, 2) + s(perc.alumni, 3) + s(Expend, 4) + s(Grad.Rate, 5), data = train)
## Warning in model.matrix.default(mt, mf, contrasts): non-list contrasts argument
## ignored
# plot the model
par(mfrow=c(1,3))
plot(gam2, se=TRUE,col ="#1c9099")
Based on the shape of the fit curves, Expend and Grad.Rate are strong non-linear with the Outstate.
# fit a model from the information given above
gam3 <- gam(Outstate ~ Private + Apps + Accept + F.Undergrad + s(Terminal, 4) + perc.alumni + s(Expend, 4) + Grad.Rate + Room.Board, data = train)
## Warning in model.matrix.default(mt, mf, contrasts): non-list contrasts argument
## ignored
# plot the model
par(mfrow=c(1,3))
plot(gam3, se=TRUE,col ="#1c9099")
# view summary
summary(gam3)
##
## Call: gam(formula = Outstate ~ Private + Apps + Accept + F.Undergrad +
## s(Terminal, 4) + perc.alumni + s(Expend, 4) + Grad.Rate +
## Room.Board, data = train)
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -6986.4 -1172.0 138.4 1288.3 7859.6
##
## (Dispersion Parameter for gaussian family taken to be 3586630)
##
## Null Deviance: 9035173364 on 542 degrees of freedom
## Residual Deviance: 1890153766 on 526.9999 degrees of freedom
## AIC: 9754.076
##
## Number of Local Scoring Iterations: 2
##
## Anova for Parametric Effects
## Df Sum Sq Mean Sq F value Pr(>F)
## Private 1 2598966645 2598966645 724.626 < 2.2e-16 ***
## Apps 1 1034656950 1034656950 288.476 < 2.2e-16 ***
## Accept 1 126879256 126879256 35.376 4.956e-09 ***
## F.Undergrad 1 189270680 189270680 52.771 1.356e-12 ***
## s(Terminal, 4) 1 835129066 835129066 232.845 < 2.2e-16 ***
## perc.alumni 1 314186305 314186305 87.599 < 2.2e-16 ***
## s(Expend, 4) 1 677679602 677679602 188.946 < 2.2e-16 ***
## Grad.Rate 1 123612652 123612652 34.465 7.685e-09 ***
## Room.Board 1 150367042 150367042 41.924 2.176e-10 ***
## Residuals 527 1890153766 3586630
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Anova for Nonparametric Effects
## Npar Df Npar F Pr(F)
## (Intercept)
## Private
## Apps
## Accept
## F.Undergrad
## s(Terminal, 4) 3 0.4589 0.7111
## perc.alumni
## s(Expend, 4) 3 30.8764 <2e-16 ***
## Grad.Rate
## Room.Board
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(c) Evaluate the model obtained on the test set, and explain the results obtained.
# e can perform a series of ANOVA tests in order to determine which of these three models is best
anova(gam1, gam2, gam3, test = "F")
## Analysis of Variance Table
##
## Model 1: Outstate ~ Private + ns(Apps, 2) + ns(Accept, 3) + ns(F.Undergrad,
## 4) + ns(Room.Board, 5) + ns(Terminal, 2) + ns(perc.alumni,
## 3) + ns(Expend, 4) + ns(Grad.Rate, 5)
## Model 2: Outstate ~ Private + s(Apps, 2) + s(Accept, 3) + s(F.Undergrad,
## 4) + s(Room.Board, 5) + s(Terminal, 2) + s(perc.alumni, 3) +
## s(Expend, 4) + s(Grad.Rate, 5)
## Model 3: Outstate ~ Private + Apps + Accept + F.Undergrad + s(Terminal,
## 4) + perc.alumni + s(Expend, 4) + Grad.Rate + Room.Board
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 513 1774544966
## 2 513 1765461129 1.452e-04 9083837 18178.209 9.627e-06 ***
## 3 527 1890153766 -1.400e+01 -124692637 2.588 0.001275 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
library(caret)
## Loading required package: lattice
##
## Attaching package: 'lattice'
## The following object is masked from 'package:boot':
##
## melanoma
## Loading required package: ggplot2
# We can make predictions from gam objects, just like from lm objects, using the predict() method for the class gam. Here we make predictions on the testing set.
preds <- predict(gam3, newdata=test)
# RMSE for model1
postResample(predict(gam1, test), test$Outstate)
## RMSE Rsquared MAE
## 1665.9535207 0.8164515 1291.3729441
# RMSE for model2
postResample(predict(gam2, test), test$Outstate)
## RMSE Rsquared MAE
## 1724.9172087 0.8034085 1326.5103422
# RMSE for model3
postResample(predict(gam3, test), test$Outstate)
## RMSE Rsquared MAE
## 1749.6211825 0.7978709 1332.9535628
(d) For which variables, if any, is there evidence of a non-linear relationship with the response?
summary(gam2)
##
## Call: gam(formula = Outstate ~ Private + s(Apps, 2) + s(Accept, 3) +
## s(F.Undergrad, 4) + s(Room.Board, 5) + s(Terminal, 2) + s(perc.alumni,
## 3) + s(Expend, 4) + s(Grad.Rate, 5), data = train)
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -6949.3 -1139.5 134.8 1217.1 7903.7
##
## (Dispersion Parameter for gaussian family taken to be 3441446)
##
## Null Deviance: 9035173364 on 542 degrees of freedom
## Residual Deviance: 1765461129 on 512.9999 degrees of freedom
## AIC: 9745.018
##
## Number of Local Scoring Iterations: 2
##
## Anova for Parametric Effects
## Df Sum Sq Mean Sq F value Pr(>F)
## Private 1 2584551575 2584551575 751.008 < 2.2e-16 ***
## s(Apps, 2) 1 1065415570 1065415570 309.584 < 2.2e-16 ***
## s(Accept, 3) 1 137788700 137788700 40.038 5.433e-10 ***
## s(F.Undergrad, 4) 1 175425141 175425141 50.974 3.218e-12 ***
## s(Room.Board, 5) 1 666776478 666776478 193.749 < 2.2e-16 ***
## s(Terminal, 2) 1 339237348 339237348 98.574 < 2.2e-16 ***
## s(perc.alumni, 3) 1 294136734 294136734 85.469 < 2.2e-16 ***
## s(Expend, 4) 1 588203384 588203384 170.917 < 2.2e-16 ***
## s(Grad.Rate, 5) 1 65174292 65174292 18.938 1.630e-05 ***
## Residuals 513 1765461129 3441446
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Anova for Nonparametric Effects
## Npar Df Npar F Pr(F)
## (Intercept)
## Private
## s(Apps, 2) 1 1.1553 0.2829554
## s(Accept, 3) 2 7.5925 0.0005632 ***
## s(F.Undergrad, 4) 3 2.7963 0.0396751 *
## s(Room.Board, 5) 4 2.3583 0.0525731 .
## s(Terminal, 2) 1 0.6492 0.4207470
## s(perc.alumni, 3) 2 0.5480 0.5784167
## s(Expend, 4) 3 30.4854 < 2.2e-16 ***
## s(Grad.Rate, 5) 4 2.3586 0.0525568 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
“Anova for Nonparametric Effects” shows Expend has strong non-linear relationshop with the Outstate. Grad.Rate and Accept have moderate non-linear relationship with the Outstate, which coincide with the result of (b).