In this exercise, you will further analyze the Wage data set considered throughout this chapter.
set.seed(1)
plot(wage ~ age, data = Wage, col = "blue")
agerange = range(Wage$age)
agegrid = seq(from = agerange[1], to = agerange[2])
fitlm = lm(wage ~ poly(age, 3), data = Wage)
predic = predict(fitlm, newdata = list(age = agegrid))
lines(agegrid, predic, col = "Black", lwd = 2)
(b) Fit a step function to predict wage using age, and perform
crossvalidation to choose the optimal number of cuts. Make a plot of the
fit obtained.
train = sample(1: nrow(College), nrow(College)/2)
test = -train
fit = regsubsets(Outstate ~ ., data = College, subset = train, method = 'forward')
fit.summary = summary(fit)
fit.summary
## Subset selection object
## Call: regsubsets.formula(Outstate ~ ., data = College, subset = train,
## method = "forward")
## 17 Variables (and intercept)
## Forced in Forced out
## PrivateYes FALSE FALSE
## Apps FALSE FALSE
## Accept FALSE FALSE
## Enroll FALSE FALSE
## Top10perc FALSE FALSE
## Top25perc FALSE FALSE
## F.Undergrad FALSE FALSE
## P.Undergrad FALSE FALSE
## Room.Board FALSE FALSE
## Books FALSE FALSE
## Personal FALSE FALSE
## PhD FALSE FALSE
## Terminal FALSE FALSE
## S.F.Ratio FALSE FALSE
## perc.alumni FALSE FALSE
## Expend FALSE FALSE
## Grad.Rate FALSE FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: forward
## PrivateYes Apps Accept Enroll Top10perc Top25perc F.Undergrad
## 1 ( 1 ) " " " " " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " " " "
## 3 ( 1 ) " " " " " " " " " " " " " "
## 4 ( 1 ) "*" " " " " " " " " " " " "
## 5 ( 1 ) "*" " " " " " " " " " " " "
## 6 ( 1 ) "*" " " " " " " " " " " " "
## 7 ( 1 ) "*" " " " " " " " " " " " "
## 8 ( 1 ) "*" " " " " " " "*" " " " "
## P.Undergrad Room.Board Books Personal PhD Terminal S.F.Ratio
## 1 ( 1 ) " " "*" " " " " " " " " " "
## 2 ( 1 ) " " "*" " " " " " " " " " "
## 3 ( 1 ) " " "*" " " " " " " " " " "
## 4 ( 1 ) " " "*" " " " " " " " " " "
## 5 ( 1 ) " " "*" " " " " " " " " " "
## 6 ( 1 ) " " "*" " " " " " " "*" " "
## 7 ( 1 ) " " "*" " " "*" " " "*" " "
## 8 ( 1 ) " " "*" " " "*" " " "*" " "
## perc.alumni Expend Grad.Rate
## 1 ( 1 ) " " " " " "
## 2 ( 1 ) "*" " " " "
## 3 ( 1 ) "*" "*" " "
## 4 ( 1 ) "*" "*" " "
## 5 ( 1 ) "*" "*" "*"
## 6 ( 1 ) "*" "*" "*"
## 7 ( 1 ) "*" "*" "*"
## 8 ( 1 ) "*" "*" "*"
coef(fit, id = 6)
## (Intercept) PrivateYes Room.Board Terminal perc.alumni
## -4726.8810613 2717.7019276 1.1032433 36.9990286 59.0863753
## Expend Grad.Rate
## 0.1930814 33.8303314
gamm = gam(Outstate ~ Private + s(Room.Board, 5) + s(Terminal, 5) + s(perc.alumni, 5) + s(Expend, 5) + s(Grad.Rate, 5), data = College, subset = train)
par(mfrow = c(2,3))
plot(gamm, se = TRUE, col = 'Black')
Based off the graphs, Room/Board costs, alumni percent and out of state
tution increase. Expend and Graduation rates are nonlinear with relation
to the out of state tution.
prediction = predict(gamm,newdata = College)
mse = mean((College$Outstate - prediction)^2)
mse
## [1] 3385754
preds = predict(gamm, College[test, ])
RSS = sum((College[test, ]$Outstate - preds)^2)
TSS = sum((College[test, ]$Outstate - mean(College[test, ]$Outstate)) ^ 2)
1 - (RSS / TSS)
## [1] 0.7652114
The mean squared error is 3412013 and the R-squared value of 0.616 means that this model has a moderate strength based off the data.
summary(gamm)
##
## Call: gam(formula = Outstate ~ Private + s(Room.Board, 5) + s(Terminal,
## 5) + s(perc.alumni, 5) + s(Expend, 5) + s(Grad.Rate, 5),
## data = College, subset = train)
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -7270.32 -1115.76 -58.54 1231.45 7013.47
##
## (Dispersion Parameter for gaussian family taken to be 3666101)
##
## Null Deviance: 6989966760 on 387 degrees of freedom
## Residual Deviance: 1323460787 on 360.9995 degrees of freedom
## AIC: 6993.591
##
## Number of Local Scoring Iterations: NA
##
## Anova for Parametric Effects
## Df Sum Sq Mean Sq F value Pr(>F)
## Private 1 1780121359 1780121359 485.562 < 2.2e-16 ***
## s(Room.Board, 5) 1 1635038271 1635038271 445.988 < 2.2e-16 ***
## s(Terminal, 5) 1 281113708 281113708 76.679 < 2.2e-16 ***
## s(perc.alumni, 5) 1 351003562 351003562 95.743 < 2.2e-16 ***
## s(Expend, 5) 1 607193060 607193060 165.624 < 2.2e-16 ***
## s(Grad.Rate, 5) 1 89036829 89036829 24.287 1.267e-06 ***
## Residuals 361 1323460787 3666101
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Anova for Nonparametric Effects
## Npar Df Npar F Pr(F)
## (Intercept)
## Private
## s(Room.Board, 5) 4 2.0626 0.0852 .
## s(Terminal, 5) 4 1.5754 0.1803
## s(perc.alumni, 5) 4 0.4126 0.7996
## s(Expend, 5) 4 21.2640 8.882e-16 ***
## s(Grad.Rate, 5) 4 0.8399 0.5006
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Out of State tution and Expend has a strong nonlinear relationship. Along with Out of State tution, Room/ Board also has a strong nonlinear relationship with Expend.