head(Wage)
## year age maritl race education region
## 231655 2006 18 1. Never Married 1. White 1. < HS Grad 2. Middle Atlantic
## 86582 2004 24 1. Never Married 1. White 4. College Grad 2. Middle Atlantic
## 161300 2003 45 2. Married 1. White 3. Some College 2. Middle Atlantic
## 155159 2003 43 2. Married 3. Asian 4. College Grad 2. Middle Atlantic
## 11443 2005 50 4. Divorced 1. White 2. HS Grad 2. Middle Atlantic
## 376662 2008 54 2. Married 1. White 4. College Grad 2. Middle Atlantic
## jobclass health health_ins logwage wage
## 231655 1. Industrial 1. <=Good 2. No 4.318063 75.04315
## 86582 2. Information 2. >=Very Good 2. No 4.255273 70.47602
## 161300 1. Industrial 1. <=Good 1. Yes 4.875061 130.98218
## 155159 2. Information 2. >=Very Good 1. Yes 5.041393 154.68529
## 11443 2. Information 1. <=Good 1. Yes 4.318063 75.04315
## 376662 2. Information 2. >=Very Good 1. Yes 4.845098 127.11574
attach(Wage)
When deciding the optimal degree, d, for the polynomial we see that 9 is actually the lowest value. 9 would appear to be a very high degree of polynomial so we’ll choose 5, which appears to be the next lowest value. Looking at the anova of the models we see that after model 5 the linear models become insignificant.
plot(age,wage)
library(boot)
set.seed(1)
cv.err = rep(0,10)
for (i in 1:10){
glm.wage = glm(wage ~ poly(age, i), data = Wage)
cv.err[i] = cv.glm(Wage, glm.wage, K = 10)$delta[1]
}
plot(1:10, cv.err, pch = 19, type = 'b', xlab = 'degree', ylab = 'error')
which.min(cv.err)
## [1] 9
which.min(cv.err[1:5])
## [1] 5
min.deg = 5
min.glm = glm(wage ~ poly(age,min.deg), data = Wage)
plot(age, wage)
age.range = range(age)
age.predict = seq(from = age.range[1], to = age.range[2], length.out = 100)
wage.predict = predict(min.glm, newdata = list(age = age.predict))
lines(age.predict, wage.predict, col = 'blue')
lm.1=lm(wage~age,data=Wage)
lm.2=lm(wage~poly(age,2),data=Wage)
lm.3=lm(wage~poly(age,3),data=Wage)
lm.4=lm(wage~poly(age,4),data=Wage)
lm.5=lm(wage~poly(age,5),data=Wage)
anova(lm.1,lm.2,lm.3,lm.4,lm.5)
## Analysis of Variance Table
##
## Model 1: wage ~ age
## Model 2: wage ~ poly(age, 2)
## Model 3: wage ~ poly(age, 3)
## Model 4: wage ~ poly(age, 4)
## Model 5: wage ~ poly(age, 5)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 2998 5022216
## 2 2997 4793430 1 228786 143.5931 < 2.2e-16 ***
## 3 2996 4777674 1 15756 9.8888 0.001679 **
## 4 2995 4771604 1 6070 3.8098 0.051046 .
## 5 2994 4770322 1 1283 0.8050 0.369682
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
After preforming cross validation it appears that the most optimal number of cuts is 7.
set.seed(1)
step.err = c()
for (i in 2:10){
wage.cut = model.frame(wage~cut(age,i), data = Wage)
names(wage.cut)=c('wage','age')
wage.glm = glm(wage~age,data = wage.cut)
cv.err = cv.glm(wage.glm, data = wage.cut, K=10)$delta[1]
step.err = c(step.err,cv.err)
}
plot(step.err, pch = 20, type = 'b', xlab = 'degrees', ylab = 'error')
Looking at the graph it appears that a model with 8 predictors is most satisfactory.
set.seed(1)
index = sample(1:nrow(College), 0.7*nrow(College))
college.train = College[index,]
college.test = College[-index,]
library(leaps)
## Warning: package 'leaps' was built under R version 4.0.5
slt.for = regsubsets(Outstate ~ ., data = college.train, method = 'forward')
plot(1/nrow(College)*summary(slt.for)$rss, type = 'b')
which(summary(slt.for)$which[8,])
## (Intercept) PrivateYes Room.Board Personal PhD Terminal
## 1 2 10 12 13 14
## perc.alumni Expend Grad.Rate
## 16 17 18
Looking at the plots we see that most appear to be fairly linear, mostly positive, while others like ‘Personal’ appears to be negative.
library(gam)
## Warning: package 'gam' was built under R version 4.0.3
## Loading required package: splines
## Loading required package: foreach
## Warning: package 'foreach' was built under R version 4.0.3
## Loaded gam 1.20
college.gam = gam(Outstate ~ Private + s(Room.Board) + s(Personal) + s(PhD) + s(Terminal) +
s(perc.alumni) + s(Expend) + s(Grad.Rate), data = college.train)
par(mfrow = c(3,3))
plot(college.gam, se=T, col='orange')
After evaluating the model on the test set we see that about 77% of the variance in the data can be explained by the model.
mean((predict(college.gam, college.test)- college.test$Outstate)^2)
## [1] 3187899
test.tss = sum((college.test$Outstate - mean(college.test$Outstate))^2)
test.rss = sum((predict(college.gam,newdata=college.test)-college.test$Outstate)^2)
1 - test.rss/test.tss
## [1] 0.7690495
Looking at the summary of our gam function we can see evidence of a non-linear relationship for Personal, PhD, Terminal, perc.alumi, and Grad.Rate.
summary(college.gam)
##
## Call: gam(formula = Outstate ~ Private + s(Room.Board) + s(Personal) +
## s(PhD) + s(Terminal) + s(perc.alumni) + s(Expend) + s(Grad.Rate),
## data = college.train)
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -7319.46 -1161.67 80.25 1289.53 7094.06
##
## (Dispersion Parameter for gaussian family taken to be 3526025)
##
## Null Deviance: 9260683704 on 542 degrees of freedom
## Residual Deviance: 1808848954 on 512.9995 degrees of freedom
## AIC: 9758.202
##
## Number of Local Scoring Iterations: NA
##
## Anova for Parametric Effects
## Df Sum Sq Mean Sq F value Pr(>F)
## Private 1 2429063410 2429063410 688.896 < 2.2e-16 ***
## s(Room.Board) 1 1738946395 1738946395 493.175 < 2.2e-16 ***
## s(Personal) 1 94642288 94642288 26.841 3.180e-07 ***
## s(PhD) 1 595571621 595571621 168.907 < 2.2e-16 ***
## s(Terminal) 1 40005452 40005452 11.346 0.000813 ***
## s(perc.alumni) 1 397261386 397261386 112.665 < 2.2e-16 ***
## s(Expend) 1 802746515 802746515 227.663 < 2.2e-16 ***
## s(Grad.Rate) 1 82649866 82649866 23.440 1.708e-06 ***
## Residuals 513 1808848954 3526025
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Anova for Nonparametric Effects
## Npar Df Npar F Pr(F)
## (Intercept)
## Private
## s(Room.Board) 3 2.513 0.0578 .
## s(Personal) 3 1.783 0.1494
## s(PhD) 3 1.480 0.2191
## s(Terminal) 3 1.755 0.1548
## s(perc.alumni) 3 1.132 0.3356
## s(Expend) 3 32.538 <2e-16 ***
## s(Grad.Rate) 3 1.694 0.1673
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1