Q6. In this exercise, you will further analyze the “Wage” data set considered throughout this chapter.
plot(1:10, deltas, xlab = "Degree", ylab = "Test MSE", type = "l")
d.min <- which.min(deltas)
points(which.min(deltas), deltas[which.min(deltas)], col = "red", cex = 2, pch = 20)
We may see that d=9 is the optimal degree for the polynomial. We now use ANOVA to test the null hypothesis that a model M1 is sufficient to explain the data against the alternative hypothesis that a more complex M2 is required.
fit1 <- lm(wage ~ age, data = Wage)
fit2 <- lm(wage ~ poly(age, 2), data = Wage)
fit3 <- lm(wage ~ poly(age, 3), data = Wage)
fit4 <- lm(wage ~ poly(age, 4), data = Wage)
fit5 <- lm(wage ~ poly(age, 5), data = Wage)
anova(fit1, fit2, fit3, fit4, fit5)
## Analysis of Variance Table
##
## Model 1: wage ~ age
## Model 2: wage ~ poly(age, 2)
## Model 3: wage ~ poly(age, 3)
## Model 4: wage ~ poly(age, 4)
## Model 5: wage ~ poly(age, 5)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 2998 5022216
## 2 2997 4793430 1 228786 143.5931 < 2.2e-16 ***
## 3 2996 4777674 1 15756 9.8888 0.001679 **
## 4 2995 4771604 1 6070 3.8098 0.051046 .
## 5 2994 4770322 1 1283 0.8050 0.369682
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We may see, by examining the p-values, that either a cubic or quartic polynomial appear to provide a reasonable fit to the data, but lower or higher order models are not justified. It remains to plot the resulting polynomial fit to the data.
plot(wage ~ age, data = Wage, col = "darkgrey")
agelims <- range(Wage$age)
age.grid <- seq(from = agelims[1], to = agelims[2])
fit <- lm(wage ~ poly(age, 3), data = Wage)
preds <- predict(fit, newdata = list(age = age.grid))
lines(age.grid, preds, col = "red", lwd = 2)
cvs <- rep(NA, 10)
for (i in 2:10) {
Wage$age.cut <- cut(Wage$age, i)
fit <- glm(wage ~ age.cut, data = Wage)
cvs[i] <- cv.glm(Wage, fit, K = 10)$delta[1]
}
plot(2:10, cvs[-1], xlab = "Cuts", ylab = "Test MSE", type = "l")
d.min <- which.min(cvs)
points(which.min(cvs), cvs[which.min(cvs)], col = "red", cex = 2, pch = 20)
We may see that the error is minimum for 8 cuts. Now, we fit the entire data with a step function using 8 cuts and plot it.
plot(wage ~ age, data = Wage, col = "darkgrey")
agelims <- range(Wage$age)
age.grid <- seq(from = agelims[1], to = agelims[2])
fit <- glm(wage ~ cut(age, 8), data = Wage)
preds <- predict(fit, data.frame(age = age.grid))
lines(age.grid, preds, col = "red", lwd = 2)
Q7. The “Wage” data set contains a number of other features nor explored in this chapter, such as marital status (“marit1”), job class (“jobclass”), and others. Explore the relationships between some of these other predictors and “wage”, and use non-linear fitting techniques in order to fit flexible models to the data. Create plots of the results obtained, and write a summary of your findings.
set.seed(1)
summary(Wage$maritl)
## 1. Never Married 2. Married 3. Widowed 4. Divorced
## 648 2074 19 204
## 5. Separated
## 55
summary(Wage$jobclass)
## 1. Industrial 2. Information
## 1544 1456
par(mfrow = c(1, 2))
plot(Wage$maritl, Wage$wage)
plot(Wage$jobclass, Wage$wage)
We may conclude that a married couple earns more money on average, and also that informational jobs earns more on average. We will now use GAM to predict “wage” using natural spline functions of “year”, “age”, “education”, “jobclass” and “maritl”
library(gam)
## Loading required package: splines
## Loading required package: foreach
## Loaded gam 1.20
fit0 <- gam(wage ~ lo(year, span = 0.7) + s(age, 5) + education, data = Wage)
fit1 <- gam(wage ~ lo(year, span = 0.7) + s(age, 5) + education + jobclass, data = Wage)
fit2 <- gam(wage ~ lo(year, span = 0.7) + s(age, 5) + education + maritl, data = Wage)
fit3 <- gam(wage ~ lo(year, span = 0.7) + s(age, 5) + education + jobclass + maritl, data = Wage)
anova(fit0, fit1, fit2, fit3)
## Analysis of Deviance Table
##
## Model 1: wage ~ lo(year, span = 0.7) + s(age, 5) + education
## Model 2: wage ~ lo(year, span = 0.7) + s(age, 5) + education + jobclass
## Model 3: wage ~ lo(year, span = 0.7) + s(age, 5) + education + maritl
## Model 4: wage ~ lo(year, span = 0.7) + s(age, 5) + education + jobclass +
## maritl
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 2987.1 3691855
## 2 2986.1 3679689 1 12166 0.0014637 **
## 3 2983.1 3597526 3 82163 9.53e-15 ***
## 4 2982.1 3583675 1 13852 0.0006862 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We may conclude that the model “fit3” is significantly better.
par(mfrow = c(3, 3))
plot(fit3, se = T, col = "blue")