Assignment 6

Q6. In this exercise, you will further analyze the “Wage” data set considered throughout this chapter.

Perform polynomial regression to predict “wage” using “age”. Use cross-validation to select the optimal degree d for the polynomial. What degree was chosen, and how does this compare to the results of hypothesis testing using ANOVA ? Make a plot of the resulting polynomial fit to the data.

plot(1:10, deltas, xlab = "Degree", ylab = "Test MSE", type = "l")
d.min <- which.min(deltas)
points(which.min(deltas), deltas[which.min(deltas)], col = "red", cex = 2, pch = 20)

We may see that d=9 is the optimal degree for the polynomial. We now use ANOVA to test the null hypothesis that a model M1 is sufficient to explain the data against the alternative hypothesis that a more complex M2 is required.

fit1 <- lm(wage ~ age, data = Wage)
fit2 <- lm(wage ~ poly(age, 2), data = Wage)
fit3 <- lm(wage ~ poly(age, 3), data = Wage)
fit4 <- lm(wage ~ poly(age, 4), data = Wage)
fit5 <- lm(wage ~ poly(age, 5), data = Wage)
anova(fit1, fit2, fit3, fit4, fit5)

## Analysis of Variance Table
## 
## Model 1: wage ~ age
## Model 2: wage ~ poly(age, 2)
## Model 3: wage ~ poly(age, 3)
## Model 4: wage ~ poly(age, 4)
## Model 5: wage ~ poly(age, 5)
##   Res.Df     RSS Df Sum of Sq        F    Pr(>F)    
## 1   2998 5022216                                    
## 2   2997 4793430  1    228786 143.5931 < 2.2e-16 ***
## 3   2996 4777674  1     15756   9.8888  0.001679 ** 
## 4   2995 4771604  1      6070   3.8098  0.051046 .  
## 5   2994 4770322  1      1283   0.8050  0.369682    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We may see, by examining the p-values, that either a cubic or quartic polynomial appear to provide a reasonable fit to the data, but lower or higher order models are not justified. It remains to plot the resulting polynomial fit to the data.

plot(wage ~ age, data = Wage, col = "darkgrey")
agelims <- range(Wage$age)
age.grid <- seq(from = agelims[1], to = agelims[2])
fit <- lm(wage ~ poly(age, 3), data = Wage)
preds <- predict(fit, newdata = list(age = age.grid))
lines(age.grid, preds, col = "red", lwd = 2)

Fit a step function to predict “wage” using “age”, and perform cross-validation to choose the optimal number of cuts. Make a plot of the fit obtained.

cvs <- rep(NA, 10)
for (i in 2:10) {
    Wage$age.cut <- cut(Wage$age, i)
    fit <- glm(wage ~ age.cut, data = Wage)
    cvs[i] <- cv.glm(Wage, fit, K = 10)$delta[1]
}
plot(2:10, cvs[-1], xlab = "Cuts", ylab = "Test MSE", type = "l")
d.min <- which.min(cvs)
points(which.min(cvs), cvs[which.min(cvs)], col = "red", cex = 2, pch = 20)

We may see that the error is minimum for 8 cuts. Now, we fit the entire data with a step function using 8 cuts and plot it.

plot(wage ~ age, data = Wage, col = "darkgrey")
agelims <- range(Wage$age)
age.grid <- seq(from = agelims[1], to = agelims[2])
fit <- glm(wage ~ cut(age, 8), data = Wage)
preds <- predict(fit, data.frame(age = age.grid))
lines(age.grid, preds, col = "red", lwd = 2)

Q7. The “Wage” data set contains a number of other features nor explored in this chapter, such as marital status (“marit1”), job class (“jobclass”), and others. Explore the relationships between some of these other predictors and “wage”, and use non-linear fitting techniques in order to fit flexible models to the data. Create plots of the results obtained, and write a summary of your findings.

set.seed(1)
summary(Wage$maritl)

## 1. Never Married       2. Married       3. Widowed      4. Divorced 
##              648             2074               19              204 
##     5. Separated 
##               55

summary(Wage$jobclass)

##  1. Industrial 2. Information 
##           1544           1456

par(mfrow = c(1, 2))
plot(Wage$maritl, Wage$wage)
plot(Wage$jobclass, Wage$wage)

We may conclude that a married couple earns more money on average, and also that informational jobs earns more on average. We will now use GAM to predict “wage” using natural spline functions of “year”, “age”, “education”, “jobclass” and “maritl”

library(gam)

## Loading required package: splines

## Loading required package: foreach

## Loaded gam 1.20

fit0 <- gam(wage ~ lo(year, span = 0.7) + s(age, 5) + education, data = Wage)
fit1 <- gam(wage ~ lo(year, span = 0.7) + s(age, 5) + education + jobclass, data = Wage)
fit2 <- gam(wage ~ lo(year, span = 0.7) + s(age, 5) + education + maritl, data = Wage)
fit3 <- gam(wage ~ lo(year, span = 0.7) + s(age, 5) + education + jobclass + maritl, data = Wage)
anova(fit0, fit1, fit2, fit3)

## Analysis of Deviance Table
## 
## Model 1: wage ~ lo(year, span = 0.7) + s(age, 5) + education
## Model 2: wage ~ lo(year, span = 0.7) + s(age, 5) + education + jobclass
## Model 3: wage ~ lo(year, span = 0.7) + s(age, 5) + education + maritl
## Model 4: wage ~ lo(year, span = 0.7) + s(age, 5) + education + jobclass + 
##     maritl
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1    2987.1    3691855                          
## 2    2986.1    3679689  1    12166 0.0014637 ** 
## 3    2983.1    3597526  3    82163  9.53e-15 ***
## 4    2982.1    3583675  1    13852 0.0006862 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We may conclude that the model “fit3” is significantly better.

par(mfrow = c(3, 3))
plot(fit3, se = T, col = "blue")

Assignment 6

Emmanuel Amoah

11/7/2021