6. In this exercise, you will further analyze the Wage data set considered throughout this chapter.

(a) Perform polynomial regression to predict wage using age. Use cross-validation to select the optimal degree d for the polynomial. What degree was chosen, and how does this compare to the results of hypothesis testing using ANOVA? Make a plot of the resulting polynomial fit to the data.

Performing k-fold cross validation using k=10.

#Set the seed
set.seed(123)

#Vector to capture Root mean square
cv.wage.rmses <- c()

#CV Wage Control
cv.wage.ctrl <- trainControl(method = "cv", number = 10)

#Loop to run the train for the polynomial regression
for (i in 1:10) {
  cv.wage.train <- train(y = Wage$wage,
                        x = poly(Wage$age, i, raw = T, simple = T),
                        method = "lm",
                        metric = "RMSE",
                        trControl = cv.wage.ctrl)
  cv.wage.rmses[i] <- cv.wage.train$results$RMSE
}

#Build the data frame using the degree of polynomial and Root Mean Square Errors
df.wage.rmses <- data.frame(degree = 1:10, cv.wage.rmses = cv.wage.rmses)
df.wage.rmses <- mutate(df.wage.rmses, min_cv_wage_rmse = as.numeric(min(cv.wage.rmses) == cv.wage.rmses))

#Plot the Root mean squares values against the degree of polynomial
#Mark the lowest RMSE
ggplot(df.wage.rmses, aes(x = degree, y = cv.wage.rmses)) +
  geom_line() +
  geom_point(size = 2, aes(col = factor(min_cv_wage_rmse))) +
  scale_x_continuous(breaks = seq(1, 10), minor_breaks = NULL) +
  scale_color_manual(values = c("green", "red")) +
  theme(panel.background = element_rect(fill = 'white', colour = 'red'), legend.position = "none") +
  labs(title = "Polynomial Regression for Wage",
       x = "Degree of Polynomial",
       y = "Cross Validation RMSE")

The above k-fold RMSEs, we find out the lowest RMSE is at the third degree polynomial.

Let’s me try to fit using ANOVA models to test the null hypothesis that a model \(M_1\) is sufficient to explain the data against the alternative hypothesis that a more complex \(M_2\) is required.

fit.wage.1 <- lm(wage ~ age, data = Wage)
fit.wage.2 <- lm(wage ~ poly(age, 2, raw = T), data = Wage)
fit.wage.3 <- lm(wage ~ poly(age, 3, raw = T), data = Wage)
fit.wage.4 <- lm(wage ~ poly(age, 4, raw = T), data = Wage)
fit.wage.5 <- lm(wage ~ poly(age, 5, raw = T), data = Wage)

anova(fit.wage.1, fit.wage.2, fit.wage.3, fit.wage.4, fit.wage.5)
## Analysis of Variance Table
## 
## Model 1: wage ~ age
## Model 2: wage ~ poly(age, 2, raw = T)
## Model 3: wage ~ poly(age, 3, raw = T)
## Model 4: wage ~ poly(age, 4, raw = T)
## Model 5: wage ~ poly(age, 5, raw = T)
##   Res.Df     RSS Df Sum of Sq        F    Pr(>F)    
## 1   2998 5022216                                    
## 2   2997 4793430  1    228786 143.5931 < 2.2e-16 ***
## 3   2996 4777674  1     15756   9.8888  0.001679 ** 
## 4   2995 4771604  1      6070   3.8098  0.051046 .  
## 5   2994 4770322  1      1283   0.8050  0.369682    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

When I analyzed the p-values from the above ANOVA model, I find out the third-order degree polynomial is significant as similar to the cross-validation.

So, when I plotted the third-order degree polynomial fit, it looked resonable.

ggplot(Wage, aes(x = age, y = wage)) + 
  geom_point(alpha = 0.5) + 
  geom_smooth(method = "lm", formula = "y ~ poly(x, 3, raw = T)") + 
  theme(panel.background = element_rect(fill = 'white', colour = 'red'), legend.position = "none") +
  labs(title = "Third-Order Degree Polynomial for Wage with wage vs age",
       x = "Age",
       y = "Wage")

(b) Fit a step function to predict wage using age, and perform cross-validation to choose the optimal number of cuts. Make a plot of the fit obtained.

Performing k-fold cross validation using k=10 to determine the optimal number of cuts.

#Set the seed
set.seed(123)

#Vector to capture Root mean square
cv.wage.cuts <- c()

#CV Wage Control
cv.wage.ctrl <- trainControl(method = "cv", number = 10)

#Loop to run the train for the cut
for (i in 2:10) {
  cv.wage.train <- train(y = Wage$wage,
                        x = data.frame(cut(Wage$age, i)),
                        method = "lm",
                        metric = "RMSE",
                        trControl = cv.wage.ctrl)
  cv.wage.cuts[i - 1] <- cv.wage.train$results$RMSE
}

#Build the data frame using the cuts and Root Mean Square Errors
df.wage.cuts <- data.frame(cuts = 2:10, cv.wage.cuts = cv.wage.cuts)
df.wage.cuts <- mutate(df.wage.cuts, min_cv_wage_rmse = as.numeric(min(cv.wage.cuts) == cv.wage.cuts))

#Plot the Root mean squares values against the degree of polynomial
#Mark the lowest RMSE
ggplot(df.wage.cuts, aes(x = cuts, y = cv.wage.cuts)) +
  geom_line() +
  geom_point(size = 2, aes(col = factor(min_cv_wage_rmse))) +
  scale_x_continuous(breaks = seq(2, 10), minor_breaks = NULL) +
  scale_color_manual(values = c("green", "red")) +
  theme(panel.background = element_rect(fill = 'white', colour = 'red'), legend.position = "none") +
  labs(title = "Step Function for Wage",
       x = "Cuts",
       y = "Cross Validation RMSE")

The above k-fold RMSEs, we find out the lowest RMSE is at the eighth cut. So, I will fit the entire data with 8 cuts within the Step Function and also, plot it.

ggplot(Wage, aes(x = age, y = wage)) + 
  geom_point(alpha = 0.5) + 
  geom_smooth(method = "lm", formula = "y ~ cut(x, 8)") + 
  theme(panel.background = element_rect(fill = 'white', colour = 'red'), legend.position = "none") +
  labs(title = "Step Function with 8 cuts for Wage with wage vs age",
       x = "Age",
       y = "Wage")

9. This question uses the variables dis (the weighted mean of distances to five Boston employment centers) and nox (nitrogen oxides concentration in parts per 10 million) from the Boston data. We will treat dis as the predictor and nox as the response.

(a) Use the poly() function to fit a cubic polynomial regression to predict nox using dis. Report the regression output, and plot the resulting data and polynomial fits.

set.seed(1)
poly.boston.fit <- lm (nox ~ poly(dis, 3), data = Boston)
summary(poly.boston.fit)
## 
## Call:
## lm(formula = nox ~ poly(dis, 3), data = Boston)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.121130 -0.040619 -0.009738  0.023385  0.194904 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    0.554695   0.002759 201.021  < 2e-16 ***
## poly(dis, 3)1 -2.003096   0.062071 -32.271  < 2e-16 ***
## poly(dis, 3)2  0.856330   0.062071  13.796  < 2e-16 ***
## poly(dis, 3)3 -0.318049   0.062071  -5.124 4.27e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06207 on 502 degrees of freedom
## Multiple R-squared:  0.7148, Adjusted R-squared:  0.7131 
## F-statistic: 419.3 on 3 and 502 DF,  p-value: < 2.2e-16

From the above summary, I can interpret that all the polynomial terms are significant. I use ggplot to plot the predictions overlaying on the points.

ggplot(Boston, aes(x = dis, y = nox)) + 
  geom_point(alpha = 0.5) + 
  geom_smooth(method = "lm", formula = "y ~ poly(x, 3)") + 
  theme(panel.background = element_rect(fill = 'white', colour = 'red'), legend.position = "none") +
  labs(title = "Polynomial Regression for Boston Dataset",
       x = "Weighted mean of distances to five Boston employment centers",
       y = "Nitrogen oxides concentration in parts per 10 million")

The above plot indicates that the polynomial fits are looking good except at the tail.

(b) Plot the polynomial fits for a range of different polynomial degrees (say, from 1 to 10), and report the associated residual sum of squares.

First, I store the residual sum of squares in a vector and used ggplot to plot between the degrees and their assoicated residual sum of squares.

#Vector to hold the residual sum of squares
poly.boston.residuals <- rep(NA, 10)
#Loop from 1 to 10 polynomial degree to get the residual sum of squares
for (i in 1:10) {
    poly.boston.fit <- lm(nox ~ poly(dis, i), data = Boston)
    poly.boston.residuals[i] <- sum(poly.boston.fit$residuals^2)
}
##Build the data frame for the residual sum of squares
df.boston.rss <- data.frame(degree = 1:10, poly.boston.residuals = poly.boston.residuals)
df.boston.rss <- mutate(df.boston.rss, min_poly_residuals = as.numeric(min(poly.boston.residuals) == poly.boston.residuals))

##Using ggplot to plot
ggplot(df.boston.rss, aes(x = degree, y = poly.boston.residuals)) +
  geom_line() +
  geom_point(size = 2, aes(col = factor(min_poly_residuals))) +
  scale_x_continuous(breaks = seq(1, 10), minor_breaks = NULL) +
  scale_color_manual(values = c("green", "red")) +
  theme(panel.background = element_rect(fill = 'white', colour = 'red'), legend.position = "none") +
  labs(title = "Polynomial regression for Boston Dataset",
       x = "Degree of Polynomial",
       y = "Residual Sum of Squares")

From the above plot, we can see the residual sum of squares decrease as the polynomial degree increases and the minimum residual sum of squares is at 10th degree of polynomial.

(c) Perform cross-validation or another approach to select the optimal degree for the polynomial, and explain your results.

In order to find out the optimal degree for the polynomial, I am using repeatedcv approach with k-fold = 10 with 5 repeats.

#Set the seed
set.seed(123)

#Vector to capture Root mean square
cv.boston.rmses <- c()

#CV Boston Control
cv.boston.ctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 5)

#Loop to run the train for the polynomial regression
for (i in 1:10) {
  cv.boston.train <- train(y = Boston$nox,
                        x = poly(Boston$dis, i, raw = T, simple = T),
                        method = "lm",
                        metric = "RMSE",
                        trControl = cv.boston.ctrl)
  cv.boston.rmses[i] <- cv.boston.train$results$RMSE
}

#Build the data frame using the degree of polynomial and Root Mean Square Errors
df.boston.rmses <- data.frame(degree = 1:10, cv.boston.rmses = cv.boston.rmses)
df.boston.rmses <- mutate(df.boston.rmses, min_cv_boston_rmse = as.numeric(min(cv.boston.rmses) == cv.boston.rmses))

#Plot the Root mean squares values against the degree of polynomial
#Mark the lowest RMSE
ggplot(df.boston.rmses, aes(x = degree, y = cv.boston.rmses)) +
  geom_line() +
  geom_point(size = 2, aes(col = factor(min_cv_boston_rmse))) +
  scale_x_continuous(breaks = seq(1, 10), minor_breaks = NULL) +
  scale_color_manual(values = c("green", "red")) +
  theme(panel.background = element_rect(fill = 'white', colour = 'red'), legend.position = "none") +
  labs(title = "Polynomial Regression for Boston",
       x = "Degree of Polynomial",
       y = "Cross Validation RMSE")

From the above plot, we find out the lowest RMSE is at the third degree polynomial. Also, we can notice that there are spikes (that’s few instabilities) for the higher degree of polynomials. So, let’s try to remove the outliers for dis where include only for dis < 10.

#Set the seed
set.seed(123)

#Filter the Boston data where dis < 10
boston1 <- filter(Boston, dis < 10)

#Vector to capture Root mean square
cv.boston1.rmses <- c()

#CV Boston Control
cv.boston1.ctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 5)

#Loop to run the train for the polynomial regression
for (i in 1:10) {
  cv.boston1.train <- train(y = boston1$nox,
                        x = poly(boston1$dis, i, raw = T, simple = T),
                        method = "lm",
                        metric = "RMSE",
                        trControl = cv.boston1.ctrl)
  cv.boston1.rmses[i] <- cv.boston1.train$results$RMSE
}

#Build the data frame using the degree of polynomial and Root Mean Square Errors
df.boston1.rmses <- data.frame(degree = 1:10, cv.boston1.rmses = cv.boston1.rmses)
df.boston1.rmses <- mutate(df.boston1.rmses, min_cv_boston1_rmse = as.numeric(min(cv.boston1.rmses) == cv.boston1.rmses))

#Plot the Root mean squares values against the degree of polynomial
#Mark the lowest RMSE
ggplot(df.boston1.rmses, aes(x = degree, y = cv.boston1.rmses)) +
  geom_line() +
  geom_point(size = 2, aes(col = factor(min_cv_boston1_rmse))) +
  scale_x_continuous(breaks = seq(1, 10), minor_breaks = NULL) +
  scale_color_manual(values = c("green", "red")) +
  theme(panel.background = element_rect(fill = 'white', colour = 'red'), legend.position = "none") +
  labs(title = "Polynomial Regression for Boston (dis < 10)",
       x = "Degree of Polynomial",
       y = "Cross Validation RMSE")

From the above plot, we find out that the instability has been considerably removed because of the filtering of outliers. Now, the lowest RMSE is at 8th degree of polynomial.

(d) Use the bs() function to fit a regression spline to predict nox using dis. Report the output for the fit using four degrees of freedom. How did you choose the knots? Plot the resulting fit.

summary(Boston$dis)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.130   2.100   3.207   3.795   5.188  12.127

We see that the dis has lowest to 1 and maximum limit closer to 13. If we have to use degree of freedom 4, then we can have interval of 4 between the dis limits, that is, the knots can be 4, 7 and 11.

sp.boston.fit <- lm(nox ~ bs(dis, df = 4, knots = c(4, 7, 11)), data = Boston)
summary(sp.boston.fit)
## 
## Call:
## lm(formula = nox ~ bs(dis, df = 4, knots = c(4, 7, 11)), data = Boston)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.124567 -0.040355 -0.008702  0.024740  0.192920 
## 
## Coefficients:
##                                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                            0.73926    0.01331  55.537  < 2e-16 ***
## bs(dis, df = 4, knots = c(4, 7, 11))1 -0.08861    0.02504  -3.539  0.00044 ***
## bs(dis, df = 4, knots = c(4, 7, 11))2 -0.31341    0.01680 -18.658  < 2e-16 ***
## bs(dis, df = 4, knots = c(4, 7, 11))3 -0.26618    0.03147  -8.459 3.00e-16 ***
## bs(dis, df = 4, knots = c(4, 7, 11))4 -0.39802    0.04647  -8.565  < 2e-16 ***
## bs(dis, df = 4, knots = c(4, 7, 11))5 -0.25681    0.09001  -2.853  0.00451 ** 
## bs(dis, df = 4, knots = c(4, 7, 11))6 -0.32926    0.06327  -5.204 2.85e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06185 on 499 degrees of freedom
## Multiple R-squared:  0.7185, Adjusted R-squared:  0.7151 
## F-statistic: 212.3 on 6 and 499 DF,  p-value: < 2.2e-16

From the above summary, all the p-values indicates that all terms in the splines are significant.

ggplot(Boston, aes(x = dis, y = nox)) + 
  geom_point(alpha = 0.5) + 
  geom_smooth(method = "lm", formula = "y ~ bs(x, df = 4)") + 
  theme(panel.background = element_rect(fill = 'white', colour = 'red'), legend.position = "none") +
  labs(title = "Polynomial Regression for Boston Dataset",
       x = "Weighted mean of distances to five Boston employment centers",
       y = "Nitrogen oxides concentration in parts per 10 million")

The above plot indicates that the spline fits for most of the data well except for the extreme values of dis (dis > 10).

(e) Now fit a regression spline for a range of degrees of freedom, and plot the resulting fits and report the resulting RSS. Describe the results obtained.

Let’s try the degree of freedom from 3 to 16 to get all the residual sum of squares for those degree of freedoms in order to plot those.

#Vector to hold the residual sum of squares
bs.boston.residuals <- rep(NA, 16)
#Loop from 1 to 10 polynomial degree to get the residual sum of squares
for (i in 3:16) {
    bs.boston.fit <- lm(nox ~ bs(dis, df = i), data = Boston)
    bs.boston.residuals[i] <- sum(bs.boston.fit$residuals^2)
}
#Since we started the degree of freedom from 3, remove the first 2 items from the 
#residual sum of squares vector
bs.boston.residuals <- bs.boston.residuals[-c(1, 2)]
##Build the data frame for the residual sum of squares
df.boston.rss <- data.frame(df = 3:16, bs.boston.residuals = bs.boston.residuals)
df.boston.rss <- mutate(df.boston.rss, min_bs_residuals = as.numeric(min(bs.boston.residuals) == bs.boston.residuals))

##Using ggplot to plot
ggplot(df.boston.rss, aes(x = df, y = bs.boston.residuals)) +
  geom_line() +
  geom_point(size = 2, aes(col = factor(min_bs_residuals))) +
  scale_x_continuous(breaks = seq(1, 16), minor_breaks = NULL) +
  scale_color_manual(values = c("green", "red")) +
  theme(panel.background = element_rect(fill = 'white', colour = 'red'), legend.position = "none") +
  labs(title = "Splines for Boston Dataset",
       x = "Degree of freedom",
       y = "Residual Sum of Squares")

From the above plot, we can find out that the lowest residual sum of squares is at 14th degree of freedom. Then, the residual sum of squares slightly increases after that degree of freedom.

(f) Perform cross-validation or another approach in order to select the best degrees of freedom for a regression spline on this data. Describe your results.

In order to find out the optimal degree for the polynomial, I am using repeatedcv approach with k-fold = 10 with 5 repeats.

#Set the seed
set.seed(123)

#Vector to capture Root mean square
cv.boston.rmses <- c()

#CV Boston Control
cv.boston.ctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 5)

#Loop to run the train for the polynomial regression
for (i in 3:16) {
  cv.boston.train <- train(y = Boston$nox,
                        x = bs(Boston$dis, df = i),
                        method = "lm",
                        metric = "RMSE",
                        trControl = cv.boston.ctrl)
  cv.boston.rmses[i] <- cv.boston.train$results$RMSE
}
#Since we started the degree of freedom from 3, remove the first 2 items from the 
#rmse vector
cv.boston.rmses <- cv.boston.rmses[-c(1, 2)]

#Build the data frame using the degree of freedom and Root Mean Square Errors
df.boston.rmses <- data.frame(df = 3:16, cv.boston.rmses = cv.boston.rmses)
df.boston.rmses <- mutate(df.boston.rmses, min_cv_boston_rmse = as.numeric(min(cv.boston.rmses) == cv.boston.rmses))

#Plot the Root mean squares values against the degree of polynomial
#Mark the lowest RMSE
ggplot(df.boston.rmses, aes(x = df, y = cv.boston.rmses)) +
  geom_line() +
  geom_point(size = 2, aes(col = factor(min_cv_boston_rmse))) +
  scale_x_continuous(breaks = seq(1, 16), minor_breaks = NULL) +
  scale_color_manual(values = c("green", "red")) +
  theme(panel.background = element_rect(fill = 'white', colour = 'red'), legend.position = "none") +
  labs(title = "Splines for Boston",
       x = "Degree of freedom",
       y = "Cross Validation RMSE")

The lowest RMSE is at 7th degree of freedom. The instability of the RMSEs could be because of the outliers within the dis.