Chapter 7

Exercises

8. Fit some of the non-linear models investigated in this chapter to the Auto data set. Is there evidence for non-linear relationships in this data set? Create some informative plots to justify your answer

Polynomial Regression

First I will check to see if there is a non-linear relationship between a cars weightand its mpg

require(ISLR); require(tidyverse); require(caret)
require(ggthemes); require(broom); require(knitr)
theme_set(theme_tufte(base_size = 14) + theme(legend.position = 'top'))
set.seed(5)
options(knitr.kable.NA = '')
data('Auto')

Selecting Best Fit By K-Fold Validation

I start by doing 8-fold cross-validation to find out which level of polynomial fits the training data best without over fitting.

folds <- sample(rep(1:8, nrow(Auto)/8))
errors <- matrix(NA, 8, 8)
models <- list()

for (k in 1:8) {
    for (i in 1:8) {
        model <- lm(mpg ~ poly(weight,i), data = Auto[folds != i,])
        pred <- predict(model, Auto[folds == i,])
        resid <- (Auto$mpg[folds==i] - pred)^2
        errors[k, i] <- sqrt(mean(resid))
    }
}
errors <- apply(errors, 2, mean)



data_frame(RMSE = errors) %>%
    mutate(Polynomial = row_number()) %>%
    ggplot(aes(Polynomial, RMSE, fill = Polynomial == which.min(errors))) + 
    geom_col() +
    scale_x_continuous(breaks = 1:8) +
    guides(fill = FALSE) +
    coord_cartesian(ylim = c(min(errors), max(errors)))

I find that the best model uses order of weight to \(weight^7\)

Fitting the Final Model

After finding the best model I train it on the whole dataset and look at the p.values for each coefficient.

poly_model <- lm(mpg ~ poly(weight, which.min(errors)), data = Auto)
tidy(poly_model) %>%
    mutate(term = ifelse(row_number() == 1, 'Intercept',
                         paste0('$Weight^', row_number()-1,'$'))) %>%
    kable(digits = 3)
term estimate std.error statistic p.value
Intercept 23.446 0.211 110.950 0.000
\(Weight^1\) -128.444 4.184 -30.699 0.000
\(Weight^2\) 23.159 4.184 5.535 0.000
\(Weight^3\) 0.220 4.184 0.053 0.958
\(Weight^4\) -2.808 4.184 -0.671 0.503
\(Weight^5\) 3.830 4.184 0.915 0.361
\(Weight^6\) -3.336 4.184 -0.797 0.426
\(Weight^7\) 5.400 4.184 1.291 0.198

Even though the model utilizes orders of weight up to 7, tha only statistically significant coefficients are \(Weight^1\) and \(Weight^2\). This at least strongly insinnuates that there is a second degree non-linear relationship between a car’s weight and its mpg.

Auto %>%
    mutate(Predictions = predict(poly_model, Auto)) %>%
    ggplot() + xlab('Weight') + ylab('Miles Per Gallon') +
    geom_point(aes(weight, mpg, col = 'blue')) +
    geom_line(aes(weight, Predictions, col = 'goldenrod2'), size = 1.5) +
    scale_color_manual(name = 'Value Type',
                       labels = c('Observed', 'Predicted'),
                       values = c('#56B4E9', '#E69F00'))

We see that even though the seventh order model fits the training data very well the sharp changes at the edges of the sample space would probably lead to way worse out-of-sample performance than a simple second order model.

Stepwise Linear Model

Finding Optimal Number of Cuts Via K-Fold Crossvalidation

Next I will fit a stepwise linear model to check if there is a non-linear relationship between . This means that the model fits separate first order polynomials of the covariates at each chosen interval. Just like before I will find the optimal predictive model via 8-fold cross-validation.

folds <- sample(rep(1:8, nrow(Auto)/8))
errors <- matrix(NA, 8, 8)
models <- list()

for (k in 1:8) {
    for (i in 1:8) {
        Auto$cut <- cut(Auto$horsepower, i+1)
        model <- lm(mpg ~ horsepower*cut, data = Auto[folds != i,])
        pred <- predict(model, Auto[folds == i,])
        resid <- (Auto$mpg[folds==i] - pred)^2
        errors[k, i] <- sqrt(mean(resid))
        Auto$cut <- NULL
    }
}
errors <- apply(errors, 2, mean)

data_frame(RMSE = errors) %>%
    mutate(Steps = row_number() +1) %>%
    ggplot(aes(Steps, RMSE, fill = Steps == (which.min(errors) + 1))) + 
    geom_col() +
    scale_x_continuous(breaks = 2:(length(errors)+1)) +
    guides(fill = FALSE) +
    coord_cartesian(ylim = c(min(errors), max(errors)))

The optimal predictive power is found with three cuts in the weight variable.

Fitting the Final Model

min_error <- which.min(errors) + 1
step_model <- lm(mpg ~ horsepower *cut(horsepower, min_error), data = Auto)
tidy(step_model) %>%
    kable(digits = 3)
term estimate std.error statistic p.value
(Intercept) 52.478 1.553 33.793 0
horsepower -0.311 0.019 -16.544 0
cut(horsepower, min_error)(107,169] -15.625 3.616 -4.321 0
cut(horsepower, min_error)(169,230] -35.949 7.701 -4.668 0
horsepower:cut(horsepower, min_error)(107,169] 0.166 0.031 5.432 0
horsepower:cut(horsepower, min_error)(169,230] 0.294 0.043 6.810 0

The model finds the horsepowercovariate by iself as well as each stepwise transformation to be statistically significant.

Auto %>%
    mutate(Predictions = predict(step_model, Auto)) %>%
    ggplot() + xlab('Horsepower') + ylab('Miled Per Gallon') + 
    geom_point(aes(horsepower, mpg, col = 'blue')) +
    geom_line(aes(horsepower, Predictions, col = 'goldenrod2'), size = 1.5) +
    scale_color_manual(name = 'Value Type',
                       labels = c('Observed', 'Predicted'),
                       values = c('#56B4E9', '#E69F00'))

Splines

Easier Model Selection With Caret by Max Kuhn

Now I will predict mpg with a spline function of displacement as well as a linear model. Then I will perform anova to see if the spline model fits the data significantly better than the linear model. If so there’s a high that the relationship between displacement and mpg is non-linear.

spline_model <- train(mpg ~ displacement, data = Auto,
                      method = 'gamSpline',
                      trControl = trainControl(method = 'cv', number = 8),
                      tuneGrid = expand.grid(df = seq(1, 12, 1)))
disp_lm <- train(mpg ~ displacement, data = Auto, method = 'lm',
                 trControl = trainControl(method = 'cv', number = 8))
plot(spline_model)

anova(disp_lm$finalModel, 
      spline_model$finalModel) %>%
    tidy %>%
    mutate(model = c('Linear', 'Spline'),
           percent = sumsq/lag(rss) * 100) %>% 
    select(model, rss, percent, sumsq, p.value) %>%
    kable(digits = 3)
model rss percent sumsq p.value
Linear 8378.822
Spline 6629.277 20.881 1749.545 0

The cross-validation approach leads us to choose a spline basis with 12 degrees of freedom.

We see that the Spline model fits the data considerably better. It decreases the residual sum of squares by 20.881% and its p.value is all but zero. We can therefore with good confidence say that the relationship is non-linear.

Auto %>%
    mutate(pred = predict(spline_model, Auto)) %>%
    ggplot() + 
    geom_point(aes(displacement, mpg, col = 'blue')) +
    geom_line(aes(displacement, pred, col = 'goldenrod2'), size = 1.5) +
    scale_color_manual(name = 'Value Type',
                       labels = c('Observed', 'Predicted'),
                       values = c('#56B4E9', '#E69F00')) +
    labs(x = 'Acceleration', y = 'Miles Per Gallon', title = 'Spline Relationship') +
    theme(legend.position = 'top')

Here we see the relationship between the variables and the model’s predictions.

Generalized Adaptive Model

Crossvalidating Via Caret

To end things with a bang I fit a generalized adaptive model to mpg as a function of horsepower, weight, displacement and acceleration

gam_model <- train(mpg ~ horsepower  + 
                       weight + 
                       displacement +
                       acceleration +
                       year, data = Auto,
                   method = 'gam',
                   trControl = trainControl(method = 'cv', number = 10))

lm_model <- train(mpg ~ horsepower  + 
                       weight + 
                       displacement +
                       acceleration +
                      year +
                      cylinders, data = Auto,
                   method = 'lm',
                   trControl = trainControl(method = 'cv', number = 10))

tidy(gam_model$finalModel) %>%
    kable(digits = 3)
term edf ref.df statistic p.value
s(year) 8.427 8.903 47.705 0.000
s(displacement) 1.753 2.184 3.171 0.045
s(horsepower) 2.181 2.779 7.851 0.000
s(acceleration) 2.382 3.074 2.246 0.088
s(weight) 2.545 3.219 18.996 0.000

The model creates a new basis for each of the predictors via splines and uses those to predict the mpg of a car. We see that the model deems each of its predictors to be significant.

anova(lm_model$finalModel, gam_model$finalModel) %>%
    tidy %>%
    mutate(model = c('Linear', 'GAM'),
           percent = sumsq/lag(rss) * 100) %>% 
    select(model, rss, percent, sumsq, p.value) %>%
    kable(digits = 3)
model rss percent sumsq p.value
Linear 4543.347
GAM 2636.670 41.966 1906.677 0

Next I compare the model to a basic linear model. Not unlike the model that was fitted to a single spline the gam model decreases the residual sum of squares. This time by a whopping amount of about 42%!. The anova test also deems the spline model to be a significant improvement to the linear model.

Auto %>%
    mutate(pred = predict(gam_model, Auto)) %>%
    gather(variable, value, 
           weight, horsepower, 
           displacement, acceleration,
           year) %>%
    mutate(variable = str_to_title(variable)) %>%
    ggplot() + 
    labs(x = '', y = 'Miles Per Gallon') + 
    geom_point(aes(value, mpg, col = 'blue')) +
    geom_point(aes(value, pred, col = 'goldenrod2'), size = 1.5) +
    scale_color_manual(name = 'Value Type',
                       labels = c('Observed', 'Predicted'),
                       values = c('#56B4E9', '#E69F00')) +
    facet_wrap(~variable, scales = 'free', strip.position = 'bottom') +
    theme(legend.position = 'top')

Here we can see the prediction space for each of the covariates. Keep in mind that the orange dots in the plots are not functions one variable but of five, four of which are not seen in the specified plots.