Chapter 7

Exercises

8. Fit some of the non-linear models investigated in this chapter to the Auto data set. Is there evidence for non-linear relationships in this data set? Create some informative plots to justify your answer

Polynomial Regression

First I will check to see if there is a non-linear relationship between a cars weightand its mpg

require(ISLR); require(tidyverse); require(caret)
require(ggthemes); require(broom); require(knitr)
theme_set(theme_tufte(base_size = 14) + theme(legend.position = 'top'))
set.seed(5)
options(knitr.kable.NA = '')
data('Auto')

Selecting Best Fit By K-Fold Validation

I start by doing 8-fold cross-validation to find out which level of polynomial fits the training data best without over fitting.

folds <- sample(rep(1:8, nrow(Auto)/8))
errors <- matrix(NA, 8, 8)
models <- list()

for (k in 1:8) {
    for (i in 1:8) {
        model <- lm(mpg ~ poly(weight,i), data = Auto[folds != i,])
        pred <- predict(model, Auto[folds == i,])
        resid <- (Auto$mpg[folds==i] - pred)^2
        errors[k, i] <- sqrt(mean(resid))
    }
}
errors <- apply(errors, 2, mean)



data_frame(RMSE = errors) %>%
    mutate(Polynomial = row_number()) %>%
    ggplot(aes(Polynomial, RMSE, fill = Polynomial == which.min(errors))) + 
    geom_col() +
    scale_x_continuous(breaks = 1:8) +
    guides(fill = FALSE) +
    coord_cartesian(ylim = c(min(errors), max(errors)))

I find that the best model uses order of weight to \(weight^7\)

Fitting the Final Model

After finding the best model I train it on the whole dataset and look at the p.values for each coefficient.

poly_model <- lm(mpg ~ poly(weight, which.min(errors)), data = Auto)
tidy(poly_model) %>%
    mutate(term = ifelse(row_number() == 1, 'Intercept',
                         paste0('$Weight^', row_number()-1,'$'))) %>%
    kable(digits = 3)

term	estimate	std.error	statistic	p.value
Intercept	23.446	0.211	110.950	0.000
\(Weight^1\)	-128.444	4.184	-30.699	0.000
\(Weight^2\)	23.159	4.184	5.535	0.000
\(Weight^3\)	0.220	4.184	0.053	0.958
\(Weight^4\)	-2.808	4.184	-0.671	0.503
\(Weight^5\)	3.830	4.184	0.915	0.361
\(Weight^6\)	-3.336	4.184	-0.797	0.426
\(Weight^7\)	5.400	4.184	1.291	0.198

Even though the model utilizes orders of weight up to 7, tha only statistically significant coefficients are \(Weight^1\) and \(Weight^2\). This at least strongly insinnuates that there is a second degree non-linear relationship between a car’s weight and its mpg.

Auto %>%
    mutate(Predictions = predict(poly_model, Auto)) %>%
    ggplot() + xlab('Weight') + ylab('Miles Per Gallon') +
    geom_point(aes(weight, mpg, col = 'blue')) +
    geom_line(aes(weight, Predictions, col = 'goldenrod2'), size = 1.5) +
    scale_color_manual(name = 'Value Type',
                       labels = c('Observed', 'Predicted'),
                       values = c('#56B4E9', '#E69F00'))

We see that even though the seventh order model fits the training data very well the sharp changes at the edges of the sample space would probably lead to way worse out-of-sample performance than a simple second order model.

Stepwise Linear Model

Finding Optimal Number of Cuts Via K-Fold Crossvalidation

Next I will fit a stepwise linear model to check if there is a non-linear relationship between . This means that the model fits separate first order polynomials of the covariates at each chosen interval. Just like before I will find the optimal predictive model via 8-fold cross-validation.

folds <- sample(rep(1:8, nrow(Auto)/8))
errors <- matrix(NA, 8, 8)
models <- list()

for (k in 1:8) {
    for (i in 1:8) {
        Auto$cut <- cut(Auto$horsepower, i+1)
        model <- lm(mpg ~ horsepower*cut, data = Auto[folds != i,])
        pred <- predict(model, Auto[folds == i,])
        resid <- (Auto$mpg[folds==i] - pred)^2
        errors[k, i] <- sqrt(mean(resid))
        Auto$cut <- NULL
    }
}
errors <- apply(errors, 2, mean)

data_frame(RMSE = errors) %>%
    mutate(Steps = row_number() +1) %>%
    ggplot(aes(Steps, RMSE, fill = Steps == (which.min(errors) + 1))) + 
    geom_col() +
    scale_x_continuous(breaks = 2:(length(errors)+1)) +
    guides(fill = FALSE) +
    coord_cartesian(ylim = c(min(errors), max(errors)))

The optimal predictive power is found with three cuts in the weight variable.

Fitting the Final Model

min_error <- which.min(errors) + 1
step_model <- lm(mpg ~ horsepower *cut(horsepower, min_error), data = Auto)
tidy(step_model) %>%
    kable(digits = 3)

term	estimate	std.error	statistic
(Intercept)	52.478	1.553	33.793
horsepower	-0.311	0.019	-16.544
cut(horsepower, min_error)(107,169]	-15.625	3.616	-4.321
cut(horsepower, min_error)(169,230]	-35.949	7.701	-4.668
horsepower:cut(horsepower, min_error)(107,169]	0.166	0.031	5.432
horsepower:cut(horsepower, min_error)(169,230]	0.294	0.043	6.810

The model finds the horsepowercovariate by iself as well as each stepwise transformation to be statistically significant.

Auto %>%
    mutate(Predictions = predict(step_model, Auto)) %>%
    ggplot() + xlab('Horsepower') + ylab('Miled Per Gallon') + 
    geom_point(aes(horsepower, mpg, col = 'blue')) +
    geom_line(aes(horsepower, Predictions, col = 'goldenrod2'), size = 1.5) +
    scale_color_manual(name = 'Value Type',
                       labels = c('Observed', 'Predicted'),
                       values = c('#56B4E9', '#E69F00'))

Splines

Easier Model Selection With Caret by Max Kuhn

Now I will predict mpg with a spline function of displacement as well as a linear model. Then I will perform anova to see if the spline model fits the data significantly better than the linear model. If so there’s a high that the relationship between displacement and mpg is non-linear.

spline_model <- train(mpg ~ displacement, data = Auto,
                      method = 'gamSpline',
                      trControl = trainControl(method = 'cv', number = 8),
                      tuneGrid = expand.grid(df = seq(1, 12, 1)))
disp_lm <- train(mpg ~ displacement, data = Auto, method = 'lm',
                 trControl = trainControl(method = 'cv', number = 8))
plot(spline_model)

anova(disp_lm$finalModel, 
      spline_model$finalModel) %>%
    tidy %>%
    mutate(model = c('Linear', 'Spline'),
           percent = sumsq/lag(rss) * 100) %>% 
    select(model, rss, percent, sumsq, p.value) %>%
    kable(digits = 3)

model	rss	percent	sumsq	p.value
Linear	8378.822
Spline	6629.277	20.881	1749.545	0

The cross-validation approach leads us to choose a spline basis with 12 degrees of freedom.

We see that the Spline model fits the data considerably better. It decreases the residual sum of squares by 20.881% and its p.value is all but zero. We can therefore with good confidence say that the relationship is non-linear.

Auto %>%
    mutate(pred = predict(spline_model, Auto)) %>%
    ggplot() + 
    geom_point(aes(displacement, mpg, col = 'blue')) +
    geom_line(aes(displacement, pred, col = 'goldenrod2'), size = 1.5) +
    scale_color_manual(name = 'Value Type',
                       labels = c('Observed', 'Predicted'),
                       values = c('#56B4E9', '#E69F00')) +
    labs(x = 'Acceleration', y = 'Miles Per Gallon', title = 'Spline Relationship') +
    theme(legend.position = 'top')

Here we see the relationship between the variables and the model’s predictions.

Generalized Adaptive Model

Crossvalidating Via Caret

To end things with a bang I fit a generalized adaptive model to mpg as a function of horsepower, weight, displacement and acceleration

gam_model <- train(mpg ~ horsepower  + 
                       weight + 
                       displacement +
                       acceleration +
                       year, data = Auto,
                   method = 'gam',
                   trControl = trainControl(method = 'cv', number = 10))

lm_model <- train(mpg ~ horsepower  + 
                       weight + 
                       displacement +
                       acceleration +
                      year +
                      cylinders, data = Auto,
                   method = 'lm',
                   trControl = trainControl(method = 'cv', number = 10))

tidy(gam_model$finalModel) %>%
    kable(digits = 3)

term	edf	ref.df	statistic	p.value
s(year)	8.427	8.903	47.705	0.000
s(displacement)	1.753	2.184	3.171	0.045
s(horsepower)	2.181	2.779	7.851	0.000
s(acceleration)	2.382	3.074	2.246	0.088
s(weight)	2.545	3.219	18.996	0.000

The model creates a new basis for each of the predictors via splines and uses those to predict the mpg of a car. We see that the model deems each of its predictors to be significant.

anova(lm_model$finalModel, gam_model$finalModel) %>%
    tidy %>%
    mutate(model = c('Linear', 'GAM'),
           percent = sumsq/lag(rss) * 100) %>% 
    select(model, rss, percent, sumsq, p.value) %>%
    kable(digits = 3)

model	rss	percent	sumsq	p.value
Linear	4543.347
GAM	2636.670	41.966	1906.677	0

Next I compare the model to a basic linear model. Not unlike the model that was fitted to a single spline the gam model decreases the residual sum of squares. This time by a whopping amount of about 42%!. The anova test also deems the spline model to be a significant improvement to the linear model.

Auto %>%
    mutate(pred = predict(gam_model, Auto)) %>%
    gather(variable, value, 
           weight, horsepower, 
           displacement, acceleration,
           year) %>%
    mutate(variable = str_to_title(variable)) %>%
    ggplot() + 
    labs(x = '', y = 'Miles Per Gallon') + 
    geom_point(aes(value, mpg, col = 'blue')) +
    geom_point(aes(value, pred, col = 'goldenrod2'), size = 1.5) +
    scale_color_manual(name = 'Value Type',
                       labels = c('Observed', 'Predicted'),
                       values = c('#56B4E9', '#E69F00')) +
    facet_wrap(~variable, scales = 'free', strip.position = 'bottom') +
    theme(legend.position = 'top')

Here we can see the prediction space for each of the covariates. Keep in mind that the orange dots in the plots are not functions one variable but of five, four of which are not seen in the specified plots.

Exercise7_8

Brynjólfur Gauti Jónsson

2018-02-12

Chapter 7

Exercises

8. Fit some of the non-linear models investigated in this chapter to the Auto data set. Is there evidence for non-linear relationships in this data set? Create some informative plots to justify your answer

Polynomial Regression

Selecting Best Fit By K-Fold Validation

Fitting the Final Model

Stepwise Linear Model

Finding Optimal Number of Cuts Via K-Fold Crossvalidation

Fitting the Final Model

Splines

Easier Model Selection With Caret by Max Kuhn

Generalized Adaptive Model

Crossvalidating Via Caret