First I will check to see if there is a non-linear relationship between a cars weightand its mpg
require(ISLR); require(tidyverse); require(caret)
require(ggthemes); require(broom); require(knitr)
theme_set(theme_tufte(base_size = 14) + theme(legend.position = 'top'))
set.seed(5)
options(knitr.kable.NA = '')
data('Auto')
I start by doing 8-fold cross-validation to find out which level of polynomial fits the training data best without over fitting.
folds <- sample(rep(1:8, nrow(Auto)/8))
errors <- matrix(NA, 8, 8)
models <- list()
for (k in 1:8) {
for (i in 1:8) {
model <- lm(mpg ~ poly(weight,i), data = Auto[folds != i,])
pred <- predict(model, Auto[folds == i,])
resid <- (Auto$mpg[folds==i] - pred)^2
errors[k, i] <- sqrt(mean(resid))
}
}
errors <- apply(errors, 2, mean)
data_frame(RMSE = errors) %>%
mutate(Polynomial = row_number()) %>%
ggplot(aes(Polynomial, RMSE, fill = Polynomial == which.min(errors))) +
geom_col() +
scale_x_continuous(breaks = 1:8) +
guides(fill = FALSE) +
coord_cartesian(ylim = c(min(errors), max(errors)))
I find that the best model uses order of weight to \(weight^7\)
After finding the best model I train it on the whole dataset and look at the p.values for each coefficient.
poly_model <- lm(mpg ~ poly(weight, which.min(errors)), data = Auto)
tidy(poly_model) %>%
mutate(term = ifelse(row_number() == 1, 'Intercept',
paste0('$Weight^', row_number()-1,'$'))) %>%
kable(digits = 3)
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| Intercept | 23.446 | 0.211 | 110.950 | 0.000 |
| \(Weight^1\) | -128.444 | 4.184 | -30.699 | 0.000 |
| \(Weight^2\) | 23.159 | 4.184 | 5.535 | 0.000 |
| \(Weight^3\) | 0.220 | 4.184 | 0.053 | 0.958 |
| \(Weight^4\) | -2.808 | 4.184 | -0.671 | 0.503 |
| \(Weight^5\) | 3.830 | 4.184 | 0.915 | 0.361 |
| \(Weight^6\) | -3.336 | 4.184 | -0.797 | 0.426 |
| \(Weight^7\) | 5.400 | 4.184 | 1.291 | 0.198 |
Even though the model utilizes orders of weight up to 7, tha only statistically significant coefficients are \(Weight^1\) and \(Weight^2\). This at least strongly insinnuates that there is a second degree non-linear relationship between a car’s weight and its mpg.
Auto %>%
mutate(Predictions = predict(poly_model, Auto)) %>%
ggplot() + xlab('Weight') + ylab('Miles Per Gallon') +
geom_point(aes(weight, mpg, col = 'blue')) +
geom_line(aes(weight, Predictions, col = 'goldenrod2'), size = 1.5) +
scale_color_manual(name = 'Value Type',
labels = c('Observed', 'Predicted'),
values = c('#56B4E9', '#E69F00'))
We see that even though the seventh order model fits the training data very well the sharp changes at the edges of the sample space would probably lead to way worse out-of-sample performance than a simple second order model.
Next I will fit a stepwise linear model to check if there is a non-linear relationship between . This means that the model fits separate first order polynomials of the covariates at each chosen interval. Just like before I will find the optimal predictive model via 8-fold cross-validation.
folds <- sample(rep(1:8, nrow(Auto)/8))
errors <- matrix(NA, 8, 8)
models <- list()
for (k in 1:8) {
for (i in 1:8) {
Auto$cut <- cut(Auto$horsepower, i+1)
model <- lm(mpg ~ horsepower*cut, data = Auto[folds != i,])
pred <- predict(model, Auto[folds == i,])
resid <- (Auto$mpg[folds==i] - pred)^2
errors[k, i] <- sqrt(mean(resid))
Auto$cut <- NULL
}
}
errors <- apply(errors, 2, mean)
data_frame(RMSE = errors) %>%
mutate(Steps = row_number() +1) %>%
ggplot(aes(Steps, RMSE, fill = Steps == (which.min(errors) + 1))) +
geom_col() +
scale_x_continuous(breaks = 2:(length(errors)+1)) +
guides(fill = FALSE) +
coord_cartesian(ylim = c(min(errors), max(errors)))
The optimal predictive power is found with three cuts in the weight variable.
min_error <- which.min(errors) + 1
step_model <- lm(mpg ~ horsepower *cut(horsepower, min_error), data = Auto)
tidy(step_model) %>%
kable(digits = 3)
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 52.478 | 1.553 | 33.793 | 0 |
| horsepower | -0.311 | 0.019 | -16.544 | 0 |
| cut(horsepower, min_error)(107,169] | -15.625 | 3.616 | -4.321 | 0 |
| cut(horsepower, min_error)(169,230] | -35.949 | 7.701 | -4.668 | 0 |
| horsepower:cut(horsepower, min_error)(107,169] | 0.166 | 0.031 | 5.432 | 0 |
| horsepower:cut(horsepower, min_error)(169,230] | 0.294 | 0.043 | 6.810 | 0 |
The model finds the horsepowercovariate by iself as well as each stepwise transformation to be statistically significant.
Auto %>%
mutate(Predictions = predict(step_model, Auto)) %>%
ggplot() + xlab('Horsepower') + ylab('Miled Per Gallon') +
geom_point(aes(horsepower, mpg, col = 'blue')) +
geom_line(aes(horsepower, Predictions, col = 'goldenrod2'), size = 1.5) +
scale_color_manual(name = 'Value Type',
labels = c('Observed', 'Predicted'),
values = c('#56B4E9', '#E69F00'))
Now I will predict mpg with a spline function of displacement as well as a linear model. Then I will perform anova to see if the spline model fits the data significantly better than the linear model. If so there’s a high that the relationship between displacement and mpg is non-linear.
spline_model <- train(mpg ~ displacement, data = Auto,
method = 'gamSpline',
trControl = trainControl(method = 'cv', number = 8),
tuneGrid = expand.grid(df = seq(1, 12, 1)))
disp_lm <- train(mpg ~ displacement, data = Auto, method = 'lm',
trControl = trainControl(method = 'cv', number = 8))
plot(spline_model)
anova(disp_lm$finalModel,
spline_model$finalModel) %>%
tidy %>%
mutate(model = c('Linear', 'Spline'),
percent = sumsq/lag(rss) * 100) %>%
select(model, rss, percent, sumsq, p.value) %>%
kable(digits = 3)
| model | rss | percent | sumsq | p.value |
|---|---|---|---|---|
| Linear | 8378.822 | |||
| Spline | 6629.277 | 20.881 | 1749.545 | 0 |
The cross-validation approach leads us to choose a spline basis with 12 degrees of freedom.
We see that the Spline model fits the data considerably better. It decreases the residual sum of squares by 20.881% and its p.value is all but zero. We can therefore with good confidence say that the relationship is non-linear.
Auto %>%
mutate(pred = predict(spline_model, Auto)) %>%
ggplot() +
geom_point(aes(displacement, mpg, col = 'blue')) +
geom_line(aes(displacement, pred, col = 'goldenrod2'), size = 1.5) +
scale_color_manual(name = 'Value Type',
labels = c('Observed', 'Predicted'),
values = c('#56B4E9', '#E69F00')) +
labs(x = 'Acceleration', y = 'Miles Per Gallon', title = 'Spline Relationship') +
theme(legend.position = 'top')
Here we see the relationship between the variables and the model’s predictions.
To end things with a bang I fit a generalized adaptive model to mpg as a function of horsepower, weight, displacement and acceleration
gam_model <- train(mpg ~ horsepower +
weight +
displacement +
acceleration +
year, data = Auto,
method = 'gam',
trControl = trainControl(method = 'cv', number = 10))
lm_model <- train(mpg ~ horsepower +
weight +
displacement +
acceleration +
year +
cylinders, data = Auto,
method = 'lm',
trControl = trainControl(method = 'cv', number = 10))
tidy(gam_model$finalModel) %>%
kable(digits = 3)
| term | edf | ref.df | statistic | p.value |
|---|---|---|---|---|
| s(year) | 8.427 | 8.903 | 47.705 | 0.000 |
| s(displacement) | 1.753 | 2.184 | 3.171 | 0.045 |
| s(horsepower) | 2.181 | 2.779 | 7.851 | 0.000 |
| s(acceleration) | 2.382 | 3.074 | 2.246 | 0.088 |
| s(weight) | 2.545 | 3.219 | 18.996 | 0.000 |
The model creates a new basis for each of the predictors via splines and uses those to predict the mpg of a car. We see that the model deems each of its predictors to be significant.
anova(lm_model$finalModel, gam_model$finalModel) %>%
tidy %>%
mutate(model = c('Linear', 'GAM'),
percent = sumsq/lag(rss) * 100) %>%
select(model, rss, percent, sumsq, p.value) %>%
kable(digits = 3)
| model | rss | percent | sumsq | p.value |
|---|---|---|---|---|
| Linear | 4543.347 | |||
| GAM | 2636.670 | 41.966 | 1906.677 | 0 |
Next I compare the model to a basic linear model. Not unlike the model that was fitted to a single spline the gam model decreases the residual sum of squares. This time by a whopping amount of about 42%!. The anova test also deems the spline model to be a significant improvement to the linear model.
Auto %>%
mutate(pred = predict(gam_model, Auto)) %>%
gather(variable, value,
weight, horsepower,
displacement, acceleration,
year) %>%
mutate(variable = str_to_title(variable)) %>%
ggplot() +
labs(x = '', y = 'Miles Per Gallon') +
geom_point(aes(value, mpg, col = 'blue')) +
geom_point(aes(value, pred, col = 'goldenrod2'), size = 1.5) +
scale_color_manual(name = 'Value Type',
labels = c('Observed', 'Predicted'),
values = c('#56B4E9', '#E69F00')) +
facet_wrap(~variable, scales = 'free', strip.position = 'bottom') +
theme(legend.position = 'top')
Here we can see the prediction space for each of the covariates. Keep in mind that the orange dots in the plots are not functions one variable but of five, four of which are not seen in the specified plots.