This document is a response to a challenge issued by John Toczek to see ‘how high’ I could get the \(R^2\) of the mtcars dataset.
library(dplyr); library(magrittr); library(ggplot2); library(knitr); library(broom)
data(mtcars)
head(mtcars) %>% kable()
| mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Mazda RX4 | 21.0 | 6 | 160 | 110 | 3.90 | 2.620 | 16.46 | 0 | 1 | 4 | 4 |
| Mazda RX4 Wag | 21.0 | 6 | 160 | 110 | 3.90 | 2.875 | 17.02 | 0 | 1 | 4 | 4 |
| Datsun 710 | 22.8 | 4 | 108 | 93 | 3.85 | 2.320 | 18.61 | 1 | 1 | 4 | 1 |
| Hornet 4 Drive | 21.4 | 6 | 258 | 110 | 3.08 | 3.215 | 19.44 | 1 | 0 | 3 | 1 |
| Hornet Sportabout | 18.7 | 8 | 360 | 175 | 3.15 | 3.440 | 17.02 | 0 | 0 | 3 | 2 |
| Valiant | 18.1 | 6 | 225 | 105 | 2.76 | 3.460 | 20.22 | 1 | 0 | 3 | 1 |
Here’s a naive linear model:
lm1 = lm(mpg~., data = mtcars)
lm1 %>% glance() %>% kable()
| r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual |
|---|---|---|---|---|---|---|---|---|---|---|
| 0.8690158 | 0.8066423 | 2.650197 | 13.93246 | 4e-07 | 11 | -69.85491 | 163.7098 | 181.2986 | 147.4944 | 21 |
But we can drive up the R-square by adding interaction terms. Here we add interactions:
lm(mpg~cyl + disp + hp + drat + (wt + qsec + vs + am + gear + carb)^2, data = mtcars) %>% glance() %>% kable()
| r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual |
|---|---|---|---|---|---|---|---|---|---|---|
| 0.9754612 | 0.8732162 | 2.145998 | 9.540427 | 0.0048738 | 26 | -43.05777 | 140.1155 | 179.6904 | 27.63185 | 6 |
To be clear, this is bad statistics, and is only done to show how interaction terms can drive up the fit.