This analysis uses a data set of a collection of cars to explore the relationship between a set of car features and MPG. Specifically, it answers 2 questions concerning the influence of transmission type on MPG. Data are first divided into a subset of numeric variables and another of factor variables for rudimentary exploration. Subsequent feature selection is utilizes a stepwise algorithm based on AIC. Regression model comparison, coefficient interpretation and potential problems are presented in the third part of this analysis.
library(rmarkdown)
library(knitr)
library(dplyr)
library(ggplot2)
library(magrittr)
library(stargazer)
library(ggfortify)
Check the correlation of each pair of the numeric variables. But keep in mind that variables that have a high correlation with MPG do not necessarily cause a high or low MPG.
data("mtcars")
kable(head(mtcars))
| mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Mazda RX4 | 21.0 | 6 | 160 | 110 | 3.90 | 2.620 | 16.46 | 0 | 1 | 4 | 4 |
| Mazda RX4 Wag | 21.0 | 6 | 160 | 110 | 3.90 | 2.875 | 17.02 | 0 | 1 | 4 | 4 |
| Datsun 710 | 22.8 | 4 | 108 | 93 | 3.85 | 2.320 | 18.61 | 1 | 1 | 4 | 1 |
| Hornet 4 Drive | 21.4 | 6 | 258 | 110 | 3.08 | 3.215 | 19.44 | 1 | 0 | 3 | 1 |
| Hornet Sportabout | 18.7 | 8 | 360 | 175 | 3.15 | 3.440 | 17.02 | 0 | 0 | 3 | 2 |
| Valiant | 18.1 | 6 | 225 | 105 | 2.76 | 3.460 | 20.22 | 1 | 0 | 3 | 1 |
sapply(mtcars, class)
## mpg cyl disp hp drat wt qsec
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## vs am gear carb
## "numeric" "numeric" "numeric" "numeric"
mtcars[ , c('cyl', 'vs', 'am')] %<>% lapply(function(x) as.factor(x))
# Should not use sapply().
mtcars_numeric <- select_if(mtcars, is.numeric)
kable(cor(mtcars_numeric))
| mpg | disp | hp | drat | wt | qsec | gear | carb | |
|---|---|---|---|---|---|---|---|---|
| mpg | 1.0000000 | -0.8475514 | -0.7761684 | 0.6811719 | -0.8676594 | 0.4186840 | 0.4802848 | -0.5509251 |
| disp | -0.8475514 | 1.0000000 | 0.7909486 | -0.7102139 | 0.8879799 | -0.4336979 | -0.5555692 | 0.3949769 |
| hp | -0.7761684 | 0.7909486 | 1.0000000 | -0.4487591 | 0.6587479 | -0.7082234 | -0.1257043 | 0.7498125 |
| drat | 0.6811719 | -0.7102139 | -0.4487591 | 1.0000000 | -0.7124406 | 0.0912048 | 0.6996101 | -0.0907898 |
| wt | -0.8676594 | 0.8879799 | 0.6587479 | -0.7124406 | 1.0000000 | -0.1747159 | -0.5832870 | 0.4276059 |
| qsec | 0.4186840 | -0.4336979 | -0.7082234 | 0.0912048 | -0.1747159 | 1.0000000 | -0.2126822 | -0.6562492 |
| gear | 0.4802848 | -0.5555692 | -0.1257043 | 0.6996101 | -0.5832870 | -0.2126822 | 1.0000000 | 0.2740728 |
| carb | -0.5509251 | 0.3949769 | 0.7498125 | -0.0907898 | 0.4276059 | -0.6562492 | 0.2740728 | 1.0000000 |
Violin plots are created to show the MPG distribution dependent on transmission types (am). Furthermore, 2 addtional violin plots are drawn to reveal the abovementioned distribution when either engine type (vs) or number of cylinders (cyl) is taken into consideration.
(Please see “Violin Plot of MPG on Transmission Type (1) - (3)” in Appendix.)
mtcars_factor <- select(mtcars, mpg, cyl, vs, am)
levels(mtcars_factor$vs) <- c('V-shape', 'Straight')
levels(mtcars_factor$am) <- c('Automatic', 'Manual')
theme_format <- theme(plot.margin = unit(c(1,1,1,1), "cm"),
plot.title = element_text(hjust = 0.5, size = 16),
axis.text = element_text(size = 12),
axis.title = element_text(size = 14))
am_only <- ggplot(mtcars_factor, aes(x = am, y = mpg)) +
geom_violin(trim = FALSE) + geom_boxplot(width = 0.2) +
labs(x = "Transmission Type", y = "MPG", title = "Violin Plot of MPG on Transmission Type (1)") +
theme_format
am_vs <- ggplot(mtcars_factor, aes(x = am, y = mpg, fill = vs)) + geom_violin(trim = FALSE) +
labs(x = "Transmission Type", y = "MPG", title = "Violin Plot of MPG on Transmission Type (2)") +
theme_format + scale_fill_manual(values = c("lightsteelblue1", "mistyrose"), name = "Engine Type")
am_cyl <- ggplot(mtcars_factor, aes(x = am, y = mpg, fill = cyl)) + geom_violin(trim = FALSE) +
labs(x = "Transmission Type", y = "MPG", title = "Violin Plot of MPG on Transmission Type (3)") +
theme_format + scale_fill_manual(values = c("thistle1", "lightsteelblue3", "navajowhite"),
name = "Number of Cylinders")
am_only
am_vs
am_cyl
In this analysis, only OLS models will be built and compared.
First, select a formula-based model by AIC. This gives us a basic model that yields the best performance when interaction terms are not taken into consideration. Then we run two more regressions of MPG on the variables selected, but with different variable interactions added.
basic_model <- step(lm(mpg ~ ., data = mtcars), trace = 0)
all.vars(formula(basic_model))
[1] “mpg” “wt” “qsec” “am”
inter_am_qsec <- lm(mpg ~ am * qsec + wt, data = mtcars)
inter_am_wt <- update(inter_am_qsec, mpg ~ am * wt + qsec)
stargazer(basic_model, inter_am_qsec, inter_am_wt,
title = "Regression Results", align = TRUE, type = 'html', dep.var.labels = "MPG",
covariate.labels = c('Weight', 'Manual Transmission * 1/4 Mile Time', '1/4 Mile Time',
'Manual Transmission * Weight', 'Manual Transmission'))
| Dependent variable: | |||
| MPG | |||
| (1) | (2) | (3) | |
| Weight | -3.917*** | -3.777*** | -2.937*** |
| (0.711) | (0.671) | (0.666) | |
| Manual Transmission * 1/4 Mile Time | 1.060** | ||
| (0.487) | |||
| 1/4 Mile Time | 1.226*** | 0.817** | 1.017*** |
| (0.289) | (0.330) | (0.252) | |
| Manual Transmission * Weight | -4.141*** | ||
| (1.197) | |||
| Manual Transmission | 2.936** | -15.614* | 14.079*** |
| (1.411) | (8.624) | (3.435) | |
| Constant | 9.618 | 16.529** | 9.723 |
| (6.960) | (7.267) | (5.899) | |
| Observations | 32 | 32 | 32 |
| R2 | 0.850 | 0.872 | 0.896 |
| Adjusted R2 | 0.834 | 0.853 | 0.880 |
| Residual Std. Error | 2.459 (df = 28) | 2.309 (df = 27) | 2.084 (df = 27) |
| F Statistic | 52.750*** (df = 3; 28) | 46.029*** (df = 4; 27) | 58.061*** (df = 4; 27) |
| Note: | p<0.1; p<0.05; p<0.01 | ||
Based on the above result, the third model (with interaction between transmission type and weight) is selected for MPG prediction because it has the largest F-stat as well as largest adjusted R-squared. Also we see that the regressors in this model all have significant coefficients.
Four types of residual plots are created to show potential pitfalls in the model. But judging from the plots, the model does not appear to have significant problems such as heteroskedasticity or high leverage points. However, the reliability of this result is questionable since we only have 32 observations at hand.
(Please see “Residual Plots” in Appendix.)
autoplot(inter_am_wt, label.size = 3)
The model fomula is:
\(\hat{MPG} = 9.723 + 14.079 \times Manual\_Transmission - 2.937 \times Weight + 1.017 \times (1/4\_Mile\_Time) - 4.141 \times Manual\_Transmission \times Weight\)
Holding other things constant, the predicted MPG for a car with automatic transmission is given by:
\(\hat{MPG}_{automatic} = 9.723 - 2.937 \times Weight + 1.017 \times (1/4\_Mile\_Time)\)
while that for a car with manual transmission is:
\(\hat{MPG}_{manual} = 23.802 - 7.078 \times Weight + 1.017 \times (1/4\_Mile\_Time)\)
Question 1: Quantify the MPG difference between automatic and manual transmissions.
The difference is measured by:
\(\hat{MPG}_{automatic} - \hat{MPG}_{manual} = -14.079 + 4.141 \times Weight\)
Question 2: Is an automatic or manual transmission better for MPG?
That depends. When a car is no heavier than \((14.079 / 4.141 \times 1000 \approx)\) 3,399.90 lbs, manual transmission has an edge in terms of MPG. If its weight exceeds 3,399.90 lbs, than automatic transmission is better for MPG.
Question 3: How about the uncertainties?
In this analysis, a potentially fatal problem is that the number of observations is way too small compared to the number of features. A small change in any observation of our data set can have a great influence in our model selection and robustness. Because of the lack of observations, we cannot determine with certainty whether the features not included in this analysis are actually influential or not.