28/10/2021

Synopsis

Completed as the course project for JHU’s Regression Models course. The assignment is as follows:

You work for Motor Trend, a magazine about the automobile industry. Looking at a data set of a collection of cars, they are interested in exploring the relationship between a set of variables and miles per gallon (MPG) (outcome). They are particularly interested in the following two questions:

1. Is an automatic or manual transmission better for MPG?

2. Quantify the MPG difference between automatic and manual transmissions.

Executive Summary

This analysis arrives at the following conclusions after conducting multiple regression analysis:

  1. We cannot say whether an automatic or manual transmission is better for mpg, as the coefficient on the mpg variable is insignificant.
  2. We cannot (accurately) quantify the difference in mpg between the two transmission types, again because the coefficient on the mpg variable is insignificant.

Data Processing

All of the variables were coded as numeric variables, when in reality, quite a few of them were factor variables - so their classes had to be changed accordingly.

## Change appropriate variables to factor class
mtcars <- within(mtcars, {
   vs <- factor(vs, labels = c("V-Shaped", "Straight"))
   am <- factor(am, labels = c("Automatic", "Manual"))
   cyl  <- ordered(cyl)
   gear <- ordered(gear)
   carb <- ordered(carb)
})

Exploratory Data Analysis

Here we plot miles per gallon by transmission type to see if there’s a noticeable difference between manual and automatic transmissions. It appears manual cars get better mpg than automatic.

Simple Regression Analysis I

First, we started by running a simple OLS regression without controlling for other factors in the dataset.

Single Regression Model
Predictor B SE t p
(Intercept) 17.15 1.12 15.25 0
amManual 7.24 1.76 4.11 0

Simple Regression Analysis II

The simple regression model finds that cars with manual transmissions get an average of 7.245 more miles per gallon than cars with automatic transmissions. This is, of course, the difference in means we just saw visualized in the exploratory data analysis section.

This result is significant at the 0.01% level. However, the R2 is only 0.3598, meaning that only ~36% of the variance in miles per gallon is explained by the transmission type.

This simple regression could be (and is likely) suffering from omitted variable bias. To address this bias, we’ll include other variables from the dataset which may have an effect on the miles per gallon that a given car gets.

Analysis Of Variance I

In an effort to determine which variables we’ll add to the regression, we’ll perform an analysis of variance (ANOVA) test to see which variables have significant effects on the variance of the mpg variable.

According to the ANOVA test, the number of cylinders, the displacement, and the weight of the vehicle all explain a significant amount of the variance in the mpg variable.

Analysis Of Variance II

ANOVA Test
Var df sumsq meansq F-stat p
cyl 2 824.78 412.39 51.38 0.00
disp 1 57.64 57.64 7.18 0.02
hp 1 18.50 18.50 2.31 0.15
drat 1 11.91 11.91 1.48 0.24
wt 1 55.79 55.79 6.95 0.02
qsec 1 1.52 1.52 0.19 0.67
vs 1 0.30 0.30 0.04 0.85
am 1 16.57 16.57 2.06 0.17
gear 2 5.02 2.51 0.31 0.74
carb 5 13.60 2.72 0.34 0.88
Residuals 15 120.40 8.03 NA NA

Multiple Regression Analysis I

The multiple regression model has an R2 of 0.8376, suggesting that the model explains ~84% of the variance in the mpg variable - a significant improvement on the simple model.

In this model, the coefficient on the dummy variable for transmission type is insignificant - indicating that the simple model suffered from omitted variable bias. In other words, the multiple regression model finds that the transmission type does not have a statistically significant effect on miles per gallon.

Multiple Regression Analysis II

Multiple Regression Model
Predictor B SE t p
(Intercept) 30.28 3.29 9.20 0.000
amManual 0.14 1.33 0.11 0.916
cyl.L -4.47 1.87 -2.39 0.025
cyl.Q 0.94 1.05 0.89 0.382
disp 0.00 0.01 0.12 0.906
wt -3.25 1.25 -2.60 0.015

Residual Analysis I

The residual plot has no clear pattern, implying the multiple regression model fits the data well - but if we’re nitpicking, there are a few (minor) causes for concern:

  1. There’s some imbalance in the y-axis. Positive residuals were nearly as large as 6, but negative residuals weren’t larger than -3.5.
  2. The residuals appear to get (marginally) larger as the prediction goes from small to large, implying there may be heteroskedasticity.

In this case, these causes for concern are so minor that they’re not worth acting upon. They’ve only been pointed out to demonstrate some understanding of how to analyze a residual plot.

Residual Analysis II

Conclusion

The coefficient on the transmission variable is not statistically significant once we control for other factors that affect miles per gallon, such as the weight of the car and the number of cylinders.

As a result, we cannot say whether an automatic or manual transmission is better for mpg, and we cannot quantify the difference in mpg between the two transmission types with any confidence.

It’s worth noting that it’s not difficult to “p-hack” this model with a combination of regressors that indicates that transmission type has a significant effect on miles per gallon. In reality, the transmission type likely has some sort of effect on miles per gallon, but neither the model I’ve constructed here nor a model that uses all of the regressors (not included in analysis) show a statistically significant effect.