Synopsis
Completed as the course project for JHU’s Regression Models course. The assignment is as follows:
You work for Motor Trend, a magazine about the automobile industry. Looking at a data set of a collection of cars, they are interested in exploring the relationship between a set of variables and miles per gallon (MPG) (outcome). They are particularly interested in the following two questions:
1. Is an automatic or manual transmission better for MPG?
2. Quantify the MPG difference between automatic and manual transmissions.
This analysis finds that after conducting multiple regression analysis, the answer to the above two questions is as follows:
- We cannot say whether an automatic or manual transmission is better for mpg, as the coefficient on the mpg variable is insignificant.
- We cannot (accurately) quantify the difference in mpg between the two transmission types, again because the coefficient on the mpg variable is insignificant.
If interested, you can read more about the mtcars dataset here.
Data Processing
Let’s take a quick look at the dataset we’ll be using for the analysis, mtcars.
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## [1] 32 11
## $mpg
## [1] "numeric"
##
## $cyl
## [1] "numeric"
##
## $disp
## [1] "numeric"
##
## $hp
## [1] "numeric"
##
## $drat
## [1] "numeric"
##
## $wt
## [1] "numeric"
##
## $qsec
## [1] "numeric"
##
## $vs
## [1] "numeric"
##
## $am
## [1] "numeric"
##
## $gear
## [1] "numeric"
##
## $carb
## [1] "numeric"
The dataset has 11 columns and 32 rows, a somewhat small sample size. Each of the 32 rows corresponds to a different car model. All of the variables are coded as numeric variables, when in reality, quite a few of them are factor variables. Let’s fix that.
Exploratory Data Analysis
Let’s plot miles per gallon by transmission type and see if there’s a noticeable difference between manual and automatic transmissions.
On average, manual transmissions tend to get more miles per gallon than automatic transmissions. Now, let’s turn to regression analysis to explore the relationship further.
Regression Analysis
Let’s start by running a simple OLS regression without controlling for other factors in the dataset.
## Simple regression model
simpleRegressionModel <- lm(mpg ~ am, mtcars)
summary(simpleRegressionModel)
##
## Call:
## lm(formula = mpg ~ am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.3923 -3.0923 -0.2974 3.2439 9.5077
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.147 1.125 15.247 1.13e-15 ***
## amManual 7.245 1.764 4.106 0.000285 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared: 0.3598, Adjusted R-squared: 0.3385
## F-statistic: 16.86 on 1 and 30 DF, p-value: 0.000285
The simple regression model finds that cars with manual transmissions get an average of 7.245 more miles per gallon than cars with automatic transmissions. This is, of course, the difference in means we just saw visualized in the exploratory data analysis section. This result is significant at the 0.01% level. However, the R^2 is only 0.3598, meaning that only ~36% of the variance in miles per gallon is explained by the transmission type.
This simple regression could be (and is likely) suffering from omitted variable bias. To address this bias, we’ll include other variables from the dataset which may have an effect on the miles per gallon that a given car gets. In an effort to determine which variables we’ll add to the regression, we’ll perform an analysis of variance (ANOVA) test to see which variables have significant effects on the variance of the mpg variable.
## Analysis of variance test
mpgVarianceAnalysis <- aov(mpg ~ ., mtcars)
summary(mpgVarianceAnalysis)
## Df Sum Sq Mean Sq F value Pr(>F)
## cyl 2 824.8 412.4 51.377 1.94e-07 ***
## disp 1 57.6 57.6 7.181 0.0171 *
## hp 1 18.5 18.5 2.305 0.1497
## drat 1 11.9 11.9 1.484 0.2419
## wt 1 55.8 55.8 6.950 0.0187 *
## qsec 1 1.5 1.5 0.190 0.6692
## vs 1 0.3 0.3 0.038 0.8488
## am 1 16.6 16.6 2.064 0.1714
## gear 2 5.0 2.5 0.313 0.7361
## carb 5 13.6 2.7 0.339 0.8814
## Residuals 15 120.4 8.0
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
According to the ANOVA test, the number of cylinders, the displacement, and the weight of the vehicle all explain a significant amount of the variance in the mpg variable. Let’s add those variables to our multiple regression model.
## Multiple regression model
multipleRegressionModel <- lm(mpg ~ am + cyl + disp + wt, mtcars)
summary(multipleRegressionModel)
##
## Call:
## lm(formula = mpg ~ am + cyl + disp + wt, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.5029 -1.2829 -0.4825 1.4954 5.7889
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 30.275005 3.290562 9.201 1.17e-09 ***
## amManual 0.141212 1.326751 0.106 0.9161
## cyl.L -4.467788 1.872177 -2.386 0.0246 *
## cyl.Q 0.935362 1.052358 0.889 0.3822
## disp 0.001632 0.013757 0.119 0.9065
## wt -3.249176 1.249098 -2.601 0.0151 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.652 on 26 degrees of freedom
## Multiple R-squared: 0.8376, Adjusted R-squared: 0.8064
## F-statistic: 26.82 on 5 and 26 DF, p-value: 1.73e-09
This model has an R^2 of 0.8376, suggesting that the model explain ~84% of the variance in the mpg variable - a significant improvement on the simple model.
In this model, the coefficient on the dummy variable for transmission type is insignificant - indicating that the simple model suffered from omitted variable bias. In other words, the multiple regression model finds that the transmission type does not have a statistically significant effect on miles per gallon.
Residual Plot
The residual plot has no clear pattern, implying the multiple regression model fits the data well - but if we’re nitpicking, there are a few (minor) causes for concern:
- There’s some imbalance in the y-axis. Positive residuals were nearly as large as 6, but negative residuals weren’t larger than -3.5.
- The residuals appear to get (marginally) larger as the prediction goes from small to large, implying there may be heteroskedasticity.
In this case, these causes for concern are so minor that they’re not worth acting upon. They’ve only been pointed out to demonstrate some understanding of how to analyze a residual plot.
Conclusion
The coefficient on the transmission variable is not statistically significant once we control for other factors that affect miles per gallon, such as the weight of the car and the number of cylinders.
As a result, we cannot say whether an automatic or manual transmission is better for mpg, and we cannot quantify the difference in mpg between the two transmission types with any confidence.
It’s worth noting that it’s not difficult to “p-hack” this model with a combination of regressors that indicates that transmission type has a significant effect on miles per gallon. In reality, the transmission type likely has some sort of effect on miles per gallon, but neither the model I’ve constructed here nor a model that uses all of the regressors (not included in analysis) show a statistically significant effect.