The purpose this course project was to determine what impact transmission type (automatic or manual) has on fuel economy (miles driven per gallon of fuel consumed, MPG) in the mtcars data set. Based on the analysis presented below, vehicles with manual transmissions have a 0.05 to 5.83 MPG advantage over automatic transmissions (at a 95% confidence interval).
The goal of this project is to determine how vehicle gas mileage (miles per gallon, abbreviated here as MPG) varies as a function of other vehicle characteristics (transmission type, weight, etc.) using the data available in the mtcars data set. Additionally, the problem statement requests answers to the questions:
The mtcars data set contains 32 observations on 11 variables (Henderson and Velleman, 1981)1:
| Column # | Observation | Description |
|---|---|---|
| [, 1] | mpg | Miles/(US) gallon |
| [, 2] | cyl | Number of cylinders |
| [, 3] | disp | Displacement (cu.in.) |
| [, 4] | hp | Gross horsepower |
| [, 5] | drat | Rear axle ratio |
| [, 6] | wt | Weight (1000 lbs) |
| [, 7] | qsec | 1/4 mile time |
| [, 8] | vs | V/S (0 = V-engine, 1 = straight-engine) |
| [, 9] | am | Transmission (0 = automatic, 1 = manual) |
| [,10] | gear | Number of forward gears |
| [,11] | carb | Number of carburetors |
Given the focus of this project, we’re only interested in the relationship between mpg and the other 10 variables.
Quick look at the structure of the data set:
# Data set structure
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
# Summary of each variable
summary(mtcars)
## mpg cyl disp hp
## Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
## 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
## Median :19.20 Median :6.000 Median :196.3 Median :123.0
## Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
## Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
## drat wt qsec vs
## Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
## 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
## Median :3.695 Median :3.325 Median :17.71 Median :0.0000
## Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
## 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
## Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
## am gear carb
## Min. :0.0000 Min. :3.000 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
## Median :0.0000 Median :4.000 Median :2.000
## Mean :0.4062 Mean :3.688 Mean :2.812
## 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :1.0000 Max. :5.000 Max. :8.000
# How many unique values for each variable?
apply(X = mtcars,MARGIN = 2,FUN = function(x) length(unique(x)))
## mpg cyl disp hp drat wt qsec vs am gear carb
## 25 3 27 22 22 29 30 2 2 3 6
From the quick summary, we note that several variables (cyl, vs, am, gear, and carb) only take on integers values, and that vs and am only have two possible values (0 or 1) to indicate type. For this analysis we’re particularly interested in transmission type (am), so we change that variable to a factor (making a new data.frame d for all of our subsequent analysis) so we can more easily track the difference between it and other variables:
# Make copy of mtcars called d
d <- mtcars
# Change specified column from numeric to factor
d$am <- as.factor(d$am)
Next, we can look at scatterplots of mpg vs. each of the other variables. The multipanel plot below was created using the R Graphics Cookbook2 (the code for the multiplot function has been excluded from the generated report to save on report length). Note that in the legend of each plot “0” indicates automatic and “1” indicates manual transmission.
# Create scatterplots with mpg on the y-axis
p1 <- ggplot(d, aes(x = cyl, y = mpg)) + geom_point(aes(color = am)) + geom_smooth(method = "lm")
p2 <- ggplot(d, aes(x = disp, y = mpg)) + geom_point(aes(color = am)) + geom_smooth(method = "lm")
p3 <- ggplot(d, aes(x = hp, y = mpg)) + geom_point(aes(color = am)) + geom_smooth(method = "lm")
p4 <- ggplot(d, aes(x = drat, y = mpg)) + geom_point(aes(color = am)) + geom_smooth(method = "lm")
p5 <- ggplot(d, aes(x = wt, y = mpg)) + geom_point(aes(color = am)) + geom_smooth(method = "lm")
p6 <- ggplot(d, aes(x = qsec, y = mpg)) + geom_point(aes(color = am)) + geom_smooth(method = "lm")
p7 <- ggplot(d, aes(x = vs, y = mpg)) + geom_point(aes(color = am)) + geom_smooth(method = "lm")
# Switch back to mtcars to get ggplot to correctly draw smoothed regression line for am
p8 <- ggplot(mtcars, aes(x = am, y = mpg)) + geom_point() + geom_smooth(method = "lm")
p9 <- ggplot(d, aes(x = gear, y = mpg)) + geom_point(aes(color = am)) + geom_smooth(method = "lm")
p10 <- ggplot(d, aes(x = carb, y = mpg)) + geom_point(aes(color = am)) + geom_smooth(method = "lm")
# Create multi-panel plot
multiplot(p1, p2, p3, p4, p5, p6, p7, p8, p9, p10, cols = 3)
From the scatterplots above, we can see that:
As a first pass, consider a model with all variables used as predictors (model 1):
# Fit model 1 using all variables as predictors
m1 <- lm(mpg ~ ., mtcars)
# Summary
summary(m1)
##
## Call:
## lm(formula = mpg ~ ., data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4506 -1.6044 -0.1196 1.2193 4.6271
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.30337 18.71788 0.657 0.5181
## cyl -0.11144 1.04502 -0.107 0.9161
## disp 0.01334 0.01786 0.747 0.4635
## hp -0.02148 0.02177 -0.987 0.3350
## drat 0.78711 1.63537 0.481 0.6353
## wt -3.71530 1.89441 -1.961 0.0633 .
## qsec 0.82104 0.73084 1.123 0.2739
## vs 0.31776 2.10451 0.151 0.8814
## am 2.52023 2.05665 1.225 0.2340
## gear 0.65541 1.49326 0.439 0.6652
## carb -0.19942 0.82875 -0.241 0.8122
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.65 on 21 degrees of freedom
## Multiple R-squared: 0.869, Adjusted R-squared: 0.8066
## F-statistic: 13.93 on 10 and 21 DF, p-value: 3.793e-07
Every variable in model 1 is insignificant except for wt. Clearly we can do better than model 1. While we could do adhoc testing of different combinations of variables, a more systematic (and automated) approach is to use the step function to choose a model based on AIC values (the output of this step is supressed to reduce report length by setting trace = 0).
# Use step function to find best model
m2 <- step(m1, direction = "both", trace = 0)
The best fit found using the step function is:
# Summary
summary(m2)
##
## Call:
## lm(formula = mpg ~ wt + qsec + am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4811 -1.5555 -0.7257 1.4110 4.6610
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.6178 6.9596 1.382 0.177915
## wt -3.9165 0.7112 -5.507 6.95e-06 ***
## qsec 1.2259 0.2887 4.247 0.000216 ***
## am 2.9358 1.4109 2.081 0.046716 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.459 on 28 degrees of freedom
## Multiple R-squared: 0.8497, Adjusted R-squared: 0.8336
## F-statistic: 52.75 on 3 and 28 DF, p-value: 1.21e-11
Diagnostic plots for model 2 are shown below:
par(mfrow = c(2,2))
plot(m2)
Observations with significant leverage are shown with labels (Chrysler Imperial, Fiat 128, Toyota Corolla, Merc 230). However despite the presence of these points, the diagnostic plots don’t reveal any systemic error in model 2.
From the fit of model 2 and the exploratory data analysis, it appears that manual transmission have better gas mileage than automatic transmissions. We can construct a confidence interval (CI) to test whether or not the difference between the two transmission types is significant, and quantify how much of an impact transmission type has on MPG. The confidence interval is given by:
\[CI = Estimate \pm (t_{quantile} * StandardError)\]
Coding that relationship and assuming a 95% confidence interval:
# Estimate
est <- coef(m2)["am"]
# Standard error (get from model summary)
se <- coef(summary(m2))["am", "Std. Error"]
# Make t quantile
tquant <- qt(p = 0.975, df = m2$df.residual)
# Calculate CI
CI <- est + c(-1,1) * tquant * se
# Print CI
CI
## [1] 0.04573031 5.82594408
Given that the CI doesn’t include 0 and that the p-value for am is small (r round(coef(summary(m2))["am", "Pr(>|t|)"],4)), we conclude that there is a significant difference between the MPG ratings based on transmission type, and that manual transmissions have a 0.05 to 5.83 MPG advantage over automatic transmissions.
Henderson and Velleman (1981), Building multiple regression models interactively. Biometrics, 37, 391–411.↩
http://www.cookbook-r.com/Graphs/Multiple_graphs_on_one_page_(ggplot2)/↩