Regression Modeling Course Project

Executive Summmary

The purpose this course project was to determine what impact transmission type (automatic or manual) has on fuel economy (miles driven per gallon of fuel consumed, MPG) in the mtcars data set. Based on the analysis presented below, vehicles with manual transmissions have a 0.05 to 5.83 MPG advantage over automatic transmissions (at a 95% confidence interval).

Introduction

The goal of this project is to determine how vehicle gas mileage (miles per gallon, abbreviated here as MPG) varies as a function of other vehicle characteristics (transmission type, weight, etc.) using the data available in the mtcars data set. Additionally, the problem statement requests answers to the questions:

Is an automatic or manual transmission better for MPG?
Quantify the MPG difference between automatic and manual transmissions.

Exploratory Data Analysis

The mtcars data set contains 32 observations on 11 variables (Henderson and Velleman, 1981)¹:

Column #	Observation	Description
[, 1]	mpg	Miles/(US) gallon
[, 2]	cyl	Number of cylinders
[, 3]	disp	Displacement (cu.in.)
[, 4]	hp	Gross horsepower
[, 5]	drat	Rear axle ratio
[, 6]	wt	Weight (1000 lbs)
[, 7]	qsec	1/4 mile time
[, 8]	vs	V/S (0 = V-engine, 1 = straight-engine)
[, 9]	am	Transmission (0 = automatic, 1 = manual)
[,10]	gear	Number of forward gears
[,11]	carb	Number of carburetors

Given the focus of this project, we’re only interested in the relationship between mpg and the other 10 variables.

Quick look at the structure of the data set:

# Data set structure
str(mtcars)

## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

# Summary of each variable
summary(mtcars)

##       mpg             cyl             disp             hp       
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
##       drat             wt             qsec             vs        
##  Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
##  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
##  Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
##  Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
##  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
##  Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
##        am              gear            carb      
##  Min.   :0.0000   Min.   :3.000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :0.0000   Median :4.000   Median :2.000  
##  Mean   :0.4062   Mean   :3.688   Mean   :2.812  
##  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :5.000   Max.   :8.000

# How many unique values for each variable?
apply(X = mtcars,MARGIN = 2,FUN = function(x) length(unique(x)))

##  mpg  cyl disp   hp drat   wt qsec   vs   am gear carb 
##   25    3   27   22   22   29   30    2    2    3    6

From the quick summary, we note that several variables (cyl, vs, am, gear, and carb) only take on integers values, and that vs and am only have two possible values (0 or 1) to indicate type. For this analysis we’re particularly interested in transmission type (am), so we change that variable to a factor (making a new data.frame d for all of our subsequent analysis) so we can more easily track the difference between it and other variables:

# Make copy of mtcars called d
d <- mtcars

# Change specified column from numeric to factor
d$am   <- as.factor(d$am)

Next, we can look at scatterplots of mpg vs. each of the other variables. The multipanel plot below was created using the R Graphics Cookbook² (the code for the multiplot function has been excluded from the generated report to save on report length). Note that in the legend of each plot “0” indicates automatic and “1” indicates manual transmission.

# Create scatterplots with mpg on the y-axis
p1  <- ggplot(d, aes(x = cyl, y = mpg))  + geom_point(aes(color = am)) + geom_smooth(method = "lm")
p2  <- ggplot(d, aes(x = disp, y = mpg)) + geom_point(aes(color = am)) + geom_smooth(method = "lm")
p3  <- ggplot(d, aes(x = hp, y = mpg))   + geom_point(aes(color = am)) + geom_smooth(method = "lm")
p4  <- ggplot(d, aes(x = drat, y = mpg)) + geom_point(aes(color = am)) + geom_smooth(method = "lm")
p5  <- ggplot(d, aes(x = wt, y = mpg))   + geom_point(aes(color = am)) + geom_smooth(method = "lm")
p6  <- ggplot(d, aes(x = qsec, y = mpg)) + geom_point(aes(color = am)) + geom_smooth(method = "lm")
p7  <- ggplot(d, aes(x = vs, y = mpg))   + geom_point(aes(color = am)) + geom_smooth(method = "lm")
# Switch back to mtcars to get ggplot to correctly draw smoothed regression line for am
p8  <- ggplot(mtcars, aes(x = am, y = mpg)) + geom_point() + geom_smooth(method = "lm")
p9  <- ggplot(d, aes(x = gear, y = mpg)) + geom_point(aes(color = am)) + geom_smooth(method = "lm")
p10 <- ggplot(d, aes(x = carb, y = mpg)) + geom_point(aes(color = am)) + geom_smooth(method = "lm")

# Create multi-panel plot
multiplot(p1, p2, p3, p4, p5, p6, p7, p8, p9, p10, cols = 3)

From the scatterplots above, we can see that:

MPG is inversely related to cyl, wt, disp, carb, and hp (i.e. MPG drops as any of those variables increase).
- However, subject matter knowledge tells us that the number of cylinders and carburetors is essentially a measure of engine displacement, which in turn is responsible the engine’s horsepower.
MPG is somewhat positively related to gear, qsec, vs, drat, and am
- All of the positive correlations are much noiser than than inverse relationships. For example, it doesn’t appear that there are a number of 5-gear cars that have just as bad of MPG rating as 3-gear cars.
- There appears to be evidence from the mpg ~ am plot (bottom row, middle column) that automatic tranmissions (“0”s) have worse MPG than manual transmissions (“1”s).

Regression Analysis

Model Selection

As a first pass, consider a model with all variables used as predictors (model 1):

# Fit model 1 using all variables as predictors
m1 <- lm(mpg ~ ., mtcars)

# Summary
summary(m1)

## 
## Call:
## lm(formula = mpg ~ ., data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4506 -1.6044 -0.1196  1.2193  4.6271 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 12.30337   18.71788   0.657   0.5181  
## cyl         -0.11144    1.04502  -0.107   0.9161  
## disp         0.01334    0.01786   0.747   0.4635  
## hp          -0.02148    0.02177  -0.987   0.3350  
## drat         0.78711    1.63537   0.481   0.6353  
## wt          -3.71530    1.89441  -1.961   0.0633 .
## qsec         0.82104    0.73084   1.123   0.2739  
## vs           0.31776    2.10451   0.151   0.8814  
## am           2.52023    2.05665   1.225   0.2340  
## gear         0.65541    1.49326   0.439   0.6652  
## carb        -0.19942    0.82875  -0.241   0.8122  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.65 on 21 degrees of freedom
## Multiple R-squared:  0.869,  Adjusted R-squared:  0.8066 
## F-statistic: 13.93 on 10 and 21 DF,  p-value: 3.793e-07

Every variable in model 1 is insignificant except for wt. Clearly we can do better than model 1. While we could do adhoc testing of different combinations of variables, a more systematic (and automated) approach is to use the step function to choose a model based on AIC values (the output of this step is supressed to reduce report length by setting trace = 0).

# Use step function to find best model
m2 <- step(m1, direction = "both", trace = 0)

The best fit found using the step function is:

# Summary
summary(m2)

## 
## Call:
## lm(formula = mpg ~ wt + qsec + am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4811 -1.5555 -0.7257  1.4110  4.6610 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   9.6178     6.9596   1.382 0.177915    
## wt           -3.9165     0.7112  -5.507 6.95e-06 ***
## qsec          1.2259     0.2887   4.247 0.000216 ***
## am            2.9358     1.4109   2.081 0.046716 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.459 on 28 degrees of freedom
## Multiple R-squared:  0.8497, Adjusted R-squared:  0.8336 
## F-statistic: 52.75 on 3 and 28 DF,  p-value: 1.21e-11

Diagnostics

Diagnostic plots for model 2 are shown below:

par(mfrow = c(2,2))
plot(m2)

Observations with significant leverage are shown with labels (Chrysler Imperial, Fiat 128, Toyota Corolla, Merc 230). However despite the presence of these points, the diagnostic plots don’t reveal any systemic error in model 2.

Inference

From the fit of model 2 and the exploratory data analysis, it appears that manual transmission have better gas mileage than automatic transmissions. We can construct a confidence interval (CI) to test whether or not the difference between the two transmission types is significant, and quantify how much of an impact transmission type has on MPG. The confidence interval is given by:

\[CI = Estimate \pm (t_{quantile} * StandardError)\]

Coding that relationship and assuming a 95% confidence interval:

# Estimate
est <- coef(m2)["am"]

# Standard error (get from model summary)
se <- coef(summary(m2))["am", "Std. Error"]

# Make t quantile
tquant <- qt(p = 0.975, df = m2$df.residual)

# Calculate CI
CI <- est + c(-1,1) * tquant * se

# Print CI
CI

## [1] 0.04573031 5.82594408

Conclusions

Given that the CI doesn’t include 0 and that the p-value for am is small (r round(coef(summary(m2))["am", "Pr(>|t|)"],4)), we conclude that there is a significant difference between the MPG ratings based on transmission type, and that manual transmissions have a 0.05 to 5.83 MPG advantage over automatic transmissions.

References

Henderson and Velleman (1981), Building multiple regression models interactively. Biometrics, 37, 391–411.↩
http://www.cookbook-r.com/Graphs/Multiple_graphs_on_one_page_(ggplot2)/↩