Executive Summary

Motor Trend, a popular automotive magazine, has expressed an interest in understanding the relationship between miles per gallon (mpg) and transmission type (am). In order to conduct an analysis in this regard, we will leverage the 1974 Motor Trend Car Road Tests, which can be found here. Ultimately, Motor Trend would like to answer two questions:

  1. Is an automatic or manual transmission better for MPG?
  2. Quantify the MPG difference between automatic and manual transmissions?

After conducting the following exploratory and regression analysis, we cannot answer the question of which transmission type is better or worse without considering alternate variables, e.g. the weight of the car. In turn, with these additional considerations, we can however assert that manual transmissions get 1.8 to 2.9 more miles per gallon than do automatic transmissions.

This analysis was conducted as part of the Coursera Data Science Specialization

Data Gathering and Cleaning

We will gather the data, reference the appropriate libraries, and ensure data is cleaned, transformed, and labeled appropriately.

# load up required packages
suppressMessages(library(data.table));suppressWarnings(library(ggplot2))
suppressWarnings(library(GGally))
# grab resource data and load into a data.table variable
dt<-data.table(mtcars)
# Once we have the dataset loaded, there are a few things to get in 
# order before we can perform the analysis.
dt<-data.table(mtcars);dt$cyl<-factor(dt$cyl)
dt$vs<-factor(dt$vs);dt$gear<-factor(dt$gear)
dt$am<-factor(dt$am,levels=c(0,1),labels=c('Automatic','Manual'))
dt$carb<-factor(dt$carb)

Exploratory Analysis

The following function provides some important data profiling.

## Classes 'data.table' and 'data.frame':   32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : Factor w/ 3 levels "4","6","8": 2 2 1 2 3 2 3 1 1 2 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : Factor w/ 2 levels "0","1": 1 1 2 2 1 2 1 2 2 2 ...
##  $ am  : Factor w/ 2 levels "Automatic","Manual": 2 2 2 1 1 1 1 1 1 1 ...
##  $ gear: Factor w/ 3 levels "3","4","5": 2 2 2 1 1 1 1 2 2 2 ...
##  $ carb: Factor w/ 6 levels "1","2","3","4",..: 4 4 1 1 2 1 4 2 2 4 ...
##  - attr(*, ".internal.selfref")=<externalptr>

Using the above data profiling, some important variables are captured with the below visualization. At first glance, it would appear that low weight, low horsepower, 4 cylinder ‘manual’ cars have the most impact on miles per gallon. It is important we justify this claim, and we can begin our regression analysis to do so.

Regression Analysis

We can begin by pulling together a simple linear model directly in line with the first question posed; so, when we look at the relationship between mpg and am alone, we see 7.24 increase in miles per gallon when comparing just the two variables.

fit<-lm(mpg~am,data=dt); summary(fit)$coef
##              Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 17.147368   1.124603 15.247492 1.133983e-15
## amManual     7.244939   1.764422  4.106127 2.850207e-04

However, when we review the R squared value, we see that only 36% of the variance is explained with this model.

summary(fit)$r.squared
## [1] 0.3597989

Should we take the existing model and confound the relationship with two numeric values, the weight of the car (wt) and the quarter mile time (qsec), we get the following which shows over a 2.9 mile per gallon increase between an ‘automatic’ and a ‘manual’ transmission.

fit2<-lm(mpg~wt+qsec+am,data=dt); summary(fit2)$coef
##              Estimate Std. Error   t value     Pr(>|t|)
## (Intercept)  9.617781  6.9595930  1.381946 1.779152e-01
## wt          -3.916504  0.7112016 -5.506882 6.952711e-06
## qsec         1.225886  0.2886696  4.246676 2.161737e-04
## amManual     2.935837  1.4109045  2.080819 4.671551e-02

This model produces an R squared value of 0.85, or 85% of the variance is explained. This is a promising model.

summary(fit2)$r.squared
## [1] 0.8496636

In an attempt to take the analysis a step further, there is an opportunity to analyze more discrete variables to produce an even more reliable model. However, with the fairly large number of variable in the source data set, a trial-and-error approach would be tedious and could take a bit of time. R provides a ‘stepwise’ function to iterate models, in this case, our linear model. Here we see the number of cylinders (cyl), the horsepower (hp), and the weight (wt) also contribute to a 1.8 mile per gallon increase when comparing transmission types. Note, this is the same observation we made with the exploratory visualization.

fit3<-step(lm(data=dt,mpg~.),steps=1000,direction='backward',trace=FALSE)
summary(fit3)$coef
##                Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 33.70832390 2.60488618 12.940421 7.733392e-13
## cyl6        -3.03134449 1.40728351 -2.154040 4.068272e-02
## cyl8        -2.16367532 2.28425172 -0.947214 3.522509e-01
## hp          -0.03210943 0.01369257 -2.345025 2.693461e-02
## wt          -2.49682942 0.88558779 -2.819404 9.081408e-03
## amManual     1.80921138 1.39630450  1.295714 2.064597e-01

Furthermore, there is an even higher R squared value of 0.87, and, in checking some diagnostics, the slope coefficients do not change dramatically.

summary(fit3)$r.squared; c(min(dfbetas(fit3)[1:10,2]),max(dfbetas(fit3)[1:10,2]))
## [1] 0.8658799
## [1] -0.2808982  0.2916951

Conclusion

The latter two models, fit2 and fit3, explain a relatively high percentage of variance with respect to miles per gallon, or mpg. While other variables in the data set did confound the direct relationship between transmission type and miles per gallon, we can conclude with a fairly high degree of certainty that manual transmission do get 1.8 to 2.9 miles per gallon more than automatic transmissions - this is, of course, a much lower of a result when we don’t consider confounders in the data.

Appendix

Data Table Summary:

summary(dt)
##       mpg        cyl         disp             hp             drat      
##  Min.   :10.40   4:11   Min.   : 71.1   Min.   : 52.0   Min.   :2.760  
##  1st Qu.:15.43   6: 7   1st Qu.:120.8   1st Qu.: 96.5   1st Qu.:3.080  
##  Median :19.20   8:14   Median :196.3   Median :123.0   Median :3.695  
##  Mean   :20.09          Mean   :230.7   Mean   :146.7   Mean   :3.597  
##  3rd Qu.:22.80          3rd Qu.:326.0   3rd Qu.:180.0   3rd Qu.:3.920  
##  Max.   :33.90          Max.   :472.0   Max.   :335.0   Max.   :4.930  
##        wt             qsec       vs             am     gear   carb  
##  Min.   :1.513   Min.   :14.50   0:18   Automatic:19   3:15   1: 7  
##  1st Qu.:2.581   1st Qu.:16.89   1:14   Manual   :13   4:12   2:10  
##  Median :3.325   Median :17.71                         5: 5   3: 3  
##  Mean   :3.217   Mean   :17.85                                4:10  
##  3rd Qu.:3.610   3rd Qu.:18.90                                6: 1  
##  Max.   :5.424   Max.   :22.90                                8: 1

Model Variance Summary:

anova(fit,fit2,fit3)
## Analysis of Variance Table
## 
## Model 1: mpg ~ am
## Model 2: mpg ~ wt + qsec + am
## Model 3: mpg ~ cyl + hp + wt + am
##   Res.Df    RSS Df Sum of Sq       F   Pr(>F)    
## 1     30 720.90                                  
## 2     28 169.29  2    551.61 47.4816 2.09e-09 ***
## 3     26 151.03  2     18.26  1.5718   0.2268    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Pairs (ggpairs) Visual:

## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)