Motor Trend, a popular automotive magazine, has expressed an interest in understanding the relationship between miles per gallon (mpg) and transmission type (am). In order to conduct an analysis in this regard, we will leverage the 1974 Motor Trend Car Road Tests, which can be found here. Ultimately, Motor Trend would like to answer two questions:
After conducting the following exploratory and regression analysis, we cannot answer the question of which transmission type is better or worse without considering alternate variables, e.g. the weight of the car. In turn, with these additional considerations, we can however assert that manual transmissions get 1.8 to 2.9 more miles per gallon than do automatic transmissions.
This analysis was conducted as part of the Coursera Data Science Specialization
We will gather the data, reference the appropriate libraries, and ensure data is cleaned, transformed, and labeled appropriately.
# load up required packages
suppressMessages(library(data.table));suppressWarnings(library(ggplot2))
suppressWarnings(library(GGally))
# grab resource data and load into a data.table variable
dt<-data.table(mtcars)
# Once we have the dataset loaded, there are a few things to get in
# order before we can perform the analysis.
dt<-data.table(mtcars);dt$cyl<-factor(dt$cyl)
dt$vs<-factor(dt$vs);dt$gear<-factor(dt$gear)
dt$am<-factor(dt$am,levels=c(0,1),labels=c('Automatic','Manual'))
dt$carb<-factor(dt$carb)
The following function provides some important data profiling.
## Classes 'data.table' and 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : Factor w/ 3 levels "4","6","8": 2 2 1 2 3 2 3 1 1 2 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : Factor w/ 2 levels "0","1": 1 1 2 2 1 2 1 2 2 2 ...
## $ am : Factor w/ 2 levels "Automatic","Manual": 2 2 2 1 1 1 1 1 1 1 ...
## $ gear: Factor w/ 3 levels "3","4","5": 2 2 2 1 1 1 1 2 2 2 ...
## $ carb: Factor w/ 6 levels "1","2","3","4",..: 4 4 1 1 2 1 4 2 2 4 ...
## - attr(*, ".internal.selfref")=<externalptr>
Using the above data profiling, some important variables are captured with the below visualization. At first glance, it would appear that low weight, low horsepower, 4 cylinder ‘manual’ cars have the most impact on miles per gallon. It is important we justify this claim, and we can begin our regression analysis to do so.
We can begin by pulling together a simple linear model directly in line with the first question posed; so, when we look at the relationship between mpg and am alone, we see 7.24 increase in miles per gallon when comparing just the two variables.
fit<-lm(mpg~am,data=dt); summary(fit)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.147368 1.124603 15.247492 1.133983e-15
## amManual 7.244939 1.764422 4.106127 2.850207e-04
However, when we review the R squared value, we see that only 36% of the variance is explained with this model.
summary(fit)$r.squared
## [1] 0.3597989
Should we take the existing model and confound the relationship with two numeric values, the weight of the car (wt) and the quarter mile time (qsec), we get the following which shows over a 2.9 mile per gallon increase between an ‘automatic’ and a ‘manual’ transmission.
fit2<-lm(mpg~wt+qsec+am,data=dt); summary(fit2)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.617781 6.9595930 1.381946 1.779152e-01
## wt -3.916504 0.7112016 -5.506882 6.952711e-06
## qsec 1.225886 0.2886696 4.246676 2.161737e-04
## amManual 2.935837 1.4109045 2.080819 4.671551e-02
This model produces an R squared value of 0.85, or 85% of the variance is explained. This is a promising model.
summary(fit2)$r.squared
## [1] 0.8496636
In an attempt to take the analysis a step further, there is an opportunity to analyze more discrete variables to produce an even more reliable model. However, with the fairly large number of variable in the source data set, a trial-and-error approach would be tedious and could take a bit of time. R provides a ‘stepwise’ function to iterate models, in this case, our linear model. Here we see the number of cylinders (cyl), the horsepower (hp), and the weight (wt) also contribute to a 1.8 mile per gallon increase when comparing transmission types. Note, this is the same observation we made with the exploratory visualization.
fit3<-step(lm(data=dt,mpg~.),steps=1000,direction='backward',trace=FALSE)
summary(fit3)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.70832390 2.60488618 12.940421 7.733392e-13
## cyl6 -3.03134449 1.40728351 -2.154040 4.068272e-02
## cyl8 -2.16367532 2.28425172 -0.947214 3.522509e-01
## hp -0.03210943 0.01369257 -2.345025 2.693461e-02
## wt -2.49682942 0.88558779 -2.819404 9.081408e-03
## amManual 1.80921138 1.39630450 1.295714 2.064597e-01
Furthermore, there is an even higher R squared value of 0.87, and, in checking some diagnostics, the slope coefficients do not change dramatically.
summary(fit3)$r.squared; c(min(dfbetas(fit3)[1:10,2]),max(dfbetas(fit3)[1:10,2]))
## [1] 0.8658799
## [1] -0.2808982 0.2916951
The latter two models, fit2 and fit3, explain a relatively high percentage of variance with respect to miles per gallon, or mpg. While other variables in the data set did confound the direct relationship between transmission type and miles per gallon, we can conclude with a fairly high degree of certainty that manual transmission do get 1.8 to 2.9 miles per gallon more than automatic transmissions - this is, of course, a much lower of a result when we don’t consider confounders in the data.
Data Table Summary:
summary(dt)
## mpg cyl disp hp drat
## Min. :10.40 4:11 Min. : 71.1 Min. : 52.0 Min. :2.760
## 1st Qu.:15.43 6: 7 1st Qu.:120.8 1st Qu.: 96.5 1st Qu.:3.080
## Median :19.20 8:14 Median :196.3 Median :123.0 Median :3.695
## Mean :20.09 Mean :230.7 Mean :146.7 Mean :3.597
## 3rd Qu.:22.80 3rd Qu.:326.0 3rd Qu.:180.0 3rd Qu.:3.920
## Max. :33.90 Max. :472.0 Max. :335.0 Max. :4.930
## wt qsec vs am gear carb
## Min. :1.513 Min. :14.50 0:18 Automatic:19 3:15 1: 7
## 1st Qu.:2.581 1st Qu.:16.89 1:14 Manual :13 4:12 2:10
## Median :3.325 Median :17.71 5: 5 3: 3
## Mean :3.217 Mean :17.85 4:10
## 3rd Qu.:3.610 3rd Qu.:18.90 6: 1
## Max. :5.424 Max. :22.90 8: 1
Model Variance Summary:
anova(fit,fit2,fit3)
## Analysis of Variance Table
##
## Model 1: mpg ~ am
## Model 2: mpg ~ wt + qsec + am
## Model 3: mpg ~ cyl + hp + wt + am
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 30 720.90
## 2 28 169.29 2 551.61 47.4816 2.09e-09 ***
## 3 26 151.03 2 18.26 1.5718 0.2268
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Pairs (ggpairs) Visual:
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)