Executive Summary

This will be an analysis of the mtcars data set included in R. We will specifically look at the role of transmission (automatic vs. manual) on the mileage of cars. At first glance it would appear that cars with manual transmissions get higher mpg. The mean for manuals is over 7 miles per gallon higher than for cars with automatic transmissions. However, when we account for some important confounding variables this relationship is reduced. When weight, number of cylinders and horsepower are held constant the advantage to manual transmissions is only 1.48 mpg higher.

To begin, we’ll run a simple linear regression using only the mpg and am variables. We’ll convert the am variable to a factor and rename the levels for easier interpretion of the model. The coefficients for this model are as follows;

##              Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 17.147368   1.124603 15.247492 1.133983e-15
## amManual     7.244939   1.764422  4.106127 2.850207e-04
## Adjusted R-squared  0.3384589

We see that the mean for automatic transmissions, the reference case, is 17.1 mpg while the mean for manual transmissions is 7.2 higher or 24.39 mpg. When we run a t-test on these groups the p-value is significant at 0.0014. However, when we look a little closer at our simple linear model we find that the adjusted R-squared value is quite low. This model only explains 34% of the variablity in mpg. We have some evidence that the transmission variable is significant but the picture is incomplete. We need to look at other variables in our data set to see if there are any other important relationships on mpg and transmission.

Let’s use EDA to find these relationships with the mpg variable. We’ll start by seeing which variables are strongly correlated to mpg in absolute terms.

##        wt       cyl      disp        hp      drat        vs        am 
## 0.8676594 0.8521620 0.8475514 0.7761684 0.6811719 0.6640389 0.5998324 
##      carb      gear      qsec 
## 0.5509251 0.4802848 0.4186840

Weight and number of cylinders are correlated to mpg. Displacement is very closely correlated to cylinder so we’ll skip that and look into horsepower as well. We’ll plot mpg against weight and color the points by transmission type. Then We’ll plot mpg against hp and color the points by cylinder type.

There are clear linear relationships between weight and mpg, number of cylinders and mpg and horsepower and mpg. We also see in the first plot that most of the cars with manual transmissions weigh under 3,000lbs while most of the cars with automatics weigh over 3,000lbs. It’s probably fair to assume that automatic transmissions are usually heavier than manuals but the extent to which that is true is beyond the scope of this analysis. Our takeaway is that weight, cylinder type and horsepower have a confounding effect on the relationship between transmission and mpg.

Let’s create a new model with these three additional regressors and see how the effect of transmission type changes. The coefficents and adjusted R-squared from this new model are below.

##                Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 36.14653575 3.10478079 11.642218 4.944804e-12
## amManual     1.47804771 1.44114927  1.025603 3.141799e-01
## wt          -2.60648071 0.91983749 -2.833632 8.603218e-03
## cyl         -0.74515702 0.58278741 -1.278609 2.119166e-01
## hp          -0.02495106 0.01364614 -1.828433 7.855337e-02
## Adjusted R-squared  0.8266657

We now see that the increase in mpg for a manual over an automatic has been reduced to 1.478. The range for the increase in mpg for a manual over an automatic can be quantified by the standard error as follows:

interval <- summary(fit)$coef[2,1] + c(1,-1) * summary(fit)$coef[2,2] 

The lower bound of our interval is just a 0.04 mpg increase for manuals so the p-value is no longer significant at 0.31. The upper bound is 2.92. The adjusted R-squared value is also much higher than our original regression. It tells us that approximately 83% of the variability can be explained the new model.

Conclusion

A manual transmission is, in a strictly practical sense, “better” for mpg, ie, cars with manuals have higher mpg ratings. However, from the data we have available and our interpretation of the models we’ve created we can not say this is true in a statistical sense. When we compare manuals to automatics while holding the variables weight, cylinder and horsepower constant the difference in mpg is dramatically reduced.

Note: Residuals and diagnostics are plotted in the appendix. They did not return any distinct patterns or anomalies which would lead us to question the accuracy of our models.

Appendix

data(mtcars)
mtcars.num <- mtcars
# split into manual and auto DFs and calc. mean and sd of mpg
s <- split(mtcars, mtcars$am)
man <- s[[2]]
auto <- s[[1]]
mean(man$mpg)
## [1] 24.39231
mean(auto$mpg)
## [1] 17.14737
# convert am var to factor and rename levels for easier reading
mtcars$am <- factor(mtcars$am)
levels(mtcars$am) <- c('Auto', 'Manual')
fit <- lm(mpg ~ am, mtcars)
# correlations between mpg and other vars sorted by absolute value
eda.cor <- sort(abs(cor(mtcars.num)[,1]), decreasing=TRUE)

# perform a t-test to see if a simple null hypothesis can be rejected.
t.test(mpg ~ am, data=mtcars)
## 
##  Welch Two Sample t-test
## 
## data:  mpg by am
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -11.280194  -3.209684
## sample estimates:
##   mean in group Auto mean in group Manual 
##             17.14737             24.39231
# EDA code
# multiplot function to place two ggplots side by side
multiplot <- function(..., plotlist=NULL, file, cols=1, layout=NULL) {
  library(grid)

  # Make a list from the ... arguments and plotlist
  plots <- c(list(...), plotlist)

  numPlots = length(plots)

  # If layout is NULL, then use 'cols' to determine layout
  if (is.null(layout)) {
    # Make the panel
    # ncol: Number of columns of plots
    # nrow: Number of rows needed, calculated from # of cols
    layout <- matrix(seq(1, cols * ceiling(numPlots/cols)),
                    ncol = cols, nrow = ceiling(numPlots/cols))
  }

 if (numPlots==1) {
    print(plots[[1]])

  } else {
    # Set up the page
    grid.newpage()
    pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout))))

    # Make each plot, in the correct location
    for (i in 1:numPlots) {
      # Get the i,j matrix positions of the regions that contain this subplot
      matchidx <- as.data.frame(which(layout == i, arr.ind = TRUE))

      print(plots[[i]], vp = viewport(layout.pos.row = matchidx$row,
                                      layout.pos.col = matchidx$col))
    }
  }
}
p1 <- ggplot(mtcars, aes(wt, mpg)) + geom_point(aes(colour=am), size=5, alpha=.8) +
scale_color_manual(values=c('#375e97',"#fb6542", "#3f681c")) + theme(panel.background = 
element_rect(fill = "beige")) + xlab("Weight") + ylab("MPG") + ggtitle("Miles per Gallon by Weight")

p2 <- ggplot(mtcars, aes(hp, mpg)) + geom_point(aes(colour=factor(cyl), shape=am), size=5, alpha=.8) +
scale_color_manual(values=c('#375e97',"#fb6542", "#3f681c")) + theme(panel.background = 
element_rect(fill = "beige")) + xlab("Horsepower") + ylab("MPG") + ggtitle("Miles per Gallon by Horsepower")

# nested regression models and anova code. am variable is not significant
fit <- lm(mpg ~ wt, data=mtcars)
fit2 <- lm(mpg ~ wt + am, mtcars)
fit3 <- lm(mpg ~ wt + am + cyl, mtcars)
fit4 <- lm(mpg ~ wt + am + cyl + hp, mtcars)
anova(fit, fit2, fit3, fit4)
## Analysis of Variance Table
## 
## Model 1: mpg ~ wt
## Model 2: mpg ~ wt + am
## Model 3: mpg ~ wt + am + cyl
## Model 4: mpg ~ wt + am + cyl + hp
##   Res.Df    RSS Df Sum of Sq       F    Pr(>F)    
## 1     30 278.32                                   
## 2     29 278.32  1     0.002  0.0004 0.9850889    
## 3     28 191.05  1    87.273 13.8611 0.0009165 ***
## 4     27 170.00  1    21.049  3.3432 0.0785534 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
par(mfrow=c(2,2))
# plot residuals and diagnostics from final lm model to verify data is randomly scattered, 
# normally distributed, and not overly affected by outliers
plot(fit4)