library(knitr)

Executive Summary

Do cars with manual transmission perform better than cars with automatic transmission? It depends on what you’re measuring. The purpose of this analysis is to explore the relationship between miles per gallon (MPG) and other variables in the mtcars dataset. It addresses the following main questions:

  1. Is an automatic or manual transmission better for MPG?
  2. Quantify the MPG difference between automatic and manual transmissions.

The findings lead us to conclude that manual transmission is better for MPG. The difference in MPG between automatic and manual transmissions is quantified in the analysis.

Analysis

Exploratory Data Analysis

First, let’s explore the data a little bit to get a sense of what we have. Here is a boxplot to show MPG by type of Transmission.
According to the plot, manual transmission has a higher average MPG than automatic transmission. There are no visible outliers that could influence the regression.


Here is a histogram of MPG to check for outliers, just in case.


Normal distribution. Nice!


Finally, let’s look at the distribution or spread of values for the variable mpg.

Simple Linear Regression

First, let’s fit a simple linear regression for mpg on am (transmission). Here, fit is calculated using simple linear regression with mpg as the outcome and am (transmission) as the predictor.

data(mtcars)
fit <- lm(mpg ~ am, data=mtcars)


Now, summarize the fit to interpret coefficient and intercepts.

summary(fit)$coefficient
##              Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 17.147368   1.124603 15.247492 1.133983e-15
## am           7.244939   1.764422  4.106127 2.850207e-04

On average, cars with automatic transmission have 17.147 MPG and cars with manual transmission have 7.245 MPG more, making it 24.392 MPG.


We can use the results from the summary above to calculate a 95 percent confidence interval for am (transmission):

n <- length(mtcars$mpg)
alpha <- 0.05
pe <- coef(summary(fit))["am", "Estimate"]
se <- coef(summary(fit))["am", "Std. Error"]
tstat <- qt(1 - alpha/2, n - 2)               # n-2 is for the model with intercept and slope
pe + c(-1,1) * (se * tstat)
## [1]  3.64151 10.84837

The p-value for am of 0.0002 (Pr(>|t|) = 2.850207e-04) is small and the confidence interval does not include 0, so the null hypothesis can be rejected in favor of the alternative hypothesis that there is a significant difference in MPG between the two groups at alpha = 0.05.

Multivariate Linear Regression

In order adjust for other confounding variables, such as weight (wt) and horsepower (hp), let’s run a multivariate regression to get a better estimate of the impact of transmission type on mpg.


In calculating the fit using multivariate linear regression, we’re still looking at mpg as the outcome and am (transmission) as the predictor, but for mfit1, we’re adding a confounding variable wt (weight), and for mfit2, we’re adding another confounding variable hp (horsepower). Then we’re using the ANOVA function to compare the three models (fit, mfit1, mfit2) to see if there are any significant differences.

mfit1 <- lm(mpg ~ am + wt, data = mtcars)
mfit2 <- lm(mpg ~ am + wt + hp, data = mtcars)
anova(fit, mfit1, mfit2)
## Analysis of Variance Table
## 
## Model 1: mpg ~ am
## Model 2: mpg ~ am + wt
## Model 3: mpg ~ am + wt + hp
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1     30 720.90                                  
## 2     29 278.32  1    442.58 68.734 5.071e-09 ***
## 3     28 180.29  1     98.03 15.224 0.0005464 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

With small p-values for both models mfit1 and mfit2 (5.071e-09 and 0.0005464, respectively), we can reject the null hypothesis and claim that the multivariate linear regression models are significantly different than the simple linear regression model.


Now let’s check the residuals and diagnostics for anything outside the norm and look at the residuals vs. fitted values plot to look for signs of heteroskedasticity.

# residuals for mfit1
plot(mfit1, which=c(1:1))

# residuals for mfit2
plot(mfit2, which=c(1:1))



Residuals for both mfit1 and mfit2 are normally distributed and there are no signs of heteroskedasticity.


Finally, here are the estimates from the final model, mfit2:

summary(mfit2)
## 
## Call:
## lm(formula = mpg ~ am + wt + hp, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4221 -1.7924 -0.3788  1.2249  5.5317 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 34.002875   2.642659  12.867 2.82e-13 ***
## am           2.083710   1.376420   1.514 0.141268    
## wt          -2.878575   0.904971  -3.181 0.003574 ** 
## hp          -0.037479   0.009605  -3.902 0.000546 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.538 on 28 degrees of freedom
## Multiple R-squared:  0.8399, Adjusted R-squared:  0.8227 
## F-statistic: 48.96 on 3 and 28 DF,  p-value: 2.908e-11

The model mfit2 explains 83.99 percent of the variance. As originally suspected, weight (wt) and horsepower (hp) confounded the relationship between miles per gallon (mpg) and transmission (am). The coefficient for am in the summary above tells us that cars with manual transmission can go about 2.08 miles more per gallon than cars with automatic transmission.

Conclusion

According to the simple linear regression, manual transmission is better for MPG. According to the multivariate lineare regression, there is a significant difference in MPG between automatic and manual transmission. To account for any uncertainty, we made sure the p-values for all model fits were small (less than 0.05) and used various plots to look for outliers and signs of heteroskedasticity (didn’t find any).

Appendix

Diagnostics

Here are the full diagnostics. The Residuals vs Fitted plots for both models were used in the Multivariate Linear Regression subsection of the Analysis section.

# diagnostics for model mfit1
par(mfrow = c(2, 2))
plot(mfit1)

# diagnostics for model mfit2
par(mfrow = c(2, 2))
plot(mfit2)