Executive Summary

This report examines the relationship between miles per gallon (mpg) in the mtcars dataset using linear regression. The following prompts are addressed:

Regression analysis using transmission type, weight, and 1/4 mile time as explanatory variables leads to the conclusion that manual cars get on average 2.9 more mpg than automatic cars, when the effects of weight and 1/4 mile time are ignored.

Exploratory Data Analysis

Let’s take a look at the dataset.

str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

The variable am is represented as a numeric, where 0 encodes automatic and 1 encodes manual. In order to use regression analysis to compare mpg in manual vs. automatic transmissions, we will need to convert am to a factor. We can also replace the 0’s and 1’s with labels to make the output more readable.

mtcars$am <- gsub("0", "auto", mtcars$am)
mtcars$am <- gsub("1", "manual", mtcars$am)
mtcars$am <- factor(mtcars$am)

Several other variables should also be treated as factors; namely, vs, gear, and carb.

mtcars$vs <- factor(mtcars$vs)
mtcars$gear <- factor(mtcars$gear)
mtcars$carb <- factor(mtcars$carb)

A boxplot will give us a visual idea of whether or not mpg might differ between automatic and manual transmission cars.

mtcars %>%
  ggplot(aes(x = am, y = mpg, fill = am)) +
  geom_boxplot()

Manual cars appear to get better mileage on average than do automatic cars. This observation can be confirmed with statistical inference.

Refer to the Appendix for a pairs plot of the dataset.

Statistical inference

We can use the R function t.test to find out whether our hypothesis that manual cars get better gas mileage than automatic cars is statistically significant.

t.test(mpg ~ am, mtcars)
## 
##  Welch Two Sample t-test
## 
## data:  mpg by am
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -11.280194  -3.209684
## sample estimates:
##   mean in group auto mean in group manual 
##             17.14737             24.39231

The 95% confidence interval shown in the output of t.test does not contain 0, so we can conclude that the difference in mpg between manual and automatic transmissions is in fact significant.

Regression Model

Let’s start by regressing mpg on just am.

am_model <- lm(mpg ~ am, mtcars)
summary(am_model)
## 
## Call:
## lm(formula = mpg ~ am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3923 -3.0923 -0.2974  3.2439  9.5077 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   17.147      1.125  15.247 1.13e-15 ***
## ammanual       7.245      1.764   4.106 0.000285 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared:  0.3598, Adjusted R-squared:  0.3385 
## F-statistic: 16.86 on 1 and 30 DF,  p-value: 0.000285

The R2 value for this model is only 0.3598, which means that fitting mpg on am alone only explains about 36% of the variance in mpg.

Building a model that regresses mpg on all other variables in the dataset will explain more of the variance.

full_model <- lm(mpg ~ ., mtcars)
summary(full_model)
## 
## Call:
## lm(formula = mpg ~ ., data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.6533 -1.3325 -0.5166  0.7643  4.7284 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 25.31994   23.88164   1.060   0.3048  
## cyl         -1.02343    1.48131  -0.691   0.4995  
## disp         0.04377    0.03058   1.431   0.1716  
## hp          -0.04881    0.03189  -1.531   0.1454  
## drat         1.82084    2.38101   0.765   0.4556  
## wt          -4.63540    2.52737  -1.834   0.0853 .
## qsec         0.26967    0.92631   0.291   0.7747  
## vs1          1.04908    2.70495   0.388   0.7032  
## ammanual     0.96265    3.19138   0.302   0.7668  
## gear4        1.75360    3.72534   0.471   0.6442  
## gear5        1.87899    3.65935   0.513   0.6146  
## carb2       -0.93427    2.30934  -0.405   0.6912  
## carb3        3.42169    4.25513   0.804   0.4331  
## carb4       -0.99364    3.84683  -0.258   0.7995  
## carb6        1.94389    5.76983   0.337   0.7406  
## carb8        4.36998    7.75434   0.564   0.5809  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.823 on 16 degrees of freedom
## Multiple R-squared:  0.8867, Adjusted R-squared:  0.7806 
## F-statistic: 8.352 on 15 and 16 DF,  p-value: 6.044e-05

As expected, the full model has a higher R2 value (0.8867). But the output of summary shows that none of the coefficients are significant at the 0.05 level.

Excluding variables that are correlated with transmission type will introduce bias in the coefficients. However, including unnecessary regressors will inflate the model’s variance. We will use the step function in R to determine which variables to include in our final model.

step_model <- step(full_model, direction = "backward", trace = FALSE)
summary(step_model)
## 
## Call:
## lm(formula = mpg ~ wt + qsec + am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4811 -1.5555 -0.7257  1.4110  4.6610 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   9.6178     6.9596   1.382 0.177915    
## wt           -3.9165     0.7112  -5.507 6.95e-06 ***
## qsec          1.2259     0.2887   4.247 0.000216 ***
## ammanual      2.9358     1.4109   2.081 0.046716 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.459 on 28 degrees of freedom
## Multiple R-squared:  0.8497, Adjusted R-squared:  0.8336 
## F-statistic: 52.75 on 3 and 28 DF,  p-value: 1.21e-11

The model produced by the step algorithm includes wt, qsec, and am. The effects of all three variables are significant at the 0.05 level, and the model explains about 85% of the variance.

From the coefficients of this model, we can conclude that, holding weight and qsec constant, manual cars get 2.9 more miles per gallon on average than automatic cars.

The 95% confidence interval for this claim is:

confint(step_model)['ammanual', ]
##      2.5 %     97.5 % 
## 0.04573031 5.82594408

Diagnostic plotting using base graphics shows that the residuals are uncorrelated with the fitted values. The quantile-quantile plot indicates that the distributon of the residiuals is roughly normal.

par(mfrow = c(2,2))
plot(step_model)

Appendix

data(mtcars)
mtcars$am <- factor(mtcars$am)
ggpairs(mtcars, mapping = aes(colour = am))