Executive Summary

This report contains a brief analysis of car data in order to answer the question of whether automatic or manual transmission is better for miles per gallon (mpg), and to quantify this difference between the transmission types. The data being investigated is the mtcars dataset, which aside from mpg and transmission type data, also includes other measured variables such as gross horsepower, weight, number of cylinders, and more. After approriate selection of regressors, a linear model of the data is fit, and the adjusted influence of the transmission type on the mpg is determined. From the analysis, it is found that manual transmission results in better miles per gallon than automatic transmission, and in fact there is a 1.806 increase in mpg when the transmission is manual relative to when the transmission is automatic.

Exploratory Data Analysis

The mtcars data set is loaded in from R, and appropriate variables are converted to factors. The variables of interest with regards to the immediate questions posed by the problem are \(mpg\), which is miles per gallon, and \(am\), which represents the transmission (0 for automatic, 1 for manual).

str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : Factor w/ 3 levels "4","6","8": 2 2 1 2 3 2 3 1 1 2 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : Factor w/ 2 levels "0","1": 1 1 2 2 1 2 1 2 2 2 ...
##  $ am  : Factor w/ 2 levels "0","1": 2 2 2 1 1 1 1 1 1 1 ...
##  $ gear: Factor w/ 3 levels "3","4","5": 2 2 2 1 1 1 1 2 2 2 ...
##  $ carb: Factor w/ 6 levels "1","2","3","4",..: 4 4 1 1 2 1 4 2 2 4 ...

The question being addressed is which transmission has better mpg, automatic or manual. Therefore, a first pass at the analysis is to create a linear model with \(mpg\) as the outcome and simply \(am\) as the sole regressor.

fit0 <- lm(mpg ~ am, data = mtcars)
summary(fit0)$coef
##              Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 17.147368   1.124603 15.247492 1.133983e-15
## am1          7.244939   1.764422  4.106127 2.850207e-04
  1. The p-value for the \(am\) coefficient is < 0.5 and so is highly significant, indicating that the null hypothesis that the coefficient is zero is rejected, and there is likely a linear relationship between transmission type and mpg.

  2. The intercept value of 17.147 is the average mpg for the automatic transmission (\(am\) = 0), and there is a positive increase in the average mpg when the transmission is manual (\(am\) = 1), as can be seen from the \(am\) regressor coefficient of 7.245 (meaning that the average mpg for the manual transmission is 24.392).

  3. However, the residuals plot for this fit (first figure in the appendix) shows an obvious pattern, indicating that the single linear regression alone is not a very good model for mpg. Therefore, it is likely that other regressors need to be included in the model, and there will be an adjustment to the transmission type effect on mpg that should be considered based on these other regressors.

Model Selection

A pairs plot of all the variables in the data set is the second figure in the appendix. Variables that appear to have the highest correlation with mpg are \(cyl\), \(disp\), \(hp\), and \(wt\). Nested models are fit, where the mentioned correlated regressors are added one-by-one. An anova test is performed, the output of which is shown here.

## Analysis of Variance Table
## 
## Model 1: mpg ~ am
## Model 2: mpg ~ am + cyl
## Model 3: mpg ~ am + cyl + disp
## Model 4: mpg ~ am + cyl + disp + hp
## Model 5: mpg ~ am + cyl + disp + hp + wt
##   Res.Df    RSS Df Sum of Sq       F    Pr(>F)    
## 1     30 720.90                                   
## 2     28 264.50  2    456.40 37.9300 2.678e-08 ***
## 3     27 230.46  1     34.04  5.6572  0.025339 *  
## 4     26 183.04  1     47.42  7.8820  0.009541 ** 
## 5     25 150.41  1     32.63  5.4236  0.028246 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

All of the p-values are < 0.05, including the last fit with all the regressors of interest added, indicating that all the additional regressors should be included and that this model could be a good fit.

Diagnostic plots for the final fit are shown as the third figure in the appendix. There appears to be no noticible pattern in the residual plot, and the quantiles of the residuals are relatively normal. These indicate a good fit.

The summary of the final chosen fit is shown below.

summary(fit4)$coef
##                 Estimate Std. Error    t value     Pr(>|t|)
## (Intercept) 33.864276061 2.69541569 12.5636562 2.668321e-12
## am1          1.806099494 1.42107933  1.2709350 2.154510e-01
## cyl6        -3.136066556 1.46909031 -2.1346996 4.277253e-02
## cyl8        -2.717781289 2.89814941 -0.9377644 3.573375e-01
## disp         0.004087893 0.01276729  0.3201848 7.514890e-01
## hp          -0.032480178 0.01398322 -2.3227963 2.862128e-02
## wt          -2.738694608 1.17597755 -2.3288664 2.824553e-02

The intercept of 33.864 is the value of mpg for an automatic transmission (as well as a 4-cylinder engine, with 0 displacement, 0 horsepower, 0 weight). The coefficient on \(am\) indicates that there is 1.806 increase in mpg when the transmission is manual, and all other variables are held constant. So with the new model, even with the \(am\) adjusted by the other regressors, the manual transmission is still better.

Appendix

Residual plot for the simple linear model mpg ~ am.

Pairs plot for the mtcars dataset.

Diagnostic plots for the final model fit selected.