Executive Summary

In this report, we analyze the mtcars data set (extracted from the 1974 Motor Trend US magazine, and comprising fuel consumption and 10 aspects of automobile design and performance for 32 automobiles) in order to ascertain and quantify the MPG difference between automatic and manual transmissions.

Our model based on the multivariable linear regression analysis leads us to the conclusion that a manual transmission is better for MPG, and with 95% confidence, we estimate a 0.05 to 5.83 increase in MPG for manual cars versus automatic.

Exploratory Data Analysis

The mtcars data set is a data frame with 32 observations on 11 numeric variables (see Appendix Figure 1 for a description of the variables).

library( datasets ) ; data( mtcars ) ; str( mtcars )
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

The following boxplot shows the difference between automatic and manual transmissions - manual cars appear to have better MPG.

par( mar=c(4,4,1,2) , mgp=c(2,1,0) )
boxplot( mpg ~ factor(am, labels=c("Automatic","Manual")), data=mtcars, xlab="Miles Per Gallon (MPG)", ylab="", col=c("red","blue"), horizontal=TRUE )

The variables of interest (mpg and am) also demonstrate a marked correlation with other variables in the data set (see Appendix Figure 2 for the scatterplot matrix from the mtcars data set). This suggests using covariate adjustment and investigating an adjusted effect of those variables.

round( cor( mtcars ) , 3 )[c("mpg","am"), ]   #getting a correlation matrix for `mpg` and `am`
##     mpg    cyl   disp     hp  drat     wt   qsec    vs  am  gear   carb
## mpg 1.0 -0.852 -0.848 -0.776 0.681 -0.868  0.419 0.664 0.6 0.480 -0.551
## am  0.6 -0.523 -0.591 -0.243 0.713 -0.692 -0.230 0.168 1.0 0.794  0.058

Model Selection

For more simplicity and brevity, the following is out of scope of this project (it is arguably a subject of another research):

Here is our approach to fitting and selecting a model:

Data Transformation

The continuous regressor variables (disp, hp, drat, wt and qsec) are candidates for centering (for a more interpretable intercept).
The rest of the regressor variables (cyl, vs, am, gear and carb) are converted into a factor.

mtcars2 <- within( mtcars, {
  disp <- disp-mean(disp) ; hp <- hp-mean(hp) ; drat <- drat-mean(drat) ; wt <- wt-mean(wt) ;
  qsec <- qsec-mean(qsec) ; cyl <- factor(cyl) ; vs <- factor(vs,labels=c("V","S")) ;
  am <- factor(am,labels=c("automatic","manual")) ; gear <- factor(gear) ; carb <- factor(carb) } )

Model with a single predictor

Firstly, we fit and analyse a minimal linear regression model of mpg with the single predictor am (2-level factor).

m0 <- lm(mpg ~ am, mtcars2) ; summary(m0)$coef  # fitting the initial model and getting the coefficients
##              Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 17.147368   1.124603 15.247492 1.133983e-15
## ammanual     7.244939   1.764422  4.106127 2.850207e-04
summary(m0)$r.squared   # getting R2
## [1] 0.3597989

The intercept, 17.1, is the mpg mean for automatic transmission cars, which is the reference level (level automatic of variable am). The coefficient ammanual (level manual of variable am), 7.2, is the change in the intercept to the reference level and is interpreted as the increase in the mpg mean for manual transmission cars. The t-tests for both the intercept and the slope are significant (statistically) since p-values, 1.133983e-15 and 2.850207e-04, are less than 0.05 (a typical benchmark for Type I error rate). The confidence intervals are entirely above zero (see Appendix Figure 4 for the models’ summary and confidence intervals). According to \(R^2\) (R-squared) value, 0.36, this model doesn’t explain 64% of the total variability so adding more predictors (ideally, necessary ones) is worth exploring.

Model with all variables as predictors

Our second model includes all variables as predictors.

m_all <- lm(mpg ~ . , mtcars2)  # fitting the full model
summary(m_all)$coef[c("(Intercept)","ammanual"),]  # getting the coefficients of interest
##              Estimate Std. Error   t value   Pr(>|t|)
## (Intercept) 17.984247   5.324119 3.3778825 0.00414153
## ammanual     1.212116   3.213545 0.3771896 0.71131573
c( R2=summary(m_all)$r.squared, adj.R2=summary(m_all)$adj.r.squared )  # getting R2 and adjusted R2
##        R2    adj.R2 
## 0.8930749 0.7790215

None of the coefficients in this model is significant due to their p-values being greater than 0.05 (see Appendix Figure 4 for the models’ summary and confidence intervals). With t-test for the coefficient of interest, ammanual, p-value 0.711, we fail to reject the null hypothesis of a zero difference in mpg between automatic and manual transmission cars. Also, the confidence interval (-5.637, 8.062) contains zero suggesting no difference. The standard error of ammanual is significantly larger than that of the previous model (3.214 vs 1.764) which is the result of adding unnecessary and correlated variables. Taking into account the large number of regressors included in the model, the \(adjusted\) \(R^2\) value, 0.7790215, is a better indicator of model performace than \(R^2\) (0.8930749).

Stepwise-selected model

This model is the result of a stepwise variable selection procedure. We use the step() function, which is supposed to return the “best” model. The algorithm deals with both the risk of overfitting and the risk of underfitting - the first increases standard errors of other regressors, the second results in bias in the coefficients of interest.

m2 <- step(m0, scope=list(lower=m0,upper=m_all), direction="both", trace=0)  # get a stepwise-selected model
summary(m2)$coef  # getting the coefficients
##              Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 18.897941  0.7193542 26.270704 2.855851e-21
## ammanual     2.935837  1.4109045  2.080819 4.671551e-02
## wt          -3.916504  0.7112016 -5.506882 6.952711e-06
## qsec         1.225886  0.2886696  4.246676 2.161737e-04
c( R2=summary(m2)$r.squared, adj.R2=summary(m2)$adj.r.squared )  # getting R2 and adjusted R2
##        R2    adj.R2 
## 0.8496636 0.8335561

The resulting model is mpg ~ am + wt + qsec. Surprisingly, all the coefficients in this model are significant, p-values are less than 0.05 (see Appendix Figure 4 for the models’ summary and confidence intervals), so we reject the null hypothesis of a zero difference in mpg between automatic and manual transmission cars. The confidence interval for coefficient ammanual is above zero (0.046, 5.826), suggesting the manual transmission is better (by 2.936 in average) than the automatic. \(R^2\) value is 0.85 i.e. this model explains 85% of the total variability.

Model Comparison

We use the anova function to compare our three models.

anova( m0, m2, m_all )  # get the analysis-of-variance and analysis-of-deviance tables
## Analysis of Variance Table
## 
## Model 1: mpg ~ am
## Model 2: mpg ~ am + wt + qsec
## Model 3: mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
##   Res.Df    RSS Df Sum of Sq       F    Pr(>F)    
## 1     30 720.90                                   
## 2     28 169.29  2    551.61 34.3604 2.509e-06 ***
## 3     15 120.40 13     48.88  0.4685    0.9114    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Our “winner” is Model 2 (mpg ~ am + wt + qsec), the stepwise-selected model we created last.

Diagnostic Plots

We use the following residual plots for diagnostics (to assess quality of our final linear-regression model).

par(mfrow=c(1,2), mar=c(3,4,2,2), mgp=c(2,1,0)) ; plot(m2,which=c(1:2))  # plot diagnostics for the model

The left panel, Residuals (\(e_i\)) versus Fitted values (\(\hat Y_i\)) plot shows points scattered symmetrically around the horizontal line at 0, uncorrelated with the fitted values. There are no systematic patterns, suggesting homoscedasticity (the assumption of equal variance).

The right panel, Normal Q-Q plot shows points that are roughly on a straight line (the \(45^∘\) diagonal). This indicates a normal distribution of the residuals i.e. the assumption of normality (of the errors) is fulfilled.

Conclusion

Model 2 (mpg ~ am + wt + qsec) best describes the effect of transmission type on fuel consumption - there is a 2.9-mpg average difference in favor of manual transmission, and with 95% confidence, we estimate a 0.05 to 5.83 increase in MPG for manual cars versus automatic.


Appendix

Figure 1. Description of the mtcars data set

[, 1]   mpg Miles/(US) gallon
[, 2]   cyl Number of cylinders
[, 3]   disp    Displacement (cu.in.)
[, 4]   hp  Gross horsepower
[, 5]   drat    Rear axle ratio
[, 6]   wt  Weight (1000 lbs)
[, 7]   qsec    1/4 mile time
[, 8]   vs  Engine (0 = V-shaped, 1 = straight)
[, 9]   am  Transmission (0 = automatic, 1 = manual)
[,10]   gear    Number of forward gears
[,11]   carb    Number of carburetors

Figure 2. Scatterplot matrix from the mtcars data set

pairs( mtcars , pch=18 , col=c("red","blue")[mtcars$am + 1] )  #colors: red - automatic cars, blue - manual cars

Figure 3. Outlier detection with boxplots

par( mfcol=c(1,4) , mar=c(3,4,1,2) , mgp=c(1,1,0) )
boxplot(mtcars[,4] , xlab="hp"  )  ; boxplot(mtcars[,6]  , xlab="wt"  )
boxplot(mtcars[,7] , xlab="qsec")  ; boxplot(mtcars[,11] , xlab="carb")

Figure 4. Models’ summary and confidence intervals

model mpg ~ am

summary(m0)
## 
## Call:
## lm(formula = mpg ~ am, data = mtcars2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3923 -3.0923 -0.2974  3.2439  9.5077 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   17.147      1.125  15.247 1.13e-15 ***
## ammanual       7.245      1.764   4.106 0.000285 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared:  0.3598, Adjusted R-squared:  0.3385 
## F-statistic: 16.86 on 1 and 30 DF,  p-value: 0.000285
confint(m0)
##                2.5 %   97.5 %
## (Intercept) 14.85062 19.44411
## ammanual     3.64151 10.84837

model mpg ~ .

summary(m_all)
## 
## Call:
## lm(formula = mpg ~ ., data = mtcars2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5087 -1.3584 -0.0948  0.7745  4.6251 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept) 17.98425    5.32412   3.378  0.00414 **
## cyl6        -2.64870    3.04089  -0.871  0.39747   
## cyl8        -0.33616    7.15954  -0.047  0.96317   
## disp         0.03555    0.03190   1.114  0.28267   
## hp          -0.07051    0.03943  -1.788  0.09393 . 
## drat         1.18283    2.48348   0.476  0.64074   
## wt          -4.52978    2.53875  -1.784  0.09462 . 
## qsec         0.36784    0.93540   0.393  0.69967   
## vsS          1.93085    2.87126   0.672  0.51151   
## ammanual     1.21212    3.21355   0.377  0.71132   
## gear4        1.11435    3.79952   0.293  0.77332   
## gear5        2.52840    3.73636   0.677  0.50890   
## carb2       -0.97935    2.31797  -0.423  0.67865   
## carb3        2.99964    4.29355   0.699  0.49547   
## carb4        1.09142    4.44962   0.245  0.80956   
## carb6        4.47757    6.38406   0.701  0.49381   
## carb8        7.25041    8.36057   0.867  0.39948   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.833 on 15 degrees of freedom
## Multiple R-squared:  0.8931, Adjusted R-squared:  0.779 
## F-statistic:  7.83 on 16 and 15 DF,  p-value: 0.000124
confint(m_all, parm=c("(Intercept)","ammanual") )
##                 2.5 %    97.5 %
## (Intercept)  6.636157 29.332338
## ammanual    -5.637394  8.061625

model mpg ~ am + wt + qsec

summary(m2)
## 
## Call:
## lm(formula = mpg ~ am + wt + qsec, data = mtcars2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4811 -1.5555 -0.7257  1.4110  4.6610 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  18.8979     0.7194  26.271  < 2e-16 ***
## ammanual      2.9358     1.4109   2.081 0.046716 *  
## wt           -3.9165     0.7112  -5.507 6.95e-06 ***
## qsec          1.2259     0.2887   4.247 0.000216 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.459 on 28 degrees of freedom
## Multiple R-squared:  0.8497, Adjusted R-squared:  0.8336 
## F-statistic: 52.75 on 3 and 28 DF,  p-value: 1.21e-11
confint(m2)
##                   2.5 %    97.5 %
## (Intercept) 17.42441087 20.371471
## ammanual     0.04573031  5.825944
## wt          -5.37333423 -2.459673
## qsec         0.63457320  1.817199