MPG Analysis Motor Trend

Executive Summary

Motor Trend collected data on various cars to understand the impact of several factors on Miles Per Gallon (MPG). Specifically, this report tries to answer these two questions: “Is an automatic or manual transmission better for MPG” and “Quantify the MPG difference between automatic and manual transmissions”.

The analysis shows manual transmissions are better for MPG. In addition, weight of the car and horse power of the engine are also influencing factors.

Analysis

Data preparation

## make the factor variables after creating a copy of the data
library(datasets)
mt <- mtcars
mt$cyl <- factor(mt$cyl, labels = c("4cyl", "6cyl", "8cyl"))
mt$vs <- factor(mt$vs, labels = c("VeeEngine", "StraightEngine"))
mt$am <- factor(mt$am, labels = c("Automatic", "Manual"))
mt$gear <- factor(mt$gear, labels = c("3gears", "4gears", "5gears"))

Please see appendix A1 for the structure of the above dataset!

Is an automatic or manual transmission better for MPG?

aggregate(mpg ~ am, data = mt, mean)

##          am      mpg
## 1 Automatic 17.14737
## 2    Manual 24.39231

The mean for Automatic transmission is lower by about 7 MPGs. Examination of plot (please see A2 in Appendix) shows the same too. Let us run T test to see if this is significant.

t.test(mpg~am, data = mt, conf.level = 0.95)

## 
##  Welch Two Sample t-test
## 
## data:  mpg by am
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -11.280194  -3.209684
## sample estimates:
## mean in group Automatic    mean in group Manual 
##                17.14737                24.39231

Since the p-value is 0.001374 (which is less than 0.05), the means are not equal. Hence Manual transmission is better for MPG!!!

Quantify the MPG difference between automatic and manual transmissions

fit <- lm(mpg ~ am, data = mt)

The simple model (please see A3 in Appendix) shows the R-squared value to be 0.3598 meaning 35.98% of variance is explained. Based on Subject Matter Experts’ (SME) knowledge, weight of a car and the hore power of the engine, displacement of the engine are influencing factors. Number of cyclinders can have an influence too. However, displacement and horse power can reflect this. We know introducing too many variables increase standard error. We also know, omission of variables cause bias!

fit2 <- update(fit, mpg ~ am + wt)
fit3 <- update(fit, mpg ~ am + wt + hp)
fit4 <- update(fit, mpg ~ am + wt + hp + disp)
anova(fit, fit2, fit3, fit4)

## Analysis of Variance Table
## 
## Model 1: mpg ~ am
## Model 2: mpg ~ am + wt
## Model 3: mpg ~ am + wt + hp
## Model 4: mpg ~ am + wt + hp + disp
##   Res.Df    RSS Df Sum of Sq       F    Pr(>F)    
## 1     30 720.90                                   
## 2     29 278.32  1    442.58 66.4206 9.394e-09 ***
## 3     28 180.29  1     98.03 14.7118 0.0006826 ***
## 4     27 179.91  1      0.38  0.0576 0.8122229    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

By analyzing the above nested models and examining the Pr(>F) column, we can see weight and horse power had significant positive impact on the model. Displacement did not! Our theory is that horse power already has been influence by displacement and hence not needed.

## fit3 is the best model (am, wt and hp)
bestfit <- fit3

We can see fit3 model (please see A4 in Appendix) shows the R-squared value to be 0.8399 meaning 83.99% of variance is explained. This is definitely an improvement from the simple model!

## compare fit3 (am, wt and hp) to simple linear model (am)
anova(fit, bestfit)

## Analysis of Variance Table
## 
## Model 1: mpg ~ am
## Model 2: mpg ~ am + wt + hp
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1     30 720.90                                  
## 2     28 180.29  2    540.61 41.979 3.745e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Comparing the simple linear model to the fit3 model, the p-value is 3.745e-09! This rejects the null hypothesis and shows the model that includes transmission, weight and horse power is much better! It is possible for us to refine this model further and see if there are other influencing factors.

The residuals plot (please see A5 in Appendix) does not show any significant abnormality that requires more in-depth examination.

Appendix

A1: Structure of our data set

str(mt)

## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : Factor w/ 3 levels "4cyl","6cyl",..: 2 2 1 2 3 2 3 1 1 2 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : Factor w/ 2 levels "VeeEngine","StraightEngine": 1 1 2 2 1 2 1 2 2 2 ...
##  $ am  : Factor w/ 2 levels "Automatic","Manual": 2 2 2 1 1 1 1 1 1 1 ...
##  $ gear: Factor w/ 3 levels "3gears","4gears",..: 2 2 2 1 1 1 1 2 2 2 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

A2: Plot of MPG for Automatic and Manual Transmission Cars

## box plot
library(ggplot2)
g2 <- ggplot(mt, aes(am,mpg))
g2 <- g2 + geom_boxplot(fill = "light grey", colour = "blue",
                        outlier.colour = "red", outlier.shape = 1) +
         labs(x = "Transmission Type") + 
         labs(y = "Mile Per Gallon (MPG)") +
         labs(title = "MPG for Automatic and Manual Transmission Cars")
print(g2)

A3: Summary of simple linear model

summary(fit)

## 
## Call:
## lm(formula = mpg ~ am, data = mt)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3923 -3.0923 -0.2974  3.2439  9.5077 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   17.147      1.125  15.247 1.13e-15 ***
## amManual       7.245      1.764   4.106 0.000285 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared:  0.3598, Adjusted R-squared:  0.3385 
## F-statistic: 16.86 on 1 and 30 DF,  p-value: 0.000285

A4: Summary of best fit model

summary(bestfit)

## 
## Call:
## lm(formula = mpg ~ am + wt + hp, data = mt)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4221 -1.7924 -0.3788  1.2249  5.5317 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 34.002875   2.642659  12.867 2.82e-13 ***
## amManual     2.083710   1.376420   1.514 0.141268    
## wt          -2.878575   0.904971  -3.181 0.003574 ** 
## hp          -0.037479   0.009605  -3.902 0.000546 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.538 on 28 degrees of freedom
## Multiple R-squared:  0.8399, Adjusted R-squared:  0.8227 
## F-statistic: 48.96 on 3 and 28 DF,  p-value: 2.908e-11

A5: Plot of Diagnostics and Residuals for Best-fit model

par(mfrow = c(2,2))
plot(bestfit)