Executive Summary

We are looking at a data set of a collection of cars, mtcars, and are interested in exploring the relationship between a set of variables and miles per gallon. Particularly we want to answer the following two questions:

  1. Is an automatic or manual transmission better for MPG?
  2. Quantify the MPG difference between automatic and manual transmissions

We will verify that cars with manual transmission are better than automatic cars in terms of fuel efficiency (MPG) [QUESTION 1]. In particular, our best model determines that a car with manual transmission, on average, has 2.936 miles per gallon more than a car with automatic transmission, holding values of weights and 1/4 mile time constant [QUESTION 2].

Configuration and Required Packages

The global options for the document are set with echo = TRUE so that reviewers can see the code, data is lazy-loaded from previously saved databases with cache = TRUE, and figures sizes are pre-set.

# set the global options
knitr::opts_chunk$set(echo = TRUE, cache = TRUE, warning = FALSE, message = FALSE, fig.width=12, fig.height=8)

The analysis requires the r packages knitr, ggplot2, car.

# load the required packages
library(knitr)
library(ggplot2)
library(car)

Data Processing

For the purpose of this analysis we use the mtcars dataset which is a dataset that was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973-74 models).

The Appendix includes a brief description of the variables in the data set.

Each line of mtcars represents one model of car, which we can see in the row names. Each column is then one attribute of that car model.

In order to make sure that R will interpret am as a categorical variable we need to convert it to factor. We also rename the variable levels 0, 1 to Automatic and Manual respectively.

mtcars$am <- factor(mtcars$am, 
                    levels = c(0,1), 
                    labels = c("Automatic", "Manual"))

Exploratory Analysis

The first six records of the dataset are shown below:

##print first 6 rows of mtcars
kable(head(mtcars),align = 'c')
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 Manual 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 Manual 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 Manual 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 Automatic 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 Automatic 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 Automatic 3 1

To get an initial idea of the existing patterns between variables in the data set we computed the scatterplot matrix which can be found in Appendix (Fig.1).

As we can see from the histogram of am versus MPG (Fig. 2 in Appendix), the distribution of manual cars seem associated with higher values of MPG. On the other hand, there is some overlap among the distributions, and the difference might also be caused by confounders. Therefore further analysis is needed to determine whether cars with manual transmission are better than automatic cars in terms of MPG.

To test the hypothesis that cars with an automatic transmission use more fuel than cars with manual transmission, we use a two sample T-test.

test <- t.test(mpg ~ am, data= mtcars, var.equal = FALSE, paired=FALSE ,conf.level = .95)
result <- data.frame( "t-statistic"  = test$statistic, 
                       "df" = test$parameter,
                        "p-value"  = test$p.value,
                        "lower CL" = test$conf.int[1],
                        "upper CL" = test$conf.int[2],
                        "automatic mean" = test$estimate[1],
                        "manual mean" = test$estimate[2],
                        row.names = "")
kable(x = round(result,3),align = 'c')
t.statistic df p.value lower.CL upper.CL automatic.mean manual.mean
-3.767 18.332 0.001 -11.28 -3.21 17.147 24.392

As the p-value is 0.001 we can reject the null hypothesis that cars with manual transmission have the same mean than cars with automatic transmission.

Model Selection

We first consider two naive models, the model including all predictors (fit.full), and the model with the variable am as the only predictor (fit.am).

fit.full <- lm(mpg ~ ., data = mtcars)
##extract p-values from the full model
round(summary(fit.full)$coef[, 4][-1], 2)
##      cyl     disp       hp     drat       wt     qsec       vs amManual 
##     0.92     0.46     0.33     0.64     0.06     0.27     0.88     0.23 
##     gear     carb 
##     0.67     0.81
##extract adjusted r squared from the full model
round(summary(fit.full)$adj.r.squared, 2)
## [1] 0.81

Despite the model having a large adjusted R2 (0.81), the above p-values show that none of the coefficients in the Full Model are significant at 5% level. Multicollinearity and overfitting inflates the estimated standard errors of the coefficients.

fit.am <- lm(mpg ~ am, data = mtcars)
##extract estimated coefficient from the am only model
round(summary(fit.am)$coef[2, ], 3)
##   Estimate Std. Error    t value   Pr(>|t|) 
##      7.245      1.764      4.106      0.000
##extract adjusted r squared from the am only model
round(summary(fit.am)$adj.r.squared, 2)
## [1] 0.34

In the case of the am-only model, the coefficient for am results significantly different from zero. Under this naive model, on average, a manual transmitted car has 7.245 more MPG than an automatic transmitted car. However the small adjusted R2 (0.34) implies that the trasmission variable can explain only a small variability of MPG.

In addition to am, other variables can be then introuced to better explain the total variability of MPG. In order to find the best possible fit, we use stepwise selection with two criteria, AIC and BIC. AIC is defined as (\(-2 \log{L(M)}+2k\)) while BIC is (\(-2 \log{L(M)}+k\log{(n)}\)), where \(\log{L(M)}\) is the maximum of the likelihood function of model \(M\), \(k\) is the number of parameters in \(M\), and \(n\) is the number of observations.

# best model with AIC
aic.best <- step(lm(mpg ~ ., data = mtcars), direction = "both", 
                scope = formula(fit.full), k = 2, trace = 0) 
# best model with BIC
bic.best <- step(lm(mpg ~ ., data=mtcars), direction = "both", 
                scope = formula(fit.full), k = log(32), trace = 0)
#extract predictor names from best model according to AIC
row.names(summary(aic.best)$coef) 
## [1] "(Intercept)" "wt"          "qsec"        "amManual"
#extract predictor names from best model according to BIC
row.names(summary(bic.best)$coef)
## [1] "(Intercept)" "wt"          "qsec"        "amManual"

As you can see, both the AIC and BIC criteria indicate that the best model to explain MPG includes wt and qsec in addition of am. The adjusted R2 for this model is 0.83. Under this model, a car with manual transmission, on average, has 2.936 miles per gallon more than a car with automatic transmission, holding values of weights and 1/4 mile time constant:

#get r2 from best model
round(summary(aic.best)$adj.r.squared, 2)
## [1] 0.83
#get model coefficients
round(summary(aic.best)$coef, 3)
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)    9.618      6.960   1.382    0.178
## wt            -3.917      0.711  -5.507    0.000
## qsec           1.226      0.289   4.247    0.000
## amManual       2.936      1.411   2.081    0.047

Model Examination, Statistical Inference, and Residuals

Now we want to test whether the inclusion of the am variable in our best model is significant. To do that we perform an ANOVA to compare our model against the model including only wt and qsec.

lm.reduced <- lm(mpg ~ wt + qsec, data = mtcars)
anova(lm.reduced, aic.best)
## Analysis of Variance Table
## 
## Model 1: mpg ~ wt + qsec
## Model 2: mpg ~ wt + qsec + am
##   Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
## 1     29 195.46                              
## 2     28 169.29  1    26.178 4.3298 0.04672 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

As we can see, the inclusion of am is beneficial as it reduces the RSS, and am is significant at 95% level (p-value = 0.05).

The 95% confidence interval for the am coeffiecient does not include 0 either:

confint(aic.best, "amManual", level = 0.95)
##               2.5 %   97.5 %
## amManual 0.04573031 5.825944

From the table below is also evident that there are no clear sign of collinearity in our model as the variance inflation factors are reasonably low:

kable(x = round(vif(aic.best), 3), align = 'c')
wt 2.483
qsec 1.364
am 2.541

From the analysis of the residual plots (Fig. 3 in Appendix), we can verify the following assumptions:

  1. The Residuals vs. Fitted plot shows no consistent pattern, supporting the accuracy of the independence assumption.
  2. The Normal Q-Q plot indicates that the residuals are normally distributed because the points lie quite closely to the line.
  3. The Scale-Location plot confirms the constant variance assumption, as the points are randomly distributed.
  4. The Residuals vs. Leverage shows no particularly concering outliers, as all values fall well within the 0.5 bands.

Conclusions

Through this project we have verified that cars with manual transmission are better than automatic cars in terms of fuel efficiency (MPG) [QUESTION 1] In particular, our best model quantifies that a car with manual transmission, on average, has 2.936 miles per gallon more than a car with automatic transmission, holding values of weights and 1/4 mile time constant [QUESTION 2]

Appendix

Environment

  1. OS: Windows 10
  2. R Version: 3.2.2 (2015-08-14)
  3. Tool: R studio Version 0.99.902
  4. R packages used: ggplot2, car
  5. Data: mtcars, base R dataset
  6. Publishing tool: RPubs

Dataset Description

[, 1]    mpg     Miles/(US) gallon
[, 2]    cyl     Number of cylinders
[, 3]    disp    Displacement (cu.in.)
[, 4]    hp  Gross horsepower
[, 5]    drat    Rear axle ratio
[, 6]    wt  Weight (lb/1000)
[, 7]    qsec    1/4 mile time
[, 8]    vs  V/S
[, 9]    am  Transmission (0 = automatic, 1 = manual)
[,10]    gear    Number of forward gears
[,11]    carb    Number of carburetors)

Figures

Scatterplot Matrix

## put histograms on the diagonal
panel.hist <- function(x, ...)
{
    usr <- par("usr"); on.exit(par(usr))
    par(usr = c(usr[1:2], 0, 1.5) )
    h <- hist(x, plot = FALSE)
    breaks <- h$breaks; nB <- length(breaks)
    y <- h$counts; y <- y/max(y)
    rect(breaks[-nB], 0, breaks[-1], y, col = "cyan", ...)
}

## put (absolute) correlations on the upper panels,
## with size proportional to the correlations.
panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor, ...)
{
    usr <- par("usr"); on.exit(par(usr))
    par(usr = c(0, 1, 0, 1))
    r <- abs(cor(x, y))
    txt <- format(c(r, 0.123456789), digits = digits)[1]
    txt <- paste0(prefix, txt)
    if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
    text(0.5, 0.5, txt, cex = cex.cor * r)
}
pairs(mtcars, upper.panel=panel.cor,diag.panel=panel.hist, lower.panel = panel.smooth)
Fig. 1 - Scatterplot Matrix

Fig. 1 - Scatterplot Matrix

Histogram of am vs. MPG

library(ggplot2)
p <- ggplot(mtcars, aes(am, mpg))
p + geom_boxplot(aes(fill = am))
Fig. 2 - Histogram of am vs. MPG

Fig. 2 - Histogram of am vs. MPG

Residuals Diagnostics

par(mfrow = (c(2,2)))
plot(aic.best)
Fig. 3 - Analysis of Residuals

Fig. 3 - Analysis of Residuals