Executive Summary
Configuration and Required Packages
Data Processing
Exploratory Analysis
Model Selection
Model Examination, Statistical Inference, and Residuals
Conclusions
Appendix

Executive Summary

We are looking at a data set of a collection of cars, mtcars, and are interested in exploring the relationship between a set of variables and miles per gallon. Particularly we want to answer the following two questions:

Is an automatic or manual transmission better for MPG?
Quantify the MPG difference between automatic and manual transmissions

We will verify that cars with manual transmission are better than automatic cars in terms of fuel efficiency (MPG) [QUESTION 1]. In particular, our best model determines that a car with manual transmission, on average, has 2.936 miles per gallon more than a car with automatic transmission, holding values of weights and 1/4 mile time constant [QUESTION 2].

Configuration and Required Packages

The global options for the document are set with echo = TRUE so that reviewers can see the code, data is lazy-loaded from previously saved databases with cache = TRUE, and figures sizes are pre-set.

# set the global options
knitr::opts_chunk$set(echo = TRUE, cache = TRUE, warning = FALSE, message = FALSE, fig.width=12, fig.height=8)

The analysis requires the r packages knitr, ggplot2, car.

# load the required packages
library(knitr)
library(ggplot2)
library(car)

Data Processing

For the purpose of this analysis we use the mtcars dataset which is a dataset that was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973-74 models).

The Appendix includes a brief description of the variables in the data set.

Each line of mtcars represents one model of car, which we can see in the row names. Each column is then one attribute of that car model.

In order to make sure that R will interpret am as a categorical variable we need to convert it to factor. We also rename the variable levels 0, 1 to Automatic and Manual respectively.

mtcars$am <- factor(mtcars$am, 
                    levels = c(0,1), 
                    labels = c("Automatic", "Manual"))

Exploratory Analysis

The first six records of the dataset are shown below:

##print first 6 rows of mtcars
kable(head(mtcars),align = 'c')

	mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
Mazda RX4	21.0	6	160	110	3.90	2.620	16.46	0	Manual	4	4
Mazda RX4 Wag	21.0	6	160	110	3.90	2.875	17.02	0	Manual	4	4
Datsun 710	22.8	4	108	93	3.85	2.320	18.61	1	Manual	4	1
Hornet 4 Drive	21.4	6	258	110	3.08	3.215	19.44	1	Automatic	3	1
Hornet Sportabout	18.7	8	360	175	3.15	3.440	17.02	0	Automatic	3	2
Valiant	18.1	6	225	105	2.76	3.460	20.22	1	Automatic	3	1

To get an initial idea of the existing patterns between variables in the data set we computed the scatterplot matrix which can be found in Appendix (Fig.1).

As we can see from the histogram of am versus MPG (Fig. 2 in Appendix), the distribution of manual cars seem associated with higher values of MPG. On the other hand, there is some overlap among the distributions, and the difference might also be caused by confounders. Therefore further analysis is needed to determine whether cars with manual transmission are better than automatic cars in terms of MPG.

To test the hypothesis that cars with an automatic transmission use more fuel than cars with manual transmission, we use a two sample T-test.

test <- t.test(mpg ~ am, data= mtcars, var.equal = FALSE, paired=FALSE ,conf.level = .95)
result <- data.frame( "t-statistic"  = test$statistic, 
                       "df" = test$parameter,
                        "p-value"  = test$p.value,
                        "lower CL" = test$conf.int[1],
                        "upper CL" = test$conf.int[2],
                        "automatic mean" = test$estimate[1],
                        "manual mean" = test$estimate[2],
                        row.names = "")
kable(x = round(result,3),align = 'c')

	t.statistic	df	p.value	lower.CL	upper.CL	automatic.mean	manual.mean
	-3.767	18.332	0.001	-11.28	-3.21	17.147	24.392

As the p-value is 0.001 we can reject the null hypothesis that cars with manual transmission have the same mean than cars with automatic transmission.

Model Selection

We first consider two naive models, the model including all predictors (fit.full), and the model with the variable am as the only predictor (fit.am).

fit.full <- lm(mpg ~ ., data = mtcars)
##extract p-values from the full model
round(summary(fit.full)$coef[, 4][-1], 2)

##      cyl     disp       hp     drat       wt     qsec       vs amManual 
##     0.92     0.46     0.33     0.64     0.06     0.27     0.88     0.23 
##     gear     carb 
##     0.67     0.81

##extract adjusted r squared from the full model
round(summary(fit.full)$adj.r.squared, 2)

## [1] 0.81

Despite the model having a large adjusted R2 (0.81), the above p-values show that none of the coefficients in the Full Model are significant at 5% level. Multicollinearity and overfitting inflates the estimated standard errors of the coefficients.

fit.am <- lm(mpg ~ am, data = mtcars)
##extract estimated coefficient from the am only model
round(summary(fit.am)$coef[2, ], 3)

##   Estimate Std. Error    t value   Pr(>|t|) 
##      7.245      1.764      4.106      0.000

##extract adjusted r squared from the am only model
round(summary(fit.am)$adj.r.squared, 2)

## [1] 0.34

In the case of the am-only model, the coefficient for am results significantly different from zero. Under this naive model, on average, a manual transmitted car has 7.245 more MPG than an automatic transmitted car. However the small adjusted R2 (0.34) implies that the trasmission variable can explain only a small variability of MPG.

In addition to am, other variables can be then introuced to better explain the total variability of MPG. In order to find the best possible fit, we use stepwise selection with two criteria, AIC and BIC. AIC is defined as (\(-2 \log{L(M)}+2k\)) while BIC is (\(-2 \log{L(M)}+k\log{(n)}\)), where \(\log{L(M)}\) is the maximum of the likelihood function of model \(M\), \(k\) is the number of parameters in \(M\), and \(n\) is the number of observations.

# best model with AIC
aic.best <- step(lm(mpg ~ ., data = mtcars), direction = "both", 
                scope = formula(fit.full), k = 2, trace = 0) 
# best model with BIC
bic.best <- step(lm(mpg ~ ., data=mtcars), direction = "both", 
                scope = formula(fit.full), k = log(32), trace = 0)
#extract predictor names from best model according to AIC
row.names(summary(aic.best)$coef)

## [1] "(Intercept)" "wt"          "qsec"        "amManual"

#extract predictor names from best model according to BIC
row.names(summary(bic.best)$coef)

## [1] "(Intercept)" "wt"          "qsec"        "amManual"

As you can see, both the AIC and BIC criteria indicate that the best model to explain MPG includes wt and qsec in addition of am. The adjusted R2 for this model is 0.83. Under this model, a car with manual transmission, on average, has 2.936 miles per gallon more than a car with automatic transmission, holding values of weights and 1/4 mile time constant:

#get r2 from best model
round(summary(aic.best)$adj.r.squared, 2)

## [1] 0.83

#get model coefficients
round(summary(aic.best)$coef, 3)

##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)    9.618      6.960   1.382    0.178
## wt            -3.917      0.711  -5.507    0.000
## qsec           1.226      0.289   4.247    0.000
## amManual       2.936      1.411   2.081    0.047

Model Examination, Statistical Inference, and Residuals

Now we want to test whether the inclusion of the am variable in our best model is significant. To do that we perform an ANOVA to compare our model against the model including only wt and qsec.

lm.reduced <- lm(mpg ~ wt + qsec, data = mtcars)
anova(lm.reduced, aic.best)

## Analysis of Variance Table
## 
## Model 1: mpg ~ wt + qsec
## Model 2: mpg ~ wt + qsec + am
##   Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
## 1     29 195.46                              
## 2     28 169.29  1    26.178 4.3298 0.04672 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

As we can see, the inclusion of am is beneficial as it reduces the RSS, and am is significant at 95% level (p-value = 0.05).

The 95% confidence interval for the am coeffiecient does not include 0 either:

confint(aic.best, "amManual", level = 0.95)

##               2.5 %   97.5 %
## amManual 0.04573031 5.825944

From the table below is also evident that there are no clear sign of collinearity in our model as the variance inflation factors are reasonably low:

kable(x = round(vif(aic.best), 3), align = 'c')

wt	2.483
qsec	1.364
am	2.541

From the analysis of the residual plots (Fig. 3 in Appendix), we can verify the following assumptions:

The Residuals vs. Fitted plot shows no consistent pattern, supporting the accuracy of the independence assumption.
The Normal Q-Q plot indicates that the residuals are normally distributed because the points lie quite closely to the line.
The Scale-Location plot confirms the constant variance assumption, as the points are randomly distributed.
The Residuals vs. Leverage shows no particularly concering outliers, as all values fall well within the 0.5 bands.

Conclusions

Through this project we have verified that cars with manual transmission are better than automatic cars in terms of fuel efficiency (MPG) [QUESTION 1] In particular, our best model quantifies that a car with manual transmission, on average, has 2.936 miles per gallon more than a car with automatic transmission, holding values of weights and 1/4 mile time constant [QUESTION 2]

Appendix

Environment

OS: Windows 10
R Version: 3.2.2 (2015-08-14)
Tool: R studio Version 0.99.902
R packages used: ggplot2, car
Data: mtcars, base R dataset
Publishing tool: RPubs

Dataset Description

[, 1]    mpg     Miles/(US) gallon
[, 2]    cyl     Number of cylinders
[, 3]    disp    Displacement (cu.in.)
[, 4]    hp  Gross horsepower
[, 5]    drat    Rear axle ratio
[, 6]    wt  Weight (lb/1000)
[, 7]    qsec    1/4 mile time
[, 8]    vs  V/S
[, 9]    am  Transmission (0 = automatic, 1 = manual)
[,10]    gear    Number of forward gears
[,11]    carb    Number of carburetors)

Figures

Scatterplot Matrix

## put histograms on the diagonal
panel.hist <- function(x, ...)
{
    usr <- par("usr"); on.exit(par(usr))
    par(usr = c(usr[1:2], 0, 1.5) )
    h <- hist(x, plot = FALSE)
    breaks <- h$breaks; nB <- length(breaks)
    y <- h$counts; y <- y/max(y)
    rect(breaks[-nB], 0, breaks[-1], y, col = "cyan", ...)
}

## put (absolute) correlations on the upper panels,
## with size proportional to the correlations.
panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor, ...)
{
    usr <- par("usr"); on.exit(par(usr))
    par(usr = c(0, 1, 0, 1))
    r <- abs(cor(x, y))
    txt <- format(c(r, 0.123456789), digits = digits)[1]
    txt <- paste0(prefix, txt)
    if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
    text(0.5, 0.5, txt, cex = cex.cor * r)
}
pairs(mtcars, upper.panel=panel.cor,diag.panel=panel.hist, lower.panel = panel.smooth)

Fig. 1 - Scatterplot Matrix

Histogram of `am` vs. `MPG`

library(ggplot2)
p <- ggplot(mtcars, aes(am, mpg))
p + geom_boxplot(aes(fill = am))

Fig. 2 - Histogram of am vs. MPG

Residuals Diagnostics

par(mfrow = (c(2,2)))
plot(aic.best)

Fig. 3 - Analysis of Residuals

Motor Trend - Analysis of the MPG difference between automatic and manual transmissions

Filippo Mingione

October, 2016