Executive Summary

This report for Motor Trend magazine will use the mtcars dataset to explore the relationship between 10 aspects of car design and performance (the predictor variables), and fuel consumption measured in miles per US gallon, MPG (the outcome variable). The dataset contains data for 32 cars. In particular, the report addresses the question of whether an automatic or manual transmission is better for fuel consumption and quantifies the difference in fuel consumption between automatic and manual transmission using regression models and exploratory data analyses. The conclusion is that manual transmission is better for fuel consumption (+1.48) based on the available data. However, the data and available model still leave 15% to be explained by other factors not included in the data analysis.

Note that the echo = FALSE parameter has been added to the code chunks to prevent printing of the R code, all of which can be found in the appendix.

Summary of the data

# load the dataset and display the structure
data(mtcars)
str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

Exploration of the mtcars dataset reveals a data frame with 32 observations on 11 (numeric) variables as follows.

Outcome (dependent) Variable
  • mpg Miles/(US) gallon
Predictor (independent) variables
  • cyl Number of cylinders
  • disp Displacement (cu.in.)
  • hp Gross horsepower
  • drat Rear axle ratio
  • wt Weight (1000 lbs)
  • qsec 1/4 mile time
  • vs Engine (0 = V-shaped, 1 = straight)
  • am Transmission (0 = automatic, 1 = manual)
  • gear Number of forward gears
  • carb Number of carburetors

Exploratory data analyses

A plot of the mpg and transmission data provides a visual indication of the relationship between the two. The red and blue horizontal lines represent the mean MPG for automatic and manual transmission respectively and indicate that manual transmission does return a higher mean MPG.
Figure  1: A plot of mpg and transmission data where the red and blue horizontal lines represent the mean mpg for automatic and manual transmission respectively. The green line is the linear regression line

Figure 1: A plot of mpg and transmission data where the red and blue horizontal lines represent the mean mpg for automatic and manual transmission respectively. The green line is the linear regression line

Simple Bivariate Regresssion

A simple linear regression analysis will provide further information on the strength of the relationship and the relative impact of transmission on MPG.

##              Estimate Std. Error   t value                Pr(>|t|)
## (Intercept) 17.147368   1.124603 15.247492 0.000000000000001133983
## factor(am)1  7.244939   1.764422  4.106127 0.000285020743935067769
## [1] 0.3597989

We can see that the coefficients are the mean values for where am is equal to 0 and where am is equal to 1. With only one categorical explanatory variable, the linear regression coefficients are the means of each category as shown by the green line on the plot in Figure 1.

Interpretation: The coefficients indicate that manual transmission improves MPG by 7.244939 miles over automatic transmission. The low p-value >0.05 indicates a strong relationship. However, the r.squared value of 0.3597989 indicates that only 36% of the difference in fuel consumption can be explained by transmission type alone. This means we have to examine other variables to find and fit a better explanatory model.

Diagnostic - Correlation Matrix

To improve model selection we can test if a relationship exists in the dataset by generating a correlation matrix to visualise the correlation coefficients between the other variables.

Figure  2: A matrix of correlation coefficients for all variables. Where size of the text relects the level of correlation

Figure 2: A matrix of correlation coefficients for all variables. Where size of the text relects the level of correlation

Interpretation: The variables with the strongest relationship (those with coefficients closest to 1 or -1) are cyl, disp, hp and wt. cyl and disp are also highly correlated because displacement is the total volume of all the cylinders in an engine. So we will exclude disp from the model selection. Horsepower also has a relationship to cyl and disp, albeit much weaker, through torque because generally speaking, the more cylinders an engine has, the more horsepower and torque an engine makes. We will consider whether or not hp adds explanatory value by fitting multiple models.

Model Selection and Multivariate Regression

We can now perform a multivariate regression on the model selection that includes am + cyl + hp + wt and display the coefficients to see if we have improved the predictability of our original bivariate model.

##                Estimate Std. Error   t value             Pr(>|t|)
## (Intercept) 36.14653575 3.10478079 11.642218 0.000000000004944804
## factor(am)1  1.47804771 1.44114927  1.025603 0.314179886317531576
## cyl         -0.74515702 0.58278741 -1.278609 0.211916611111083397
## hp          -0.02495106 0.01364614 -1.828433 0.078553373699869630
## wt          -2.60648071 0.91983749 -2.833632 0.008603218128270956
## [1] 0.8490314

Interpretation: The r squared value has significantly increase suggesting we now have a model that explains 85% of variation in fuel consumption. This can be compared to an alternative model that excludes hp for the reasons outlined previously. The results are shown in Appendix B and show a slight reduction in r squared, to 83%. Sticking with the model I have conducted some diagnostic tests in Appendix C are the residual plots that do not show any unexpected patterns, and Appendix D the checks for leverage and influence do not show any data values of concern.

Summary

Regression analysis has shown that manual transmission improves fuel consumption (MPG) by 1.48 miles over automatic transmission when cylinders, horsepower and weight are confounding variables. This combination explains 85% of the variation. This seems reasonable when considering other significant factors that influence fuel efficiency and consumption, for example, driving behaviour and tyres, are not included in the available data.

Appendix A: All R code for this report

# knitr options
knitr::opts_chunk$set(echo = TRUE, warning = FALSE, fig.cap = TRUE, fig.align = "center",
                      fig.path="figures/", options(scipen=999))
knitr::opts_current$get('label')
# use captioner to add figure number and caption
library(captioner)
fig_nums <- captioner()
fig_nums("figa", "A plot of mpg and transmission data where the red and blue horizontal lines represent the mean mpg for automatic and manual transmission respectively. The green line is the linear regression line")
fig_nums("figb", "A matrix of correlation coefficients for all variables. Where size of the text relects the level of correlation")
fig_nums("figc", "Residual plot for the model lm(mpg ~ factor(am) + cyl + hp + wt, mtcars)")
# load the dataset and display the structure
data(mtcars)
str(mtcars)
# create the plot for Figure 1 "Fuel Consumption by Transmission"
manu <- subset(mtcars, am == "1")
auto <- subset(mtcars, am == "0")
plot(mtcars$am, mtcars$mpg, 
     main = "Fuel Consumption by Transmission",
     xlab = "Transmission (0 = automatic, 1 = manual)",
     ylab = "Miles/(US) gallon",
     col = ifelse(mtcars$am == "1", "blue", "red"))
legend("topleft", 
       pch = c(1, 1), 
       c("auto", "manu"), 
       col = c("red", "blue")) 
abline(lm(mtcars$mpg ~ mtcars$am), col = "green")
abline(h = mean(auto$mpg), col = "red")
abline(h = mean(manu$mpg), col = "blue")
text(0.5, 18, 'mean=17.15')
text(0.5, 25, 'mean=24.39')
# compute the simple bivariate regression and display the coefficients and r squared results
summary(lm(mpg ~ factor(am), data = mtcars))$coefficients
summary(lm(mpg ~ factor(am), data = mtcars))$r.squared
# Generate the correlation Matrix
panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor, ...) {
        usr <- par("usr")
        on.exit(par(usr))
        par(usr = c(0, 1, 0, 1))
        Cor <- cor(x, y) 
        txt <- paste0(prefix, format(c(Cor, 0.123456789), digits = digits)[1])
        if(missing(cex.cor)) {
                cex.cor <- 0.4 / strwidth(txt)
        }
        text(0.5, 0.5, txt,
             cex = cex.cor * Cor)
}
# Plot the data on the correlation matrix
pairs(mtcars,
      upper.panel = panel.cor,
      lower.panel = panel.smooth,
      gap=0)
# Perform a multivariate regression on the model selection that includes am + cyl + hp + wt and display the coefficients
fit1 <- lm(mpg ~ factor(am) + cyl + hp + wt, mtcars)
summary(fit1)$coef
summary(fit1)$r.squared
# Perform a multivariate regression on the alternative model selection that includes am + cyl + hp + wt and display the coefficients
fit2 <- lm(mpg ~ factor(am) + cyl + wt, mtcars)
summary(fit2)$coef
summary(fit2)$r.squared
# plot residuals to look for patterns in the data
par(mfrow = c(2, 2))
plot(fit1)
# calculate dfbetas to check model data for influence and hatvalues to check model data for leverage
# The hat values are necessarily between 0 and 1 with larger values indicating greater (potential for) leverage.
# The dfbetas check for influence in the coefficients individually
infl <- cbind(round(dfbetas(fit1)[ ,3], 3), round(hatvalues(fit1), 3))
colnames(infl) <- c("dfbetas", "hatvalues")
infl

Appendix B: Alternative model excluding hp

##               Estimate Std. Error    t value                Pr(>|t|)
## (Intercept) 39.4179334  2.6414573 14.9227979 0.000000000000007424998
## factor(am)1  0.1764932  1.3044515  0.1353007 0.893342147923960827605
## cyl         -1.5102457  0.4222792 -3.5764148 0.001291604589147556711
## wt          -3.1251422  0.9108827 -3.4308942 0.001885894386856281756
## [1] 0.8303383

Appendix C: Residual Plots for selected model

Figure  3: Residual plot for the model lm(mpg ~ factor(am) + cyl + hp + wt, mtcars)

Figure 3: Residual plot for the model lm(mpg ~ factor(am) + cyl + hp + wt, mtcars)

Appendix D: Diagnostic checks for leverage and influence in the selected model

##                     dfbetas hatvalues
## Mazda RX4            -0.324     0.162
## Mazda RX4 Wag        -0.223     0.181
## Datsun 710            0.174     0.101
## Hornet 4 Drive        0.008     0.076
## Hornet Sportabout     0.179     0.125
## Valiant              -0.018     0.077
## Duster 360            0.016     0.197
## Merc 240D            -0.134     0.189
## Merc 230             -0.041     0.228
## Merc 280              0.009     0.065
## Merc 280C             0.036     0.065
## Merc 450SE            0.080     0.079
## Merc 450SL            0.098     0.089
## Merc 450SLC          -0.046     0.086
## Cadillac Fleetwood    0.028     0.238
## Lincoln Continental   0.017     0.285
## Chrysler Imperial    -0.383     0.256
## Fiat 128              0.013     0.111
## Honda Civic           0.063     0.137
## Toyota Corolla        0.112     0.108
## Toyota Corona         0.516     0.277
## Dodge Challenger     -0.252     0.151
## AMC Javelin          -0.340     0.160
## Camaro Z28            0.033     0.153
## Pontiac Firebird      0.273     0.087
## Fiat X1-9            -0.008     0.105
## Porsche 914-2         0.030     0.094
## Lotus Europa         -0.146     0.173
## Ford Pantera L       -0.041     0.229
## Ferrari Dino          0.003     0.094
## Maserati Bora        -0.171     0.466
## Volvo 142E            0.334     0.156