This report for Motor Trend magazine will use the mtcars dataset to explore the relationship between 10 aspects of car design and performance (the predictor variables), and fuel consumption measured in miles per US gallon, MPG (the outcome variable). The dataset contains data for 32 cars. In particular, the report addresses the question of whether an automatic or manual transmission is better for fuel consumption and quantifies the difference in fuel consumption between automatic and manual transmission using regression models and exploratory data analyses. The conclusion is that manual transmission is better for fuel consumption (+1.48) based on the available data. However, the data and available model still leave 15% to be explained by other factors not included in the data analysis.
Note that the echo = FALSE parameter has been added
to the code chunks to prevent printing of the R code, all of which can
be found in the appendix.
# load the dataset and display the structure
data(mtcars)
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
Exploration of the mtcars dataset reveals a data frame with 32 observations on 11 (numeric) variables as follows.
Figure 1: A plot of mpg and transmission data where the red and blue horizontal lines represent the mean mpg for automatic and manual transmission respectively. The green line is the linear regression line
A simple linear regression analysis will provide further information on the strength of the relationship and the relative impact of transmission on MPG.
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.147368 1.124603 15.247492 0.000000000000001133983
## factor(am)1 7.244939 1.764422 4.106127 0.000285020743935067769
## [1] 0.3597989
We can see that the coefficients are the mean values for where am is equal to 0 and where am is equal to 1. With only one categorical explanatory variable, the linear regression coefficients are the means of each category as shown by the green line on the plot in Figure 1.
Interpretation: The coefficients indicate that manual transmission improves MPG by 7.244939 miles over automatic transmission. The low p-value >0.05 indicates a strong relationship. However, the r.squared value of 0.3597989 indicates that only 36% of the difference in fuel consumption can be explained by transmission type alone. This means we have to examine other variables to find and fit a better explanatory model.
To improve model selection we can test if a relationship exists in the dataset by generating a correlation matrix to visualise the correlation coefficients between the other variables.
Figure 2: A matrix of correlation coefficients for all variables. Where size of the text relects the level of correlation
Interpretation: The variables with the strongest relationship (those with coefficients closest to 1 or -1) are cyl, disp, hp and wt. cyl and disp are also highly correlated because displacement is the total volume of all the cylinders in an engine. So we will exclude disp from the model selection. Horsepower also has a relationship to cyl and disp, albeit much weaker, through torque because generally speaking, the more cylinders an engine has, the more horsepower and torque an engine makes. We will consider whether or not hp adds explanatory value by fitting multiple models.
We can now perform a multivariate regression on the model selection that includes am + cyl + hp + wt and display the coefficients to see if we have improved the predictability of our original bivariate model.
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.14653575 3.10478079 11.642218 0.000000000004944804
## factor(am)1 1.47804771 1.44114927 1.025603 0.314179886317531576
## cyl -0.74515702 0.58278741 -1.278609 0.211916611111083397
## hp -0.02495106 0.01364614 -1.828433 0.078553373699869630
## wt -2.60648071 0.91983749 -2.833632 0.008603218128270956
## [1] 0.8490314
Interpretation: The r squared value has significantly increase suggesting we now have a model that explains 85% of variation in fuel consumption. This can be compared to an alternative model that excludes hp for the reasons outlined previously. The results are shown in Appendix B and show a slight reduction in r squared, to 83%. Sticking with the model I have conducted some diagnostic tests in Appendix C are the residual plots that do not show any unexpected patterns, and Appendix D the checks for leverage and influence do not show any data values of concern.
Regression analysis has shown that manual transmission improves fuel consumption (MPG) by 1.48 miles over automatic transmission when cylinders, horsepower and weight are confounding variables. This combination explains 85% of the variation. This seems reasonable when considering other significant factors that influence fuel efficiency and consumption, for example, driving behaviour and tyres, are not included in the available data.
# knitr options
knitr::opts_chunk$set(echo = TRUE, warning = FALSE, fig.cap = TRUE, fig.align = "center",
fig.path="figures/", options(scipen=999))
knitr::opts_current$get('label')
# use captioner to add figure number and caption
library(captioner)
fig_nums <- captioner()
fig_nums("figa", "A plot of mpg and transmission data where the red and blue horizontal lines represent the mean mpg for automatic and manual transmission respectively. The green line is the linear regression line")
fig_nums("figb", "A matrix of correlation coefficients for all variables. Where size of the text relects the level of correlation")
fig_nums("figc", "Residual plot for the model lm(mpg ~ factor(am) + cyl + hp + wt, mtcars)")
# load the dataset and display the structure
data(mtcars)
str(mtcars)
# create the plot for Figure 1 "Fuel Consumption by Transmission"
manu <- subset(mtcars, am == "1")
auto <- subset(mtcars, am == "0")
plot(mtcars$am, mtcars$mpg,
main = "Fuel Consumption by Transmission",
xlab = "Transmission (0 = automatic, 1 = manual)",
ylab = "Miles/(US) gallon",
col = ifelse(mtcars$am == "1", "blue", "red"))
legend("topleft",
pch = c(1, 1),
c("auto", "manu"),
col = c("red", "blue"))
abline(lm(mtcars$mpg ~ mtcars$am), col = "green")
abline(h = mean(auto$mpg), col = "red")
abline(h = mean(manu$mpg), col = "blue")
text(0.5, 18, 'mean=17.15')
text(0.5, 25, 'mean=24.39')
# compute the simple bivariate regression and display the coefficients and r squared results
summary(lm(mpg ~ factor(am), data = mtcars))$coefficients
summary(lm(mpg ~ factor(am), data = mtcars))$r.squared
# Generate the correlation Matrix
panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor, ...) {
usr <- par("usr")
on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
Cor <- cor(x, y)
txt <- paste0(prefix, format(c(Cor, 0.123456789), digits = digits)[1])
if(missing(cex.cor)) {
cex.cor <- 0.4 / strwidth(txt)
}
text(0.5, 0.5, txt,
cex = cex.cor * Cor)
}
# Plot the data on the correlation matrix
pairs(mtcars,
upper.panel = panel.cor,
lower.panel = panel.smooth,
gap=0)
# Perform a multivariate regression on the model selection that includes am + cyl + hp + wt and display the coefficients
fit1 <- lm(mpg ~ factor(am) + cyl + hp + wt, mtcars)
summary(fit1)$coef
summary(fit1)$r.squared
# Perform a multivariate regression on the alternative model selection that includes am + cyl + hp + wt and display the coefficients
fit2 <- lm(mpg ~ factor(am) + cyl + wt, mtcars)
summary(fit2)$coef
summary(fit2)$r.squared
# plot residuals to look for patterns in the data
par(mfrow = c(2, 2))
plot(fit1)
# calculate dfbetas to check model data for influence and hatvalues to check model data for leverage
# The hat values are necessarily between 0 and 1 with larger values indicating greater (potential for) leverage.
# The dfbetas check for influence in the coefficients individually
infl <- cbind(round(dfbetas(fit1)[ ,3], 3), round(hatvalues(fit1), 3))
colnames(infl) <- c("dfbetas", "hatvalues")
infl
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.4179334 2.6414573 14.9227979 0.000000000000007424998
## factor(am)1 0.1764932 1.3044515 0.1353007 0.893342147923960827605
## cyl -1.5102457 0.4222792 -3.5764148 0.001291604589147556711
## wt -3.1251422 0.9108827 -3.4308942 0.001885894386856281756
## [1] 0.8303383
Figure 3: Residual plot for the model lm(mpg ~ factor(am) + cyl + hp + wt, mtcars)
## dfbetas hatvalues
## Mazda RX4 -0.324 0.162
## Mazda RX4 Wag -0.223 0.181
## Datsun 710 0.174 0.101
## Hornet 4 Drive 0.008 0.076
## Hornet Sportabout 0.179 0.125
## Valiant -0.018 0.077
## Duster 360 0.016 0.197
## Merc 240D -0.134 0.189
## Merc 230 -0.041 0.228
## Merc 280 0.009 0.065
## Merc 280C 0.036 0.065
## Merc 450SE 0.080 0.079
## Merc 450SL 0.098 0.089
## Merc 450SLC -0.046 0.086
## Cadillac Fleetwood 0.028 0.238
## Lincoln Continental 0.017 0.285
## Chrysler Imperial -0.383 0.256
## Fiat 128 0.013 0.111
## Honda Civic 0.063 0.137
## Toyota Corolla 0.112 0.108
## Toyota Corona 0.516 0.277
## Dodge Challenger -0.252 0.151
## AMC Javelin -0.340 0.160
## Camaro Z28 0.033 0.153
## Pontiac Firebird 0.273 0.087
## Fiat X1-9 -0.008 0.105
## Porsche 914-2 0.030 0.094
## Lotus Europa -0.146 0.173
## Ford Pantera L -0.041 0.229
## Ferrari Dino 0.003 0.094
## Maserati Bora -0.171 0.466
## Volvo 142E 0.334 0.156