This report investigates the relationship between the fuel economy of an automobile and its characteristics such as weight, horsepower, transmission, # of cylinders, etc. The data used for this report was extracted from the 1974 Motor Trend US magazine and comprises data about fuel consumption (measured by miles per gallon (mpg)) plus 10 additional measures of the automobile’s design and performance for 32 cars (1973–74 models). The dataset can be accessed in the r programming interface as shown below (it comes as part of the base r installation) or you can download the data at this link.
Although this report initially set out to answer a simple question of what type of transmission is better for fuel economy, it turned out that the answer was a bit more complex than “manual” or “automatic”. The findings of this report are detailed below and can be summarized as follows; fuel economy of a vehicle is most heavily influenced by the weight and ¼-mile time of the vehicle (of course ¼-mile time is determined by # of cylinders & horsepower). As weight increases, fuel economy falls. As ¼-mile time increases (indicating a slower car), fuel economy rises. In the end, the report concludes that a manual transmission vehicle with the same weight and same ¼-mile time as an automatic transmission vehicle will get more miles on a single gallon of fuel.
First things first, what does the data 'look' like? What is the structure of the dataset and are there any missing value? Are outliers present and is the mpg data roughly normally distributed?
data(mtcars)
class(mtcars)
## [1] "data.frame"
sum(is.na(mtcars))
## [1] 0
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
What we find is that this is a dataset with no missing values and all variables are numeric. After converting transmission type ('am') to a factor aka categorical variable, a simple box plot and density chart (shown below) reveals to us that no outliers appear and the data is roughly normally distributed. This is important, because it's a necessary conditions for running multiple linear regression (MLR), which we'll do shortly. More on linear model assumptions here
mtcars$am <- as.factor(mtcars$am)
levels(mtcars$am) <- c("Automatic", "Manual")
par(mfrow = c(1, 2))
boxplot(mpg ~ am, data = mtcars, xlab = "Transmission Type", ylab = "MPG", main = "Boxplot")
library(sm)
## Package 'sm', version 2.2-5.4: type help(sm) for summary information
sm.density.compare(mtcars$mpg, mtcars$am, xlab = "MPG")
title(main = "Density Plot")
The boxplot demonstrates that automatic transmission automobiles seem to have a lower average fuel economy than manual transmission vehicles. Here are the actual averages.
mtcars$am <- as.factor(mtcars$am)
levels(mtcars$am) <- c("Automatic", "Manual")
tapply(mtcars$mpg, mtcars$am, mean)
## Automatic Manual
## 17.15 24.39
This sounds in line with conventional wisdom, but perhaps this is only a statistical anomaly of our dataset. Let's use an independent t-test to find out.
t.test(mtcars$mpg ~ mtcars$am, conf.level = 0.95)
##
## Welch Two Sample t-test
##
## data: mtcars$mpg by mtcars$am
## t = -3.767, df = 18.33, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -11.28 -3.21
## sample estimates:
## mean in group Automatic mean in group Manual
## 17.15 24.39
In this test the p-value is .001374. Translation for data people: assuming the null hypothesis is true (which is that there is no difference in the means of auto and manual transmission), the probability of acquiring a dataset like the one we have where the means are 7 points different is only .1374% … very very small. Hence we reject the null. Translation for everyone else: manual transmission vehicles get better fuel economy than automatic transmission vehicles. But … should we stop there?
We want to go another few steps and quantify this difference.
In order to further quantify the magnitude of the difference in mpg between auto trans and manual trans we need to conduct a MLR linking a car's fuel economy with its characteristics. So we need to develop a statistical model. A reasonable guess for this relationship is that all of the variables in the dataset will have an impact on mpg. That model looks like this:
mpg = intercept + weight + horsepower + ¼-mile time + # of cylinders + displacement + rear axle ration + cylinder configuration + transmission type + # of forward gears + # of carburetors + some margin of error
And in r …
model <- lm(mpg ~ ., data = mtcars)
summary(model)
##
## Call:
## lm(formula = mpg ~ ., data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.45 -1.60 -0.12 1.22 4.63
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.3034 18.7179 0.66 0.518
## cyl -0.1114 1.0450 -0.11 0.916
## disp 0.0133 0.0179 0.75 0.463
## hp -0.0215 0.0218 -0.99 0.335
## drat 0.7871 1.6354 0.48 0.635
## wt -3.7153 1.8944 -1.96 0.063 .
## qsec 0.8210 0.7308 1.12 0.274
## vs 0.3178 2.1045 0.15 0.881
## amManual 2.5202 2.0567 1.23 0.234
## gear 0.6554 1.4933 0.44 0.665
## carb -0.1994 0.8288 -0.24 0.812
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.65 on 21 degrees of freedom
## Multiple R-squared: 0.869, Adjusted R-squared: 0.807
## F-statistic: 13.9 on 10 and 21 DF, p-value: 3.79e-07
Interesting that this model tells me that only weight is a significant determinant of fuel economy. Inevitably, including all variables would lead to 1) overfitting, 2) a model that is too complex to explain (Occum's Razor) and 3) poor results.
Multicolinearity creeps in because we see strong relationships between ¼-mile time and horsepower, as well as other variables. If we include both of those in the model, in a sense we'd be double counting the effect and our coefficient estimates change radically (hence poor results).
mtcars$am <- as.numeric(mtcars$am)
cor(mtcars, use = "complete.obs", method = "kendall")
## mpg cyl disp hp drat wt qsec vs
## mpg 1.0000 -0.7953 -0.7681 -0.7428 0.46455 -0.7278 0.31537 0.5897
## cyl -0.7953 1.0000 0.8144 0.7852 -0.55132 0.7283 -0.44897 -0.7710
## disp -0.7681 0.8144 1.0000 0.6660 -0.49898 0.7434 -0.30082 -0.6033
## hp -0.7428 0.7852 0.6660 1.0000 -0.38263 0.6113 -0.47291 -0.6306
## drat 0.4645 -0.5513 -0.4990 -0.3826 1.00000 -0.5471 0.03272 0.3751
## wt -0.7278 0.7283 0.7434 0.6113 -0.54715 1.0000 -0.14199 -0.4885
## qsec 0.3154 -0.4490 -0.3008 -0.4729 0.03272 -0.1420 1.00000 0.6575
## vs 0.5897 -0.7710 -0.6033 -0.6306 0.37510 -0.4885 0.65754 1.0000
## am 0.4690 -0.4946 -0.5203 -0.3040 0.57555 -0.6139 -0.16890 0.1683
## gear 0.4332 -0.5125 -0.4760 -0.2794 0.58392 -0.5436 -0.09126 0.2697
## carb -0.5044 0.4654 0.4137 0.5960 -0.09535 0.3714 -0.50644 -0.5769
## am gear carb
## mpg 0.4690 0.43315 -0.50439
## cyl -0.4946 -0.51254 0.46543
## disp -0.5203 -0.47598 0.41374
## hp -0.3040 -0.27945 0.59598
## drat 0.5755 0.58392 -0.09535
## wt -0.6139 -0.54360 0.37137
## qsec -0.1689 -0.09126 -0.50644
## vs 0.1683 0.26975 -0.57693
## am 1.0000 0.77079 -0.05860
## gear 0.7708 1.00000 0.09801
## carb -0.0586 0.09801 1.00000
So, let's use stepwise selection in order to gain insight into the best fit model (which variable to use) for our analysis.
step.model <- step(lm(data = mtcars, mpg ~ .), trace = 0, steps = 10000)
summary(step.model)
##
## Call:
## lm(formula = mpg ~ wt + qsec + am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.481 -1.556 -0.726 1.411 4.661
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.618 6.960 1.38 0.17792
## wt -3.917 0.711 -5.51 7e-06 ***
## qsec 1.226 0.289 4.25 0.00022 ***
## amManual 2.936 1.411 2.08 0.04672 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.46 on 28 degrees of freedom
## Multiple R-squared: 0.85, Adjusted R-squared: 0.834
## F-statistic: 52.7 on 3 and 28 DF, p-value: 1.21e-11
The results tell us we have 3 variables in our model; 'wt' (weight), 'qsec' (¼ mile time), and 'am' (transmission type). With an adjusted r2 of .83, this model explains 83% of the variation in mpgs between cars. The p-value is tiny — way below our benchmark of .005 — which tells us that these results are statistically significant. The coefficients can be interpreted as so
In order to strengthen this model we can control for the am variable to see the effect of wt and qsec on mpg between automatic transmission vehicles and manual transmission vehicles. This will mediate any interaction effect (aka moderation effect) taking place.
step.model2 <- lm(mpg ~ factor(am):wt + factor(am):qsec, data = mtcars)
summary(step.model2)
##
## Call:
## lm(formula = mpg ~ factor(am):wt + factor(am):qsec, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.936 -1.402 -0.155 1.269 3.886
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.969 5.776 2.42 0.0226 *
## factor(am)Automatic:wt -3.176 0.636 -4.99 3.1e-05 ***
## factor(am)Manual:wt -6.099 0.969 -6.30 9.7e-07 ***
## factor(am)Automatic:qsec 0.834 0.260 3.20 0.0035 **
## factor(am)Manual:qsec 1.446 0.269 5.37 1.1e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.1 on 27 degrees of freedom
## Multiple R-squared: 0.895, Adjusted R-squared: 0.879
## F-statistic: 57.3 on 4 and 27 DF, p-value: 8.42e-13
Interpretation: An automatic transmission car with wt equal to 2.32 and qsec equal to 18.61, we predict the MPG to be
MPG = 13.97 - 3.18 * 2.32 + 0.83 * 18.61 = 22.04.
A manual transmission car with wt equal to 2.32 and qsec equal to 18.61, we predict the MPG to be
MPG = 13.97 - 6.1 * 2.32 + 1.4 * 18.61 = 25.87.
This model appears to agree with our prior notion (and the conventional wisdom) that manual cars get better gas mileage than automatic transmission cars.
So, is this a decent model?
Analyzing the residuals (see directly below), there is no apparent trend in the residuals, the Q-Q plot appears linear and lies very close to the line y=x, and there is no apparent pattern in the scale-location graph. All indicate a good fit.
par(mfrow = c(2, 2))
plot(step.model)
Further, plotting the predicted value of each vehicle according to this model with a confidence margin of 95% versus the actual value shows that the model comes pretty close to identifying the correct mpg for each automobile. Predicted in black, actual in blue.
library(Hmisc)
## Loading required package: grid
## Loading required package: lattice
## Loading required package: survival
## Loading required package: splines
## Loading required package: Formula
##
## Attaching package: 'Hmisc'
##
## The following objects are masked from 'package:base':
##
## format.pval, round.POSIXt, trunc.POSIXt, units
p <- predict(step.model, interval = "confidence")
PredictedMPG <- p[, 1]
PredictedLower <- p[, 2]
PredictedHigher <- p[, 3]
FrankDF <- data.frame(Index = 1:32, Predicted = PredictedMPG, PredictedLower = PredictedLower,
PredictedHigher = PredictedHigher)
par(las = 2, mar = c(8, 5, 4, 2))
plot(FrankDF$Index, mtcars$mpg, xaxt = "n", xlab = NA, ylim = c(0, 35), main = "Predicted vs. Actual MPGs",
ylab = "MPGs", col = "blue")
errbar(FrankDF$Index, FrankDF$Predicted, FrankDF$PredictedLower, FrankDF$PredictedHigher,
add = T)
axis(1, at = 1:32, labels = row.names(mtcars), cex.axis = 0.8)
grid(nx = NA, ny = NULL)
legend("topleft", inset = 0.01, c("Predicted", "Actual"), fill = c("black",
"blue"))
Not perfect, but no model is.
Actually, we are quite far from perfect. In ten out of 32 cases, our model failed to capture the actual mpg of the auto we were predicting for as indicated by the blue dots falling outside of our predicted value error bars. While being off by a mile per gallon may not be a life or death situation, this is definitely an area for further investigation.