Regression Project

Executive Summary

The purpose of this analysis is to determine if there is a substantive difference between “automatic” and “manual” transmission cars when it comes to mpg, and to quantify this difference. A dataset provided by MotorTrend in 1974 will serve as the basis for our study. Our initial exploration will have the twofold goal of familiarizing us with the data and helping us determine appropriate methods for analysis. After our initial inspection, we’ll try fitting some models and comparing their performance and implications.

Analysis

Is there a difference in fuel economy for vehicles with automatic and manual transmissions? How do they compare? For this exploration, we’ll be working with the mtcars dataset provided in R (it’s extracted from a 1974 issue of MotorTrend magazine).

Let’s look at the structure of the data.

data(mtcars)
str(mtcars)

## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

We’ll probably need to change a few things. Let’s make the transmission, cylinder, gear, carburetor, and vs (engine type: “vee” or “straight”) variables into factors; this is to help us handle these “discrete-valued” variables. We’ll also rename the “am” variable to “tran” to make its meaning more explicit.

colnames(mtcars) <- c(colnames(mtcars)[1:8], "tran", colnames(mtcars)[10:11])
mtcars$tran <- factor(mtcars$tran, levels = c(0,1), labels = c("auto", "manual"))
mtcars$cyl <- factor(mtcars$cyl)
mtcars$vs <- factor(mtcars$vs, levels = c(0,1), labels = c("vee", "straight"))
mtcars$gear <- factor(mtcars$gear)
mtcars$carb <- factor(mtcars$carb)

We’ll be interested in the tran variable. Let’s see if there’s an obvious difference between manual and automatic.

sapply(split(mtcars[,1], mtcars$tran), summary)

##          auto manual
## Min.    10.40  15.00
## 1st Qu. 14.95  21.00
## Median  17.30  22.80
## Mean    17.15  24.39
## 3rd Qu. 19.20  30.40
## Max.    24.40  33.90

boxplot(mpg~tran, data = mtcars)

Right away, it appears that manual transmissions tend to get better gas mileage than automatics. While there does appear to be more variation for manual-transmission vehicles, the difference in the mean values looks pretty stark. Let’s look at the linear regression

fit.tran <- lm(mpg ~ tran, data = mtcars)
summary(fit.tran)[c(1,4)]

## $call
## lm(formula = mpg ~ tran, data = mtcars)
## 
## $coefficients
##              Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 17.147368   1.124603 15.247492 1.133983e-15
## tranmanual   7.244939   1.764422  4.106127 2.850207e-04

print(data.frame(summary(fit.tran)[c(6,8,9)]), row.names = F)

##     sigma r.squared adj.r.squared
##  4.902029 0.3597989     0.3384589

The coefficient for the transmission factor variable is giving us the same information as our summaries of the split data: the expected value of the increase in mpg associated with a vehicle with a manual-transmission is 7.2449393. Note that, because automatic transmissions are the reference level for this model, the intercept of 17.1473684 provides the expected mpg for a vehicle with an automatic transmission. These results seem clear-cut, but we get a much more intriguing outcome when we attempt to “compartmentalize” the effects of all the variables in our dataset.

fit.all <- lm(mpg ~ . , data = mtcars)
##"multiple regression"
summary(fit.all)[c(1,4)]

## $call
## lm(formula = mpg ~ ., data = mtcars)
## 
## $coefficients
##                Estimate  Std. Error     t value   Pr(>|t|)
## (Intercept) 23.87913244 20.06582026  1.19004018 0.25252548
## cyl6        -2.64869528  3.04089041 -0.87102622 0.39746642
## cyl8        -0.33616298  7.15953951 -0.04695316 0.96317000
## disp         0.03554632  0.03189920  1.11433290 0.28267339
## hp          -0.07050683  0.03942556 -1.78835344 0.09393155
## drat         1.18283018  2.48348458  0.47627845 0.64073922
## wt          -4.52977584  2.53874584 -1.78425732 0.09461859
## qsec         0.36784482  0.93539569  0.39325050 0.69966720
## vsstraight   1.93085054  2.87125777  0.67247551 0.51150791
## tranmanual   1.21211570  3.21354514  0.37718957 0.71131573
## gear4        1.11435494  3.79951726  0.29328856 0.77332027
## gear5        2.52839599  3.73635801  0.67670068 0.50889747
## carb2       -0.97935432  2.31797446 -0.42250436 0.67865093
## carb3        2.99963875  4.29354611  0.69863900 0.49546781
## carb4        1.09142288  4.44961992  0.24528452 0.80956031
## carb6        4.47756921  6.38406242  0.70136677 0.49381268
## carb8        7.25041126  8.36056638  0.86721532 0.39948495

print(data.frame(summary(fit.all)[c(6,8,9)]), row.names = F)

##     sigma r.squared adj.r.squared
##  2.833169 0.8930749     0.7790215

It’s still possible manual transmission vehicles may be more fuel-efficient, but it’s not nearly as emphatic of a difference now. In fact, the t-test for our “tranmanual” coefficient fails to reject the possibility of there being no difference between automatic and manual transmissions when it comes to fuel economy.

Before we jump to conclusions, let’s examine the trustworthiness of these models via residual plots.

par(mfrow = c(2,1), mar = c(0.25, 4, 0.25, .5), oma = c(3, 0, 2, 0), mgp = c(2,1,0))
plot(resid(lm(mpg ~ tran, data = mtcars)), 
  ylab = "simp. reg. residuals", xlab = "", xaxt = "n", cex.lab = 1.15)
abline(h = 0, col = "blue")
plot(resid(lm(mpg ~ . , data = mtcars)), 
    ylab = "mult. reg. residuals", cex.lab = 1.15)
abline(h = 0, col = "blue")
title(main = "Residual Plots", xlab = "Index", outer = TRUE, cex.lab = 1.15)

Neither plot looks terrible, but it’s possible there’s a bit of a cyclical oscillation in the plot for the model that relies only on transmission type. This, combined with the stark contrast in adjusted R-squared between the aforementioned model (0.3384589) and the model using all of the variables (0.7790215), supports the possibility that the multiple regression model is more reliable than the model only using transmission type, and lends credence to the notion that we lack thorough evidence that manual transmissions get better fuel economy.

Discussion

Why the disparity between the implications of the models? Well, it seems that there must be some collinearity between the variables in the dataset. This makes sense intuitively, as the nature of a vehicle may preclude the possibility of a transmission being changed while holding everything else constant. It seems that the apparent differences in fuel economy that we observed in our initial analysis were actually the result of broader changes in a vehicle that tend to accompany a change in the transmission.

Due to potential collinearity of factors, any attempts to quantify the difference in mpg between automatic and manual transmissions seem inevitably untidy. Nevertheless, I would like to at least craft one more indicator of the impact of transmission type. Looking back to the adjusted model, we observe that horsepower and weight had the strongest rejections of triviality. Let’s use these two variables along with transmission type to craft a model.

summary(lm(mpg ~ hp + wt + tran, data = mtcars))[c(1,4,9)]

## $call
## lm(formula = mpg ~ hp + wt + tran, data = mtcars)
## 
## $coefficients
##                Estimate  Std. Error   t value     Pr(>|t|)
## (Intercept) 34.00287512 2.642659337 12.866916 2.824030e-13
## hp          -0.03747873 0.009605422 -3.901830 5.464023e-04
## wt          -2.87857541 0.904970538 -3.180850 3.574031e-03
## tranmanual   2.08371013 1.376420152  1.513862 1.412682e-01
## 
## $adj.r.squared
## [1] 0.8227357

If horsepower and weight are held constant, it appears that manual transmissions (and their accompanying features) tend to result in an increase of around 2 mpg. It may be a bit of a stretch to reason it out this way (and it’s certainly arbitrary), but I think this model is probably more practical than the two I previously put forth. In any case, thank you for taking the time to read this paper. Have a great day!

Notes: You can often find the “best” model according to some designated criterion by trying stepwise algorithms. For example, in the situation above, you could call step(fit.all, direction = “both”) in R, and it would use the AIC criterion to determine the “best” model. (In case you’re wondering, the code just mentioned would recommend a model using cylinder, horsepower, weight, and transmission type as explanatory variables. The resulting model would estimate that manual transmissions resulted in about a 1.8 mpg increase in fuel economy). Bottom line: stepwise algorithms are pretty cool!…Methodology aside, it’s important to keep in mind that the data used for this study are relatively old. It seems likely that the relationships between fuel economy and vehicle attributes shifted considerably in the interim…Finally, please note that this project was carried out with the intent of conforming to the requirements of the “Regression Models” course provided by Johns Hopkins via Coursera. Still, I hope it was at least a little enjoyable for the reader. Take care! Have an awesome day! -wraphaeljr