I examined a data set of vehicle characteristics (“mtcars”) to explore what characteristics impacted fuel efficiency, as measured in miles per gallon (mpg). In particular, I will answer two questions:
Along the way, I’ll give an example of multiple regression and model selection using R.
The mtcars data set is found in the datasets package in R. The data itself dates from the 1974 Motor Trend magazine. It includes fuel economy (mpg) and ten characteristics of 32 models of automobiles for the 1973-1974 time period.
| Variable | Description |
|---|---|
| mpg | Miles per gallon |
| cyl | Number of cylinders |
| disp | Displacement (cu. in.) |
| hp | Horsepower |
| drat | Rear axle ratio |
| wt | Weight (1000 lbs) |
| qsec | Quarter mile time |
| vs | Engine (V vs straight) |
| am | Transmission |
| gear | Number of gears |
| carb | Number of carburetors |
mtcars$cyl <- as.factor(mtcars$cyl)
mtcars$gear <- as.factor(mtcars$gear)
mtcars$carb <- as.factor(mtcars$carb)
mtcars$vs <- as.factor(mtcars$vs)
mtcars$am <- as.factor(mtcars$am)
summary(mtcars)
## mpg cyl disp hp drat
## Min. :10.40 4:11 Min. : 71.1 Min. : 52.0 Min. :2.760
## 1st Qu.:15.43 6: 7 1st Qu.:120.8 1st Qu.: 96.5 1st Qu.:3.080
## Median :19.20 8:14 Median :196.3 Median :123.0 Median :3.695
## Mean :20.09 Mean :230.7 Mean :146.7 Mean :3.597
## 3rd Qu.:22.80 3rd Qu.:326.0 3rd Qu.:180.0 3rd Qu.:3.920
## Max. :33.90 Max. :472.0 Max. :335.0 Max. :4.930
## wt qsec vs am gear carb
## Min. :1.513 Min. :14.50 0:18 0:19 3:15 1: 7
## 1st Qu.:2.581 1st Qu.:16.89 1:14 1:13 4:12 2:10
## Median :3.325 Median :17.71 5: 5 3: 3
## Mean :3.217 Mean :17.85 4:10
## 3rd Qu.:3.610 3rd Qu.:18.90 6: 1
## Max. :5.424 Max. :22.90 8: 1
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : Factor w/ 3 levels "4","6","8": 2 2 1 2 3 2 3 1 1 2 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : Factor w/ 2 levels "0","1": 1 1 2 2 1 2 1 2 2 2 ...
## $ am : Factor w/ 2 levels "0","1": 2 2 2 1 1 1 1 1 1 1 ...
## $ gear: Factor w/ 3 levels "3","4","5": 2 2 2 1 1 1 1 2 2 2 ...
## $ carb: Factor w/ 6 levels "1","2","3","4",..: 4 4 1 1 2 1 4 2 2 4 ...
The average car of 1973-1974 got 20.090625 miles per gallon with a standard deviation of 6.0269481. We’re most interested in the effect of transmission type (automatic = 0, manual = 1) on mpg so let’s look at that variable in more detail.
The mean fuel efficiency in 1973-1974 for cars with manual transmissions was higher than the mean for cars with automatic transmissions.
boxplot(mpg~am, data = mtcars)
The mean fuel economy for manual transmission vehicles was 24.3923077 versus 17.1473684 for automatics, 7.2449393 higher.
Shapiro-Wilk normality tests found that mpg, rear axle ratio (drat), weight (wt), and quarter-mile time (qsec) were normally distributed whereas displacement (disp) and horsepower (hp) were not. The remaining variables are factors.
shapiro.test(mtcars$mpg)
##
## Shapiro-Wilk normality test
##
## data: mtcars$mpg
## W = 0.94756, p-value = 0.1229
shapiro.test(mtcars$disp)
##
## Shapiro-Wilk normality test
##
## data: mtcars$disp
## W = 0.92001, p-value = 0.02081
shapiro.test(mtcars$hp)
##
## Shapiro-Wilk normality test
##
## data: mtcars$hp
## W = 0.93342, p-value = 0.04881
shapiro.test(mtcars$drat)
##
## Shapiro-Wilk normality test
##
## data: mtcars$drat
## W = 0.94588, p-value = 0.1101
shapiro.test(mtcars$wt)
##
## Shapiro-Wilk normality test
##
## data: mtcars$wt
## W = 0.94326, p-value = 0.09265
shapiro.test(mtcars$qsec)
##
## Shapiro-Wilk normality test
##
## data: mtcars$qsec
## W = 0.97325, p-value = 0.5935
Accordingly, disp and hp will need to be transformed. I first try natural log transformations. and rerun the Shapiro Wilk normality test.
mtcars$ln_disp <- log(mtcars$disp)
mtcars$ln_hp <- log(mtcars$hp)
shapiro.test(mtcars$ln_disp)
##
## Shapiro-Wilk normality test
##
## data: mtcars$ln_disp
## W = 0.93477, p-value = 0.05327
shapiro.test(mtcars$ln_hp)
##
## Shapiro-Wilk normality test
##
## data: mtcars$ln_hp
## W = 0.97026, p-value = 0.5065
Why go through that trouble? One of the main assumptions of linear regression is that the variables are normally distributed. Now that all continuous variables are (just barely) normal, I check for cross-correlations between the regressors. One note: This can only be done on the entire data set if all the variables are numeric. So you’ll notice I reverted to the original data set before I set several as factors.
data("mtcars")
mtcars$ln_disp <- log(mtcars$disp)
mtcars$ln_hp <- log(mtcars$hp)
cor(mtcars)
## mpg cyl disp hp drat wt
## mpg 1.0000000 -0.8521620 -0.8475514 -0.7761684 0.68117191 -0.8676594
## cyl -0.8521620 1.0000000 0.9020329 0.8324475 -0.69993811 0.7824958
## disp -0.8475514 0.9020329 1.0000000 0.7909486 -0.71021393 0.8879799
## hp -0.7761684 0.8324475 0.7909486 1.0000000 -0.44875912 0.6587479
## drat 0.6811719 -0.6999381 -0.7102139 -0.4487591 1.00000000 -0.7124406
## wt -0.8676594 0.7824958 0.8879799 0.6587479 -0.71244065 1.0000000
## qsec 0.4186840 -0.5912421 -0.4336979 -0.7082234 0.09120476 -0.1747159
## vs 0.6640389 -0.8108118 -0.7104159 -0.7230967 0.44027846 -0.5549157
## am 0.5998324 -0.5226070 -0.5912270 -0.2432043 0.71271113 -0.6924953
## gear 0.4802848 -0.4926866 -0.5555692 -0.1257043 0.69961013 -0.5832870
## carb -0.5509251 0.5269883 0.3949769 0.7498125 -0.09078980 0.4276059
## ln_disp -0.9071119 0.9318804 0.9725003 0.8020950 -0.75645083 0.8845389
## ln_hp -0.8487707 0.8806646 0.8278959 0.9687353 -0.56387082 0.7158277
## qsec vs am gear carb ln_disp
## mpg 0.41868403 0.6640389 0.59983243 0.4802848 -0.55092507 -0.9071119
## cyl -0.59124207 -0.8108118 -0.52260705 -0.4926866 0.52698829 0.9318804
## disp -0.43369788 -0.7104159 -0.59122704 -0.5555692 0.39497686 0.9725003
## hp -0.70822339 -0.7230967 -0.24320426 -0.1257043 0.74981247 0.8020950
## drat 0.09120476 0.4402785 0.71271113 0.6996101 -0.09078980 -0.7564508
## wt -0.17471588 -0.5549157 -0.69249526 -0.5832870 0.42760594 0.8845389
## qsec 1.00000000 0.7445354 -0.22986086 -0.2126822 -0.65624923 -0.4479810
## vs 0.74453544 1.0000000 0.16834512 0.2060233 -0.56960714 -0.7281298
## am -0.22986086 0.1683451 1.00000000 0.7940588 0.05753435 -0.6438649
## gear -0.21268223 0.2060233 0.79405876 1.0000000 0.27407284 -0.5465753
## carb -0.65624923 -0.5696071 0.05753435 0.2740728 1.00000000 0.4383136
## ln_disp -0.44798103 -0.7281298 -0.64386495 -0.5465753 0.43831361 1.0000000
## ln_hp -0.68495060 -0.7617319 -0.34579172 -0.2190341 0.69964733 0.8617723
## ln_hp
## mpg -0.8487707
## cyl 0.8806646
## disp 0.8278959
## hp 0.9687353
## drat -0.5638708
## wt 0.7158277
## qsec -0.6849506
## vs -0.7617319
## am -0.3457917
## gear -0.2190341
## carb 0.6996473
## ln_disp 0.8617723
## ln_hp 1.0000000
We can see that several variables are highly correlated with mpg, both negatively and positively. Unfortunately, they’re also correlated with each other. So we’re going to have to use multiple regression to tease out the intertwined effects.
Just for the fun of it, here’s the same data as a plot. There are too many variables for a good plot, honestly, but here’s an example of how not to do it:
plot(mtcars)
So, let’s start with the analysis.
First, let’s do this by hand. The first model fits all the variables.
# Change certain variables back to factors and drop disp and hp
mtcars$cyl <- as.factor(mtcars$cyl)
mtcars$gear <- as.factor(mtcars$gear)
mtcars$carb <- as.factor(mtcars$carb)
mtcars$vs <- as.factor(mtcars$vs)
mtcars$am <- as.factor(mtcars$am)
mtcars <- subset(mtcars, select = -c(disp, hp))
# Model 1 with all the variables
fit1 <- lm(mpg~., data = mtcars)
summary(fit1)
##
## Call:
## lm(formula = mpg ~ ., data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4755 -1.4827 0.1112 0.9614 3.5887
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 72.7330 32.4106 2.244 0.0403 *
## cyl6 1.8963 3.3757 0.562 0.5826
## cyl8 6.2922 5.4164 1.162 0.2635
## drat 0.1487 2.4406 0.061 0.9522
## wt -0.8663 2.2000 -0.394 0.6993
## qsec 0.3944 0.8963 0.440 0.6662
## vs1 0.6826 2.7491 0.248 0.8073
## am1 1.5233 3.0402 0.501 0.6236
## gear4 -0.4903 3.2872 -0.149 0.8834
## gear5 3.0209 3.6283 0.833 0.4181
## carb2 -0.1733 2.2343 -0.078 0.9392
## carb3 -1.8749 3.7813 -0.496 0.6272
## carb4 -1.0962 4.4207 -0.248 0.8075
## carb6 -2.9741 6.6300 -0.449 0.6601
## carb8 -2.7182 6.4682 -0.420 0.6803
## ln_disp -6.2489 6.2004 -1.008 0.3295
## ln_hp -5.7411 4.5838 -1.252 0.2296
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.724 on 15 degrees of freedom
## Multiple R-squared: 0.9012, Adjusted R-squared: 0.7958
## F-statistic: 8.549 on 16 and 15 DF, p-value: 7.252e-05
Note how the overall model is statistically significant with a p = 0.00007252 and an R-squared of 0.9011787 even though none of the individual coefficients are significant. That indicates I have too many variables in the model. Accordingly, I dropped the three least significant variables and tried again with the remaining six regressors.
fit2 <- lm(mpg ~ cyl + wt + qsec + am + ln_disp + ln_hp, data = mtcars)
summary(fit2)
##
## Call:
## lm(formula = mpg ~ cyl + wt + qsec + am + ln_disp + ln_hp, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2535 -1.1738 -0.3765 0.5230 3.7947
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 56.8444 21.2712 2.672 0.0133 *
## cyl6 -0.8034 1.8028 -0.446 0.6598
## cyl8 1.5724 3.1459 0.500 0.6217
## wt -2.2185 1.1107 -1.997 0.0572 .
## qsec 0.3189 0.5607 0.569 0.5748
## am1 1.8312 1.7256 1.061 0.2992
## ln_disp -2.3961 3.0153 -0.795 0.4346
## ln_hp -4.8959 2.3797 -2.057 0.0507 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.302 on 24 degrees of freedom
## Multiple R-squared: 0.887, Adjusted R-squared: 0.8541
## F-statistic: 26.92 on 7 and 24 DF, p-value: 6.901e-10
Now weight and the natural log of horsepower are nearly significant but four regressors aren’t, indicating we still have too many variables. An ANOVA comparison finds no significant difference between the first and second models, indicating that the simpler model fits the data just as well.
anova(fit1, fit2)
Accordingly, I dropped three more variables that had the largest p-values to get my third model.
fit3 <- lm(mpg~wt + am + ln_hp, data = mtcars)
summary(fit3)
##
## Call:
## lm(formula = mpg ~ wt + am + ln_hp, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.8148 -1.5495 -0.4402 1.1327 4.7638
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 58.9957 4.8970 12.047 1.36e-12 ***
## wt -2.4681 0.8215 -3.005 0.00555 **
## am1 1.7565 1.1987 1.465 0.15397
## ln_hp -6.4888 1.3004 -4.990 2.85e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.294 on 28 degrees of freedom
## Multiple R-squared: 0.8692, Adjusted R-squared: 0.8552
## F-statistic: 62.01 on 3 and 28 DF, p-value: 1.747e-12
CI <- confint(fit3)
As far as keeping transmission type in my model, this is probably as far as I can go. The next step would be to drop am (transmission type) as it’s still not significant and then compare the models to see if that really made a difference to model fit. As it is, the model indicates that manual transmissions increased mpg by a mean of 1.7565027 miles per gallon (95% confidence interval -0.6989562 < x < 4.2119616) over automatic transmissions holding all other variables constant.
The new model fits the data quite well, with an R-squared value of 0.8691763 and a p-value of 0.000000000001747. Additionally, ANOVA comparisons find no significant difference between models 1, 2, and 3, indicating that the simpler model with only three regressors is just as good as the more complex models.
anova(fit2, fit3)
anova(fit1, fit3)
Residual and diagnostic plots do not indicate any major problems with the final model. Cooks distance shows no outliers with an overly large influence on the final model.
fit3.resid <- resid(fit3)
plot(fit3.resid)
par(mfrow = c(2,2))
plot(fit3)
The regression model shows that vehicle weight is negatively correlated with fuel efficiency whereas quarter-mile time and manual transmissions are positively correlated. The final model is
Estimate mpg = 58.9956718 + -2.4681432weight + 1.7565027am + -6.4888385*ln_hp
Holding all other conditions constant, switching from an automatic transmission to a manual transmission adds an average of 1.7565027 mpg (95% confidence interval -0.6989562 to 4.2119616) to the fuel efficiency of any vehicle. This means manual transmissions are generally (but not always) better for fuel economy.
This being R, there’s a fast and easy way to pick a linear model: Use stepwise regression. Just create an object containing a regression model and pass that object to the step function.
fit1 <- lm(mpg~., data = mtcars)
step(fit1, direction = c("both"), scope = list(lower = ~am))
## Start: AIC=73.88
## mpg ~ cyl + drat + wt + qsec + vs + am + gear + carb + ln_disp +
## ln_hp
##
## Df Sum of Sq RSS AIC
## - carb 5 3.1740 114.45 64.781
## - drat 1 0.0275 111.31 71.889
## - vs 1 0.4574 111.73 72.013
## - wt 1 1.1502 112.43 72.210
## - qsec 1 1.4364 112.71 72.292
## - cyl 2 12.3515 123.63 73.250
## - gear 2 12.4040 123.68 73.263
## <none> 111.28 73.881
## - ln_disp 1 7.5351 118.81 73.978
## - ln_hp 1 11.6373 122.92 75.064
##
## Step: AIC=64.78
## mpg ~ cyl + drat + wt + qsec + vs + am + gear + ln_disp + ln_hp
##
## Df Sum of Sq RSS AIC
## - drat 1 0.106 114.56 62.811
## - qsec 1 1.867 116.32 63.299
## - vs 1 2.586 117.04 63.496
## - gear 2 11.425 125.88 63.826
## - ln_disp 1 5.646 120.10 64.322
## <none> 114.45 64.781
## - wt 1 8.165 122.62 64.986
## - cyl 2 18.004 132.46 65.456
## - ln_hp 1 34.357 148.81 71.182
##
## Step: AIC=62.81
## mpg ~ cyl + wt + qsec + vs + am + gear + ln_disp + ln_hp
##
## Df Sum of Sq RSS AIC
## - qsec 1 1.768 116.33 61.301
## - vs 1 2.550 117.11 61.515
## - gear 2 11.534 126.09 61.881
## - ln_disp 1 5.729 120.29 62.372
## <none> 114.56 62.811
## - wt 1 8.352 122.91 63.063
## - cyl 2 18.660 133.22 63.640
## - ln_hp 1 34.385 148.94 69.210
##
## Step: AIC=61.3
## mpg ~ cyl + wt + vs + am + gear + ln_disp + ln_hp
##
## Df Sum of Sq RSS AIC
## - gear 2 10.423 126.75 60.047
## - vs 1 4.869 121.19 60.613
## - ln_disp 1 6.533 122.86 61.049
## - wt 1 6.603 122.93 61.068
## <none> 116.33 61.301
## - cyl 2 18.115 134.44 61.932
## - ln_hp 1 42.934 159.26 69.353
##
## Step: AIC=60.05
## mpg ~ cyl + wt + vs + am + ln_disp + ln_hp
##
## Df Sum of Sq RSS AIC
## - vs 1 2.175 128.92 58.591
## - ln_disp 1 3.396 130.14 58.893
## - cyl 2 16.079 142.83 59.868
## <none> 126.75 60.047
## - wt 1 20.604 147.35 62.867
## - ln_hp 1 34.518 161.27 65.754
##
## Step: AIC=58.59
## mpg ~ cyl + wt + am + ln_disp + ln_hp
##
## Df Sum of Sq RSS AIC
## - ln_disp 1 4.424 133.35 57.671
## - cyl 2 14.222 143.15 57.940
## <none> 128.92 58.591
## - wt 1 20.349 149.27 61.281
## - ln_hp 1 33.500 162.42 63.983
##
## Step: AIC=57.67
## mpg ~ cyl + wt + am + ln_hp
##
## Df Sum of Sq RSS AIC
## - cyl 2 13.967 147.31 56.859
## <none> 133.35 57.671
## - wt 1 42.561 175.91 64.535
## - ln_hp 1 49.622 182.97 65.794
##
## Step: AIC=56.86
## mpg ~ wt + am + ln_hp
##
## Df Sum of Sq RSS AIC
## <none> 147.31 56.859
## - wt 1 47.496 194.81 63.801
## - ln_hp 1 131.006 278.32 75.217
##
## Call:
## lm(formula = mpg ~ wt + am + ln_hp, data = mtcars)
##
## Coefficients:
## (Intercept) wt am1 ln_hp
## 58.996 -2.468 1.757 -6.489
Note I specified “direction” as well as “scope.” “Both” gives R leeway to add and subtract various combinations of variables to find the optimal model. “Scope” forced R to keep the variable “am” in each model. Without direction, the default is for step to go backwards. If you don’t specify “scope,” R will add and subtract all variables equally.
Keeping “am” in the model yields the same model as the one I found above: MPG is negatively affected by both weight and horsepower and positively affected by manual transmissions.
mpg = 58.9956718 + -2.4681432wt + 1.7565027am + -6.4888385*ln(hp)
Now you just have to code the new model and run diagnotic plots on it as I did above. And there you have it. The same model as above but in two lines of code.