An example of multiple regression with R using mtcars

Summary

I examined a data set of vehicle characteristics (“mtcars”) to explore what characteristics impacted fuel efficiency, as measured in miles per gallon (mpg). In particular, I will answer two questions:

Which type of transmission—automatic or manual—is better for fuel efficiency?
How much difference is there between transmission types?

Along the way, I’ll give an example of multiple regression and model selection using R.

Dataset characteristics

The mtcars data set is found in the datasets package in R. The data itself dates from the 1974 Motor Trend magazine. It includes fuel economy (mpg) and ten characteristics of 32 models of automobiles for the 1973-1974 time period.

Variable	Description
mpg	Miles per gallon
cyl	Number of cylinders
disp	Displacement (cu. in.)
hp	Horsepower
drat	Rear axle ratio
wt	Weight (1000 lbs)
qsec	Quarter mile time
vs	Engine (V vs straight)
am	Transmission
gear	Number of gears
carb	Number of carburetors

Exploratory Data Analysis

mtcars$cyl <- as.factor(mtcars$cyl)
mtcars$gear <- as.factor(mtcars$gear)
mtcars$carb <- as.factor(mtcars$carb)
mtcars$vs <- as.factor(mtcars$vs)
mtcars$am <- as.factor(mtcars$am)
summary(mtcars)

##       mpg        cyl         disp             hp             drat      
##  Min.   :10.40   4:11   Min.   : 71.1   Min.   : 52.0   Min.   :2.760  
##  1st Qu.:15.43   6: 7   1st Qu.:120.8   1st Qu.: 96.5   1st Qu.:3.080  
##  Median :19.20   8:14   Median :196.3   Median :123.0   Median :3.695  
##  Mean   :20.09          Mean   :230.7   Mean   :146.7   Mean   :3.597  
##  3rd Qu.:22.80          3rd Qu.:326.0   3rd Qu.:180.0   3rd Qu.:3.920  
##  Max.   :33.90          Max.   :472.0   Max.   :335.0   Max.   :4.930  
##        wt             qsec       vs     am     gear   carb  
##  Min.   :1.513   Min.   :14.50   0:18   0:19   3:15   1: 7  
##  1st Qu.:2.581   1st Qu.:16.89   1:14   1:13   4:12   2:10  
##  Median :3.325   Median :17.71                 5: 5   3: 3  
##  Mean   :3.217   Mean   :17.85                        4:10  
##  3rd Qu.:3.610   3rd Qu.:18.90                        6: 1  
##  Max.   :5.424   Max.   :22.90                        8: 1

str(mtcars)

## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : Factor w/ 3 levels "4","6","8": 2 2 1 2 3 2 3 1 1 2 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : Factor w/ 2 levels "0","1": 1 1 2 2 1 2 1 2 2 2 ...
##  $ am  : Factor w/ 2 levels "0","1": 2 2 2 1 1 1 1 1 1 1 ...
##  $ gear: Factor w/ 3 levels "3","4","5": 2 2 2 1 1 1 1 2 2 2 ...
##  $ carb: Factor w/ 6 levels "1","2","3","4",..: 4 4 1 1 2 1 4 2 2 4 ...

The average car of 1973-1974 got 20.090625 miles per gallon with a standard deviation of 6.0269481. We’re most interested in the effect of transmission type (automatic = 0, manual = 1) on mpg so let’s look at that variable in more detail.

The mean fuel efficiency in 1973-1974 for cars with manual transmissions was higher than the mean for cars with automatic transmissions.

boxplot(mpg~am, data = mtcars)

The mean fuel economy for manual transmission vehicles was 24.3923077 versus 17.1473684 for automatics, 7.2449393 higher.

Shapiro-Wilk normality tests found that mpg, rear axle ratio (drat), weight (wt), and quarter-mile time (qsec) were normally distributed whereas displacement (disp) and horsepower (hp) were not. The remaining variables are factors.

shapiro.test(mtcars$mpg)

## 
##  Shapiro-Wilk normality test
## 
## data:  mtcars$mpg
## W = 0.94756, p-value = 0.1229

shapiro.test(mtcars$disp)

## 
##  Shapiro-Wilk normality test
## 
## data:  mtcars$disp
## W = 0.92001, p-value = 0.02081

shapiro.test(mtcars$hp)

## 
##  Shapiro-Wilk normality test
## 
## data:  mtcars$hp
## W = 0.93342, p-value = 0.04881

shapiro.test(mtcars$drat)

## 
##  Shapiro-Wilk normality test
## 
## data:  mtcars$drat
## W = 0.94588, p-value = 0.1101

shapiro.test(mtcars$wt)

## 
##  Shapiro-Wilk normality test
## 
## data:  mtcars$wt
## W = 0.94326, p-value = 0.09265

shapiro.test(mtcars$qsec)

## 
##  Shapiro-Wilk normality test
## 
## data:  mtcars$qsec
## W = 0.97325, p-value = 0.5935

Accordingly, disp and hp will need to be transformed. I first try natural log transformations. and rerun the Shapiro Wilk normality test.

mtcars$ln_disp <- log(mtcars$disp)
mtcars$ln_hp <- log(mtcars$hp)
shapiro.test(mtcars$ln_disp)

## 
##  Shapiro-Wilk normality test
## 
## data:  mtcars$ln_disp
## W = 0.93477, p-value = 0.05327

shapiro.test(mtcars$ln_hp)

## 
##  Shapiro-Wilk normality test
## 
## data:  mtcars$ln_hp
## W = 0.97026, p-value = 0.5065

Why go through that trouble? One of the main assumptions of linear regression is that the variables are normally distributed. Now that all continuous variables are (just barely) normal, I check for cross-correlations between the regressors. One note: This can only be done on the entire data set if all the variables are numeric. So you’ll notice I reverted to the original data set before I set several as factors.

data("mtcars")
mtcars$ln_disp <- log(mtcars$disp)
mtcars$ln_hp <- log(mtcars$hp)
cor(mtcars)

##                mpg        cyl       disp         hp        drat         wt
## mpg      1.0000000 -0.8521620 -0.8475514 -0.7761684  0.68117191 -0.8676594
## cyl     -0.8521620  1.0000000  0.9020329  0.8324475 -0.69993811  0.7824958
## disp    -0.8475514  0.9020329  1.0000000  0.7909486 -0.71021393  0.8879799
## hp      -0.7761684  0.8324475  0.7909486  1.0000000 -0.44875912  0.6587479
## drat     0.6811719 -0.6999381 -0.7102139 -0.4487591  1.00000000 -0.7124406
## wt      -0.8676594  0.7824958  0.8879799  0.6587479 -0.71244065  1.0000000
## qsec     0.4186840 -0.5912421 -0.4336979 -0.7082234  0.09120476 -0.1747159
## vs       0.6640389 -0.8108118 -0.7104159 -0.7230967  0.44027846 -0.5549157
## am       0.5998324 -0.5226070 -0.5912270 -0.2432043  0.71271113 -0.6924953
## gear     0.4802848 -0.4926866 -0.5555692 -0.1257043  0.69961013 -0.5832870
## carb    -0.5509251  0.5269883  0.3949769  0.7498125 -0.09078980  0.4276059
## ln_disp -0.9071119  0.9318804  0.9725003  0.8020950 -0.75645083  0.8845389
## ln_hp   -0.8487707  0.8806646  0.8278959  0.9687353 -0.56387082  0.7158277
##                qsec         vs          am       gear        carb    ln_disp
## mpg      0.41868403  0.6640389  0.59983243  0.4802848 -0.55092507 -0.9071119
## cyl     -0.59124207 -0.8108118 -0.52260705 -0.4926866  0.52698829  0.9318804
## disp    -0.43369788 -0.7104159 -0.59122704 -0.5555692  0.39497686  0.9725003
## hp      -0.70822339 -0.7230967 -0.24320426 -0.1257043  0.74981247  0.8020950
## drat     0.09120476  0.4402785  0.71271113  0.6996101 -0.09078980 -0.7564508
## wt      -0.17471588 -0.5549157 -0.69249526 -0.5832870  0.42760594  0.8845389
## qsec     1.00000000  0.7445354 -0.22986086 -0.2126822 -0.65624923 -0.4479810
## vs       0.74453544  1.0000000  0.16834512  0.2060233 -0.56960714 -0.7281298
## am      -0.22986086  0.1683451  1.00000000  0.7940588  0.05753435 -0.6438649
## gear    -0.21268223  0.2060233  0.79405876  1.0000000  0.27407284 -0.5465753
## carb    -0.65624923 -0.5696071  0.05753435  0.2740728  1.00000000  0.4383136
## ln_disp -0.44798103 -0.7281298 -0.64386495 -0.5465753  0.43831361  1.0000000
## ln_hp   -0.68495060 -0.7617319 -0.34579172 -0.2190341  0.69964733  0.8617723
##              ln_hp
## mpg     -0.8487707
## cyl      0.8806646
## disp     0.8278959
## hp       0.9687353
## drat    -0.5638708
## wt       0.7158277
## qsec    -0.6849506
## vs      -0.7617319
## am      -0.3457917
## gear    -0.2190341
## carb     0.6996473
## ln_disp  0.8617723
## ln_hp    1.0000000

We can see that several variables are highly correlated with mpg, both negatively and positively. Unfortunately, they’re also correlated with each other. So we’re going to have to use multiple regression to tease out the intertwined effects.

Just for the fun of it, here’s the same data as a plot. There are too many variables for a good plot, honestly, but here’s an example of how not to do it:

plot(mtcars)

So, let’s start with the analysis.

Regression Analysis

First, let’s do this by hand. The first model fits all the variables.

# Change certain variables back to factors and drop disp and hp
mtcars$cyl <- as.factor(mtcars$cyl)
mtcars$gear <- as.factor(mtcars$gear)
mtcars$carb <- as.factor(mtcars$carb)
mtcars$vs <- as.factor(mtcars$vs)
mtcars$am <- as.factor(mtcars$am)
mtcars <- subset(mtcars, select = -c(disp, hp))

# Model 1 with all the variables
fit1 <- lm(mpg~., data = mtcars)
summary(fit1)

## 
## Call:
## lm(formula = mpg ~ ., data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4755 -1.4827  0.1112  0.9614  3.5887 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  72.7330    32.4106   2.244   0.0403 *
## cyl6          1.8963     3.3757   0.562   0.5826  
## cyl8          6.2922     5.4164   1.162   0.2635  
## drat          0.1487     2.4406   0.061   0.9522  
## wt           -0.8663     2.2000  -0.394   0.6993  
## qsec          0.3944     0.8963   0.440   0.6662  
## vs1           0.6826     2.7491   0.248   0.8073  
## am1           1.5233     3.0402   0.501   0.6236  
## gear4        -0.4903     3.2872  -0.149   0.8834  
## gear5         3.0209     3.6283   0.833   0.4181  
## carb2        -0.1733     2.2343  -0.078   0.9392  
## carb3        -1.8749     3.7813  -0.496   0.6272  
## carb4        -1.0962     4.4207  -0.248   0.8075  
## carb6        -2.9741     6.6300  -0.449   0.6601  
## carb8        -2.7182     6.4682  -0.420   0.6803  
## ln_disp      -6.2489     6.2004  -1.008   0.3295  
## ln_hp        -5.7411     4.5838  -1.252   0.2296  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.724 on 15 degrees of freedom
## Multiple R-squared:  0.9012, Adjusted R-squared:  0.7958 
## F-statistic: 8.549 on 16 and 15 DF,  p-value: 7.252e-05

Note how the overall model is statistically significant with a p = 0.00007252 and an R-squared of 0.9011787 even though none of the individual coefficients are significant. That indicates I have too many variables in the model. Accordingly, I dropped the three least significant variables and tried again with the remaining six regressors.

fit2 <- lm(mpg ~ cyl + wt + qsec + am + ln_disp + ln_hp, data = mtcars)
summary(fit2)

## 
## Call:
## lm(formula = mpg ~ cyl + wt + qsec + am + ln_disp + ln_hp, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2535 -1.1738 -0.3765  0.5230  3.7947 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  56.8444    21.2712   2.672   0.0133 *
## cyl6         -0.8034     1.8028  -0.446   0.6598  
## cyl8          1.5724     3.1459   0.500   0.6217  
## wt           -2.2185     1.1107  -1.997   0.0572 .
## qsec          0.3189     0.5607   0.569   0.5748  
## am1           1.8312     1.7256   1.061   0.2992  
## ln_disp      -2.3961     3.0153  -0.795   0.4346  
## ln_hp        -4.8959     2.3797  -2.057   0.0507 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.302 on 24 degrees of freedom
## Multiple R-squared:  0.887,  Adjusted R-squared:  0.8541 
## F-statistic: 26.92 on 7 and 24 DF,  p-value: 6.901e-10

Now weight and the natural log of horsepower are nearly significant but four regressors aren’t, indicating we still have too many variables. An ANOVA comparison finds no significant difference between the first and second models, indicating that the simpler model fits the data just as well.

anova(fit1, fit2)

Accordingly, I dropped three more variables that had the largest p-values to get my third model.

fit3 <- lm(mpg~wt + am + ln_hp, data = mtcars)
summary(fit3)

## 
## Call:
## lm(formula = mpg ~ wt + am + ln_hp, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8148 -1.5495 -0.4402  1.1327  4.7638 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  58.9957     4.8970  12.047 1.36e-12 ***
## wt           -2.4681     0.8215  -3.005  0.00555 ** 
## am1           1.7565     1.1987   1.465  0.15397    
## ln_hp        -6.4888     1.3004  -4.990 2.85e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.294 on 28 degrees of freedom
## Multiple R-squared:  0.8692, Adjusted R-squared:  0.8552 
## F-statistic: 62.01 on 3 and 28 DF,  p-value: 1.747e-12

CI <- confint(fit3)

As far as keeping transmission type in my model, this is probably as far as I can go. The next step would be to drop am (transmission type) as it’s still not significant and then compare the models to see if that really made a difference to model fit. As it is, the model indicates that manual transmissions increased mpg by a mean of 1.7565027 miles per gallon (95% confidence interval -0.6989562 < x < 4.2119616) over automatic transmissions holding all other variables constant.

The new model fits the data quite well, with an R-squared value of 0.8691763 and a p-value of 0.000000000001747. Additionally, ANOVA comparisons find no significant difference between models 1, 2, and 3, indicating that the simpler model with only three regressors is just as good as the more complex models.

anova(fit2, fit3)

anova(fit1, fit3)

Diagnostic plots

Residual and diagnostic plots do not indicate any major problems with the final model. Cooks distance shows no outliers with an overly large influence on the final model.

fit3.resid <- resid(fit3)
plot(fit3.resid)

par(mfrow = c(2,2))
plot(fit3)

Conclusions

The regression model shows that vehicle weight is negatively correlated with fuel efficiency whereas quarter-mile time and manual transmissions are positively correlated. The final model is

Estimate mpg = 58.9956718 + -2.4681432weight + 1.7565027am + -6.4888385*ln_hp

Holding all other conditions constant, switching from an automatic transmission to a manual transmission adds an average of 1.7565027 mpg (95% confidence interval -0.6989562 to 4.2119616) to the fuel efficiency of any vehicle. This means manual transmissions are generally (but not always) better for fuel economy.

Letting the computer pick the model

This being R, there’s a fast and easy way to pick a linear model: Use stepwise regression. Just create an object containing a regression model and pass that object to the step function.

fit1 <- lm(mpg~., data = mtcars)
step(fit1, direction = c("both"), scope = list(lower = ~am))

## Start:  AIC=73.88
## mpg ~ cyl + drat + wt + qsec + vs + am + gear + carb + ln_disp + 
##     ln_hp
## 
##           Df Sum of Sq    RSS    AIC
## - carb     5    3.1740 114.45 64.781
## - drat     1    0.0275 111.31 71.889
## - vs       1    0.4574 111.73 72.013
## - wt       1    1.1502 112.43 72.210
## - qsec     1    1.4364 112.71 72.292
## - cyl      2   12.3515 123.63 73.250
## - gear     2   12.4040 123.68 73.263
## <none>                 111.28 73.881
## - ln_disp  1    7.5351 118.81 73.978
## - ln_hp    1   11.6373 122.92 75.064
## 
## Step:  AIC=64.78
## mpg ~ cyl + drat + wt + qsec + vs + am + gear + ln_disp + ln_hp
## 
##           Df Sum of Sq    RSS    AIC
## - drat     1     0.106 114.56 62.811
## - qsec     1     1.867 116.32 63.299
## - vs       1     2.586 117.04 63.496
## - gear     2    11.425 125.88 63.826
## - ln_disp  1     5.646 120.10 64.322
## <none>                 114.45 64.781
## - wt       1     8.165 122.62 64.986
## - cyl      2    18.004 132.46 65.456
## - ln_hp    1    34.357 148.81 71.182
## 
## Step:  AIC=62.81
## mpg ~ cyl + wt + qsec + vs + am + gear + ln_disp + ln_hp
## 
##           Df Sum of Sq    RSS    AIC
## - qsec     1     1.768 116.33 61.301
## - vs       1     2.550 117.11 61.515
## - gear     2    11.534 126.09 61.881
## - ln_disp  1     5.729 120.29 62.372
## <none>                 114.56 62.811
## - wt       1     8.352 122.91 63.063
## - cyl      2    18.660 133.22 63.640
## - ln_hp    1    34.385 148.94 69.210
## 
## Step:  AIC=61.3
## mpg ~ cyl + wt + vs + am + gear + ln_disp + ln_hp
## 
##           Df Sum of Sq    RSS    AIC
## - gear     2    10.423 126.75 60.047
## - vs       1     4.869 121.19 60.613
## - ln_disp  1     6.533 122.86 61.049
## - wt       1     6.603 122.93 61.068
## <none>                 116.33 61.301
## - cyl      2    18.115 134.44 61.932
## - ln_hp    1    42.934 159.26 69.353
## 
## Step:  AIC=60.05
## mpg ~ cyl + wt + vs + am + ln_disp + ln_hp
## 
##           Df Sum of Sq    RSS    AIC
## - vs       1     2.175 128.92 58.591
## - ln_disp  1     3.396 130.14 58.893
## - cyl      2    16.079 142.83 59.868
## <none>                 126.75 60.047
## - wt       1    20.604 147.35 62.867
## - ln_hp    1    34.518 161.27 65.754
## 
## Step:  AIC=58.59
## mpg ~ cyl + wt + am + ln_disp + ln_hp
## 
##           Df Sum of Sq    RSS    AIC
## - ln_disp  1     4.424 133.35 57.671
## - cyl      2    14.222 143.15 57.940
## <none>                 128.92 58.591
## - wt       1    20.349 149.27 61.281
## - ln_hp    1    33.500 162.42 63.983
## 
## Step:  AIC=57.67
## mpg ~ cyl + wt + am + ln_hp
## 
##         Df Sum of Sq    RSS    AIC
## - cyl    2    13.967 147.31 56.859
## <none>               133.35 57.671
## - wt     1    42.561 175.91 64.535
## - ln_hp  1    49.622 182.97 65.794
## 
## Step:  AIC=56.86
## mpg ~ wt + am + ln_hp
## 
##         Df Sum of Sq    RSS    AIC
## <none>               147.31 56.859
## - wt     1    47.496 194.81 63.801
## - ln_hp  1   131.006 278.32 75.217

## 
## Call:
## lm(formula = mpg ~ wt + am + ln_hp, data = mtcars)
## 
## Coefficients:
## (Intercept)           wt          am1        ln_hp  
##      58.996       -2.468        1.757       -6.489

Note I specified “direction” as well as “scope.” “Both” gives R leeway to add and subtract various combinations of variables to find the optimal model. “Scope” forced R to keep the variable “am” in each model. Without direction, the default is for step to go backwards. If you don’t specify “scope,” R will add and subtract all variables equally.

Keeping “am” in the model yields the same model as the one I found above: MPG is negatively affected by both weight and horsepower and positively affected by manual transmissions.

mpg = 58.9956718 + -2.4681432wt + 1.7565027am + -6.4888385*ln(hp)

Now you just have to code the new model and run diagnotic plots on it as I did above. And there you have it. The same model as above but in two lines of code.