Introduction

The following report examines the relationship between the consumption in miles per gallon mpg in the data set mtcars using linear regression. The following two queries are addressed and answered:

  1. “Is an automatic or manual transmission better for MPG”

  2. “Quantify the MPG difference between automatic and manual transmissions”

The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973-74 models). Source: Henderson and Velleman (1981), Building multiple regression models interactively. Biometrics, 37, 391-411.

Exploratory data analysis

Let us first load the data into R and perform a preliminary exploration.

data(mtcars)
head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

Let us also load the help for this data set to see the actual meaning of the variables. Typing ?mtcars we can see the following:

Format

A data frame with 32 observations on 11 (numeric) variables.

[, 1] mpg Miles/(US) gallon
[, 2] cyl Number of cylinders
[, 3] disp Displacement (cu.in.)
[, 4] hp Gross horsepower
[, 5] drat Rear axle ratio
[, 6] wt Weight (1000 lbs)
[, 7] qsec 1/4 mile time
[, 8] vs Engine (0 = V-shaped, 1 = straight)
[, 9] am Transmission (0 = automatic, 1 = manual)
[,10] gear Number of forward gears
[,11] carb Number of carburetors

Since we are interested in the relationship between the consumption mpg and the transmission (automatic vs manual), am, let us transform the variable am into a factor and relabel the levels accordingly.

mtcars$am <- factor(mtcars$am)
levels(mtcars$am) <- c("automatic", "manual")

The number of items divided by these two levels is the following:

nrow(mtcars[mtcars$am == "automatic",])
## [1] 19
nrow(mtcars[mtcars$am == "manual",])
## [1] 13

Let us load the package ggplot2 to perform some data visualisation beforehand.

library(ggplot2)

Since we are interested in the formula mpg ~ am let us first draw a graph showing this relationship.

ggplot(data = mtcars, aes(x = am, y = mpg, colour = am)) + 
    xlab("Transmission") + ylab("Consumption (mpg)") +
    labs(colour = "Transmission") +
    ggtitle("Consumption by Transmission") +
    geom_boxplot() +
    theme(plot.title = element_text(hjust = 0.5, face = "bold"))

The graph suggests a significant difference in the consumption depending on the type of transmission, which needs to be analysed further.

Statistical Inferencee

First of all, let us prove that the difference in the mpg between automatic and manual transmission is significant. Since the number of data is relatively small, let us use the t-test.

test <- t.test(mpg ~ am, mtcars)
test
## 
##  Welch Two Sample t-test
## 
## data:  mpg by am
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means between group automatic and group manual is not equal to 0
## 95 percent confidence interval:
##  -11.280194  -3.209684
## sample estimates:
## mean in group automatic    mean in group manual 
##                17.14737                24.39231

The t-test shows a p-value of 0.0014 which yeileds that at a significance level of 0.05 we can reject the null hypothesis and infer that

Linear Models

Let us now try to fit a linear model for the variable mpg. We shall consider the following nested models:

lm_1 <- lm(mpg ~ am, mtcars)
lm_2 <- lm(mpg ~ am + wt, mtcars)
lm_3 <- lm(mpg ~ am + wt + hp, mtcars)
lm_4 <- lm(mpg ~ am + wt + hp + factor(cyl), mtcars)
lm_all <- lm(mpg ~ ., mtcars)

In the simplest model, with outcome mpg and only regressor am, we have

summary(lm_1)
## 
## Call:
## lm(formula = mpg ~ am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3923 -3.0923 -0.2974  3.2439  9.5077 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   17.147      1.125  15.247 1.13e-15 ***
## ammanual       7.245      1.764   4.106 0.000285 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared:  0.3598, Adjusted R-squared:  0.3385 
## F-statistic: 16.86 on 1 and 30 DF,  p-value: 0.000285

which shows a, mpg increase of 7.245 for manual cars.

Let us run the ANOVA test to these nested models to see which variables are the most significative towards the variation of mpg.

anova(lm_1, lm_2, lm_3, lm_4, lm_all)
## Analysis of Variance Table
## 
## Model 1: mpg ~ am
## Model 2: mpg ~ am + wt
## Model 3: mpg ~ am + wt + hp
## Model 4: mpg ~ am + wt + hp + factor(cyl)
## Model 5: mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
##   Res.Df    RSS Df Sum of Sq       F    Pr(>F)    
## 1     30 720.90                                   
## 2     29 278.32  1    442.58 63.0133 9.325e-08 ***
## 3     28 180.29  1     98.03 13.9571  0.001219 ** 
## 4     26 151.03  2     29.27  2.0834  0.149491    
## 5     21 147.49  5      3.53  0.1006  0.990931    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From the test it is appears that the best model is the third, mpg ~ am + wt + hp. In fact, though the model lm_all, with all variables considered as regressors, we would inflate the model’s variance, whereas the first mode, mpg ~ am, with a \(R^2\) coefficient of 0.36 would only explain around the 36% of the variation in mpg. Instead, as shown below,

summary(lm_3)
## 
## Call:
## lm(formula = mpg ~ am + wt + hp, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4221 -1.7924 -0.3788  1.2249  5.5317 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 34.002875   2.642659  12.867 2.82e-13 ***
## ammanual     2.083710   1.376420   1.514 0.141268    
## wt          -2.878575   0.904971  -3.181 0.003574 ** 
## hp          -0.037479   0.009605  -3.902 0.000546 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.538 on 28 degrees of freedom
## Multiple R-squared:  0.8399, Adjusted R-squared:  0.8227 
## F-statistic: 48.96 on 3 and 28 DF,  p-value: 2.908e-11

the model mpg ~ am + wt + hp has a \(R^2\) coefficient of 0.84 and therefore it explains around the 84% of the variation in mpg.

Conclusions

Is an automatic or manual transmission better for MPG?

It appears that manual transmission cars are better for MPG compared to automatic cars. However when modeled with confounding variables like HP and weight, the difference is not as significant as it seems in the beginning: a big part of the difference is explained by other variables.

Quantify the MPG difference between automatic and manual transmissions

Analysis shows that when only transmission was used in the model manual cars have an mpg increase of 7.245. However, when variables wt and hp are included, the manual car advantage drops to 2.084 with other variables contributing, sometimes more (e.g. weight) to the effect.

Appendix

Istogram of MPG in automatic and manual cars

ggplot(data = mtcars, aes(x = mpg, colour = am)) +
    geom_histogram(fill = "white", bins = 10) +
    labs(colour = "Trnasmission") +
    facet_grid(. ~ am) +
    ggtitle("Comparison between MPG in automatic and manual cars") +
    xlab("Consumption (mpg)") +
    ylab("Frequency") +
    theme(plot.title = element_text(hjust = 0.5, face = "bold"))

Plot of correlation for the variables considered.

library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
ggpairs(data = mtcars, aes(colour = factor(am)), columns = c(1, 2, 4, 6, 9))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.