The following report examines the relationship between the
consumption in miles per gallon mpg in the data set
mtcars using linear regression. The following two queries
are addressed and answered:
“Is an automatic or manual transmission better for MPG”
“Quantify the MPG difference between automatic and manual transmissions”
The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973-74 models). Source: Henderson and Velleman (1981), Building multiple regression models interactively. Biometrics, 37, 391-411.
Let us first load the data into R and perform a preliminary exploration.
data(mtcars)
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
Let us also load the help for this data set to see the actual meaning
of the variables. Typing ?mtcars we can see the
following:
Format
A data frame with 32 observations on 11 (numeric) variables.
[, 1] mpg Miles/(US) gallon
[, 2] cyl Number of cylinders
[, 3] disp Displacement (cu.in.)
[, 4] hp Gross horsepower
[, 5] drat Rear axle ratio
[, 6] wt Weight (1000 lbs)
[, 7] qsec 1/4 mile time
[, 8] vs Engine (0 = V-shaped, 1 = straight)
[, 9] am Transmission (0 = automatic, 1 = manual)
[,10] gear Number of forward gears
[,11] carb Number of carburetors
Since we are interested in the relationship between the consumption
mpg and the transmission (automatic vs manual),
am, let us transform the variable am into a
factor and relabel the levels accordingly.
mtcars$am <- factor(mtcars$am)
levels(mtcars$am) <- c("automatic", "manual")
The number of items divided by these two levels is the following:
nrow(mtcars[mtcars$am == "automatic",])
## [1] 19
nrow(mtcars[mtcars$am == "manual",])
## [1] 13
Let us load the package ggplot2 to perform some data
visualisation beforehand.
library(ggplot2)
Since we are interested in the formula mpg ~ am let us
first draw a graph showing this relationship.
ggplot(data = mtcars, aes(x = am, y = mpg, colour = am)) +
xlab("Transmission") + ylab("Consumption (mpg)") +
labs(colour = "Transmission") +
ggtitle("Consumption by Transmission") +
geom_boxplot() +
theme(plot.title = element_text(hjust = 0.5, face = "bold"))
The graph suggests a significant difference in the consumption depending on the type of transmission, which needs to be analysed further.
First of all, let us prove that the difference in the
mpg between automatic and manual transmission is
significant. Since the number of data is relatively small, let us use
the t-test.
test <- t.test(mpg ~ am, mtcars)
test
##
## Welch Two Sample t-test
##
## data: mpg by am
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means between group automatic and group manual is not equal to 0
## 95 percent confidence interval:
## -11.280194 -3.209684
## sample estimates:
## mean in group automatic mean in group manual
## 17.14737 24.39231
The t-test shows a p-value of 0.0014 which yeileds that at a significance level of 0.05 we can reject the null hypothesis and infer that
Let us now try to fit a linear model for the variable
mpg. We shall consider the following nested models:
lm_1 <- lm(mpg ~ am, mtcars)
lm_2 <- lm(mpg ~ am + wt, mtcars)
lm_3 <- lm(mpg ~ am + wt + hp, mtcars)
lm_4 <- lm(mpg ~ am + wt + hp + factor(cyl), mtcars)
lm_all <- lm(mpg ~ ., mtcars)
In the simplest model, with outcome mpg and only
regressor am, we have
summary(lm_1)
##
## Call:
## lm(formula = mpg ~ am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.3923 -3.0923 -0.2974 3.2439 9.5077
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.147 1.125 15.247 1.13e-15 ***
## ammanual 7.245 1.764 4.106 0.000285 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared: 0.3598, Adjusted R-squared: 0.3385
## F-statistic: 16.86 on 1 and 30 DF, p-value: 0.000285
which shows a, mpg increase of 7.245 for manual cars.
Let us run the ANOVA test to these nested models to see which
variables are the most significative towards the variation of
mpg.
anova(lm_1, lm_2, lm_3, lm_4, lm_all)
## Analysis of Variance Table
##
## Model 1: mpg ~ am
## Model 2: mpg ~ am + wt
## Model 3: mpg ~ am + wt + hp
## Model 4: mpg ~ am + wt + hp + factor(cyl)
## Model 5: mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 30 720.90
## 2 29 278.32 1 442.58 63.0133 9.325e-08 ***
## 3 28 180.29 1 98.03 13.9571 0.001219 **
## 4 26 151.03 2 29.27 2.0834 0.149491
## 5 21 147.49 5 3.53 0.1006 0.990931
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From the test it is appears that the best model is the third,
mpg ~ am + wt + hp. In fact, though the model
lm_all, with all variables considered as regressors, we
would inflate the model’s variance, whereas the first mode,
mpg ~ am, with a \(R^2\)
coefficient of 0.36 would only explain around the 36% of the variation
in mpg. Instead, as shown below,
summary(lm_3)
##
## Call:
## lm(formula = mpg ~ am + wt + hp, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4221 -1.7924 -0.3788 1.2249 5.5317
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34.002875 2.642659 12.867 2.82e-13 ***
## ammanual 2.083710 1.376420 1.514 0.141268
## wt -2.878575 0.904971 -3.181 0.003574 **
## hp -0.037479 0.009605 -3.902 0.000546 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.538 on 28 degrees of freedom
## Multiple R-squared: 0.8399, Adjusted R-squared: 0.8227
## F-statistic: 48.96 on 3 and 28 DF, p-value: 2.908e-11
the model mpg ~ am + wt + hp has a \(R^2\) coefficient of 0.84 and therefore it
explains around the 84% of the variation in mpg.
Is an automatic or manual transmission better for MPG?
It appears that manual transmission cars are better for MPG compared to automatic cars. However when modeled with confounding variables like HP and weight, the difference is not as significant as it seems in the beginning: a big part of the difference is explained by other variables.
Quantify the MPG difference between automatic and manual transmissions
Analysis shows that when only transmission was used in the model manual cars have an mpg increase of 7.245. However, when variables wt and hp are included, the manual car advantage drops to 2.084 with other variables contributing, sometimes more (e.g. weight) to the effect.
Istogram of MPG in automatic and manual cars
ggplot(data = mtcars, aes(x = mpg, colour = am)) +
geom_histogram(fill = "white", bins = 10) +
labs(colour = "Trnasmission") +
facet_grid(. ~ am) +
ggtitle("Comparison between MPG in automatic and manual cars") +
xlab("Consumption (mpg)") +
ylab("Frequency") +
theme(plot.title = element_text(hjust = 0.5, face = "bold"))
Plot of correlation for the variables considered.
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
ggpairs(data = mtcars, aes(colour = factor(am)), columns = c(1, 2, 4, 6, 9))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.