Many believe that a car’s horsepower (how fast it goes) directly impacts the car’s miles per gallon (fuel efficiency). However, I wasn’t aware of what this relationship would actually look like, so I decided to investigate this on my own with a dataset in R.
The dataset in question is the well-known mtcars data
frame in the datasets R package. This dataset contains a
lot of information about various cars, including:
There are about 32 cars represented in this dataset, which isn’t a lot, but this at least gives us some data to look at what relationship horsepower and miles per gallon have in the real world.
The two variables related to this problem in the mtcars
data are hp and mpg. I visualized this
relationship first with a raw scatterplot.
#### Load Library ####
library(tidyverse)
#### Setup Plot Theme ####
set_theme(
theme_bw()
)
#### Save Plot Data ####
raw <- mtcars %>%
ggplot(
aes(
x = hp,
y = mpg
)
)+
geom_point(
size = 5,
alpha = .5
)+
labs(
x = "HP",
y = "MPG",
title = "Association between HP and MPG"
)
#### Print Plot ####
raw
The relationship appeared to be negative but possibly nonlinear (as the MPG doesn’t increase or decrease much as HP reaches its maximum). I first fit a standard ordinary least squares (OLS) regression model and assessed the impact on model performance.
To fit the model, I used the lm() function. After that,
I ran residual plots using ggfortify’s
autoplot() function on the saved object.
#### Load Library ####
library(ggfortify)
#### Fit Model ####
fit.1 <- lm(
formula = mpg ~ hp,
data = mtcars
)
#### Plot Residuals ####
autoplot(fit.1)
The first residual plot makes it very obvious that there are issues with nonlinearity, which confirms my earlier suspicions. Therefore, I refit the model using a polynomial curve and determined if this was a better fit.
#### Fit Model Again ####
fit.2 <- lm(
formula = mpg ~ poly(hp,2),
data = mtcars
)
#### Plot Residuals Again ####
autoplot(fit.2)
Though the residuals are not perfect, they at least suggest a much
better performance than the original fit. To visualize the model, I just
used the scatterplot from before but added a regression line with
geom_smooth().
raw +
geom_smooth(
color = "hotpink",
method = "lm",
formula = y ~ poly(x,2)
)
We can see from the plot that, as usual, the polynomial interpolates poorly at the end of the association (the association increases a little at the maximum values of HP), so this fit isn’t perfect. However, this model allowed for better interpretation, so it was left alone.
To investigate the influence of each coefficient, I inspected the model summary.
summary(fit.2)
##
## Call:
## lm(formula = mpg ~ poly(hp, 2), data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.5512 -1.6027 -0.6977 1.5509 8.7213
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 20.091 0.544 36.931 < 2e-16 ***
## poly(hp, 2)1 -26.046 3.077 -8.464 2.51e-09 ***
## poly(hp, 2)2 13.155 3.077 4.275 0.000189 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.077 on 29 degrees of freedom
## Multiple R-squared: 0.7561, Adjusted R-squared: 0.7393
## F-statistic: 44.95 on 2 and 29 DF, p-value: 1.301e-09
Here we see the linear effect of horse power is negative and the quadratic curve is positive. This is visually present in the scatterplot, where the association first starts as negative and linear and then curves upward. The \(R^2\) for the model is about \(.74\), indicating that much of the variation in the response could be attributed to the model. This is likely due to the strong coefficient values and the low respective error (aka noise).
Though the effect looked clearly nonlinear, I was curious to determine why. It is possible that some categorical variable could be influencing this relationship. Therefore, I looked at another possible factor in this association: the type of engine shape used in the car. Normally straight-engine cars expend less gas, so I colored in the scatterplot from before to see if this was shaping the relationship.
#### Save Plot Object ####
raw.color <- mtcars %>%
ggplot(
aes(
x = hp,
y = mpg,
color = factor(vs)
)
)+
geom_point(
size = 5,
alpha = .5
)+
scale_color_manual(
values = c("darkred","darkblue")
)+
labs(
x = "HP",
y = "MPG",
title = "Association between HP and MPG",
color = "Engine Type (V-Shaped = 0, Straight = 1)"
)+
theme(legend.position = "bottom")
#### Print Plot ####
raw.color
Now it seemed clear that the relationship depended on the engine type. To make the association for V-shaped cars more clear, I filtered the scatterplot to only include those cars which had this shape.
mtcars %>%
filter(vs == 1) %>% # added this
ggplot(
aes(
x = hp,
y = mpg # and removed the color
)
)+
geom_point()
The association is much more linear when we look at just the engine types that are straight. Therefore, I looked at the interaction between engine type and horse power since they seemed to jointly influence the variation in MPG.
To fit this model, I used the following syntax:
#### Fit Model ####
fit.final <- lm(
formula = mpg ~ hp * factor(vs),
data = mtcars
)
#### Summarise Model ####
summary(fit.final)
##
## Call:
## lm(formula = mpg ~ hp * factor(vs), data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.5821 -1.7710 -0.3612 1.5969 9.2646
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 24.49637 2.73893 8.944 1.07e-09 ***
## hp -0.04153 0.01379 -3.011 0.00547 **
## factor(vs)1 14.50418 4.58160 3.166 0.00371 **
## hp:factor(vs)1 -0.11657 0.04130 -2.822 0.00868 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.428 on 28 degrees of freedom
## Multiple R-squared: 0.7077, Adjusted R-squared: 0.6764
## F-statistic: 22.6 on 3 and 28 DF, p-value: 1.227e-07
The model tells us the following information (in order of the coefficients):
The model seemed to explain less variation in the response based on the \(R^2\) in the model (\(R^2 = .68\)). To get a sense of the range of possible intercept/slope values, I also inspected the CIs:
confint(fit.final)
## 2.5 % 97.5 %
## (Intercept) 18.88592590 30.10681342
## hp -0.06978835 -0.01327733
## factor(vs)1 5.11919236 23.88916067
## hp:factor(vs)1 -0.20117017 -0.03196062
The CI for the V-shaped engine car’s slope seemed to be a lot narrower than straight engine cars. This is probably due to the data clustering closer together for the dots shown on the previous scatterplot, but this was determined with a scatterplot later. I also assessed the model fitness first before plotting the scatterplot.
autoplot(fit.final)
The model residuals are not perfect and there could be some
additional smoothing applied (e.g. LOESS or penalized splines), but for
now this model seemed to be directly interpretable and had generally
okay fitness. I visualized this model using the original
raw.color plot saved earlier by adding another
geom_smooth().
raw.color +
geom_smooth(
method = "lm"
)
Now we clearly see the difference in intercepts/slopes for each engine type.
So what does this tell us? It seems at least that the general association between horse power and miles per gallon is negative but nonlinear. This means that as cars are made faster, they waste more fuel. However, this effect is limited as horse power gets larger.
That said, there is an important moderating effect of engine type. Cars with straight-shaped engines seem to have higher fuel efficiency on average (shown by the blue line hovering over the other line). However, the impact of horsepower on these cars is more extreme, shown by the sharp linear decrease in MPG from the blue line.
Should we always buy cars with straight engines then? There is an imporant contextual element to this data. The data was extracted from the 1974 magazine Motor Trend. In other words, these cars are pretty old by modern standards. Fuel efficiency has likely improved as cars have been developed since the 1970s. Also, there aren’t a lot of cars represented in this dataset, so the limited sample size doesn’t allow us to extrapolate much to the total population of cars out there. Therefore, a separate analysis on more modern and robust data should be conducted to ascertain this with certainty.