The performance of a vehicle in MPG is on average 30% more efficient for manual transmission vehicles as opposed to automatic transmission vehicles. However, the type of transmission only explains 36% of the variance of the response variable. In order to better explain and predict the variation in performance, other variables such as horsepower, displacement, vehicle weight or engine displacement must be taken into account. The most optimal model found explains 76% of the MPG variability through type of transmission and horse power.
We analyze the example “mtcars” R example dataset to find out what is the performance in Miles per Galon according to the type of transmission of the vehicle. The two questions to which we will respond are:
For this, we need to load the dataset and libraries. We also factor the type of transmission, as this will make it easier to work. No further operations are necessary since it is a very clean dataset.
library(tidyverse)
library(PerformanceAnalytics)
library(formattable)
cars <- as_tibble(mtcars)
cars <- cars %>%
mutate(fct_am = factor(am, levels = unique(am)),
fct_cyl = factor(cyl, levels = unique(cyl)),
fct_am = fct_recode(fct_am,
"auto" = "0",
"manual" = "1"))
Let´s visualize how the type of transmission yields with respect to MPG, for this we are going to carry out an average segmenting by the two types of transmission. The vehicles of manual transmission yield an average of 24 mpg, while the vehicles of automatic transmission only 17 mpg.
means_mpg <- cars %>%
group_by(fct_am) %>%
summarise(
mean_mpg = mean(mpg)
)
formattable(means_mpg)
fct_am | mean_mpg |
---|---|
manual | 24.39231 |
auto | 17.14737 |
We adjust a linear model with the variable MPG as response, and the variable AM as regressor, and obtain its \(R^2\):
modelo_mpg_am <- lm(mpg~fct_am, data = cars)
summary(modelo_mpg_am)$r.squared
## [1] 0.3597989
The result is that the type of transmission only explains 36% of the MPG variability. Therefore, we must look for other variables that explain a higher percentage of MPG. To do this we will graphically show other variables that, intuitively, we suspect may influence consumption. These are hp, cyl, wt and disp:
ggplot(cars, aes(cars$fct_am, cars$mpg, color = cars$fct_am, size = cars$wt, alpha = hp, shape = cars$fct_cyl)) +
geom_point(position = "jitter") +
labs(y = "Miles per galon",
x = "Type of gear",
title = "Possible regressors of MPG",
color="Type of transmission",
size="Weight",
alpha="Horse Power",
shape="Cylinders") +
theme(plot.title = element_text(hjust = 0.5),
legend.position = "right")
The graphic display supports the hypothesis that there is an inverse relationship MPG - wt, hp and cyl.
Let’s analyze the correlation between MPG and the rest of the variables:
formattable(cor(mtcars))
## mpg cyl disp hp drat wt qsec vs
## mpg 1 -0.8522 -0.8476 -0.7762 0.6812 -0.8677 0.4187 0.664
## cyl -0.8522 1 0.902 0.8324 -0.6999 0.7825 -0.5912 -0.8108
## disp -0.8476 0.902 1 0.7909 -0.7102 0.888 -0.4337 -0.7104
## hp -0.7762 0.8324 0.7909 1 -0.4488 0.6587 -0.7082 -0.7231
## drat 0.6812 -0.6999 -0.7102 -0.4488 1 -0.7124 0.0912 0.4403
## wt -0.8677 0.7825 0.888 0.6587 -0.7124 1 -0.1747 -0.5549
## qsec 0.4187 -0.5912 -0.4337 -0.7082 0.0912 -0.1747 1 0.7445
## vs 0.664 -0.8108 -0.7104 -0.7231 0.4403 -0.5549 0.7445 1
## am 0.5998 -0.5226 -0.5912 -0.2432 0.7127 -0.6925 -0.2299 0.1683
## gear 0.4803 -0.4927 -0.5556 -0.1257 0.6996 -0.5833 -0.2127 0.206
## carb -0.5509 0.527 0.395 0.7498 -0.09079 0.4276 -0.6562 -0.5696
## am gear carb
## mpg 0.5998 0.4803 -0.5509
## cyl -0.5226 -0.4927 0.527
## disp -0.5912 -0.5556 0.395
## hp -0.2432 -0.1257 0.7498
## drat 0.7127 0.6996 -0.09079
## wt -0.6925 -0.5833 0.4276
## qsec -0.2299 -0.2127 -0.6562
## vs 0.1683 0.206 -0.5696
## am 1 0.7941 0.05753
## gear 0.7941 1 0.2741
## carb 0.05753 0.2741 1
We observe the first line and select the variables that have the highest correlation with the response variable
tabla_correlacion <- cars %>%
select(mpg, am, disp, cyl, hp, wt)
chart.Correlation(tabla_correlacion, histogram=FALSE, pch=19)
Disp, cyl, hp and wt have high correlation coefficients with each other (colinearity exists), and similar correlation coefficients with mpg. Therefore we must select only one of them which at the same time not maintain a high correlation with AM (which is the independent variable that we want to explain).
The best in this sense is hp with a correlation coefficient = 0.24 (although it is not statistically significant, since p > 0.05 occurs)
However, we can make a model attempt with which we explain MPG through the type of transmission and the hp:
fit_am_hp <- lm(mpg ~ fct_am + hp, data = cars)
summary(fit_am_hp)
##
## Call:
## lm(formula = mpg ~ fct_am + hp, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.3843 -2.2642 0.1366 1.6968 5.8657
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 31.861999 1.282279 24.848 < 2e-16 ***
## fct_amauto -5.277085 1.079541 -4.888 3.46e-05 ***
## hp -0.058888 0.007857 -7.495 2.92e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.909 on 29 degrees of freedom
## Multiple R-squared: 0.782, Adjusted R-squared: 0.767
## F-statistic: 52.02 on 2 and 29 DF, p-value: 2.55e-10
Using this model we would conclude that a manual car travels an average of 31.8 miles per gallon, while an automatic one would perform about 26.6 (only the intercept changes from automatic to manual). To this figure it would be necessary to subtract 0.058 miles for each horsepower unit increasement.
This model would be statistically significant and explains 76% of the variability of MPG, in addition to being simple and parsimonious.
par(mfrow=c(2,2))
plot(fit_am_hp)
From the analysis of the residuals we conclude that it fulfills the assumptions for the correct application of a linear model, but there are a series of observations that sum up for a less optimal adjustment. These are outliers 8, 18, 20 and especially 31 (which has an outstanding 335 hp with 8 cylinders).
cars_outliers <- cars %>%
filter(row_number() %in% c(8,18, 20, 31))
formattable(cars_outliers)
mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | fct_am | fct_cyl |
---|---|---|---|---|---|---|---|---|---|---|---|---|
24.4 | 4 | 146.7 | 62 | 3.69 | 3.190 | 20.00 | 1 | 0 | 4 | 2 | auto | 4 |
32.4 | 4 | 78.7 | 66 | 4.08 | 2.200 | 19.47 | 1 | 1 | 4 | 1 | manual | 4 |
33.9 | 4 | 71.1 | 65 | 4.22 | 1.835 | 19.90 | 1 | 1 | 4 | 1 | manual | 4 |
15.0 | 8 | 301.0 | 335 | 3.54 | 3.570 | 14.60 | 0 | 1 | 5 | 8 | manual | 8 |