Executive summary

The performance of a vehicle in MPG is on average 30% more efficient for manual transmission vehicles as opposed to automatic transmission vehicles. However, the type of transmission only explains 36% of the variance of the response variable. In order to better explain and predict the variation in performance, other variables such as horsepower, displacement, vehicle weight or engine displacement must be taken into account. The most optimal model found explains 76% of the MPG variability through type of transmission and horse power.

Exploratory Data analysis

We analyze the example “mtcars” R example dataset to find out what is the performance in Miles per Galon according to the type of transmission of the vehicle. The two questions to which we will respond are:

  • Is an automatic or manual transmission better for MPG?
  • Quantify the MPG difference between automatic and manual transmissions

For this, we need to load the dataset and libraries. We also factor the type of transmission, as this will make it easier to work. No further operations are necessary since it is a very clean dataset.

library(tidyverse)
library(PerformanceAnalytics)
library(formattable)
cars <- as_tibble(mtcars)
cars <- cars %>%
  mutate(fct_am = factor(am, levels = unique(am)),
         fct_cyl = factor(cyl, levels = unique(cyl)),
         fct_am = fct_recode(fct_am,
                         "auto" = "0",
                         "manual" = "1"))

Let´s visualize how the type of transmission yields with respect to MPG, for this we are going to carry out an average segmenting by the two types of transmission. The vehicles of manual transmission yield an average of 24 mpg, while the vehicles of automatic transmission only 17 mpg.

means_mpg <- cars %>%
  group_by(fct_am) %>%
  summarise(
    mean_mpg = mean(mpg)
  )
formattable(means_mpg)
fct_am mean_mpg
manual 24.39231
auto 17.14737

We adjust a linear model with the variable MPG as response, and the variable AM as regressor, and obtain its \(R^2\):

modelo_mpg_am <- lm(mpg~fct_am, data = cars)
summary(modelo_mpg_am)$r.squared 
## [1] 0.3597989

The result is that the type of transmission only explains 36% of the MPG variability. Therefore, we must look for other variables that explain a higher percentage of MPG. To do this we will graphically show other variables that, intuitively, we suspect may influence consumption. These are hp, cyl, wt and disp:

ggplot(cars, aes(cars$fct_am, cars$mpg, color = cars$fct_am, size = cars$wt, alpha = hp, shape = cars$fct_cyl)) +
  geom_point(position = "jitter") +
  labs(y = "Miles per galon",
       x = "Type of gear",
       title = "Possible regressors of MPG",
       color="Type of transmission",
       size="Weight",
       alpha="Horse Power", 
       shape="Cylinders") +
  theme(plot.title = element_text(hjust = 0.5),
        legend.position = "right")

The graphic display supports the hypothesis that there is an inverse relationship MPG - wt, hp and cyl.

Fitting multiple models, detailing strategy for model selection and interpretation of coefficients

Let’s analyze the correlation between MPG and the rest of the variables:

formattable(cor(mtcars))
##      mpg     cyl     disp    hp      drat     wt      qsec    vs     
## mpg  1       -0.8522 -0.8476 -0.7762 0.6812   -0.8677 0.4187  0.664  
## cyl  -0.8522 1       0.902   0.8324  -0.6999  0.7825  -0.5912 -0.8108
## disp -0.8476 0.902   1       0.7909  -0.7102  0.888   -0.4337 -0.7104
## hp   -0.7762 0.8324  0.7909  1       -0.4488  0.6587  -0.7082 -0.7231
## drat 0.6812  -0.6999 -0.7102 -0.4488 1        -0.7124 0.0912  0.4403 
## wt   -0.8677 0.7825  0.888   0.6587  -0.7124  1       -0.1747 -0.5549
## qsec 0.4187  -0.5912 -0.4337 -0.7082 0.0912   -0.1747 1       0.7445 
## vs   0.664   -0.8108 -0.7104 -0.7231 0.4403   -0.5549 0.7445  1      
## am   0.5998  -0.5226 -0.5912 -0.2432 0.7127   -0.6925 -0.2299 0.1683 
## gear 0.4803  -0.4927 -0.5556 -0.1257 0.6996   -0.5833 -0.2127 0.206  
## carb -0.5509 0.527   0.395   0.7498  -0.09079 0.4276  -0.6562 -0.5696
##      am      gear    carb    
## mpg  0.5998  0.4803  -0.5509 
## cyl  -0.5226 -0.4927 0.527   
## disp -0.5912 -0.5556 0.395   
## hp   -0.2432 -0.1257 0.7498  
## drat 0.7127  0.6996  -0.09079
## wt   -0.6925 -0.5833 0.4276  
## qsec -0.2299 -0.2127 -0.6562 
## vs   0.1683  0.206   -0.5696 
## am   1       0.7941  0.05753 
## gear 0.7941  1       0.2741  
## carb 0.05753 0.2741  1

We observe the first line and select the variables that have the highest correlation with the response variable

tabla_correlacion <- cars %>%
  select(mpg, am, disp, cyl, hp, wt)
chart.Correlation(tabla_correlacion, histogram=FALSE, pch=19)

Disp, cyl, hp and wt have high correlation coefficients with each other (colinearity exists), and similar correlation coefficients with mpg. Therefore we must select only one of them which at the same time not maintain a high correlation with AM (which is the independent variable that we want to explain).

The best in this sense is hp with a correlation coefficient = 0.24 (although it is not statistically significant, since p > 0.05 occurs)

However, we can make a model attempt with which we explain MPG through the type of transmission and the hp:

fit_am_hp <- lm(mpg ~ fct_am + hp, data = cars)
summary(fit_am_hp)
## 
## Call:
## lm(formula = mpg ~ fct_am + hp, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.3843 -2.2642  0.1366  1.6968  5.8657 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 31.861999   1.282279  24.848  < 2e-16 ***
## fct_amauto  -5.277085   1.079541  -4.888 3.46e-05 ***
## hp          -0.058888   0.007857  -7.495 2.92e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.909 on 29 degrees of freedom
## Multiple R-squared:  0.782,  Adjusted R-squared:  0.767 
## F-statistic: 52.02 on 2 and 29 DF,  p-value: 2.55e-10

Using this model we would conclude that a manual car travels an average of 31.8 miles per gallon, while an automatic one would perform about 26.6 (only the intercept changes from automatic to manual). To this figure it would be necessary to subtract 0.058 miles for each horsepower unit increasement.

This model would be statistically significant and explains 76% of the variability of MPG, in addition to being simple and parsimonious.

Residual plot

par(mfrow=c(2,2))
plot(fit_am_hp)

  • Residuals v/s fitted: shows if there are non-linear patterns. In this case, the distribution is close to an horizontal one.
  • Normal q-q plot: shows if the residuals follow a normal distribution.
  • Scale-location: verification of homoscedasticity.
  • Residuals v/s leverages: detection of influential cases.

From the analysis of the residuals we conclude that it fulfills the assumptions for the correct application of a linear model, but there are a series of observations that sum up for a less optimal adjustment. These are outliers 8, 18, 20 and especially 31 (which has an outstanding 335 hp with 8 cylinders).

cars_outliers <- cars %>%
  filter(row_number() %in% c(8,18, 20, 31))
formattable(cars_outliers)
mpg cyl disp hp drat wt qsec vs am gear carb fct_am fct_cyl
24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 auto 4
32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 manual 4
33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 manual 4
15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8 manual 8