INTRODUCTION

The purpose of this analysis is to investigate, on behalf of “Motor Trend”, the impact of bunch of cars settings on “Miles Per Gallon” (MPG). Therefore, “Motor Trend” is particulary interested in the following two questions:

Is an automatic or manual transmission better for MPG.
Quantify the MPG difference between automatic and manual transmissions.

An overview of our study will be shown in the below rubric “EXECUTIVE SUMMARY”.

0. EXECUTIVE SUMMARY :

Our study will be performed as follow :

Exploratory Analysis
Model selection
Transmission impact on MPG
MPG difference between manual and automatic

CONCLUSION.

1. Exploratory Analysis

First of all, let us look at the mtcars dataset :

install.packages("FactoMineR", repos = "http://cran.us.r-project.org")

## package 'FactoMineR' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\mekie_000\AppData\Local\Temp\Rtmpqw8Fum\downloaded_packages

library(FactoMineR)
install.packages("factoextra", repos = "http://cran.us.r-project.org")

## package 'factoextra' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\mekie_000\AppData\Local\Temp\Rtmpqw8Fum\downloaded_packages

library(factoextra)
install.packages("corrplot", repos = "http://cran.us.r-project.org")

## package 'corrplot' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\mekie_000\AppData\Local\Temp\Rtmpqw8Fum\downloaded_packages

library(corrplot)
library(ggplot2)
str(mtcars)

## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

# dim(mtcars)
writeLines(paste("\n", "The mtcars dataframe is", dim(mtcars)[1], 
    "lines(types of cars) and", dim(mtcars)[2], "columns(each being our cars characteristics).", 
    "\n"))

## 
##  The mtcars dataframe is 32 lines(types of cars) and 11 columns(each being our cars characteristics).

One will notice that all the variable are presented as numeric :

[, 1] mpg Miles/(US) gallon [, 2] cyl Number of cylinders [, 3] disp Displacement (cu.in.) [, 4] hp Gross horsepower [, 5] drat Rear axle ratio [, 6] wt Weight (1000 lbs) [, 7] qsec 1/4 mile time [, 8] vs Engine (0 = V-shaped, 1 = straight) [, 9] am Transmission (0 = automatic, 1 = manual) [,10] gear Number of forward gears [,11] carb Number of carburetors

Therefore we should mention that variables 2, 8, 9, 10 & 11 are factorial variables.

Let us perform some analysis on how the variability is explained in this dataset :

# Principal Composant Analysis to perform our task
var_cars <- PCA(mtcars, graph = FALSE)
fviz_eig(var_cars, addlabels = TRUE, ylim = c(0, 70))

One can observe that the 2 first PCA new dimensions explain 84% of the overall variation.

# Most meaning variables
var_cars <- PCA(mtcars, graph = FALSE)
corrplot(var_cars$var$cos2, is.corr = FALSE)

From what we got the 5 most meaningfull variables are definitely disp, cyl, mpg, wt & vs.

2. Model selection :

In this section, we will have to choose a linear model which fit the best the MPG variability :

# Let us define our models From what we understood during our
# exploratory analysis, taking variability in account, we are
# able to set an order between the variables :
# DISP>CYL>MPG>WT>HP>VS>DRAT>AM>GEAR>CARB>QSEC As we want to
# explain MPG per AM(at least), we can remove them and start
# make our combinations :

model1 <- lm(mpg ~ factor(am), data = mtcars)
model2 <- lm(mpg ~ factor(am) + disp, data = mtcars)
model3 <- lm(mpg ~ factor(am) + disp + factor(cyl), data = mtcars)
model4 <- lm(mpg ~ factor(am) + disp + factor(cyl) + wt, data = mtcars)
model5 <- lm(mpg ~ factor(am) + disp + factor(cyl) + wt + hp, 
    data = mtcars)
model6 <- lm(mpg ~ factor(am) + disp + factor(cyl) + wt + hp + 
    factor(vs), data = mtcars)
model7 <- lm(mpg ~ factor(am) + disp + factor(cyl) + wt + hp + 
    factor(vs) + factor(gear), data = mtcars)
model8 <- lm(mpg ~ factor(am) + disp + factor(cyl) + wt + hp + 
    factor(vs) + factor(gear) + factor(carb), data = mtcars)
model9 <- lm(mpg ~ factor(am) + disp + factor(cyl) + wt + hp + 
    factor(vs) + factor(gear) + factor(carb) + qsec, data = mtcars)
modelf2 <- lm(mpg ~ factor(am) * disp, data = mtcars)
modelf3 <- lm(mpg ~ factor(am) * (disp + factor(cyl)), data = mtcars)
modelf4 <- lm(mpg ~ factor(am) * (disp + factor(cyl) + wt), data = mtcars)
modelf5 <- lm(mpg ~ factor(am) * (disp + factor(cyl) + wt + hp), 
    data = mtcars)
modelf6 <- lm(mpg ~ factor(am) * (disp + factor(cyl) + wt + hp + 
    factor(vs)), data = mtcars)
modelf7 <- lm(mpg ~ factor(am) * (disp + factor(cyl) + wt + hp + 
    factor(vs) + factor(gear)), data = mtcars)
modelf8 <- lm(mpg ~ factor(am) * (disp + factor(cyl) + wt + hp + 
    factor(vs) + factor(gear) + factor(carb)), data = mtcars)
modelf9 <- lm(mpg ~ factor(am) * (disp + factor(cyl) + wt + hp + 
    factor(vs) + factor(gear) + factor(carb) + qsec), data = mtcars)
anova(model1, model2, model3, model4, model5, model6, model7, 
    model8, model9, modelf2, modelf3, modelf4, modelf5, modelf6, 
    modelf7, modelf8, modelf9)

The more statistically significant seems to be the “model2” with a p-value of “4.326e-08”.

Let look at the unexplained side of our model so-called residuals to ensure there is no pattern left away :

# We may plot the residuals in front of the two selected
# variables

selected_model_init <- lm(mpg ~ factor(am) + disp, data = mtcars)
par(mfrow = c(1, 2))
plot(mtcars$disp, resid(selected_model_init), pch = 21, bg = "grey", 
    main = "Residuals VS Displacement", xlab = "Displacement", 
    ylab = "Residuals")
plot(mtcars$am, resid(selected_model_init), pch = 21, bg = "lightblue", 
    main = "Residuals VS Transmission", xlab = "Transmission", 
    ylab = "Residuals")

par(mfrow = c(1, 1))

Hopefully, there is no apparent unexplained pattern :) .

3. Transmission impact on MPG :

Let us use our previously selected “lm(factor(am)+disp, data=mtcars)” :

# We may perform a summary on our selected model Let us
# remove the intercept to be able to get the factor
# coefficients on the same scale
selected_model <- lm(mpg ~ factor(am) + disp - 1, data = mtcars)
summary(selected_model)

## 
## Call:
## lm(formula = mpg ~ factor(am) + disp - 1, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.6382 -2.4751 -0.5631  2.2333  6.8386 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## factor(am)0 27.848081   1.834071  15.184 2.45e-15 ***
## factor(am)1 29.681539   1.218689  24.355  < 2e-16 ***
## disp        -0.036851   0.005782  -6.373 5.75e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.218 on 29 degrees of freedom
## Multiple R-squared:  0.9786, Adjusted R-squared:  0.9764 
## F-statistic: 442.4 on 3 and 29 DF,  p-value: < 2.2e-16

One may notice that the model coefficients are all highly significant.

# Let look carefully at the factor coefficients
coef(selected_model)

## factor(am)0 factor(am)1        disp 
## 27.84808111 29.68153936 -0.03685086

As we may notice the factor_1 (manual), seems to be better(implying highest values) for MPG than factor_0 (automatic).

In addition, let us look at the 95% confidence intervals :

# Let look carefully at the factor coefficients confidence
# interval on 95% conservative level
round(confint(selected_model), 2)

##             2.5 % 97.5 %
## factor(am)0 24.10  31.60
## factor(am)1 27.19  32.17
## disp        -0.05  -0.03

In fact :

factor_0 (automatic) – 95% coef confidence interval is [24.10 ; 31.60] ; then size 7.5.
factor_1 (manual) – 95% coef confidence interval is [27.19 ; 32.17] ; then size 4.98.

Then factor_1 (manual) 95% coef confidence interval is also narrower than “automatic” one.

4. MPG difference between manual and automatic :

# We may perform a summary on our selected model Let put back
# the intercept to be able to get the difference between
# factor coefficients
selected_model_init <- lm(mpg ~ factor(am) + disp, data = mtcars)
summary(selected_model_init)

## 
## Call:
## lm(formula = mpg ~ factor(am) + disp, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.6382 -2.4751 -0.5631  2.2333  6.8386 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 27.848081   1.834071  15.184 2.45e-15 ***
## factor(am)1  1.833458   1.436100   1.277    0.212    
## disp        -0.036851   0.005782  -6.373 5.75e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.218 on 29 degrees of freedom
## Multiple R-squared:  0.7333, Adjusted R-squared:  0.7149 
## F-statistic: 39.87 on 2 and 29 DF,  p-value: 4.749e-09

As we can see, the model is globally unchanged, therefore the difference coefficient is actually statistically non significant.

# Let look carefully at the factor spread
coefs <- round(coef(selected_model_init), 2)
cnfint <- round(confint(selected_model_init), 2)
writeLines(paste("\n", "The non-significant observed difference in MPG expressed as (manual - automatic) is", 
    coefs[2], ";", "\n with 95% confidence interval equal to [", 
    cnfint[2], ";", cnfint[5], "].", "\n"))

## 
##  The non-significant observed difference in MPG expressed as (manual - automatic) is 1.83 ; 
##  with 95% confidence interval equal to [ -1.1 ; 4.77 ].

CONCLUSION

We can then conclude that the manual transmission is better for MPG than the automatic one. Therefore their MPG’s difference is not statistically significant (P-value=0.212).

Regression Models : Peer Assessment – Motor Trend Car Study on “Miles Per Gallon”

Ralph Kevin MEKIE

01 May 2019