The purpose of this analysis is to investigate, on behalf of “Motor Trend”, the impact of bunch of cars settings on “Miles Per Gallon” (MPG). Therefore, “Motor Trend” is particulary interested in the following two questions:
An overview of our study will be shown in the below rubric “EXECUTIVE SUMMARY”.
Our study will be performed as follow :
CONCLUSION.
install.packages("FactoMineR", repos = "http://cran.us.r-project.org")
## package 'FactoMineR' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\mekie_000\AppData\Local\Temp\Rtmpqw8Fum\downloaded_packages
library(FactoMineR)
install.packages("factoextra", repos = "http://cran.us.r-project.org")
## package 'factoextra' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\mekie_000\AppData\Local\Temp\Rtmpqw8Fum\downloaded_packages
library(factoextra)
install.packages("corrplot", repos = "http://cran.us.r-project.org")
## package 'corrplot' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\mekie_000\AppData\Local\Temp\Rtmpqw8Fum\downloaded_packages
library(corrplot)
library(ggplot2)
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
# dim(mtcars)
writeLines(paste("\n", "The mtcars dataframe is", dim(mtcars)[1],
"lines(types of cars) and", dim(mtcars)[2], "columns(each being our cars characteristics).",
"\n"))
##
## The mtcars dataframe is 32 lines(types of cars) and 11 columns(each being our cars characteristics).
One will notice that all the variable are presented as numeric :
[, 1] mpg Miles/(US) gallon [, 2] cyl Number of cylinders [, 3] disp Displacement (cu.in.) [, 4] hp Gross horsepower [, 5] drat Rear axle ratio [, 6] wt Weight (1000 lbs) [, 7] qsec 1/4 mile time [, 8] vs Engine (0 = V-shaped, 1 = straight) [, 9] am Transmission (0 = automatic, 1 = manual) [,10] gear Number of forward gears [,11] carb Number of carburetors
Therefore we should mention that variables 2, 8, 9, 10 & 11 are factorial variables.
# Principal Composant Analysis to perform our task
var_cars <- PCA(mtcars, graph = FALSE)
fviz_eig(var_cars, addlabels = TRUE, ylim = c(0, 70))
One can observe that the 2 first PCA new dimensions explain 84% of the overall variation.
# Most meaning variables
var_cars <- PCA(mtcars, graph = FALSE)
corrplot(var_cars$var$cos2, is.corr = FALSE)
From what we got the 5 most meaningfull variables are definitely disp, cyl, mpg, wt & vs.
In this section, we will have to choose a linear model which fit the best the MPG variability :
# Let us define our models From what we understood during our
# exploratory analysis, taking variability in account, we are
# able to set an order between the variables :
# DISP>CYL>MPG>WT>HP>VS>DRAT>AM>GEAR>CARB>QSEC As we want to
# explain MPG per AM(at least), we can remove them and start
# make our combinations :
model1 <- lm(mpg ~ factor(am), data = mtcars)
model2 <- lm(mpg ~ factor(am) + disp, data = mtcars)
model3 <- lm(mpg ~ factor(am) + disp + factor(cyl), data = mtcars)
model4 <- lm(mpg ~ factor(am) + disp + factor(cyl) + wt, data = mtcars)
model5 <- lm(mpg ~ factor(am) + disp + factor(cyl) + wt + hp,
data = mtcars)
model6 <- lm(mpg ~ factor(am) + disp + factor(cyl) + wt + hp +
factor(vs), data = mtcars)
model7 <- lm(mpg ~ factor(am) + disp + factor(cyl) + wt + hp +
factor(vs) + factor(gear), data = mtcars)
model8 <- lm(mpg ~ factor(am) + disp + factor(cyl) + wt + hp +
factor(vs) + factor(gear) + factor(carb), data = mtcars)
model9 <- lm(mpg ~ factor(am) + disp + factor(cyl) + wt + hp +
factor(vs) + factor(gear) + factor(carb) + qsec, data = mtcars)
modelf2 <- lm(mpg ~ factor(am) * disp, data = mtcars)
modelf3 <- lm(mpg ~ factor(am) * (disp + factor(cyl)), data = mtcars)
modelf4 <- lm(mpg ~ factor(am) * (disp + factor(cyl) + wt), data = mtcars)
modelf5 <- lm(mpg ~ factor(am) * (disp + factor(cyl) + wt + hp),
data = mtcars)
modelf6 <- lm(mpg ~ factor(am) * (disp + factor(cyl) + wt + hp +
factor(vs)), data = mtcars)
modelf7 <- lm(mpg ~ factor(am) * (disp + factor(cyl) + wt + hp +
factor(vs) + factor(gear)), data = mtcars)
modelf8 <- lm(mpg ~ factor(am) * (disp + factor(cyl) + wt + hp +
factor(vs) + factor(gear) + factor(carb)), data = mtcars)
modelf9 <- lm(mpg ~ factor(am) * (disp + factor(cyl) + wt + hp +
factor(vs) + factor(gear) + factor(carb) + qsec), data = mtcars)
anova(model1, model2, model3, model4, model5, model6, model7,
model8, model9, modelf2, modelf3, modelf4, modelf5, modelf6,
modelf7, modelf8, modelf9)
The more statistically significant seems to be the “model2” with a p-value of “4.326e-08”.
Let look at the unexplained side of our model so-called residuals to ensure there is no pattern left away :
# We may plot the residuals in front of the two selected
# variables
selected_model_init <- lm(mpg ~ factor(am) + disp, data = mtcars)
par(mfrow = c(1, 2))
plot(mtcars$disp, resid(selected_model_init), pch = 21, bg = "grey",
main = "Residuals VS Displacement", xlab = "Displacement",
ylab = "Residuals")
plot(mtcars$am, resid(selected_model_init), pch = 21, bg = "lightblue",
main = "Residuals VS Transmission", xlab = "Transmission",
ylab = "Residuals")
par(mfrow = c(1, 1))
Hopefully, there is no apparent unexplained pattern :) .
Let us use our previously selected “lm(factor(am)+disp, data=mtcars)” :
# We may perform a summary on our selected model Let us
# remove the intercept to be able to get the factor
# coefficients on the same scale
selected_model <- lm(mpg ~ factor(am) + disp - 1, data = mtcars)
summary(selected_model)
##
## Call:
## lm(formula = mpg ~ factor(am) + disp - 1, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.6382 -2.4751 -0.5631 2.2333 6.8386
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## factor(am)0 27.848081 1.834071 15.184 2.45e-15 ***
## factor(am)1 29.681539 1.218689 24.355 < 2e-16 ***
## disp -0.036851 0.005782 -6.373 5.75e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.218 on 29 degrees of freedom
## Multiple R-squared: 0.9786, Adjusted R-squared: 0.9764
## F-statistic: 442.4 on 3 and 29 DF, p-value: < 2.2e-16
One may notice that the model coefficients are all highly significant.
# Let look carefully at the factor coefficients
coef(selected_model)
## factor(am)0 factor(am)1 disp
## 27.84808111 29.68153936 -0.03685086
As we may notice the factor_1 (manual), seems to be better(implying highest values) for MPG than factor_0 (automatic).
In addition, let us look at the 95% confidence intervals :
# Let look carefully at the factor coefficients confidence
# interval on 95% conservative level
round(confint(selected_model), 2)
## 2.5 % 97.5 %
## factor(am)0 24.10 31.60
## factor(am)1 27.19 32.17
## disp -0.05 -0.03
In fact :
Then factor_1 (manual) 95% coef confidence interval is also narrower than “automatic” one.
# We may perform a summary on our selected model Let put back
# the intercept to be able to get the difference between
# factor coefficients
selected_model_init <- lm(mpg ~ factor(am) + disp, data = mtcars)
summary(selected_model_init)
##
## Call:
## lm(formula = mpg ~ factor(am) + disp, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.6382 -2.4751 -0.5631 2.2333 6.8386
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 27.848081 1.834071 15.184 2.45e-15 ***
## factor(am)1 1.833458 1.436100 1.277 0.212
## disp -0.036851 0.005782 -6.373 5.75e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.218 on 29 degrees of freedom
## Multiple R-squared: 0.7333, Adjusted R-squared: 0.7149
## F-statistic: 39.87 on 2 and 29 DF, p-value: 4.749e-09
As we can see, the model is globally unchanged, therefore the difference coefficient is actually statistically non significant.
# Let look carefully at the factor spread
coefs <- round(coef(selected_model_init), 2)
cnfint <- round(confint(selected_model_init), 2)
writeLines(paste("\n", "The non-significant observed difference in MPG expressed as (manual - automatic) is",
coefs[2], ";", "\n with 95% confidence interval equal to [",
cnfint[2], ";", cnfint[5], "].", "\n"))
##
## The non-significant observed difference in MPG expressed as (manual - automatic) is 1.83 ;
## with 95% confidence interval equal to [ -1.1 ; 4.77 ].
We can then conclude that the manual transmission is better for MPG than the automatic one. Therefore their MPG’s difference is not statistically significant (P-value=0.212).