This document aims at analyzing the influence of the transmission mode of a vehicule (automatic Vs Manual) on the miles per gallon (mpg) achieved by this vehicule. The goal is to investigate which transmission mode is better and to quantify the mpg difference between the two modes. In order to do, we use the mtcars dataset and we fit different linear regression models considering mpg as an outcome and a subset of the remaining variables as regressors. We use nested models testing for selecting the best fitting model. The best model includes, in addition to the “am” variable referring to the vehicule transmission mode, the “cyl”, “wt” and “hp” variables respectively referring to the number of cylinders, weight and horsepower of a vehicule. Given this model, a manual transmission can be considered as better than an automatic one. An average gain of around 1.478 mpg is expected when moving from an automatic transmission to a manual one.
We load the mtcars dataset. In the Appendix, we plot the relationship between the mpg per vehicule and the vehicule transmission mode (Manual Vs automatic). The obtained plot can be found in Appendix. The plot shows that a manual vehicule achieves higher miles per gallon (mpg) than an automatic one.
data(mtcars)
We fit a linear model with “mpg” as an outcome and “am” (transmission mode) as a regressor and we calculate the fitting coefficients.
fit <- lm(mpg~am, data=mtcars)
We plot the fitted values obtained through the regression model Vs the real data and we dress a residuals plot highlighting the gap between the two.
par(mfrow=c(1,2))
plot(mtcars$am,mtcars$mpg,col="blue",xlab = "Transmission Mode",
ylab = "miles per gallon (mpg)",main ="Fitted Vs Observed Values")
abline(fit,lwd=2)
plot(predict(fit),resid(fit),col="red",xlab="mpg Fitted Values",
ylab = "Residuals",main="Residuals Overview")
abline(h=0,lwd=2)
We notice that the gap between the observed and the fitted values can be high and that some residuals can reach values as high as 10 mpg units. This is an indicator that a regression model with only the transmission mode as a regressor is underfitted and that more regressors need to be added to the model. Next, we gradually add new regressors to the fitting model and use nested models testing for assessing the relevance of different regressors.
fit1 <- lm(mpg ~ am + cyl, data=mtcars)
fit2 <- lm(mpg ~ am + cyl+ disp, data=mtcars)
fit3 <- lm(mpg ~ am + cyl+ disp + hp, data=mtcars)
fit4 <- lm(mpg ~ am + cyl+ disp + hp + drat, data=mtcars)
fit5 <- lm(mpg ~ am + cyl+ disp + hp + drat + wt, data=mtcars)
fit6 <- lm(mpg ~ am + cyl+ disp + hp + drat + wt + qsec, data=mtcars)
fit7 <- lm(mpg ~ am + cyl+ disp + hp + drat + wt + qsec + vs, data=mtcars)
fit8 <- lm(mpg ~ am + cyl+ disp + hp + drat + wt + qsec + vs + gear, data=mtcars)
fit9 <- lm(mpg ~ am + cyl+ disp + hp + drat + wt + qsec + vs + gear + carb, data=mtcars)
anova(fit,fit1,fit2,fit3,fit4,fit5,fit6,fit7,fit8,fit9)
## Analysis of Variance Table
##
## Model 1: mpg ~ am
## Model 2: mpg ~ am + cyl
## Model 3: mpg ~ am + cyl + disp
## Model 4: mpg ~ am + cyl + disp + hp
## Model 5: mpg ~ am + cyl + disp + hp + drat
## Model 6: mpg ~ am + cyl + disp + hp + drat + wt
## Model 7: mpg ~ am + cyl + disp + hp + drat + wt + qsec
## Model 8: mpg ~ am + cyl + disp + hp + drat + wt + qsec + vs
## Model 9: mpg ~ am + cyl + disp + hp + drat + wt + qsec + vs + gear
## Model 10: mpg ~ am + cyl + disp + hp + drat + wt + qsec + vs + gear + carb
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 30 720.90
## 2 29 271.36 1 449.53 64.0039 8.231e-08 ***
## 3 28 252.08 1 19.28 2.7452 0.11241
## 4 27 216.37 1 35.71 5.0849 0.03493 *
## 5 26 214.50 1 1.87 0.2663 0.61121
## 6 25 162.43 1 52.06 7.4127 0.01275 *
## 7 24 149.09 1 13.34 1.8999 0.18260
## 8 23 148.87 1 0.22 0.0309 0.86214
## 9 22 147.90 1 0.97 0.1384 0.71365
## 10 21 147.49 1 0.41 0.0579 0.81218
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
When looking at the obtained F-values, we can conclude that the “cyl” (Number of cylinders), “wt” (Weight) and “hp” (horsepower) variables are the most relevant and should be added to the regression model. We fit a new model with these 3 new variables as regressors in addition to the transmission mode. In the appendix, plots showing the residuals obtained when using the new model Vs the ones obtained when using only the transmission mode as a regressor can be found. Plots show a significant decrease of the residuals values thus demonstrating that the new model is more accurate than the old one. We calculate the “am” coefficient of the new model. This can be interpreted as the average change in mpg when moving from an automatic mode to a manual mode of transmission.
fitnew <- lm(mpg ~ am+cyl+wt+hp,data=mtcars)
coef(fitnew)[2]
## am
## 1.478048
confint(fitnew)[2,]
## 2.5 % 97.5 %
## -1.478946 4.435042
The “am” coefficent value is 1.478. As it is positive, a manual transmission can be considered in average as better for mpg than an automatic one. A gain of around 1.478 mpg is expected in average when moving from an automatic transmission to a manual one. However, as the confidence interval of this coefficient includes negative values, there will be cases where moving from an automatic transmission to a manual one will result in mpg loss.
data(mtcars)
library(ggplot2)
list <- c("automatic","manual")
mtcars$mode <- list[mtcars$am + 1]
g <- ggplot(as.data.frame(mtcars),aes(x=mode,y=mpg)) + geom_boxplot(aes(fill=mode)) +
xlab("Transmission Mode: Automatic Vs Manual") +
ylab("miles per gallon (mpg)") + ggtitle("Evolution of mpg per transmission mode")
g
Models comparison in terms of transmission (“am”) coefficients and confidence intervals of these coefficients: Comparing new fit model (selected one) and old model with only the transmission mode as a regressor: