In this project, using the mtcars dataset, I will try to answer to the main question about the different impact of the automatic respect to the manual transmission:
In order to do that, I have followed the steps:
Linear regression and multivariate linear regression have been used to select the model.
The data included in the R’s datasets is called mtcars and the description says:
The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models).
Before a quick look into the data, dimensions and how the data look like (type included):
data("mtcars")
dim(mtcars)
## [1] 32 11
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
summary(mtcars)
## mpg cyl disp hp
## Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
## 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
## Median :19.20 Median :6.000 Median :196.3 Median :123.0
## Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
## Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
## drat wt qsec vs
## Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
## 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
## Median :3.695 Median :3.325 Median :17.71 Median :0.0000
## Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
## 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
## Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
## am gear carb
## Min. :0.0000 Min. :3.000 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
## Median :0.0000 Median :4.000 Median :2.000
## Mean :0.4062 Mean :3.688 Mean :2.812
## 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :1.0000 Max. :5.000 Max. :8.000
So we have 32 observations on 11 variables. Always from the documentation, I have extracted the following variable descriptions:
| Variables | Description |
|---|---|
| mpg | Miles/(US) gallon |
| cyl | Number of cylinders |
| disp | Displacement (cu.in.) |
| hp | Gross horsepower |
| drat | Rear axle ratio |
| wt | Weight (1000 lbs) |
| qsec | 1/4 mile time |
| vs | V/S |
| am | Transmission (0 = automatic, 1 = manual) |
| gear | Number of forward gears |
| carb | Number of carburetors |
Some of the values are discrete like: cyl, vs, am, gear and carb. I change them to factors:
mtcars$vs <- factor(mtcars$vs,labels=c('V','S'))
mtcars$am <- factor(mtcars$am,labels=c('A','M'))
mtcars$cyl <- factor(mtcars$cyl)
mtcars$gear <- factor(mtcars$gear)
mtcars$carb <- factor(mtcars$carb)
Let’s see how cyl, vs, gear and carb are distributed against am (that is the parameter that represents our main interest along mpg):
par(mfrow=c(2,2))
display_barplot <- function(table, title, xlab) {
barplot(table, main=title, xlab=xlab, col=c("darkblue","red"), beside=TRUE)
legend("topright", title = "AM", rownames(table), fill = 1:2, ncol = 1, cex = 0.5)
}
display_barplot(table(mtcars$am, mtcars$cyl) ,"AM vs. Cylinders", "Number of Cylinders")
display_barplot(table(mtcars$am, mtcars$vs) ,"AM vs. VS", "V/Straight Engine")
display_barplot(table(mtcars$am, mtcars$gear) ,"AM vs. Gears", "Number of Gears")
display_barplot(table(mtcars$am, mtcars$carb) ,"AM vs. Carb", "Number of Carburetors")
Gears looks not having the fifth one in the Automatic model. Similarly carburetors look limited in the case of automatic model.
Now, let’s have a summary of the data, included two scatterplot matrixes relative to the variables for a better data insight (mpg, hp and wt separated by am and cyl):
library(car)
scatterplotMatrix(~mpg+hp+wt|am, data=mtcars, main="MPG - HP - WT vs AM")
On the first quadrant we can see that we have higher values of mpg for manual transmission. This trend looks confirmed for mpg against hp and wt.
scatterplotMatrix(~mpg+hp+wt|cyl, data=mtcars, main="MPG - HP - WT vs CYL")
Here we have 3 main categories of cars based on the number cylinders: 4, 6 and 8. It is easy to see that for higher number of cylinders, mpg drops down due to higher fuel consumption. Same observation can be considered for higher values of hp and wt that correspond to lower mpg values.
In Appendix A there is a complete display of the correlations across the whole set of variables in the dataset. Looking at the values for mpg, some of those correlations look negative. However these correlations are meant for single independent variables. In case of multi-regressors, their impact on the outcome might change. Indeed, correlations suggest us another main point: mpg is strongly correlated to am but also cyl, disp, hp, drat, wt and vs and what makes the things even more complicated it is the mutual interaction between these variables. Variance Inflation Factors helps you to find out how much those variables are mutually influenced. For example this is how disp changes with and without wt:
sort(sqrt(vif(lm(mpg~., mtcars))), decreasing = TRUE)
## [1] 22.432384 11.319053 7.769536 7.131081 5.312210 4.881683 3.284842
## [8] 3.151269 2.843970 2.787389 2.609533 2.304823 2.236068 2.209453
## [15] 1.834225 1.812413 1.775181 1.686407 1.634138 1.615405 1.414214
## [22] 1.414214 1.364858 1.000000 1.000000 1.000000 1.000000 1.000000
## [29] 1.000000 1.000000
sort(sqrt(vif(lm(mpg~. -wt, mtcars))), decreasing = TRUE)
## [1] 16.056162 10.255633 6.811936 5.272347 5.038395 3.176320 3.060302
## [8] 2.835853 2.498479 2.296159 2.244637 2.236068 1.789537 1.782223
## [15] 1.749372 1.683999 1.615540 1.580658 1.414214 1.414214 1.319970
## [22] 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
disp has the highest value, but it drops down removing wt. It suggests a interaction between the two.
To select the model, we will consider the fixed regressor am for the outcome mpg as the base model. I will add incrementally new parameters, based to their absolute correlations as in the graph in Appendix A creating so, a set of nested models. With the anova function then, I will select the best model, related to the p-value:
fit1 <- lm(mpg ~ am , mtcars)
fit2 <- update(fit1, mpg ~ am + wt)
fit3 <- update(fit1, mpg ~ am + wt + cyl)
fit4 <- update(fit1, mpg ~ am + wt + cyl + disp)
fit5 <- update(fit1, mpg ~ am + wt + cyl + disp + hp)
fit6 <- update(fit1, mpg ~ am + wt + cyl + disp + hp + drat)
fit7 <- update(fit1, mpg ~ am + wt + cyl + disp + hp + drat + carb)
fit8 <- update(fit1, mpg ~ am + wt + cyl + disp + hp + drat + carb + gear)
fit9 <- update(fit1, mpg ~ am + wt + cyl + disp + hp + drat + carb + gear + qsec)
fit10 <- update(fit1, mpg ~ am + wt + cyl + disp + hp + drat + carb + gear + qsec + vs)
anova(fit1, fit2, fit3, fit4, fit5, fit6, fit7, fit8, fit9, fit10)
## Analysis of Variance Table
##
## Model 1: mpg ~ am
## Model 2: mpg ~ am + wt
## Model 3: mpg ~ am + wt + cyl
## Model 4: mpg ~ am + wt + cyl + disp
## Model 5: mpg ~ am + wt + cyl + disp + hp
## Model 6: mpg ~ am + wt + cyl + disp + hp + drat
## Model 7: mpg ~ am + wt + cyl + disp + hp + drat + carb
## Model 8: mpg ~ am + wt + cyl + disp + hp + drat + carb + gear
## Model 9: mpg ~ am + wt + cyl + disp + hp + drat + carb + gear + qsec
## Model 10: mpg ~ am + wt + cyl + disp + hp + drat + carb + gear + qsec +
## vs
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 30 720.90
## 2 29 278.32 1 442.58 55.1371 2.129e-06 ***
## 3 27 182.97 2 95.35 5.9395 0.01259 *
## 4 26 182.87 1 0.10 0.0123 0.91304
## 5 25 150.41 1 32.46 4.0440 0.06266 .
## 6 24 150.10 1 0.31 0.0384 0.84726
## 7 19 132.74 5 17.36 0.4326 0.81894
## 8 17 125.78 2 6.96 0.4335 0.65609
## 9 16 124.03 1 1.75 0.2176 0.64756
## 10 15 120.40 1 3.63 0.4522 0.51151
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The model based on am + wt looks the best one. Although it is still good, the model am + wt + cyl has lower p-value than am + wt. Let us see if it can be improved by including the interaction wt * cyl:
fit3Interaction <- update(fit1, mpg ~ am + wt + cyl + wt * cyl)
anova(fit1, fit2, fit3, fit3Interaction)
## Analysis of Variance Table
##
## Model 1: mpg ~ am
## Model 2: mpg ~ am + wt
## Model 3: mpg ~ am + wt + cyl
## Model 4: mpg ~ am + wt + cyl + wt:cyl
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 30 720.90
## 2 29 278.32 1 442.58 71.9823 7.916e-09 ***
## 3 27 182.97 2 95.35 7.7541 0.002399 **
## 4 25 153.71 2 29.26 2.3793 0.113263
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
There is not a real vantage with the interaction. So I will stick with the model mpg ~ am + wt only.
Comparing the coefficients for the model mpg ~ am and mpg ~ am + wt:
summary(fit1)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.147368 1.124603 15.247492 1.133983e-15
## amM 7.244939 1.764422 4.106127 2.850207e-04
summary(fit2)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.32155131 3.0546385 12.21799285 5.843477e-13
## amM -0.02361522 1.5456453 -0.01527855 9.879146e-01
## wt -5.35281145 0.7882438 -6.79080719 1.867415e-07
It is possible to see that in the first case the manual transmission is better respect to the automatic one, as the coefficient increase is +7.245. In the second case, the two transmissions are pretty much the same. It means that wt regressor affects a lot the transmission and we cannot easily stating if one transmission is better than another.
In Appendix B residual plots for the models fit2 and fit10 are displayed. In both cases it seems that these models fit enough well the data. However there are a few values that are highlighted. It can be seen clearly from the graph Normal Q-Q as those values are out from the predicted linear model (potentially they have some leverage). Instead the Residuals-Fitted graph looks covering well the distribution of the data.
The suspicious values are related to the cars Chrysler Imperial, Toyota Corolla and Fiat 128. Let’s see their influence using hatvalues to quantify their level of leverage, and dfbetas to see how much their inclusion affects the slope coefficient:
require(ggplot2)
require(reshape2)
model_under_investigation <- lm (mpg ~ am + wt - 1, mtcars)
htv <- melt(as.matrix(round(hatvalues(model_under_investigation), 3)))
ggplot(data=htv, aes(x=Var1, y=value, group=Var2)) + geom_line() + theme(axis.text.x = element_text(angle=90, hjust=1, vjust=0.5)) + ylab('hatvalues') + xlab('cars') + ggtitle("Hatvalues: mpg ~ am + wt")
dfb <- melt(round(dfbetas(model_under_investigation), 3))
ggplot(data=dfb, aes(x=Var1, y=value, group=Var2)) + geom_line(aes(colour = Var2))+ theme(axis.text.x = element_text(angle=90, hjust=1, vjust=0.5)) + ylab('dfbetas') + xlab('cars') + ggtitle("Dfbetas: mpg ~ am + wt")
For the hatvalues peaks are not small, but not so big to affect in some way the model. In the case of dbetas values, the 3 cars look to have high peaks (positive and negative). In the case of Chrysler Imperial, for both am types, the peaks are negative against the positive for wt. It would explain why Chrysler Imperial is in the top left quadrant (Residual vs Fitted). We can conclude that those values have some influence but they do not affect particularly the model.
It is also interesting to see the graphical comparison of the models mpg ~ am, mpg ~ am + wt and mpg ~ .:
plot(predict(fit2))
lines(predict(fit2), col = "black")
lines(predict(fit1), col = "blue")
lines(predict(lm(mpg ~ ., mtcars)), col = "red")
Overall we can say that our model fit2 it is more closed to the fit10, indicating that am and wt are quite important factors.
Looking at the model I have used, there was not a real change between automatic and manual transmission using the parameters wt and am against mpg. Residual analysis revealed that the model fits quite well the data respect to including all the variable set. However the VIF and the correlation graph also say that some of the variables are mutually influenced. So, taking some of those in or out, we could over or under fitting the model.
To answer at the main question Is an automatic or manual transmission better for MPG? there is not a clear answer to the problem, because it is not really possible to define which parameters are more important than others. It is due to the fact that the variables are mutually influenced and using linear model to describe those interactions is it not enough accurate. So, exchanging parameters can easily leads to different results.
More advanced models and more insightful data are required to give a more accurate answer.
require(GGally)
require(ggplot2)
lowerFn <- function(data, mapping, ...) {
p <- ggplot(data = data, mapping = mapping) +
geom_point(color = 'blue', alpha=0.3, size=0.5) +
geom_smooth(color = 'black', method='lm', size=0.7,...)
p
}
printCorrelations <- function(modelData, graphTitle) {
g <- ggpairs(
data = modelData,
title = graphTitle,
lower = list(
continuous = wrap(lowerFn) #wrap("smooth", alpha = 0.3, color = "blue", lwd=1)
),
upper = list(continuous = wrap("cor", size = 2))
)
g <- g + theme(
axis.text = element_text(size = 4),
axis.title = element_text(size = 4),
legend.background = element_rect(fill = "white"),
panel.grid.major = element_line(colour = NA),
panel.grid.minor = element_blank(),
panel.background = element_rect(fill = "grey95")
)
suppressMessages(print(g)) #, bottomHeightProportion = 0.5, leftWidthProportion = .5))
}
printCorrelations( subset(mtcars, select = c(mpg,am)), "Correlation Matrix for MPG/AM" )
printCorrelations( mtcars, "Correlation Martrix for all variables" )
par(mfrow=c(2,2))
plot(fit2)
par(mfrow=c(2,2))
plot(fit10)
## Warning: not plotting observations with leverage one:
## 30, 31
## Warning: not plotting observations with leverage one:
## 30, 31