This report is devoted to the quantitative analysis of the dependence of miles per gallon (MPG) attribute on other variables from collection of cars dataset mtcars presented in 1974 US magazine Motor Trend. The question to be addressed here is an automatic or manual transmission better for MPG.
By using multivariate linear regression and hypothesis testing variables having significant impact on MPG are established. Built regression model is further validated by ANOVA and residuals are analyzed.
The data mtcars to be analyzed is originated from the 1974 Motor Trend magazine. It contains an information about 32 cars (number of observations) and their 11 variables as follows: miles/(US) per gallon (mpg), number of cylinders (cyl), dispacement (cu. in.) (disp), gross horsepower (hp), rear axle ratio (drat), weight (1000 lbs) (wt), quarter mile time (qsec), engine type (V 0=vs or straight 1=vs engine) (vs), transmission (0 = automatic, 1 = manual) (am), number of forward gears (gear), number of carburetors (carb).
Now let us take a look at the data derived from mtcars dataset by subsetting it and sorting cars by mpg in descending order. As we can see from first 7 cars with best MPG only one car Merc 240D has automatic transmission (am=0) (see Table 1).
require(pacman); p_load(pander,knitr); attach(mtcars)
cars_by_mpg <- mtcars[order(-mpg),] # sort by mpg
pander(head(cars_by_mpg, 7), caption = "A table of the first 7 cars with best MPG.")
 | mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb |
---|---|---|---|---|---|---|---|---|---|---|---|
Toyota Corolla | 33.9 | 4 | 71.1 | 65 | 4.22 | 1.835 | 19.9 | 1 | 1 | 4 | 1 |
Fiat 128 | 32.4 | 4 | 78.7 | 66 | 4.08 | 2.2 | 19.47 | 1 | 1 | 4 | 1 |
Honda Civic | 30.4 | 4 | 75.7 | 52 | 4.93 | 1.615 | 18.52 | 1 | 1 | 4 | 2 |
Lotus Europa | 30.4 | 4 | 95.1 | 113 | 3.77 | 1.513 | 16.9 | 1 | 1 | 5 | 2 |
Fiat X1-9 | 27.3 | 4 | 79 | 66 | 4.08 | 1.935 | 18.9 | 1 | 1 | 4 | 1 |
Porsche 914-2 | 26 | 4 | 120.3 | 91 | 4.43 | 2.14 | 16.7 | 0 | 1 | 5 | 2 |
Merc 240D | 24.4 | 4 | 146.7 | 62 | 3.69 | 3.19 | 20 | 1 | 0 | 4 | 2 |
detach(mtcars)
To see if there is difference between automatic and manual transmission we run Welch two sample t-test (see Table 2). The p-value=0.001374 is less than \(\alpha\) = 0.05 and we can safely reject null hypothesis that no difference in MPG bewteen automatic and manual transmissions. Resullts of t-test indicate that manual transmission is better than automatic one and confidence interval show that true difference between manual and aiutomatic lies between 3.21 MPG and 11.28 MPG. The difference in the means between two types of transmissions is about 7 MPG. See violin plot of MPG vs am in Figure 1 in Appendix.
ttest <- t.test(mpg ~ am, data= mtcars)
tt <- data.frame( "t-statistic"=ttest$statistic, "df" = ttest$parameter,
"p-value" = ttest$p.value, "cl-min" = ttest$conf.int[1], "cl-max" = ttest$conf.int[2],
"autom.mean" = ttest$estimate[1],"manual.mean" = ttest$estimate[2],row.names = "")
pander(tt, caption = "Two Sample t-test: automatic vs manual transmission")
t.statistic | df | p.value | cl.min | cl.max | autom.mean | manual.mean |
---|---|---|---|---|---|---|
-3.767 | 18.33 | 0.001374 | -11.28 | -3.21 | 17.15 | 24.39 |
We start from simplest linear regresion model by taking into account only correlation between transmission type as the predictor and MPG as outcome (see Table 3). This analysis confirms the result of previous section. From Table 4 we see that adjusted R-squared value is 0.3385, which means that this model is poor and can only explain 34 % of variation in MPG. So we need to consider other variables as predictors as well.
model1 <- lm(mpg~am,data = mtcars); su1 <- summary(model1)
df <- data.frame("R-squared" = su1$r.squared, "Adjusted R-squared" = su1$adj.r.squared)
pander(model1); pander(df, caption = "R-squared values for single model")
 | Estimate | Std. Error | t value | Pr(>|t|) |
---|---|---|---|---|
am | 7.245 | 1.764 | 4.106 | 0.000285 |
(Intercept) | 17.15 | 1.125 | 15.25 | 1.134e-15 |
R.squared | Adjusted.R.squared |
---|---|
0.3598 | 0.3385 |
To correctly choose which variables have maximum impact on MPG we use MASS library and run step-wise model. The results are shown in Table 5 and Table 6. Test chooses with transmission also car’s weight and inverse acceleration as important variables. Adjusted R-squared value now is 0.8336 indicating that chosen model is robust and predictive and explains 83 % of variation in MPG. Another automatic method to choose right variables is best subset method. Here we run it with maximum 4 variables returning model with even higher adjusted R-squared than the previous model, but we can not use it since horsepower (hp) is highly correlated with weight (wt) and acceleration and will result in overfitting (See Appendix, Figure 2).
require(pacman); p_load(leaps,pander,knitr, MASS)
model2 <- stepAIC(lm(mpg ~ . ,data=mtcars), trace = FALSE); su2 <- summary(model2)
df1 <- data.frame("R-squared" = su2$r.squared, "Adjusted R-squared" =su2$adj.r.squared)
pander(model2); pander(df1, "R-squared values for step-wise model")
 | Estimate | Std. Error | t value | Pr(>|t|) |
---|---|---|---|---|
wt | -3.917 | 0.7112 | -5.507 | 6.953e-06 |
qsec | 1.226 | 0.2887 | 4.247 | 0.0002162 |
am | 2.936 | 1.411 | 2.081 | 0.04672 |
(Intercept) | 9.618 | 6.96 | 1.382 | 0.1779 |
R.squared | Adjusted.R.squared |
---|---|
0.8497 | 0.8336 |
regmodel1 <- regsubsets(mpg ~ ., data = mtcars, nvmax = 4)
Here we investigate if there is an interactions between different variables included in our model. By including interactions we can possibly further improve our regression model. We can introduce an interaction term to the previous model to capture the different slopes and intercepts between automatic and manual transmissions. Most significant interaction term originates from the fact that cars with manual transmission weigh less than cars with automatic one. See figure 3 in Appendix where MPG versus car weight is depicted for different transmission types.
To finally select the best model we run ANOVA test.
Our final model includes interaction term between weight and binary transmission wt:am. As we can see we substantially improved previous model and adjusted R-squared value now is 0.88. The results in the Table 7 show that mpg = 9.72 + (14.079 - 4.141 wt) am + 1.01 qsec - 2.936 wt, meaning that changing from automatic to manual transmission adds (14.079 - 4.141 wt) more MPG on average for cars weighing less than 14.079/4.141=3400 lb.
model_final<-lm(mpg ~ wt + qsec + am + wt:am, data=mtcars); su <- summary(model_final)
ant <- anova(model1,model2,model_final)
dff <- data.frame("R-squared" = su$r.squared, "Adjusted R-squared" = su$adj.r.squared)
pander(model_final); pander(dff, caption = "R-squared values for final model")
 | Estimate | Std. Error | t value | Pr(>|t|) |
---|---|---|---|---|
wt | -2.937 | 0.666 | -4.409 | 0.0001489 |
qsec | 1.017 | 0.252 | 4.035 | 0.000403 |
am | 14.08 | 3.435 | 4.099 | 0.0003409 |
wt:am | -4.141 | 1.197 | -3.46 | 0.001809 |
(Intercept) | 9.723 | 5.899 | 1.648 | 0.1109 |
R.squared | Adjusted.R.squared |
---|---|
0.8959 | 0.8804 |
See Appendix Figure 4. Looking at the residual plots, we can state the following:
Dfbetas measure how much the coeiffcients change when the point is deleted. So far we do not have any influential points and our model meets all requirements of linear regression.
c(max(dfbetas(model_final)), max(hatvalues(model_final)))
## [1] 0.9769875 0.3742640
library(pacman); p_load(ggplot2, GGally, scales, psych)
ggplot(mtcars, aes(y=mpg, x=factor(am, labels = c("Automatic","Manual")),
fill=factor(am)))+geom_violin(size=0.5,trim=FALSE)+scale_fill_brewer(palette="Accent")+
xlab("Transmission Type") + ylab("Miles per Gallon (MPG)") +
geom_boxplot(width=0.15, color="red", size=0.5) +
geom_jitter(shape=16, position=position_jitter(0.4)) + theme(legend.position="none")
Violin plot of MPG vs transmission type.
my_fn <- function(data, mapping, method="lm", ...){
p <- ggplot(data = data, mapping = mapping) + geom_point() +
geom_smooth(method=method, ...)
p}
pairs.panels(mtcars[, c(1,4,6,7,9)], ellipses = FALSE, breaks = 8)
Pair Correlations and Histograms Panel Graph
auto.fit <- lm(mpg ~ wt, data = mtcars[mtcars$am == "0",])
man.fit <- lm(mpg ~ wt, data = mtcars[mtcars$am == "1",])
plot(mpg ~ wt, data = mtcars, pch = 19, col=(am=="0")*2+2, xlab="Weight (1000 lbs)",
ylab = "Miles per Gallon (MPG)"); abline(auto.fit, lwd = 3, col = "blue")
abline(man.fit, lwd = 3, col = "red")
legend(4.4,34, c("Automatic","Manual"), lty=c(1,1), lwd=c(3,3),col=c("blue","red"))
MPG versus Weight by Transmission Type
par(mfrow = c(2, 2)); plot(model_final)
Residuals and Diagnostics PLot