This month we will be reviewing the effect of automatic and manual transmission (am) in fuel consumption or miles per galon (mpg) in several car models. For more information about this data please visit the dataset web-page.
Here we show that using hypothesis testing and linear regression using only those two variables, does not give us an accurate description of how transmission affects fuel consumption since we do pretty big assumptions. Using multivariate regression analysis we actually count for other factors that could affect mpg:am relationship.
Taking a first look to the data, we clearly see that the type of transmission installed can affect fuel consumption:
As we can see it seems to be an effect of the kind of transmission and the fuel consumption. To further address this issue we will be performing a little inferential analysis using t-test (Supplementary Information 2):
## Transmission Type
## auto 17.14737
## manual 24.39231
Is this difference significant?, lets look at the p-value:
## [1] 0.0006868192
We see a p value way lower than the type one error of .05, so we could reject the null hypothesis and state that manual transmission represents a more economic fuel consumption (higher milles per gallon). BUT this interpretation assumes that all cars are the same, and we know that cars are not only gears. So we have to consider other variables.
For starters we have to analyze the possible correlations between mpg and the other 10 variables:
## wt cyl disp hp carb qsec
## -0.8676594 -0.8521620 -0.8475514 -0.7761684 -0.5509251 0.4186840
## gear am vs drat mpg
## 0.4802848 0.5998324 0.6640389 0.6811719 1.0000000
So it seems that other variables have also a possible effect on mpg. We can also see that if we plot some of these variables together with mpg and am (see supplementary Inofrmation, 3).
Our first linear model (fit) analysis will be just using the simplest linear regression model for mpg with only am as a predictor:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.147368 1.124603 15.247492 1.133983e-15
## am 7.244939 1.764422 4.106127 2.850207e-04
## [1] 0.3597989
As we can see here the intercept represents the mpg mean for automatic transmission cars while am represents the the adjusted estimate increase of manual transmission cars. We also see that the effect is significant with a really low p value, BUT R squared is also low, only 0.3597, which means that this model only explains the 36% of the variance. If we take a look to the Residuals vs Fitted plot in our first model (Supplementary Information, 4), there is clear pattern, indicating that other variables should also affect mpg (as we saw in our correlation analysis)
Next question is, can we do better? Is there any combination of variables as predictors (including am) that will give us a better model?. To address this we will be using the step() function to find the best model (fit3 model):
fit3 <- step(lm(mpg ~ ., mtcars), trace = 0)
summary(fit3)$coef; summary(fit3)$r.squared
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.617781 6.9595930 1.381946 1.779152e-01
## wt -3.916504 0.7112016 -5.506882 6.952711e-06
## qsec 1.225886 0.2886696 4.246676 2.161737e-04
## am 2.935837 1.4109045 2.080819 4.671551e-02
## [1] 0.8496636
This shows that weight of the vehicle as well as acceleration speed explain most of the variation in mpg. So we actually have a model which includes only 1 predictor (am) and 2 confounders (wt and qsec) that explains up to 85% of the variance. Including all 9 confounders (fit2 model, supplementary information, 5) gave us a model that explained the 87% of the variance but this actually shows that we also included unnecessary variables in the analysis.
Finally we will perform an analysis of variance or ANOVA comparing the linear model (fit) and the best fitted model (fit3):
## Analysis of Variance Table
##
## Model 1: mpg ~ am
## Model 2: mpg ~ wt + qsec + am
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 30 720.90
## 2 28 169.29 2 551.61 45.618 1.55e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
In summary we can state that we cannot answer the question Is an automatic or manual transmission better for MPG? just by including mpg and am but we have to consider in the context the weight and the acceleartion. So we conclude that in our model “fit3” where wt and qsec are included as confounders, cars with manual transmission have 2.94 increased in Milles per Gallon (thus more economic) than cars with automatic transmission.
g1 <- qplot(mpg, am, data = mtcars, main = "Mileage by transmission", xlab = "Miles per Gallon",
ylab = "0 = automatic, 1 = manual")
g4 <- qplot(mpg, factor(am),data=mtcars, geom=c("boxplot", "jitter"),
fill = am, main="Mileage by Transmission",
xlab="0 = automatic, 1 = manual", ylab="Miles per Gallon")
grid.arrange(g4, g1, ncol = 2, nrow = 1)
#grouping the data and t-test
manual <- mtcars[mtcars$am == "1",]$mpg
auto <- mtcars[mtcars$am == "0",]$mpg
matrix(c(mean(auto), mean(manual))) -> tab
rownames(tab) <- c("auto", "manual")
colnames(tab) <- "Transmission Type"
print(tab)
t.test(manual, auto, alternative = "greater")$p.value
g3bis <- qplot(mpg, am, data = mtcars, size = cyl, ylab = "Miles per Gallon",
xlab="0 = automatic, 1 = manual", main = "Cylinders effect on mpg")
g3bis2 <- qplot(mpg, am, data = mtcars, size = carb, main = "Carburetors effect on mpg",
ylab = "Miles per Gallon", xlab="0 = automatic, 1 = manual")
g3bis3 <- qplot(mpg, am, data = mtcars, size = wt, main = "Weight effect on mpg",
ylab = "Miles per Gallon", xlab="0 = automatic, 1 = manual")
grid.arrange(g3bis, g3bis2, g3bis3, ncol = 3, nrow = 1)
Different trends in the data depending on the variables we choose to include in our anaiysis, supporting the idea that we can not assume a t-test including just mpg and am as the only factors.
fit2 <- lm (mpg ~ ., mtcars)
par(mfrow = c(1, 3))
plot(fit, which = 1)
title(main = "Linear regression fit model", cex = 1.5)
plot(fit2, which = 1)
title(main = "Multivariate analysis fit2 model", cex = 1.5)
plot(fit3, which = 1)
title(main = "Multivariate analysis fit3 model", cex = 1.5)
##
## Call:
## lm(formula = mpg ~ ., data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4506 -1.6044 -0.1196 1.2193 4.6271
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.30337 18.71788 0.657 0.5181
## cyl -0.11144 1.04502 -0.107 0.9161
## disp 0.01334 0.01786 0.747 0.4635
## hp -0.02148 0.02177 -0.987 0.3350
## drat 0.78711 1.63537 0.481 0.6353
## wt -3.71530 1.89441 -1.961 0.0633 .
## qsec 0.82104 0.73084 1.123 0.2739
## vs 0.31776 2.10451 0.151 0.8814
## am 2.52023 2.05665 1.225 0.2340
## gear 0.65541 1.49326 0.439 0.6652
## carb -0.19942 0.82875 -0.241 0.8122
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.65 on 21 degrees of freedom
## Multiple R-squared: 0.869, Adjusted R-squared: 0.8066
## F-statistic: 13.93 on 10 and 21 DF, p-value: 3.793e-07