Best transmission choice for higher MPG: manual or automatic

Executive summary

This work analysis analysis the impact of different variables on miles-per-gallon characteristic. Specifically we’re going to compare what type of transmission is more beneficial for MPG: automatic or manual. We will also quantify the MPG difference between automatic and manual transmissions.
We will try both simple linear regression and multivariable linear regression. We will see that in both cases the results show that manual transmission is more beneficial for MPG. For otherwise identical cars manual transmission adds 2.9358 to MPG.

Exploratory data analysis

library(ggplot2)
data(mtcars)
mtcars$am = as.factor(mtcars$am)
g = ggplot(aes(x=am, y = mpg), data = mtcars) +        
        geom_boxplot(aes(fill=am)) +
        geom_point(aes(color = am)) +        
        scale_colour_manual(breaks = c("0", "1"),
                      labels = c("automatic", "manual"),
                      values = c("#0072B2", "#D55E00")) +
        scale_fill_manual(breaks = c("0", "1"),
                      labels = c("automatic", "manual"),
                      values = c("#0072B2", "#D55E00"))+
        xlab("Transmission type") +
        ylab("Miles per hour")

g

We can observe that indeed the mpg characteristic does seem to have some correlation with the transmission type. Now let’s look for other dependencies - let’s plot everything we have to see a broad picture:

pairs(mpg ~ cyl + disp + hp + drat + wt + qsec + vs + gear + carb, data = mtcars, upper.panel = panel.smooth, cex = 1.5, pch=21, bg = "thistle3")

From looking at the picture we can see that there is some correlations between mpg and disp (-), mpg and hp(-), mpg and drat(+), mpg and wt(-).

Statistical inference

Let’s check the null-hypothesis by running a t-test:

t.test(mpg~am, data = mtcars, digits = 2)$conf

## [1] -11.280194  -3.209684
## attr(,"conf.level")
## [1] 0.95

t.test(mpg~am, data = mtcars, digits = 2)$p.value

## [1] 0.001373638

t.test(mpg~am, data = mtcars, digits = 2)$estimate

## mean in group 0 mean in group 1 
##        17.14737        24.39231

dif = t.test(mpg~am, data = mtcars, digits = 2)$estimate[2] - t.test(mpg~am, data = mtcars, digits = 2)$estimate[1]

The p-value is very samll so we can reject the null hypothesis and say that indeed cars with automatic transmission have higher mpg characteristic. The difference is approximately 7.2449393

Regression analysis

We’ll start with the simplest model: mpg ~ am.

model1 = lm(mpg ~ am, data=mtcars)
summary(model1)
rsq = summary(model1)$r.squared

The am variable has good significance level but the adjusted r-squared doesn’t look very impressive: it says that we can examplain only 0.3597989% of data. Looks like we need to include other variable in our model. To do that let’s first build a regression model with all possible independent variables:

model2 = lm(mpg ~ ., data=mtcars)
summary(model2)
rsq = summary(model2)$r.squared

The adjusted R-squared is 0.8690158 which look really good - so in fact, we can explain 0.8690158% of the variance in the dependent variable. However, there is a problem with the significane of the variables: none of them (except for qsec) look significant. Let’s use the stepwise algorithm to find a better model:

step(model2)$keep

The model with the smallest AIC (61.31) value is mpg ~ wt + qsec + am. Let’s try it:

model3 = lm(mpg ~ wt + qsec + am, data = mtcars)
summary(model3)
rsq = summary(model3)$r.squared

This time the adjusted r-squared still looks good: 0.8496636, but the model itself is much more reasonable: all the variables have good significance levels(at most 0.01). So this last model looks like the optimal choice. So now we can QUANTIFY the difference: If otherwise identical, having manual transmission adds 2.9358 to MPG on average.

Residuals analysis

This part

par(mfrow=c(2,2))
plot(model3)

When doing the analysis we had to make several assumptions. Let’s now check them:

The Residuals vs Fitted plot shows that the observations are indeed independent
The Scale-Location plot shows the random distribution of the points so we can confirm that the variance of the predicted error is constant
The Q-Q plot shows that the residuals follow the normal distribution.
The Residuals vs Leverage plot shows that we don’t have outliers: all the points are below the Cook’s distance