This study will investigate the efficiency of motor vehicles and how it relates to their transmission type as part of the course project for Coursera’s ‘Regression Models’ course. The measure of efficiency is MPG (miles per gallon), and the type of transmissions we will address are manual and automatic. We will investigate 2 main questions:
The study will try to use exploratory analysis to obtain a first answer to the above, then use regression to assess the certainty of such answer.
The dataset is mtcars, which we load together with other libraries we’ll need for the study:
options(scipen=999)
library(datasets)
library(dplyr)
library(ggplot2)
data(mtcars)
mtcars comprises of 32 observations on 11 variables:
We will transform vs, am, gear, carb to factors because they represent categories. No. of (cyl)inders, while discrete, will not be considered categorical because it represents a count. Athough a poisson regression would be more suitable for counts, this tool is out of our scope for this work.
mtcars <- mtcars %>% mutate(straight=as.factor(vs),
manualtrans=as.factor(am),
gear=as.factor(gear),
carb=as.factor(carb))
mtcars <- mtcars %>% dplyr::select(mpg,cyl,disp,hp,drat,wt,qsec,
straight,manualtrans,gear,carb)
We’ll approach the 1st question with a boxplot to obtain a hint to the answer.
g <- ggplot(data=mtcars, aes(x=manualtrans, y=mpg, fill=manualtrans)) +
guides(fill=FALSE) + geom_boxplot() +
stat_summary(fun.y=mean, geom="point", shape=4, size=4) +
xlab('Manual Transmission (0 = Automatic, 1 = Manual)') +
ylab('Fuel Efficiency (mpg)') +
ggtitle('Fuel efficiency by transmission type')
g
The mean mpg of manual transmission vehicles is 24.392, while for automatic transmission cars is 17.147. with this we address part of question 2: manual transmissions are 1.42 times more efficient than automatic ones, but how significant is this difference?
We’ll test the significance of the relationship between mpg and transmission type with a linear model and take a look at the coefficients.
fitmanual <- lm(mpg ~ manualtrans, mtcars)
summary(fitmanual)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.147368 1.124603 15.247492 0.000000000000001133983
## manualtrans1 7.244939 1.764422 4.106127 0.000285020743935067769
confint(fitmanual)
## 2.5 % 97.5 %
## (Intercept) 14.85062 19.44411
## manualtrans1 3.64151 10.84837
The coefficients show that automatic transmissions achieve 17.147 mpg, and manual trans manage 17.147 + 7.245 mpg, just as our exploratory analysis showed. The significance of the slope coeff is considerably high at a p-value of 0.00029 with a 2-sided 95% CI of (18.492, 30.292). Note that this CI is wider than that obtained with a t.test because we’re considering a combined DF measure instead of just n-1. This addresses question 2 fully. However, this model only explains 35.98% of the variance, so we need to reason further about what really affects mileage. Our key measure is fuel consumption, we should consider variables that are related to it. We have decided on the following:
We’ll perform a step-wise variable selection process by regressing mpg on manualtrans, cyl, wt and their combinations and by selecting the model that explains the most variance (R2).
# Individual models
c(summary(lm(mpg ~ manualtrans, mtcars))$r.squared,
summary(lm(mpg ~ cyl, mtcars))$r.squared,
summary(lm(mpg ~ wt, mtcars))$r.squared)
## [1] 0.3597989 0.7261800 0.7528328
# Paired models
c(summary(lm(mpg ~ manualtrans + cyl, mtcars))$r.squared,
summary(lm(mpg ~ manualtrans + wt, mtcars))$r.squared)
## [1] 0.7590135 0.7528348
# Complete model
fitmanualcylwt <- lm(mpg ~ manualtrans + cyl + wt, mtcars)
summary(fitmanualcylwt)$r.squared
## [1] 0.8303383
From the above we see that the complete model is the one that explains the most amount of variance, at 83.03%. With that, we need to delve deeper into this model to conclude our set of independent variables that do affect mpg.
summary(fitmanualcylwt)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.4179334 2.6414573 14.9227979 0.000000000000007424998
## manualtrans1 0.1764932 1.3044515 0.1353007 0.893342147923960827605
## cyl -1.5102457 0.4222792 -3.5764148 0.001291604589147556711
## wt -3.1251422 0.9108827 -3.4308942 0.001885894386856281756
We can see from the coeffs that the transmission type of the vehicle is not significant, at a very high p-value of 0.89, so for this exercise, and under our set of assumptions, we are sticking with wt and cyl as explanatory variables. Removing the transmission type we get a new set of coefficients:
fitfinal <- lm(mpg ~ wt + cyl, mtcars)
summary(fitfinal)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.686261 1.7149840 23.140893 0.00000000000000000003043182
## wt -3.190972 0.7569065 -4.215808 0.00022202004951620555369546
## cyl -1.507795 0.4146883 -3.635972 0.00106428178479492895058822
summary(fitfinal)$r.squared
## [1] 0.8302274
The coeffs and r-squared above tell us that we’ve achieved a 83.02% explanation of the variance. Note that removing transmission type doesn’t change our R2, which tells us that the variable didn’t add to the model. This addresses question 1 fully: the type of transmission is not relevant to fuel efficiency.
Our final interpretation, then, is that for every increase in 1000lbs of weight of the vehicle, we’ll observe a decrease in fuel efficiency of 3.19 mpg, and that for every extra cylinder in the engine, we’ll observe a decrease of efficiency of 1.51 mpg.
Finally, we’ll diagnose our chosen model with a residual vs fitted plot.
plot(fitfinal, which=1)
We see a very slight quadratic pattern due to 3 outliers labeled 18, 20, 21. These outliers present the following rstandard:
round(rstandard(fitfinal)[c(18,20,21)],3)
## 18 20 21
## 2.341 2.500 -1.744
This means they are at ~ 2 standard deviations to the right and left of the Z-residual mean, which is very far out. They also have high leverage (low hatvalue). This hints at a model that may be further improved, either by considering log(mpg) or sqrt(mpg) to redice the pull of these outliers, or perhaps by modeling cyl as counts.
round(hatvalues(fitfinal)[c(18,20,21)],3)
## 18 20 21
## 0.080 0.097 0.083
From our analysis we conclude that the type of transmission is not relevant to fuel efficiency, and the shift in means between manual and auto shown by our initial plot are due to other variables, so we can label trans type as a confounder. Mileage varies highly with weight and the number of cylinders, as these are the 2 variables most directly related to fuel consumption, at least according to our assumptions. The model would improve if we transformed mpg to log(mpg) or sqrt(mpg) to reduce the influence of the 3 identified outliers. There’s also the issue of treating cylinders not as a count, but as a continous variable. Despite all these areas of opportunity for our model, we believe it offers a ‘good enough’ explanation, considering that ‘all models are wrong, some models are useful’.