Executive Summary

This study will investigate the efficiency of motor vehicles and how it relates to their transmission type as part of the course project for Coursera’s ‘Regression Models’ course. The measure of efficiency is MPG (miles per gallon), and the type of transmissions we will address are manual and automatic. We will investigate 2 main questions:

  1. Is an automatic or manual transmission better for MPG?
  2. Quantify the MPG difference between automatic and manual transmissions

The study will try to use exploratory analysis to obtain a first answer to the above, then use regression to assess the certainty of such answer.

Data Processing and Exploratory Analysis

The dataset is mtcars, which we load together with other libraries we’ll need for the study:

options(scipen=999)
library(datasets)
library(dplyr)
library(ggplot2)
data(mtcars)

mtcars comprises of 32 observations on 11 variables:

  1. mpg: Miles/(US) gallon
  2. cyl: Number of cylinders
  3. disp: Displacement (cu.in.)
  4. hp: Gross horsepower
  5. drat: Rear axle ratio
  6. wt: Weight (lb/1000)
  7. qsec: 1/4 mile time
  8. vs: V/S
  9. am: Transmission (0 = automatic, 1 = manual)
  10. gear: Number of forward gears
  11. carb: Number of carburetors

Data Processing

We will transform vs, am, gear, carb to factors because they represent categories. No. of (cyl)inders, while discrete, will not be considered categorical because it represents a count. Athough a poisson regression would be more suitable for counts, this tool is out of our scope for this work.

mtcars <- mtcars %>% mutate(straight=as.factor(vs),
                            manualtrans=as.factor(am),
                            gear=as.factor(gear),
                            carb=as.factor(carb))

mtcars <- mtcars %>% dplyr::select(mpg,cyl,disp,hp,drat,wt,qsec,
                                   straight,manualtrans,gear,carb)

Exploratory Analysis

We’ll approach the 1st question with a boxplot to obtain a hint to the answer.

g <- ggplot(data=mtcars, aes(x=manualtrans, y=mpg, fill=manualtrans)) + 
    guides(fill=FALSE) + geom_boxplot() + 
    stat_summary(fun.y=mean, geom="point", shape=4, size=4) +
    xlab('Manual Transmission (0 = Automatic, 1 = Manual)') +
    ylab('Fuel Efficiency (mpg)') +
    ggtitle('Fuel efficiency by transmission type')
g

The mean mpg of manual transmission vehicles is 24.392, while for automatic transmission cars is 17.147. with this we address part of question 2: manual transmissions are 1.42 times more efficient than automatic ones, but how significant is this difference?

Linear Regression Analysis

We’ll test the significance of the relationship between mpg and transmission type with a linear model and take a look at the coefficients.

fitmanual <- lm(mpg ~ manualtrans, mtcars)
summary(fitmanual)$coef
##               Estimate Std. Error   t value                Pr(>|t|)
## (Intercept)  17.147368   1.124603 15.247492 0.000000000000001133983
## manualtrans1  7.244939   1.764422  4.106127 0.000285020743935067769
confint(fitmanual)
##                 2.5 %   97.5 %
## (Intercept)  14.85062 19.44411
## manualtrans1  3.64151 10.84837

The coefficients show that automatic transmissions achieve 17.147 mpg, and manual trans manage 17.147 + 7.245 mpg, just as our exploratory analysis showed. The significance of the slope coeff is considerably high at a p-value of 0.00029 with a 2-sided 95% CI of (18.492, 30.292). Note that this CI is wider than that obtained with a t.test because we’re considering a combined DF measure instead of just n-1. This addresses question 2 fully. However, this model only explains 35.98% of the variance, so we need to reason further about what really affects mileage. Our key measure is fuel consumption, we should consider variables that are related to it. We have decided on the following:

  1. manualtrans: our original regressor, the one for which we’re trying to ensure its influence on mpg.
  2. cyl(inder): stands in for straight-line or v-shaped motor, disp and hp, since naturally the more cylinders a car has, the more it will consume fuel, the more power (hp) it will offer, and the more mass it will be able to displace.
  3. w(eigh)t: can also stand in for disp, as it is a measure of the car’s mass movement.

Variable selection

We’ll perform a step-wise variable selection process by regressing mpg on manualtrans, cyl, wt and their combinations and by selecting the model that explains the most variance (R2).

R2 comparison

# Individual models 
c(summary(lm(mpg ~ manualtrans, mtcars))$r.squared,
  summary(lm(mpg ~ cyl, mtcars))$r.squared,
  summary(lm(mpg ~ wt, mtcars))$r.squared)
## [1] 0.3597989 0.7261800 0.7528328
# Paired models
c(summary(lm(mpg ~ manualtrans + cyl, mtcars))$r.squared, 
  summary(lm(mpg ~ manualtrans + wt, mtcars))$r.squared)
## [1] 0.7590135 0.7528348
# Complete model
fitmanualcylwt <- lm(mpg ~ manualtrans + cyl + wt, mtcars)
summary(fitmanualcylwt)$r.squared
## [1] 0.8303383

From the above we see that the complete model is the one that explains the most amount of variance, at 83.03%. With that, we need to delve deeper into this model to conclude our set of independent variables that do affect mpg.

summary(fitmanualcylwt)$coef
##                Estimate Std. Error    t value                Pr(>|t|)
## (Intercept)  39.4179334  2.6414573 14.9227979 0.000000000000007424998
## manualtrans1  0.1764932  1.3044515  0.1353007 0.893342147923960827605
## cyl          -1.5102457  0.4222792 -3.5764148 0.001291604589147556711
## wt           -3.1251422  0.9108827 -3.4308942 0.001885894386856281756

We can see from the coeffs that the transmission type of the vehicle is not significant, at a very high p-value of 0.89, so for this exercise, and under our set of assumptions, we are sticking with wt and cyl as explanatory variables. Removing the transmission type we get a new set of coefficients:

fitfinal <- lm(mpg ~ wt + cyl, mtcars)
summary(fitfinal)$coef
##              Estimate Std. Error   t value                     Pr(>|t|)
## (Intercept) 39.686261  1.7149840 23.140893 0.00000000000000000003043182
## wt          -3.190972  0.7569065 -4.215808 0.00022202004951620555369546
## cyl         -1.507795  0.4146883 -3.635972 0.00106428178479492895058822
summary(fitfinal)$r.squared
## [1] 0.8302274

The coeffs and r-squared above tell us that we’ve achieved a 83.02% explanation of the variance. Note that removing transmission type doesn’t change our R2, which tells us that the variable didn’t add to the model. This addresses question 1 fully: the type of transmission is not relevant to fuel efficiency.

Our final interpretation, then, is that for every increase in 1000lbs of weight of the vehicle, we’ll observe a decrease in fuel efficiency of 3.19 mpg, and that for every extra cylinder in the engine, we’ll observe a decrease of efficiency of 1.51 mpg.

Diagnosis

Finally, we’ll diagnose our chosen model with a residual vs fitted plot.

plot(fitfinal, which=1)

We see a very slight quadratic pattern due to 3 outliers labeled 18, 20, 21. These outliers present the following rstandard:

round(rstandard(fitfinal)[c(18,20,21)],3)
##     18     20     21 
##  2.341  2.500 -1.744

This means they are at ~ 2 standard deviations to the right and left of the Z-residual mean, which is very far out. They also have high leverage (low hatvalue). This hints at a model that may be further improved, either by considering log(mpg) or sqrt(mpg) to redice the pull of these outliers, or perhaps by modeling cyl as counts.

round(hatvalues(fitfinal)[c(18,20,21)],3)
##    18    20    21 
## 0.080 0.097 0.083

Conclusion

From our analysis we conclude that the type of transmission is not relevant to fuel efficiency, and the shift in means between manual and auto shown by our initial plot are due to other variables, so we can label trans type as a confounder. Mileage varies highly with weight and the number of cylinders, as these are the 2 variables most directly related to fuel consumption, at least according to our assumptions. The model would improve if we transformed mpg to log(mpg) or sqrt(mpg) to reduce the influence of the 3 identified outliers. There’s also the issue of treating cylinders not as a count, but as a continous variable. Despite all these areas of opportunity for our model, we believe it offers a ‘good enough’ explanation, considering that ‘all models are wrong, some models are useful’.