Executive Summary

In this paper, we explored the relationship between a set of variables such as number of cylinders, transmission type etc and the fuel consumption (miles per gallon) for automobiles. For this, we used the “mtcars” dataset extracted from 1974 Motor Trend US magazine. The paper particularly tried to answer the following 2 aspects of the data:

  1. Is an automatic or manual transmission better for MPG?

  2. Quantify the MPG difference between automatic and manual transmissions.

Based on the analysis, it has been deducted that the manual transmission is in general better for milege. However, the overall impact of transmission type is limited and there are other varaibles which affects the milege more significantly.

Dataset: “mtcars”

To start with, let’s explore the dataset “mtcars”:

str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

The dataset has 32 observation of 11 variables. Out of these, “mpg” (Miles(US)/gallon) would be our outcome and rest of the variables would be used as predictors (one or many).

Linear Regression Analysis (with “am” as predictor)

mtcarsAuto   <- mtcars[mtcars$am == 0, ]
mtcarsManual <- mtcars[mtcars$am == 1, ]
meanAuto     <- mean(mtcarsAuto$mpg)
meanManual   <- mean(mtcarsManual$mpg)

There is a better way of concluding the same results using “t-test”. Let’s do that with an assumption that the data is linearly distributed.

t.test(mpg ~ am, data = mtcars)
## 
##  Welch Two Sample t-test
## 
## data:  mpg by am
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -11.280194  -3.209684
## sample estimates:
## mean in group 0 mean in group 1 
##        17.14737        24.39231

In general, the manual transmission seems to be giving better milage as compared to automatic transmission. Lets quantify it further with a simple linear regression with “am” as predictor and “mpg” as output.

summary(lm(mpg ~ as.factor(am), mtcars))
## 
## Call:
## lm(formula = mpg ~ as.factor(am), data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3923 -3.0923 -0.2974  3.2439  9.5077 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      17.147      1.125  15.247 1.13e-15 ***
## as.factor(am)1    7.245      1.764   4.106 0.000285 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared:  0.3598, Adjusted R-squared:  0.3385 
## F-statistic: 16.86 on 1 and 30 DF,  p-value: 0.000285

Interpreting the coefficients, it’s evident that on an average the manual transmission gives more milage by 7.245 miles/gallon (as.factor(am)1). Interesting though is the R2 values of .3598 which tells us that the transmission mode only explains 36% of the variation in “mpg”.

Though we already have the answers for our primary questions, let’s do some further exploratory analysis.

Exploratory Data Analysis

Based on the above observations, it’s clear that there are other precitors which significantly affects the milege. As there is a large number of predictors in this case (10), the choice of the right model is not straightforward. One option is to perform “ANOVA” but there is a better way to use AIC in a Stepwise algorithm. For this, we use “step” function:

fit.all      <- lm(mpg ~ ., data = mtcars)
backward.aic <- step(fit.all, direction = "backward", k = 2, trace = 0)
backward.aic$call
## lm(formula = mpg ~ wt + qsec + am, data = mtcars)
summary(backward.aic)$r.squared
## [1] 0.8496636

The above model is a far better choice of a model fit for this dataset and ˜83% of variance can be explained by it. We can try to do better by running further “ANOVA” on different interaction models among these varaible but it’s out of scope for this paper.

Residual Plots

Let’s create the diagnostic plot for the above model.

par(mfrow=c(2,2))
plot(backward.aic)

Based on the above plots, the following observations can be made: