Motor Trend: Searching for a relationship between MPG and transmission.

Synopsis
- Executive summary
Exploratory data analysis
Correlation study
Regression model comparison
Residual analysis

Synopsis

For an article in Motor Trend, we are interested in the mileage per gallon (MPG) of cars and whether the type of transmission (automatic vs manual) has a significant effect on this. To conclude on the relation between MPG and transmission type, we study the mtcars dataset from R’s datasets library. We use different linear regression models to fit the relation, and discuss which model is most appropriate.

Executive summary

Inspect dataset for missing values
Perform quick exploratory data analysis and inference on the relation between mpg and am
Inspect dataset for correlation of variables with outcome mpg and with each other.
Regression modeling with a comparison between models with increasing complexity.
Residual plotting and diagnostics.

Exploratory data analysis

str(mtcars)

## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

any(is.na(mtcars))

## [1] FALSE

mtcars has no missing values and contains numeric data on mpg and 10 other variables, among which the transmission type am. Before doing in-depth regression modeling, we perform a quick analysis on the relation between transmission type and MPG.

gg <- ggplot(data = mtcars, aes(x = am, y = mpg))
gg <- gg + geom_boxplot(aes(fill = factor(am)), alpha = 0.5)
gg <- gg + xlab("") + ylab("Miles per gallon") + scale_x_continuous(breaks = NULL)
gg <- gg + scale_fill_discrete(name="Transmission", labels = c("manual","automatic"))
gg

In the boxplot above, we show the distribution of mpg, for manual and automatic transmission. From this plot, it seems that cars with automatic transmission perform better on MPG than cars with manual transmission.

man <- mtcars[mtcars$am == 0,]
aut <- mtcars[mtcars$am == 1,]
t.test(man$mpg,aut$mpg,paired = FALSE)

## 
##  Welch Two Sample t-test
## 
## data:  man$mpg and aut$mpg
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -11.280194  -3.209684
## sample estimates:
## mean of x mean of y 
##  17.14737  24.39231

A T-test on this difference agrees with our observation and shows a very significant dependence of mpg on am. Intuitively, it seems strange that the type of transmission has such a strong effect on MPG. There might be confounding variables that, when taken into account, change the relation between mpg and transmission type. Let’s dive a bit deeper into the data!

Correlation study

In order to determine which variables to use in a regression model, we first check which variables are correlated:

ggpairs(mtcars)

The upper-right half of the table above shows the correlations between variables. We observe that cyl,disp, hp and wt are most correlated with mpg. However, these variables also show mutual correlation. Since including correlated regressors in a regression model can distort the fit, we pick the two variables with the lowest mutual correlation: weight and horse power.

Regression model comparison

In this section, we compare the fit coefficients and p-values found in three regression models:

mpg ~ am
mpg~ am + wt + hp
mpg ~ am + wt + hp + cyl + disp

mtcars$vs <- factor(mtcars$vs)
mtcars$am <- factor(mtcars$am)
fit1 <- lm(mpg ~ factor(am) , data = mtcars)
fit2 <- lm(mpg ~ factor(am) + wt + hp , data = mtcars)
fit3 <- lm(mpg ~ factor(am) + wt + hp + disp + cyl, data = mtcars)
summary(fit1)$coef; summary(fit2)$coef; summary(fit3)$coef;

##              Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 17.147368   1.124603 15.247492 1.133983e-15
## factor(am)1  7.244939   1.764422  4.106127 2.850207e-04

##                Estimate  Std. Error   t value     Pr(>|t|)
## (Intercept) 34.00287512 2.642659337 12.866916 2.824030e-13
## factor(am)1  2.08371013 1.376420152  1.513862 1.412682e-01
## wt          -2.87857541 0.904970538 -3.180850 3.574031e-03
## hp          -0.03747873 0.009605422 -3.901830 5.464023e-04

##                Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 38.20279869 3.66909647 10.412045 9.084987e-11
## factor(am)1  1.55649163 1.44053603  1.080495 2.898430e-01
## wt          -3.30262301 1.13364263 -2.913284 7.256888e-03
## hp          -0.02796002 0.01392172 -2.008374 5.509659e-02
## disp         0.01225708 0.01170645  1.047036 3.047194e-01
## cyl         -1.10637984 0.67635506 -1.635797 1.139322e-01

The first model shows a very significant dependence of mpg on am, with a two-sided p-value of around 0.0003. When we include wt and hp as regressors, the change of MPG with transmission type has diminished from 7.2 to 2.1 and the new p-value of 0.14 indicates that when corrected for horse power and weight of cars, the type of transmission has no significant effect on MPG anymore. Including additional variables `disp and cyl does not change this story much.

anova(fit1,fit2,fit3)

## Analysis of Variance Table
## 
## Model 1: mpg ~ factor(am)
## Model 2: mpg ~ factor(am) + wt + hp
## Model 3: mpg ~ factor(am) + wt + hp + disp + cyl
##   Res.Df    RSS Df Sum of Sq       F    Pr(>F)    
## 1     30 720.90                                   
## 2     28 180.29  2    540.61 43.0841 5.576e-09 ***
## 3     26 163.12  2     17.17  1.3685    0.2722    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Performing an anova test on the nested model confirms that including wt and hp as regressors in model 2 improved the linear model enormously, reducing the Residual Sum Squared by a factor 4. Further including disp and cyl in model 3 only gave a very minor, non-significant (p = 0.2722) improvement.

Residual analysis

summary(fit1)$r.squared; summary(fit2)$r.squared; summary(fit3)$r.squared;

## [1] 0.3597989

## [1] 0.8398903

## [1] 0.8551394

Model1, model2 and model3 respectively explain 36, 84 and 86% of the variance in the data.

e1 <- resid(fit1)
e2 <- resid(fit2)
e3 <- resid(fit3)
x1 <- rep(1, each=length(e1))
x2 <- rep(2, each=length(e1))
x3 <- rep(3, each=length(e1))
e1 <- cbind(x1,e1)
e2 <- cbind(x2,e2)
e3 <- cbind(x3,e3)
n <- 1:length(e3[,1])
e <- data.frame(rbind(e1,e2,e3))

## Warning in data.row.names(row.names, rowsi, i): some row.names duplicated:
## 33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96
## --> row.names NOT used

e <- cbind(rep(n,3), e)
colnames(e) <- c("n","model","resid")

gg2 <- ggplot(data = e, aes(n, resid)) + geom_point(aes(shape = factor(model), colour = factor(model)))
gg2 <- gg2 + xlab("") + ylab("Residual")
gg2

A plot of the residuals shows that the residuals are homoskedastic for all three models, indicating that our linear model is not obviously missing a hidden trend in the data.