Analysis of MPG Using Regression

Setting Up

We initialize by loading all required packages and loading in the data mtcars. We then load the data in to a correlation matrix before converting variables to factor to prepare the data for further analysis.

library(ggplot2);library(corrplot)
data(mtcars)
set.seed(123)

# Create correlation matrix before convertin to factores
corrMat <- cor(mtcars)

mtcars$cyl <- factor(mtcars$cyl)
mtcars$vs <- factor(mtcars$vs)
mtcars$gear <- factor(mtcars$gear)
mtcars$carb <- factor(mtcars$carb)
mtcars$am <- factor(mtcars$am, labels = c("Auto", "Manual"))

Overview

dim(mtcars)

## [1] 32 11

head(mtcars)

##                    mpg cyl disp  hp drat    wt  qsec vs     am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0 Manual    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0 Manual    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1 Manual    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1   Auto    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0   Auto    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1   Auto    3    1

The data set consist of 11 variables for 32 different car models. Of these, 19 have manual transmission and 13 have automatic.

Exploratory Analysis

We can already see that there may be lots of cofounding variables that affect MPG aside from transmission. Let’s look at a correlation matrix:

corrplot(corrMat, method = "number", order = "FPC", type = "lower", tl.cex = 0.6,  tl.col = rgb(0, 0, 0), outline = F)

plot of chunk unnamed-chunk-3

Weight, hp, disp and cyl are all strongly ( > 0.75) correlated with MPG.

plot of chunk unnamed-chunk-4

When we visualize the interaction between these variables below some clear patterns emerges - dark small triangles in the top left and big, bright colored circles in the bottom right, meaning light vehicles with low horespower, low displacement and manual transmission tend to have higher MPG. In particular, loking along the Y-axis for MPG it’s easy to draw a line (seen in red) to separate most of the Automatic (circles) from the triangular shapes (Manual). This at least visually confirms that there is a difference in MPG between transmission types.

t.test(mpg ~ am, mtcars)

## 
##  Welch Two Sample t-test
## 
## data:  mpg by am
## t = -3.767, df = 18.33, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -11.28  -3.21
## sample estimates:
##   mean in group Auto mean in group Manual 
##                17.15                24.39

The t-test confirms that there is a statistically significant difference between transmission types, with the means of 17.2 vs 24.4 MPG differing with 95% certainty.

Fitting a regression model

fit <- lm(mpg ~ am, mtcars)
summary(fit)

## 
## Call:
## lm(formula = mpg ~ am, data = mtcars)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.392 -3.092 -0.297  3.244  9.508 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    17.15       1.12   15.25  1.1e-15 ***
## amManual        7.24       1.76    4.11  0.00029 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.9 on 30 degrees of freedom
## Multiple R-squared:  0.36,   Adjusted R-squared:  0.338 
## F-statistic: 16.9 on 1 and 30 DF,  p-value: 0.000285

Rsquare is only 0.33, suggesting that AM is only responsible for a small degree of the variance in the data. In the previous plot, each panel also has a separation in transmission type along the x-axis: triangles to the left and circles to the right, suggesting weight is likely also a factor in MPG (as makes sense).Referring back to the correlation matrix, though it’s clear all four of these variables are correlated with MPG, disp and cyl are cross-correlated with all variables under consideration, while weight and horsepower are mostly correlated with MPG. This suggest that weight and horespower may be stronger predictors of MPG.

Let’s investigate a model using all variables:

fitAll <- lm(mpg ~ ., mtcars)
summary(fitAll)$coef

##             Estimate Std. Error  t value Pr(>|t|)
## (Intercept) 23.87913   20.06582  1.19004  0.25253
## cyl6        -2.64870    3.04089 -0.87103  0.39747
## cyl8        -0.33616    7.15954 -0.04695  0.96317
## disp         0.03555    0.03190  1.11433  0.28267
## hp          -0.07051    0.03943 -1.78835  0.09393
## drat         1.18283    2.48348  0.47628  0.64074
## wt          -4.52978    2.53875 -1.78426  0.09462
## qsec         0.36784    0.93540  0.39325  0.69967
## vs1          1.93085    2.87126  0.67248  0.51151
## amManual     1.21212    3.21355  0.37719  0.71132
## gear4        1.11435    3.79952  0.29329  0.77332
## gear5        2.52840    3.73636  0.67670  0.50890
## carb2       -0.97935    2.31797 -0.42250  0.67865
## carb3        2.99964    4.29355  0.69864  0.49547
## carb4        1.09142    4.44962  0.24528  0.80956
## carb6        4.47757    6.38406  0.70137  0.49381
## carb8        7.25041    8.36057  0.86722  0.39948

With an Rsquare of 0.779 this is a definite improvement. Looking at the t-statistic it is mainly hp and wt that have high values . This is something we suspected from the correlation matrix. We now have enough context to create some additional models to compare.

fit1 <- lm(mpg ~ wt + hp, mtcars)
fit2 <- lm(mpg ~ wt + hp + cyl, mtcars)
fit3 <- lm(mpg ~ wt + hp + cyl + disp, mtcars)

Analysis of variance

anova(fit, fit1, fit2, fit3, fitAll)

## Analysis of Variance Table
## 
## Model 1: mpg ~ am
## Model 2: mpg ~ wt + hp
## Model 3: mpg ~ wt + hp + cyl
## Model 4: mpg ~ wt + hp + cyl + disp
## Model 5: mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
##   Res.Df RSS Df Sum of Sq     F  Pr(>F)    
## 1     30 721                               
## 2     29 195  1       526 65.51 7.5e-07 ***
## 3     27 161  2        34  2.13    0.15    
## 4     26 160  1         1  0.08    0.78    
## 5     15 120 11        40  0.45    0.91    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Using anova to analyze the model we see that While all of them show improvements in RSS, only fit1 is significant. With an R-squared value of 0.81 this is a clear improvement over the model taking all variables in to consideration.

Let’s continue with this model and look at the interaction term:

fit2.x <- lm(mpg ~ wt + hp + wt*hp, mtcars)
anova(fit2, fit2.x)

## Analysis of Variance Table
## 
## Model 1: mpg ~ wt + hp + cyl
## Model 2: mpg ~ wt + hp + wt * hp
##   Res.Df RSS Df Sum of Sq F Pr(>F)
## 1     27 161                      
## 2     28 130 -1        31

summary(fit2.x)$r.squared

## [1] 0.8848

Here the R-squared measure is even higher and the anova test confirms the additional interaction term to be of signifiant value.

shapiro.test(fit2.x$residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  fit2.x$residuals
## W = 0.9545, p-value = 0.1928

With a Shapiro-Wilk p-value of 0.19 we do not reject the hypothesis that the residuals are normally distributed, making us more confident that we are on the right track.

Looking at residuals

plot of chunk unnamed-chunk-12

There doesn’t seem to be any correlation between residuals and the fitted values. The QQ Plot confirms the shapiro test that the model is close to normally distributed.

Conclusions

Automatic is better for MPG than manual, with a 95% confidence interval that has a lower bound of about 3.2 difference in mean MPG.
In general, going from manual to automatic increases MPG by an average of 7.25, but given the many other variables at play this may not give an accurate picture
Holding weight and horsepower constant, we see an an average increase of 2.1 in MPG for automatic transmission compared to manual