Regression Modeling of Automobile Features

by MAS16 Feb 2019

Executive Summary

This report uses the mtcars dataset to answer: (1) Is an automatic or manual transmission better for mpg? (2) What is the quantified difference between automatic and manual transmissions? Regression modeling and statistical inference show that manual transmissions are better than automatic transmissions for mpg by approximately 1.8 mpg after adjusting for contributions from other car features.

Exploratory Data Analysis

First, let’s explore mtcars using the str and head functions (results in Appendix A.1). The data consist of 32 cars with 11 numeric features. The am feature designates automatic (am=0) and manual (am=1) transmission. The boxplot in Appendix A.2 shows that there may be a difference in mean mpg between automatic and manual. To test the null hypothesis that there is no difference in mean mpg, we use a t-test (R code in Appendix A.3).

## 
##  Welch Two Sample t-test
## 
## data:  man_mpg and auto_mpg
## t = 3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   3.209684 11.280194
## sample estimates:
## mean of x mean of y 
##  24.39231  17.14737

The p-value is 0.0014 which is lower than 0.05 and the 95% confidence interval for the t-statistic does not include 0. We therefore reject the null hypothesis.

Regression Model

Next, construct a linear model relating mpg to am:

# Construct linear model
mdl1 <- lm(mpg ~ factor(am), data=mtcars)
# Get coefficients
summary(mdl1)$coefficients
##              Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 17.147368   1.124603 15.247492 1.133983e-15
## factor(am)1  7.244939   1.764422  4.106127 2.850207e-04
# Get r-squared
summary(mdl1)$r.squared
## [1] 0.3597989

The intercept is the mean mpg observed for automatic transmissions and the slope coefficient of 7.245 suggests using a manual transmission increases mpg by 7.245. The p-values are well below 0.05. However, the r-squared is 0.36, suggesting only 36% of the variability in mpg is explained by this model.

Model Fitting and Selection

To construct better models, let’s look at the correlation between mpg and all the other car features (results in Appendix A.4). We see high correlations (>0.75) with cyl, disp, hp and wt. A strategy for model selection involves constructing nested linear models using these features and comparing each model using ANOVA. (R code in Appendix A.5)

## Analysis of Variance Table
## 
## Model 1: mpg ~ factor(am)
## Model 2: mpg ~ factor(am) + factor(cyl)
## Model 3: mpg ~ factor(am) + factor(cyl) + disp
## Model 4: mpg ~ factor(am) + factor(cyl) + disp + hp
## Model 5: mpg ~ factor(am) + factor(cyl) + disp + hp + wt
##   Res.Df    RSS Df Sum of Sq       F    Pr(>F)    
## 1     30 720.90                                   
## 2     28 264.50  2    456.40 37.9300 2.678e-08 ***
## 3     27 230.46  1     34.04  5.6572  0.025339 *  
## 4     26 183.04  1     47.42  7.8820  0.009541 ** 
## 5     25 150.41  1     32.63  5.4236  0.028246 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The ANOVA shows that model 5 results in a statistically significant (p-value = 0.028) improvement relative to models with fewer features. The coefficients for model 5 are shown below:

##                  Estimate Std. Error    t value     Pr(>|t|)
## (Intercept)  33.864276061 2.69541569 12.5636562 2.668321e-12
## factor(am)1   1.806099494 1.42107933  1.2709350 2.154510e-01
## factor(cyl)6 -3.136066556 1.46909031 -2.1346996 4.277253e-02
## factor(cyl)8 -2.717781289 2.89814941 -0.9377644 3.573375e-01
## disp          0.004087893 0.01276729  0.3201848 7.514890e-01
## hp           -0.032480178 0.01398322 -2.3227963 2.862128e-02
## wt           -2.738694608 1.17597755 -2.3288664 2.824553e-02

The adjusted difference in mean mpg between automatic and manual using the coefficients for model 5 shows manual transmissions increase mpg by 1.81.

# Model 5 r-squared
summary(mdl5)$r.squared
## [1] 0.8664276

Additionally, the r-squared value has increased to 0.866, indicating 86.6% of the variability in mpg can now be explained. To verify the model, Diagnostic plots for model 5 are shown in Appendix A.6 and show there is no pattern in the residuals. The qq-plot shows they are near normally distributed.

Conclusions

  1. Manual transmission is better than automatic transmision for mpg.
  2. Manual transmissions increase mpg by 1.81 over automatic transmissions. The best model explains 86.6% of the variability in mpg. However, 13.4% of the variability remains unexplained, suggesting there is still some uncertainty in the best model.

Appendix

A.1 Structure and Head of mtcars Data Set

str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...
head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

A.2 Box Plot Comparing Automatic and Manual Transmission MPG

library(ggplot2)

g <- ggplot(mtcars, aes(x=factor(am), y=mpg, type="l")) + 
        geom_boxplot(aes(group=factor(am))) + geom_jitter(width=0.1) +
        xlab("Transmission Type") + ylab("MPG") +
        ggtitle("MPG as Function of Transmission Type") +
        scale_x_discrete(labels = c("Automatic","Manual")) +
        theme(plot.title = element_text(face="bold", hjust=0.5, size=12))  

g

A.3 R Code for Student t-Test

auto_mpg <- mtcars[mtcars$am==0, ]$mpg
man_mpg <- mtcars[mtcars$am==1, ]$mpg
t.test(man_mpg, auto_mpg)

A.4 Correlation between MPG and All Other Features

# Get correlations among features
cor(mtcars)[1,]
##        mpg        cyl       disp         hp       drat         wt 
##  1.0000000 -0.8521620 -0.8475514 -0.7761684  0.6811719 -0.8676594 
##       qsec         vs         am       gear       carb 
##  0.4186840  0.6640389  0.5998324  0.4802848 -0.5509251

R Code for Fitting Multiple Models

# Fit multiple models, nested; varying by one additional feature
mdl2 <- lm(mpg ~ factor(am) + factor(cyl), data=mtcars)
mdl3 <- lm(mpg ~ factor(am) + factor(cyl) + disp, data=mtcars)
mdl4 <- lm(mpg ~ factor(am) + factor(cyl) + disp + hp, data=mtcars)
mdl5 <- lm(mpg ~ factor(am) + factor(cyl) + disp + hp + wt, data=mtcars)
# Compare models using ANOVA
anova(mdl1, mdl2, mdl3, mdl4, mdl5)

A.6 Residuals Diagnostic Plots for Model 5

par(mar=c(4,4,2,2))
par(mfrow=c(2,2))
plot(mdl5, which=1)
plot(mdl5, which=3)
plot(mdl5, which=2)
plot(mdl5, which=5)