Regression Course Project

Executive Summary

In the following, the 1974 dataset mtcars is analyzed for the following key questions/tasks:

Is an automatic or manual transmission better for MPG?
Quantify the MPG difference between automatic and manual transmissions

The findings suggest that there is a (statistically) significant difference between manual and automatic transmission with the manual transmission being more efficient by an average of 7.2 mpg when only the transmission is taken into account and no other factors.

Finding a more accurate model for the efficiency (mpg) includes other factors such as weight or accelaration (1/4 speed) of the car. It was found that many parameters given in the database are correlated to each other and thus only the most significant were selected for a predictive model. After accounting for weight and accelaration in the regression model, the transmission still has singificant impact on the gas-milage which can be quantified as an mpg-increase of 2.9 mpg in a manual vs. automatic car.

Hence it can be concluded that even after looking at secondary parameters, cars with an automatic transmission were less efficient in 1974 (see below for remark on this topic).

The models and results were tested for accuracy (i.e. homoscedasticity, etc.) and the complete analysis is presented below. It should be noted that wide confidence intervals are also due to the low number of datapoints (amongst other reasons).

Exploratory Analysis

First we will need to find out the basics about the dataset.

library(ggplot2)
setwd("~/Google Drive/DataScienceClasses/Regression")
# Load data
data(mtcars)
head(mtcars,3)

##                mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4     21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710    22.8   4  108  93 3.85 2.320 18.61  1  1    4    1

Since automatic or manual transmission is a factor variable (0/1), this should be defined at that. A pairs plot is produced for a first look at the data. This helps to identify what parameters mpg is correlated to. (The plot was suppressed here and can be found in the appendix.)

# Make am factor
mtcars$am <- factor(mtcars$am, levels = c(0,1), labels = c("Automatic", "Manual"))

# Plot all against all
pairs(mtcars, panel = panel.smooth)

It appears that a lot of the variables are correlated to mpg, but also have a lot of correlation between each other. This is causally/logically explainable, as (for example) engines with a large number of cylinders have to have larger displacement, but also have a higher performance (horsepower), which in return results in greater torque and acceleration (qsec).

Also it appears that heavier cars (usually SUVs and high-end sedans, etc.) are more expensive and thus are more prone to being equipped with an automtic transmission rather than manual. And lastly, automatic transmission are historically heavier than their manual counterparts (note that this data was released 1974; more modern (>2000) automatic transmissions are actually lighter and more efficient than manual transmissions).

Regression Model Finding

Basic Model

First, a basic model only considering mpg and transmission type was tested:

m1 <- lm(mpg ~ am, data = mtcars)
summary(m1)

## 
## Call:
## lm(formula = mpg ~ am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3923 -3.0923 -0.2974  3.2439  9.5077 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   17.147      1.125  15.247 1.13e-15 ***
## amManual       7.245      1.764   4.106 0.000285 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared:  0.3598, Adjusted R-squared:  0.3385 
## F-statistic: 16.86 on 1 and 30 DF,  p-value: 0.000285

This results can be interpreted such as that the automatic transmission reduces the mpg by an average of 7.2 mpg. In addition, the transmission type alone explains 34% of the variance of the mpg.

Full Model

Next, a full model including all variables in the dataset was tested:

m2 <- lm(mpg ~ ., data = mtcars)
summary(m2)

## 
## Call:
## lm(formula = mpg ~ ., data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4506 -1.6044 -0.1196  1.2193  4.6271 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 12.30337   18.71788   0.657   0.5181  
## cyl         -0.11144    1.04502  -0.107   0.9161  
## disp         0.01334    0.01786   0.747   0.4635  
## hp          -0.02148    0.02177  -0.987   0.3350  
## drat         0.78711    1.63537   0.481   0.6353  
## wt          -3.71530    1.89441  -1.961   0.0633 .
## qsec         0.82104    0.73084   1.123   0.2739  
## vs           0.31776    2.10451   0.151   0.8814  
## amManual     2.52023    2.05665   1.225   0.2340  
## gear         0.65541    1.49326   0.439   0.6652  
## carb        -0.19942    0.82875  -0.241   0.8122  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.65 on 21 degrees of freedom
## Multiple R-squared:  0.869,  Adjusted R-squared:  0.8066 
## F-statistic: 13.93 on 10 and 21 DF,  p-value: 3.793e-07

This model explains 81% of the variance and is overall significant (F-stat) but none of the variables are highly signifiant due to their correlation among each other (t-value).

Reduction of model to best predictors

In order to find the best model, we can use stepwise regression to find the model with the lowest AIC (Akaike Information Criterion):

m3 <- step(m2, trace = F)
summary(m3)

## 
## Call:
## lm(formula = mpg ~ wt + qsec + am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4811 -1.5555 -0.7257  1.4110  4.6610 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   9.6178     6.9596   1.382 0.177915    
## wt           -3.9165     0.7112  -5.507 6.95e-06 ***
## qsec          1.2259     0.2887   4.247 0.000216 ***
## amManual      2.9358     1.4109   2.081 0.046716 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.459 on 28 degrees of freedom
## Multiple R-squared:  0.8497, Adjusted R-squared:  0.8336 
## F-statistic: 52.75 on 3 and 28 DF,  p-value: 1.21e-11

The best fit model for mpg appears to be weight, 1/4 mile time and transmission type.

A further model using interaction terms was tested in the appendix.

Interpretation of the regression coefficients

The coefficients can be interpreted as follows:

Weight: for every 1,000 lbs of increased weight, the mpg will drop by 3.9.
1/4 mile time: for every second longer in 1/4 time, the mpg will increase by 1.2.
If the transmission is manual, the mpg increases by 2.9 (this is smaller than above as the additional variables (wt, qsec) are now considered in the model).

Inference and Residual Diagnostics

Regressor significance

Based on the t-values found for the regressors, we can conclude that all (wt, qsec and am) are significant (t-values, see above), however the intercept is not.

Significance of the model found

Running an Anova on the model found compared to the first simple model yields that our last model found is statistically significant:

anova(m1,m3)

## Analysis of Variance Table
## 
## Model 1: mpg ~ am
## Model 2: mpg ~ wt + qsec + am
##   Res.Df    RSS Df Sum of Sq      F   Pr(>F)    
## 1     30 720.90                                 
## 2     28 169.29  2    551.61 45.618 1.55e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Confidence intervals

confint(m3)

##                   2.5 %    97.5 %
## (Intercept) -4.63829946 23.873860
## wt          -5.37333423 -2.459673
## qsec         0.63457320  1.817199
## amManual     0.04573031  5.825944

The 95% confidence intervals are rather wide (especially for the intercept) due to the limited number of observations and shortcomings of the model. The number of datapoints should be increased to obtain more accurate results for the general population.

Homoscedasticity and normality of residuals

The following shows that the residuals are * evenly scattered along the mpg variable * follow approximately a normal distribution (QQ-Plot and Shapiro test has a p-value less than 10%, see below) leading to the assumption that the model is a good fit and no trends were omitted

par(mfrow = c(2,2))
plot(m3)

#Shapiro test for Normality
shapiro.test(m3$residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  m3$residuals
## W = 0.9411, p-value = 0.08043

Appendix

Pairs plot from exploratory analysis

pairs(mtcars, panel = panel.smooth)

Further optimization (interaction terms)

We should not forget that weight and transmission and probably also 1/4 mile time are dependent on each other. Hence, let’s optimize our model further by including the interaction terms and run another stepwise regression over this:

m4 <- step(lm(mpg ~ am*qsec*wt, data = mtcars), trace = F)
summary(m4)

## 
## Call:
## lm(formula = mpg ~ am + qsec + wt + am:wt + qsec:wt, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.6264 -1.4660 -0.3559  1.1520  3.9559 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -20.1094    23.5809  -0.853 0.401568    
## amManual     14.0026     3.3918   4.128 0.000334 ***
## qsec          2.6831     1.3002   2.064 0.049171 *  
## wt            6.6931     7.4051   0.904 0.374379    
## amManual:wt  -4.1411     1.1815  -3.505 0.001675 ** 
## qsec:wt      -0.5401     0.4137  -1.306 0.203141    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.057 on 26 degrees of freedom
## Multiple R-squared:  0.9023, Adjusted R-squared:  0.8835 
## F-statistic:    48 on 5 and 26 DF,  p-value: 2.606e-12

The model found now explains 88% of the variance. However, with the interaction terms the interpretation of the coefficients becomes complicated.

Mean difference between automatic and manual transmission

Below please find the t-test to prove that there is a significant difference between the mpg of automatic and manual cars as well as a comparative plot (violin and boxplots).

t.test(mpg ~ am, data = mtcars)

## 
##  Welch Two Sample t-test
## 
## data:  mpg by am
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -11.280194  -3.209684
## sample estimates:
## mean in group Automatic    mean in group Manual 
##                17.14737                24.39231

# Am is the car variable
ggplot(mtcars, aes(am, mpg)) + 
  geom_violin(fill = "lightskyblue1") +
  geom_boxplot(width = .25, fill = "salmon2") +
  xlab("Automatic/Manual") +
  ggtitle("Violin/Boxplot of Automatic vs. Manual") +
  geom_point()

Based on the t-test (and visually by the plot) we have to reject out null hypothesis for the alternative that the tranmission type has significant influence on the mpg. The mean difference in mpg between automatic and manual transmissions is (logically same as the simple regression model):

abs(mean(mtcars$mpg[mtcars$am == "Automatic"]) - mean(mtcars$mpg[mtcars$am == "Manual"]))

## [1] 7.244939

Session Info for Reproducibility

sessionInfo()

## R version 3.2.3 (2015-12-10)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.11.3 (El Capitan)
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] ggplot2_2.0.0
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.3      digest_0.6.9     plyr_1.8.3       grid_3.2.3      
##  [5] gtable_0.1.2     formatR_1.2.1    magrittr_1.5     evaluate_0.8    
##  [9] scales_0.3.0     stringi_1.0-1    rmarkdown_0.9.2  labeling_0.3    
## [13] tools_3.2.3      stringr_1.0.0    munsell_0.4.3    yaml_2.1.13     
## [17] colorspace_1.2-6 htmltools_0.3    knitr_1.12.3