Regression Model Course Project

Summary

The purpose of this analysis report is to

  1. study whether an automatic or manual transmission is better for MPG, and
  2. quantify the MPG difference between automatic and manual transmissions

using mtcars data set. First, the data is explored using bivariate correlations. Linear regression models are used to study if MPG differs by transmission. Finally, diagnostic analyses are conducted for the adequency of the final model. All figures are shown in the Appendix section. All statistical tests are two-tailed and considered statistically signicant at alpha level 0.05.

Data exploration

First, we examine bivariate correlations between some continuous variables using scatter plots. The scatterplot matrix (Figure 1.) shows that mpg correlates quite a bit with all continous variables, although not all correlations seem to be linear. We can also look at the correlations numerically.

Among all the variables, disp, hp, and wt have the strongest and the most negative correlations with mpg (|r| > 0.89, Figure 2.). They are also highly correlated with each other (|r| > 0.77). Wt is also highly correlated with am among all the continuous variables (|r| > 0.74). Here, Spearman correlation coefficients are used to remove normality assumption for a small dataset like this and binary variable am.

The correlation plot helps identify confounding effects to be adjusted for in a regression model. A variable that causes confounding has to be correlated with both the outcome and the predictor. Wt is the best candidate here because of its strong correlations with both mpg and am. Including only wt but not disp and hp also avoids colinearity.

Next, we examine the distributions of mpg and wt with transmission type.

#Mean mpg and wt
x <- lapply(mtcars[c('mpg', 'wt')], function(x) aggregate(x ~ factor(am, labels = c('Auto', 'Manual')), data = mtcars, mean))

x <- Reduce(cbind, x)[c(1,2,4)]
colnames(x) <- c('Transmission', 'mpg', 'wt')
x
##   Transmission      mpg       wt
## 1         Auto 17.14737 3.768895
## 2       Manual 24.39231 2.411000

Manual transmission cars have higher mean mpg (24.392 vs 17.147) and lower mean weight (2411 vs 3769 lbs) than automatic transmission cars. Figure 3. displays the boxplots that contrast the non-overlapping interquartile ranges per tranmission type.

Regression models

We regress mpg on am to test if mean mpg is significanly higher for one type of tranmission than the other. We also include wt as the adjustment and explore its interaction effect with am.

#Simple regression
fit1 <- lm(mpg ~ factor(am, labels = c('Auto', 'Manual')), data = mtcars)
summary(fit1)
## 
## Call:
## lm(formula = mpg ~ factor(am, labels = c("Auto", "Manual")), 
##     data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3923 -3.0923 -0.2974  3.2439  9.5077 
## 
## Coefficients:
##                                                Estimate Std. Error t value
## (Intercept)                                      17.147      1.125  15.247
## factor(am, labels = c("Auto", "Manual"))Manual    7.245      1.764   4.106
##                                                Pr(>|t|)    
## (Intercept)                                    1.13e-15 ***
## factor(am, labels = c("Auto", "Manual"))Manual 0.000285 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared:  0.3598, Adjusted R-squared:  0.3385 
## F-statistic: 16.86 on 1 and 30 DF,  p-value: 0.000285
#Multiple regression
fit2 <- lm(mpg ~ factor(am, labels = c('Auto', 'Manual')) + wt, data = mtcars)
summary(fit2)
## 
## Call:
## lm(formula = mpg ~ factor(am, labels = c("Auto", "Manual")) + 
##     wt, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5295 -2.3619 -0.1317  1.4025  6.8782 
## 
## Coefficients:
##                                                Estimate Std. Error t value
## (Intercept)                                    37.32155    3.05464  12.218
## factor(am, labels = c("Auto", "Manual"))Manual -0.02362    1.54565  -0.015
## wt                                             -5.35281    0.78824  -6.791
##                                                Pr(>|t|)    
## (Intercept)                                    5.84e-13 ***
## factor(am, labels = c("Auto", "Manual"))Manual    0.988    
## wt                                             1.87e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.098 on 29 degrees of freedom
## Multiple R-squared:  0.7528, Adjusted R-squared:  0.7358 
## F-statistic: 44.17 on 2 and 29 DF,  p-value: 1.579e-09

In unadjusted regression, manual transmission cars on average run 7.245 miles farther per gallon than automatic transmission cars (p < 0.001). The intercept 17.147 is the mean mpg for automatic transmission cars. After adjusting for weight, the mean miles per gallon for manual transmission cars is only a needle lower than that for automatic tranmission cars. The difference is not statistically significant (p = 0.988).

#Interaction model
fit3 <- lm(mpg ~ factor(am)*wt, data = mtcars)
summary(fit3)
## 
## Call:
## lm(formula = mpg ~ factor(am) * wt, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.6004 -1.5446 -0.5325  0.9012  6.0909 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     31.4161     3.0201  10.402 4.00e-11 ***
## factor(am)1     14.8784     4.2640   3.489  0.00162 ** 
## wt              -3.7859     0.7856  -4.819 4.55e-05 ***
## factor(am)1:wt  -5.2984     1.4447  -3.667  0.00102 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.591 on 28 degrees of freedom
## Multiple R-squared:  0.833,  Adjusted R-squared:  0.8151 
## F-statistic: 46.57 on 3 and 28 DF,  p-value: 5.209e-11
coef(fit3)
##    (Intercept)    factor(am)1             wt factor(am)1:wt 
##      31.416055      14.878423      -3.785908      -5.298360

The interaction term is statistically significant at alpha level = 0.05. Hence, we include the interaction term am*wt in the model. The regression equations for the two transmissions are:

Auto: y_hat = 31.416 + 3.786 * wt

Manual: y_hat = 46.294 + 9.084 * wt

The slopes are interpreted as the degree at which mean MPG decreases as weight increases by 1000 lbs. For every 1000 lbs increase in weight, automatic transmission on average reduces 3.786 MPG while manual transmission on average reduces 9.084 MPG. The difference in slopes is depicted in Figure 4. The intercepts are unrealistic; they are the mean MPG at zero weight.

Caution: The scatter plot in Figure 4. suggests mean MPG may not be comparable at weight <2500 lbs or >3500 lbs. More data is needed to support the validity of the extrapolation.

Finally, we use likelihood ratio test to find out that successively more complex models significantly improve model fitting. The improvements per adjusted model and interaction model are significant with p < 0.001.

#Likelihood ratio test
#install.packages('lmtest')
require(lmtest)
## Loading required package: lmtest
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
lrtest(fit1, fit2, fit3)
## Likelihood ratio test
## 
## Model 1: mpg ~ factor(am, labels = c("Auto", "Manual"))
## Model 2: mpg ~ factor(am, labels = c("Auto", "Manual")) + wt
## Model 3: mpg ~ factor(am) * wt
##   #Df  LogLik Df  Chisq Pr(>Chisq)    
## 1   3 -95.242                         
## 2   4 -80.015  1 30.455  3.417e-08 ***
## 3   5 -73.738  1 12.553  0.0003955 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Diagnostic

Now, we want to evaluate the final interaction model using diagnostic plots.

par(mfrow = c(2,2), mar= c(4,4,2,3))
plot(fit3)

The diagnostic plots are interpretd as follows:

  1. Residuals vs Fitted plot: There is no specific pattern in residuals distribution across predicted mpg meaning that heteroskedasticity assumption holds.

  2. Normal Q-Q plot: There is slight deviation of observations from the expected quantiles as delineated in the dotted line meaning some violation in normality assumption.

  3. Scake-Location plot: Similar to 1) but with the y-axis plotting square root of absolute values of residuals. Again, there is no specific patterns in residuals per fitted values.

  4. Residuals vs Leverage: There is no high leverage measure, i.e. outlying weight value, that influences the model fit.

Specifically, we can look at the range of diagnostic measures individually.

summary(hatvalues(fit3))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.05264 0.06228 0.08658 0.12500 0.16290 0.37100
summary(dfbetas(fit3))
##   (Intercept)         factor(am)1               wt           
##  Min.   :-0.824005   Min.   :-0.3715329   Min.   :-0.398043  
##  1st Qu.:-0.017053   1st Qu.:-0.0931240   1st Qu.:-0.005044  
##  Median : 0.000000   Median :-0.0055104   Median : 0.000000  
##  Mean   :-0.003194   Mean   : 0.0003825   Mean   : 0.003962  
##  3rd Qu.: 0.004534   3rd Qu.: 0.0527351   3rd Qu.: 0.009767  
##  Max.   : 0.492665   Max.   : 0.5836215   Max.   : 0.930095  
##  factor(am)1:wt      
##  Min.   :-0.5057988  
##  1st Qu.:-0.0497294  
##  Median : 0.0040761  
##  Mean   : 0.0007475  
##  3rd Qu.: 0.0742876  
##  Max.   : 0.3913157
summary(cooks.distance(fit3))
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 6.000e-08 3.824e-03 1.441e-02 3.385e-02 3.144e-02 2.507e-01

The first one is hat values; high values mean some weight values are influential to the model fit. The second one is dfbetas; it means how much each regession coefficient is changed due to deletion of ith observation. The third one is Cook’s distance; similar to dfbetas but it measures the overall change in regression coefficients. All the distributions seem to be consistent; data does not seem to have influential observations.

Conclusion

Automatic transmission seems to have better MPG for heavier cars while manual transmission has better MPG for lighter cars. For cars that weigh around 2500 - 3000 lbs, transmission does not have much to do with MPG.

Appendix

data(mtcars)
#[, 1]   mpg     Miles/(US) gallon
#[, 2]   cyl     Number of cylinders
#[, 3]   disp    Displacement (cu.in.)
#[, 4]   hp  Gross horsepower
#[, 5]   drat    Rear axle ratio
#[, 6]   wt  Weight (lb/1000)
#[, 7]   qsec    1/4 mile time
#[, 8]   vs  V/S
#[, 9]   am  Transmission (0 = automatic, 1 = manual)
#[,10]   gear    Number of forward gears
#[,11]   carb    Number of carburetors

#Pairwise scatter plots
require(graphics)
pairs(~mpg+disp+hp+drat+wt+qsec+carb, data = mtcars, panel = panel.smooth, main = "Figure 1. Scatterplot matrix")

#Look at correlations
#install.packages('corrplot')
corr <- round(cor(mtcars[c(1,3,4,5,6,7,9,11)], method = 'spearman'), 4)
require(corrplot)
## Loading required package: corrplot
corrplot(corr, method = "number", mar = c(0,0,1,0),
         title = "Figure 2. Correlation plot")

#Panel boxplots
par(mfcol = c(1,2), mar = c(4,4,1,1), oma = c(0,0,3,0))
boxplot(mpg ~ ordered(mtcars$am, levels = c(0, 1),
              labels = c("Auto", "Manual")),
        data = mtcars, 
        #xlab = 'Transmission',
        ylab = 'MPG')

boxplot(wt ~ ordered(mtcars$am, levels = c(0, 1),
              labels = c("Auto", "Manual")),
        data = mtcars, 
        #xlab = 'Transmission',
        ylab = 'Weight lb/1000')
mtext("Figure 3. Boxplots per transmission", line = 1, side = 3, outer = T)

#Interaction model
#install.packages('ggplot2')
library(ggplot2)
g <- ggplot(aes(x = wt, y = mpg, col = factor(am)), data = mtcars) +
     geom_point(size = 4) +
     #xlim(0, 5.5) + ylim(10, 50) +
     xlab('Weight lb/1000') + ylab('MPG') +
     ggtitle('Figure 4. MPG vs weight')

g +  geom_abline(intercept = coef(fit3)[1], slope = coef(fit3)[3], 
                 col = 'coral1', size = 1) +
     geom_abline(intercept = coef(fit3)[1] + coef(fit3)[2], 
                 slope = coef(fit3)[3] + coef(fit3)[4], 
                 col = 'cyan3', size = 1) +
    scale_colour_discrete(name = 'Transmission',
                          labels = c('Auto', 'Manual'))