Executive Summary

This project uses the mtcars data set to explore the relationship between miles per gallon (MPG; outcome) and transmission (am; predictor). The following questions will be explored:

Exploratory analysis showed that the average MPG for automatic cars is about 17.15 mpg and for manual cars about 24.39 mpg. However, after including disp(displacment), cyl(no. of cylinders), and wt (weight), transmission became much less significant, accounting for only a 0.14 difference in mean mpg between manual and atomatic, holding all other variables constant.

Therefore, the difference in transmission, accounting for other variables, is not very significant.

For documentation on the mtcars dataset and its variables click here.

Exploratory Analysis

data(mtcars) #load the data set
str(mtcars) #MPG = mpg, Transmission = am
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...
mtcars[!complete.cases(mtcars),] #Check for NAs in dataset
##  [1] mpg  cyl  disp hp   drat wt   qsec vs   am   gear carb
## <0 rows> (or 0-length row.names)

There are no missing values and all variables are numerical. We will transform appropriate variables into factors to help with analysis. The resulting dataset mtcars2 will be used from now on.

mtcars2 <- within(mtcars, {
        vs <- factor(vs, labels = c("V", "S"))
        am <- factor(am, labels = c("automatic", "manual"))
        cyl  <- ordered(cyl)
        gear <- ordered(gear)
        carb <- ordered(carb)
        })
#used from mtcars documentation

Plot of MPG per Transmission Type

boxplot(mpg ~ am, data = mtcars2, col=c("pink", "green"), xlab = "Transmission Type", ylab = "Miles per Gallon")

aggregate(mpg ~ am, mean, data = mtcars2)
##          am      mpg
## 1 automatic 17.14737
## 2    manual 24.39231

From the plot, there seems to be a clear distinction between transmission type and mpg. In fact, we see a mean difference of about 7.25 mpg between automatic and manual transmissions.

Possible Confounding Variables

pairs(mtcars2, panel = panel.smooth) 

#pairs plot between mpg and other variables for possible correlation

Looking at the first row of the panel of plots, there may be a possible correlation between mpg several other variables. Therefore, a regression model including more than am (transmission) may be a better fit.

Regression Analysis

First a simple linear model will be analyzed.

fit_simple <- lm(mpg ~ am, data = mtcars2)
summary(fit_simple)
## 
## Call:
## lm(formula = mpg ~ am, data = mtcars2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3923 -3.0923 -0.2974  3.2439  9.5077 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   17.147      1.125  15.247 1.13e-15 ***
## ammanual       7.245      1.764   4.106 0.000285 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared:  0.3598, Adjusted R-squared:  0.3385 
## F-statistic: 16.86 on 1 and 30 DF,  p-value: 0.000285

The intercept 17.1473684 is in fact the mean mpg for automatic transmission, and the slope 7.2449393 is the change in mean between automatic and manual transmissions. Therefore there is a 7.2449393 increase in mean mpg comparing cars with a manual transmission than to those with automatic transmission. The pvalue for the slope also indicates that the difference in mean mpg for automatic vs manual transmissions is significant.

R-squared value for this model is only about 0.36, which means that only about 36% of the total variability is explained by the model. Therefore, there may be other regressors.

Multivariable Model

We shall begin by regressing mpg against all variables. Then we will conduct an analysis of variance with the anova() function and see which added regressors are significant, holding all others constant.

Note: ANOVA assumes that the fit’s residuals are normal, so we will conduct a shapiro.test, where the null hypothesis assumes residual normality.

fit_all <- lm(mpg ~ ., data = mtcars2)
summary(fit_all)$coef
##                Estimate  Std. Error     t value   Pr(>|t|)
## (Intercept) 26.57171130 19.56615969  1.35804428 0.19452722
## cyl.L       -0.23770312  5.06255894 -0.04695316 0.96317000
## cyl.Q        2.02541267  2.14952059  0.94226251 0.36098761
## disp         0.03554632  0.03189920  1.11433290 0.28267339
## hp          -0.07050683  0.03942556 -1.78835344 0.09393155
## drat         1.18283018  2.48348458  0.47627845 0.64073922
## wt          -4.52977584  2.53874584 -1.78425732 0.09461859
## qsec         0.36784482  0.93539569  0.39325050 0.69966720
## vsS          1.93085054  2.87125777  0.67247551 0.51150791
## ammanual     1.21211570  3.21354514  0.37718957 0.71131573
## gear.L       1.78784595  2.64200408  0.67670068 0.50889747
## gear.Q       0.12234634  2.40896047  0.05078802 0.96016458
## carb.L       6.06155540  6.72821581  0.90091572 0.38187054
## carb.Q       1.78825141  2.80043297  0.63856247 0.53273703
## carb.C       0.42384333  2.57388972  0.16467035 0.87140202
## carb^4       0.93317347  2.45041298  0.38082294 0.70867391
## carb^5      -2.46409938  2.90450253 -0.84837226 0.40956528
#only hp and wt have a p value under 0.1

shapiro.test(fit_all$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  fit_all$residuals
## W = 0.96146, p-value = 0.301
#p value is large so do not reject the null that residuals are approximately normal

anova(fit_all)
## Analysis of Variance Table
## 
## Response: mpg
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## cyl        2 824.78  412.39 51.3766 1.943e-07 ***
## disp       1  57.64   57.64  7.1813   0.01714 *  
## hp         1  18.50   18.50  2.3050   0.14975    
## drat       1  11.91   11.91  1.4843   0.24191    
## wt         1  55.79   55.79  6.9500   0.01870 *  
## qsec       1   1.52    1.52  0.1899   0.66918    
## vs         1   0.30    0.30  0.0376   0.84878    
## am         1  16.57   16.57  2.0639   0.17135    
## gear       2   5.02    2.51  0.3128   0.73606    
## carb       5  13.60    2.72  0.3388   0.88144    
## Residuals 15 120.40    8.03                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The ANOVA table shows that disp & wt have a p-value under 0.5, and cyl has a p-value under 0.001. Therefore we will create a final model inluding these three variables. Note that am did not have a significant impact at all, yet we will include it in the model to see it’s effect on the other regressors.

Final Model

fit_final <- lm(mpg ~ disp + wt + cyl + am, data = mtcars2)
summary(fit_final)
## 
## Call:
## lm(formula = mpg ~ disp + wt + cyl + am, data = mtcars2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5029 -1.2829 -0.4825  1.4954  5.7889 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 30.275005   3.290562   9.201 1.17e-09 ***
## disp         0.001632   0.013757   0.119   0.9065    
## wt          -3.249176   1.249098  -2.601   0.0151 *  
## cyl.L       -4.467788   1.872177  -2.386   0.0246 *  
## cyl.Q        0.935362   1.052358   0.889   0.3822    
## ammanual     0.141212   1.326751   0.106   0.9161    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.652 on 26 degrees of freedom
## Multiple R-squared:  0.8376, Adjusted R-squared:  0.8064 
## F-statistic: 26.82 on 5 and 26 DF,  p-value: 1.73e-09
#check residual plots
par(mfrow = c(2, 2))
plot(fit_final)

Including the other regressors, am became much less significant; accounting for only a 0.14 difference in mean mpg between manual and atomatic, holding all other variables constant.

The residual plots do not point out any abnormality in the residuals. They seem to be approximately normal (Q-Q Plot). The Toyota Corolla may be a high leverage point and influential on the model.

Limitations

library(car)
vif(fit_final)
##           GVIF Df GVIF^(1/(2*Df))
## disp 12.813023  1        3.579528
## wt    6.583720  1        2.565876
## cyl   6.971234  2        1.624903
## am    1.931783  1        1.389886

disp has a VIF of 12.8130229 which is considered large. This means the standard error is almost 13 times larger than if disp were uncorrelated to the other regressors. Further analysis must be conducted to check for possible problematic collinearity.