Executive Summary

In this report, the mtcars dataset from the 1974 Motor Trend US magazine was analyzed to explore how automatic and manual transmission impacts gas mileage and if so what is the difference. A simple linear model shows that manual transmission gives 7.2 mpg more gas mileage but this appears to be a biased result as there are other variables that impact mpg. Thus multivariate regression was used to find the best model using correlation and Anova. In our multivariate model, manual transmission is found to improve MPG by 2.08 but it was not found to be statistically significant at the P<0.05 level when other confounding factors are considered.

Exploratory Analysis

The dataset mtcars was used in this study. The str command is used first to look in details at the variables present in the dataset. All the variables are numeric though some seem to have only a few levels like am and vs.

data(mtcars)
str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

Since we need to study the impact of transmission type on mpg, we first do a summary box plot of mpg vs am variable.

ggplot(mtcars, aes(x=as.factor(am),y=mpg))+
  geom_boxplot(aes(fill=factor(am))) + xlab("Transmission Types") +
  ylab("MPG") + ggtitle("MPG by Transmission Type")

It is clear from the boxplot that manual transmission provides a higher MPG on average. However reegression analysis will be performed now to see if this is indeed true considering the other factors.

Correlation among the variables

First we look at the correlation matrix of the variables in the mtcars dataset to choose the appropriate covariates. mpg is the outcome and am is the definite variable to consider but we also need to look at other factors that may come into play

corrplot.mixed(cor(mtcars))

The correlation plot shows that outcome mpg is highly correlated with cyl, disp, hp and wt varibles. We find that cyl and disp are highly correlated between themselves. The variable cyl is correlated with hp and disp with wt. Usually collinear variables are avoided in regression models.

Simple Linear Regression

we first convert the numeric variable am to a factor variable and get it labeled as “A” and “M” for auto and manual respectively. This will help in the interpretability of the model.

mtcars$am<-as.factor(mtcars$am)
levels(mtcars$am)<-c("A","M")

At first, outcome mpg is fitted to a single predictor variable am. The NULL hypothesis being tested is that MPG is not different between Auto and Manual.

fit<-lm(mpg~am,data=mtcars)
summary(fit)$coef
##              Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 17.147368   1.124603 15.247492 1.133983e-15
## amM          7.244939   1.764422  4.106127 2.850207e-04
summary(fit)$r.squared
## [1] 0.3597989

The model shows that on average automatic transmission has 17.1 mpg while manual transmiison provides an increase of 7.2 mpg, as evident from the slope coefficient. The p-value shows that it is significant and we can reject the NULL hypothesis. However the R^2 value shows that only about 36% of the variance is explained and there could be other significant predictors for the MPG.

So next we fit a new model with all the variables.

fit1<-lm(mpg~., data=mtcars)
summary(fit1)$coef
##                Estimate  Std. Error    t value   Pr(>|t|)
## (Intercept) 12.30337416 18.71788443  0.6573058 0.51812440
## cyl         -0.11144048  1.04502336 -0.1066392 0.91608738
## disp         0.01333524  0.01785750  0.7467585 0.46348865
## hp          -0.02148212  0.02176858 -0.9868407 0.33495531
## drat         0.78711097  1.63537307  0.4813036 0.63527790
## wt          -3.71530393  1.89441430 -1.9611887 0.06325215
## qsec         0.82104075  0.73084480  1.1234133 0.27394127
## vs           0.31776281  2.10450861  0.1509915 0.88142347
## amM          2.52022689  2.05665055  1.2254035 0.23398971
## gear         0.65541302  1.49325996  0.4389142 0.66520643
## carb        -0.19941925  0.82875250 -0.2406258 0.81217871
summary(fit1)$r.squared
## [1] 0.8690158

We get a much higher R^2 value but none of the predictors are statistically significant at the p<0.05 level. So we need to look for the appropriate variables to choose for the best model.

Multivariate Regression

From the correlation results above, we identified a few variables that seem to have an impact on mpg. We perform ANOVA next to see which variables are important for the model fit. The four correlated variables identified earlier and the variable am are included in the different models. The NULL hypothesis is that all the models are same.

fit1 <- lm(mpg~am, data=mtcars)
fit2<-lm(mpg~am+hp+wt, data=mtcars)
fit3<-lm(mpg~am+hp+wt+cyl+disp, data=mtcars)
anova(fit1, fit2, fit3)
## Analysis of Variance Table
## 
## Model 1: mpg ~ am
## Model 2: mpg ~ am + hp + wt
## Model 3: mpg ~ am + hp + wt + cyl + disp
##   Res.Df    RSS Df Sum of Sq       F    Pr(>F)    
## 1     30 720.90                                   
## 2     28 180.29  2    540.61 43.0841 5.576e-09 ***
## 3     26 163.12  2     17.17  1.3685    0.2722    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The ANOVA clearly shows that addition of cyl and disp in fit3 does not impact the model fit2 significantly but multivariate model fit2 is significantly different from the simple model fit1. Variables wt and hp are included in the model.

summary(fit2)$coef
##                Estimate  Std. Error   t value     Pr(>|t|)
## (Intercept) 34.00287512 2.642659337 12.866916 2.824030e-13
## amM          2.08371013 1.376420152  1.513862 1.412682e-01
## hp          -0.03747873 0.009605422 -3.901830 5.464023e-04
## wt          -2.87857541 0.904970538 -3.180850 3.574031e-03
summary(fit2)$r.squared
## [1] 0.8398903

This shows that even though manual transmission improves average MPG by 2.08, it is not statisticlly significant at the p<0.05 level (p=0.142) but hp and wt changes MPG significantly.

Confidence Interval

Checking the confidence interval of the coefficients of the fit shows that for the variable am 0 is included in the interval and hence it is not statistically significant.

confint(fit2)
##                   2.5 %      97.5 %
## (Intercept) 28.58963286 39.41611738
## amM         -0.73575874  4.90317900
## hp          -0.05715454 -0.01780291
## wt          -4.73232353 -1.02482730

Residual Diagnostics

Finally we look at the residual diagnostics of the above model.

par(mfrow = c(2,2))
plot(fit2)

The residuals distribution are normal and do not exhibit heteroskedasticity. The leverage plot also does not show any outliers.