Executive Summary

The manual cars have better mean mpg than automatic cars. In fact, the mean mpg for manual cars is 2.9 higher than the mean mpg for automatic cars.

Data Analysis

Exploring the matcars data set

First we want to load the data set and look at the summary of the data:

library(datasets)
data("mtcars")
str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

As you can see, the data has 11 variables and 32 observations. The description of the variables of the data set can be found in appendix.

The effect of am variable( automatic|manual) on mpg

First we want to look at regression model in which “am” variable is the only predictor:

fit1=lm(mpg~as.factor(am),data=mtcars)
summary(fit1)$coef
##                 Estimate Std. Error   t value     Pr(>|t|)
## (Intercept)    17.147368   1.124603 15.247492 1.133983e-15
## as.factor(am)1  7.244939   1.764422  4.106127 2.850207e-04

As we can see in the variables description, am=0 represents the automatic cars and am=1 represents manual cars. From the above regression model we can conclude that the automatic cars mean mpg is 17.15 which is the intercept and the mean mpg for manual cars is 7.2 higher than mean mpg for automatic cars. This can proves that if we ignore the other variables then the manual cars have better mpg than automatic cars. In appendix, figure one, we can see the box plot of “mpg” vs “am” and as you can see the manual cars have a better mpg than the automatic cars.

Regression model with all variables

Now let’s use a regression model in which we use all variables as predictor(The summary of this regression is in appendix):

fit.all=lm(mpg~.,data=mtcars)

As we can see in appendix, summary of fit.all, if we consider all the variables then the manual cars have 2.5 mean mpg higher than automatic cars. As you can see, in appendix, figue two, there is no any specific pattern in the residual plot.

Best regression model

Now we want to optimize our regression model by eleminating some variables from fit.all. First let’s look at variance inflation factors:

library(car)
## Warning: package 'car' was built under R version 3.2.2
vif(fit.all)^(1/2)
##      cyl     disp       hp     drat       wt     qsec       vs       am 
## 3.920948 4.649757 3.135608 1.837014 3.894212 2.743712 2.228424 2.156035 
##     gear     carb 
## 2.314617 2.812249

As you can see, “cyl”, “hp”, “disp” and “wt” have a high VIP values and so we may have to eliminate them from our model. We can use the “step” function to build the optimized model( the summary can be found in appendix):

fit.best <- step(fit.all, trace=0)

As you can see in appendix, summary of fit.best, the best model is " mpg~wt + qsec + am“. This is simply because”cyl“,”hp" and “disp” have a high VIP values and “drat”,“gear” and “vs” have a high correlation with “am” or “wt” (appendix, figure three). According to the summary of “fit.best” we can see that the mean mpg for manual cars is 2.9358 higher compared to automatic cars. Now we can select our optimized model by comparing “fit1”, “fit.all” and “fit.best”. The result of applying Anova function on these regression models can be found in appendix. Based on the p-values of the coefficients, we see that “fit.best” is the best model for our analysis.

Appendix

with(mtcars, {
plot(as.factor(am), mpg, main="Figure one", xlab="am", ylab="mpg")
})

par(mfrow=c(2,2))
plot(fit.all,main="Figure two")

cor(mtcars)
##             mpg        cyl       disp         hp        drat         wt
## mpg   1.0000000 -0.8521620 -0.8475514 -0.7761684  0.68117191 -0.8676594
## cyl  -0.8521620  1.0000000  0.9020329  0.8324475 -0.69993811  0.7824958
## disp -0.8475514  0.9020329  1.0000000  0.7909486 -0.71021393  0.8879799
## hp   -0.7761684  0.8324475  0.7909486  1.0000000 -0.44875912  0.6587479
## drat  0.6811719 -0.6999381 -0.7102139 -0.4487591  1.00000000 -0.7124406
## wt   -0.8676594  0.7824958  0.8879799  0.6587479 -0.71244065  1.0000000
## qsec  0.4186840 -0.5912421 -0.4336979 -0.7082234  0.09120476 -0.1747159
## vs    0.6640389 -0.8108118 -0.7104159 -0.7230967  0.44027846 -0.5549157
## am    0.5998324 -0.5226070 -0.5912270 -0.2432043  0.71271113 -0.6924953
## gear  0.4802848 -0.4926866 -0.5555692 -0.1257043  0.69961013 -0.5832870
## carb -0.5509251  0.5269883  0.3949769  0.7498125 -0.09078980  0.4276059
##             qsec         vs          am       gear        carb
## mpg   0.41868403  0.6640389  0.59983243  0.4802848 -0.55092507
## cyl  -0.59124207 -0.8108118 -0.52260705 -0.4926866  0.52698829
## disp -0.43369788 -0.7104159 -0.59122704 -0.5555692  0.39497686
## hp   -0.70822339 -0.7230967 -0.24320426 -0.1257043  0.74981247
## drat  0.09120476  0.4402785  0.71271113  0.6996101 -0.09078980
## wt   -0.17471588 -0.5549157 -0.69249526 -0.5832870  0.42760594
## qsec  1.00000000  0.7445354 -0.22986086 -0.2126822 -0.65624923
## vs    0.74453544  1.0000000  0.16834512  0.2060233 -0.56960714
## am   -0.22986086  0.1683451  1.00000000  0.7940588  0.05753435
## gear -0.21268223  0.2060233  0.79405876  1.0000000  0.27407284
## carb -0.65624923 -0.5696071  0.05753435  0.2740728  1.00000000

Summary of regression fit.all:

summary(fit.all)$coef
##                Estimate  Std. Error    t value   Pr(>|t|)
## (Intercept) 12.30337416 18.71788443  0.6573058 0.51812440
## cyl         -0.11144048  1.04502336 -0.1066392 0.91608738
## disp         0.01333524  0.01785750  0.7467585 0.46348865
## hp          -0.02148212  0.02176858 -0.9868407 0.33495531
## drat         0.78711097  1.63537307  0.4813036 0.63527790
## wt          -3.71530393  1.89441430 -1.9611887 0.06325215
## qsec         0.82104075  0.73084480  1.1234133 0.27394127
## vs           0.31776281  2.10450861  0.1509915 0.88142347
## am           2.52022689  2.05665055  1.2254035 0.23398971
## gear         0.65541302  1.49325996  0.4389142 0.66520643
## carb        -0.19941925  0.82875250 -0.2406258 0.81217871

Summary of regression fit.best:

summary(fit.best)
## 
## Call:
## lm(formula = mpg ~ wt + qsec + am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4811 -1.5555 -0.7257  1.4110  4.6610 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   9.6178     6.9596   1.382 0.177915    
## wt           -3.9165     0.7112  -5.507 6.95e-06 ***
## qsec          1.2259     0.2887   4.247 0.000216 ***
## am            2.9358     1.4109   2.081 0.046716 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.459 on 28 degrees of freedom
## Multiple R-squared:  0.8497, Adjusted R-squared:  0.8336 
## F-statistic: 52.75 on 3 and 28 DF,  p-value: 1.21e-11

Comparing regression models(Anova)

anova(fit1,fit.best,fit.all)
## Analysis of Variance Table
## 
## Model 1: mpg ~ as.factor(am)
## Model 2: mpg ~ wt + qsec + am
## Model 3: mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
##   Res.Df    RSS Df Sum of Sq       F    Pr(>F)    
## 1     30 720.90                                   
## 2     28 169.29  2    551.61 39.2687 8.025e-08 ***
## 3     21 147.49  7     21.79  0.4432    0.8636    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1