The manual cars have better mean mpg than automatic cars. In fact, the mean mpg for manual cars is 2.9 higher than the mean mpg for automatic cars.
First we want to load the data set and look at the summary of the data:
library(datasets)
data("mtcars")
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
As you can see, the data has 11 variables and 32 observations. The description of the variables of the data set can be found in appendix.
First we want to look at regression model in which “am” variable is the only predictor:
fit1=lm(mpg~as.factor(am),data=mtcars)
summary(fit1)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.147368 1.124603 15.247492 1.133983e-15
## as.factor(am)1 7.244939 1.764422 4.106127 2.850207e-04
As we can see in the variables description, am=0 represents the automatic cars and am=1 represents manual cars. From the above regression model we can conclude that the automatic cars mean mpg is 17.15 which is the intercept and the mean mpg for manual cars is 7.2 higher than mean mpg for automatic cars. This can proves that if we ignore the other variables then the manual cars have better mpg than automatic cars. In appendix, figure one, we can see the box plot of “mpg” vs “am” and as you can see the manual cars have a better mpg than the automatic cars.
Now let’s use a regression model in which we use all variables as predictor(The summary of this regression is in appendix):
fit.all=lm(mpg~.,data=mtcars)
As we can see in appendix, summary of fit.all, if we consider all the variables then the manual cars have 2.5 mean mpg higher than automatic cars. As you can see, in appendix, figue two, there is no any specific pattern in the residual plot.
Now we want to optimize our regression model by eleminating some variables from fit.all. First let’s look at variance inflation factors:
library(car)
## Warning: package 'car' was built under R version 3.2.2
vif(fit.all)^(1/2)
## cyl disp hp drat wt qsec vs am
## 3.920948 4.649757 3.135608 1.837014 3.894212 2.743712 2.228424 2.156035
## gear carb
## 2.314617 2.812249
As you can see, “cyl”, “hp”, “disp” and “wt” have a high VIP values and so we may have to eliminate them from our model. We can use the “step” function to build the optimized model( the summary can be found in appendix):
fit.best <- step(fit.all, trace=0)
As you can see in appendix, summary of fit.best, the best model is " mpg~wt + qsec + am“. This is simply because”cyl“,”hp" and “disp” have a high VIP values and “drat”,“gear” and “vs” have a high correlation with “am” or “wt” (appendix, figure three). According to the summary of “fit.best” we can see that the mean mpg for manual cars is 2.9358 higher compared to automatic cars. Now we can select our optimized model by comparing “fit1”, “fit.all” and “fit.best”. The result of applying Anova function on these regression models can be found in appendix. Based on the p-values of the coefficients, we see that “fit.best” is the best model for our analysis.
with(mtcars, {
plot(as.factor(am), mpg, main="Figure one", xlab="am", ylab="mpg")
})
par(mfrow=c(2,2))
plot(fit.all,main="Figure two")
cor(mtcars)
## mpg cyl disp hp drat wt
## mpg 1.0000000 -0.8521620 -0.8475514 -0.7761684 0.68117191 -0.8676594
## cyl -0.8521620 1.0000000 0.9020329 0.8324475 -0.69993811 0.7824958
## disp -0.8475514 0.9020329 1.0000000 0.7909486 -0.71021393 0.8879799
## hp -0.7761684 0.8324475 0.7909486 1.0000000 -0.44875912 0.6587479
## drat 0.6811719 -0.6999381 -0.7102139 -0.4487591 1.00000000 -0.7124406
## wt -0.8676594 0.7824958 0.8879799 0.6587479 -0.71244065 1.0000000
## qsec 0.4186840 -0.5912421 -0.4336979 -0.7082234 0.09120476 -0.1747159
## vs 0.6640389 -0.8108118 -0.7104159 -0.7230967 0.44027846 -0.5549157
## am 0.5998324 -0.5226070 -0.5912270 -0.2432043 0.71271113 -0.6924953
## gear 0.4802848 -0.4926866 -0.5555692 -0.1257043 0.69961013 -0.5832870
## carb -0.5509251 0.5269883 0.3949769 0.7498125 -0.09078980 0.4276059
## qsec vs am gear carb
## mpg 0.41868403 0.6640389 0.59983243 0.4802848 -0.55092507
## cyl -0.59124207 -0.8108118 -0.52260705 -0.4926866 0.52698829
## disp -0.43369788 -0.7104159 -0.59122704 -0.5555692 0.39497686
## hp -0.70822339 -0.7230967 -0.24320426 -0.1257043 0.74981247
## drat 0.09120476 0.4402785 0.71271113 0.6996101 -0.09078980
## wt -0.17471588 -0.5549157 -0.69249526 -0.5832870 0.42760594
## qsec 1.00000000 0.7445354 -0.22986086 -0.2126822 -0.65624923
## vs 0.74453544 1.0000000 0.16834512 0.2060233 -0.56960714
## am -0.22986086 0.1683451 1.00000000 0.7940588 0.05753435
## gear -0.21268223 0.2060233 0.79405876 1.0000000 0.27407284
## carb -0.65624923 -0.5696071 0.05753435 0.2740728 1.00000000
summary(fit.all)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.30337416 18.71788443 0.6573058 0.51812440
## cyl -0.11144048 1.04502336 -0.1066392 0.91608738
## disp 0.01333524 0.01785750 0.7467585 0.46348865
## hp -0.02148212 0.02176858 -0.9868407 0.33495531
## drat 0.78711097 1.63537307 0.4813036 0.63527790
## wt -3.71530393 1.89441430 -1.9611887 0.06325215
## qsec 0.82104075 0.73084480 1.1234133 0.27394127
## vs 0.31776281 2.10450861 0.1509915 0.88142347
## am 2.52022689 2.05665055 1.2254035 0.23398971
## gear 0.65541302 1.49325996 0.4389142 0.66520643
## carb -0.19941925 0.82875250 -0.2406258 0.81217871
summary(fit.best)
##
## Call:
## lm(formula = mpg ~ wt + qsec + am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4811 -1.5555 -0.7257 1.4110 4.6610
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.6178 6.9596 1.382 0.177915
## wt -3.9165 0.7112 -5.507 6.95e-06 ***
## qsec 1.2259 0.2887 4.247 0.000216 ***
## am 2.9358 1.4109 2.081 0.046716 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.459 on 28 degrees of freedom
## Multiple R-squared: 0.8497, Adjusted R-squared: 0.8336
## F-statistic: 52.75 on 3 and 28 DF, p-value: 1.21e-11
anova(fit1,fit.best,fit.all)
## Analysis of Variance Table
##
## Model 1: mpg ~ as.factor(am)
## Model 2: mpg ~ wt + qsec + am
## Model 3: mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 30 720.90
## 2 28 169.29 2 551.61 39.2687 8.025e-08 ***
## 3 21 147.49 7 21.79 0.4432 0.8636
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1