This project uses the mtcars data set to explore the relationship between miles per gallon (MPG; outcome) and transmission (am; predictor). The following questions will be explored:
Exploratory analysis showed that the average MPG for automatic cars is about 17.15 mpg and for manual cars about 24.39 mpg. However, after including disp(displacment), cyl(no. of cylinders), and wt (weight), transmission became much less significant, accounting for only a 0.14 difference in mean mpg between manual and atomatic, holding all other variables constant.
Therefore, the difference in transmission, accounting for other variables, is not very significant.
For documentation on the mtcars dataset and its variables click here.
data(mtcars) #load the data set
str(mtcars) #MPG = mpg, Transmission = am
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
mtcars[!complete.cases(mtcars),] #Check for NAs in dataset
## [1] mpg cyl disp hp drat wt qsec vs am gear carb
## <0 rows> (or 0-length row.names)
There are no missing values and all variables are numerical. We will transform appropriate variables into factors to help with analysis. The resulting dataset mtcars2 will be used from now on.
mtcars2 <- within(mtcars, {
vs <- factor(vs, labels = c("V", "S"))
am <- factor(am, labels = c("automatic", "manual"))
cyl <- ordered(cyl)
gear <- ordered(gear)
carb <- ordered(carb)
})
#used from mtcars documentation
boxplot(mpg ~ am, data = mtcars2, col=c("pink", "green"), xlab = "Transmission Type", ylab = "Miles per Gallon")
aggregate(mpg ~ am, mean, data = mtcars2)
## am mpg
## 1 automatic 17.14737
## 2 manual 24.39231
From the plot, there seems to be a clear distinction between transmission type and mpg. In fact, we see a mean difference of about 7.25 mpg between automatic and manual transmissions.
pairs(mtcars2, panel = panel.smooth)
#pairs plot between mpg and other variables for possible correlation
Looking at the first row of the panel of plots, there may be a possible correlation between mpg several other variables. Therefore, a regression model including more than am (transmission) may be a better fit.
First a simple linear model will be analyzed.
fit_simple <- lm(mpg ~ am, data = mtcars2)
summary(fit_simple)
##
## Call:
## lm(formula = mpg ~ am, data = mtcars2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.3923 -3.0923 -0.2974 3.2439 9.5077
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.147 1.125 15.247 1.13e-15 ***
## ammanual 7.245 1.764 4.106 0.000285 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared: 0.3598, Adjusted R-squared: 0.3385
## F-statistic: 16.86 on 1 and 30 DF, p-value: 0.000285
The intercept 17.1473684 is in fact the mean mpg for automatic transmission, and the slope 7.2449393 is the change in mean between automatic and manual transmissions. Therefore there is a 7.2449393 increase in mean mpg comparing cars with a manual transmission than to those with automatic transmission. The pvalue for the slope also indicates that the difference in mean mpg for automatic vs manual transmissions is significant.
R-squared value for this model is only about 0.36, which means that only about 36% of the total variability is explained by the model. Therefore, there may be other regressors.
We shall begin by regressing mpg against all variables. Then we will conduct an analysis of variance with the anova() function and see which added regressors are significant, holding all others constant.
Note: ANOVA assumes that the fit’s residuals are normal, so we will conduct a shapiro.test, where the null hypothesis assumes residual normality.
fit_all <- lm(mpg ~ ., data = mtcars2)
summary(fit_all)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 26.57171130 19.56615969 1.35804428 0.19452722
## cyl.L -0.23770312 5.06255894 -0.04695316 0.96317000
## cyl.Q 2.02541267 2.14952059 0.94226251 0.36098761
## disp 0.03554632 0.03189920 1.11433290 0.28267339
## hp -0.07050683 0.03942556 -1.78835344 0.09393155
## drat 1.18283018 2.48348458 0.47627845 0.64073922
## wt -4.52977584 2.53874584 -1.78425732 0.09461859
## qsec 0.36784482 0.93539569 0.39325050 0.69966720
## vsS 1.93085054 2.87125777 0.67247551 0.51150791
## ammanual 1.21211570 3.21354514 0.37718957 0.71131573
## gear.L 1.78784595 2.64200408 0.67670068 0.50889747
## gear.Q 0.12234634 2.40896047 0.05078802 0.96016458
## carb.L 6.06155540 6.72821581 0.90091572 0.38187054
## carb.Q 1.78825141 2.80043297 0.63856247 0.53273703
## carb.C 0.42384333 2.57388972 0.16467035 0.87140202
## carb^4 0.93317347 2.45041298 0.38082294 0.70867391
## carb^5 -2.46409938 2.90450253 -0.84837226 0.40956528
#only hp and wt have a p value under 0.1
shapiro.test(fit_all$residuals)
##
## Shapiro-Wilk normality test
##
## data: fit_all$residuals
## W = 0.96146, p-value = 0.301
#p value is large so do not reject the null that residuals are approximately normal
anova(fit_all)
## Analysis of Variance Table
##
## Response: mpg
## Df Sum Sq Mean Sq F value Pr(>F)
## cyl 2 824.78 412.39 51.3766 1.943e-07 ***
## disp 1 57.64 57.64 7.1813 0.01714 *
## hp 1 18.50 18.50 2.3050 0.14975
## drat 1 11.91 11.91 1.4843 0.24191
## wt 1 55.79 55.79 6.9500 0.01870 *
## qsec 1 1.52 1.52 0.1899 0.66918
## vs 1 0.30 0.30 0.0376 0.84878
## am 1 16.57 16.57 2.0639 0.17135
## gear 2 5.02 2.51 0.3128 0.73606
## carb 5 13.60 2.72 0.3388 0.88144
## Residuals 15 120.40 8.03
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The ANOVA table shows that disp & wt have a p-value under 0.5, and cyl has a p-value under 0.001. Therefore we will create a final model inluding these three variables. Note that am did not have a significant impact at all, yet we will include it in the model to see it’s effect on the other regressors.
fit_final <- lm(mpg ~ disp + wt + cyl + am, data = mtcars2)
summary(fit_final)
##
## Call:
## lm(formula = mpg ~ disp + wt + cyl + am, data = mtcars2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.5029 -1.2829 -0.4825 1.4954 5.7889
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 30.275005 3.290562 9.201 1.17e-09 ***
## disp 0.001632 0.013757 0.119 0.9065
## wt -3.249176 1.249098 -2.601 0.0151 *
## cyl.L -4.467788 1.872177 -2.386 0.0246 *
## cyl.Q 0.935362 1.052358 0.889 0.3822
## ammanual 0.141212 1.326751 0.106 0.9161
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.652 on 26 degrees of freedom
## Multiple R-squared: 0.8376, Adjusted R-squared: 0.8064
## F-statistic: 26.82 on 5 and 26 DF, p-value: 1.73e-09
#check residual plots
par(mfrow = c(2, 2))
plot(fit_final)
Including the other regressors, am became much less significant; accounting for only a 0.14 difference in mean mpg between manual and atomatic, holding all other variables constant.
The residual plots do not point out any abnormality in the residuals. They seem to be approximately normal (Q-Q Plot). The Toyota Corolla may be a high leverage point and influential on the model.
library(car)
vif(fit_final)
## GVIF Df GVIF^(1/(2*Df))
## disp 12.813023 1 3.579528
## wt 6.583720 1 2.565876
## cyl 6.971234 2 1.624903
## am 1.931783 1 1.389886
disp has a VIF of 12.8130229 which is considered large. This means the standard error is almost 13 times larger than if disp were uncorrelated to the other regressors. Further analysis must be conducted to check for possible problematic collinearity.