This model has been constructed to answer the two key questions using regression models and exploratory data analyses: Is an automatic or manual transmission better for MPG? Quantify the MPG difference between automatic and manual transmissions?
For the purpose of this analysis the data set mtcars has been used with the variable MPG as a desired outcome and the key am variable as a regressor,with additional regressors added to the multivariable regression analysis.
The findings suggest that manual transmission tends to result in more mileage (as a result of the single variable t-test), further analysis based on quantifying the difference between transmission with additional regressors, leads to the conclusion that other factors affecting mileage along with the type of the trasmission are weight(wt), gross horsepowers(hp), and number of cylinders (cyl).
head(mtcars, 8)
As it can be observed from the quick overview of the data, the following variables can be used as factors (categorical data): cyl(4,6,8), gear(3,4,5), carb(1,2,3,4,6,8), vs (0,1) and am (0,1).
#convert continuous variables into categoricals in data set mtcarsf
mtcarsf<-mtcars
cols <- c("cyl", "vs", "am", "gear",'carb')
mtcarsf[cols] <- lapply(mtcarsf[cols], factor)
The simple T-test is to be used to asses the influence of the transmission on mileage. The variable that represents transmission is am (0 = automatic, 1 = manual)
Without taking into account other variables the T-test fails to support the null hypothesis that there is no significant difference between Automatic or Manual transmission effect on mpg.
t.test(mtcars$mpg~mtcars$am)
##
## Welch Two Sample t-test
##
## data: mtcars$mpg by mtcars$am
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -11.280194 -3.209684
## sample estimates:
## mean in group 0 mean in group 1
## 17.14737 24.39231
The outcome suggests that manual transmission results in higher mileage. (p =0.0014, automatic mean = 17.1 miles, manual mean = 24.4 miles). The boxplot can be found in Appendix section.
Brief correlation analysis shows that mpg has higher level of correlation with wt, cyl, disp, and hp. However wt and disp is highly correlated between each other therefore one should be omitted to avoid
round(cor(mtcars, use='everything', method='pearson'),2)
## mpg cyl disp hp drat wt qsec vs am gear carb
## mpg 1.00 -0.85 -0.85 -0.78 0.68 -0.87 0.42 0.66 0.60 0.48 -0.55
## cyl -0.85 1.00 0.90 0.83 -0.70 0.78 -0.59 -0.81 -0.52 -0.49 0.53
## disp -0.85 0.90 1.00 0.79 -0.71 0.89 -0.43 -0.71 -0.59 -0.56 0.39
## hp -0.78 0.83 0.79 1.00 -0.45 0.66 -0.71 -0.72 -0.24 -0.13 0.75
## drat 0.68 -0.70 -0.71 -0.45 1.00 -0.71 0.09 0.44 0.71 0.70 -0.09
## wt -0.87 0.78 0.89 0.66 -0.71 1.00 -0.17 -0.55 -0.69 -0.58 0.43
## qsec 0.42 -0.59 -0.43 -0.71 0.09 -0.17 1.00 0.74 -0.23 -0.21 -0.66
## vs 0.66 -0.81 -0.71 -0.72 0.44 -0.55 0.74 1.00 0.17 0.21 -0.57
## am 0.60 -0.52 -0.59 -0.24 0.71 -0.69 -0.23 0.17 1.00 0.79 0.06
## gear 0.48 -0.49 -0.56 -0.13 0.70 -0.58 -0.21 0.21 0.79 1.00 0.27
## carb -0.55 0.53 0.39 0.75 -0.09 0.43 -0.66 -0.57 0.06 0.27 1.00
library(car)
## Loading required package: carData
fit1<-lm(mpg~am+wt+hp+cyl+disp, mtcars)
fit2<-lm(mpg~am+wt+hp+cyl, mtcars)
fit3<-lm(mpg~am+wt+hp, mtcars)
fit4<-lm(mpg~am+wt, mtcars)
fit5<-lm(mpg~am, mtcars)
round(rbind(vif(fit1), c(vif(fit2),0), c(vif(fit3),0,0), c(vif(fit4),0,0,0)),3)
## am wt hp cyl disp
## [1,] 2.553 6.079 4.502 7.209 10.401
## [2,] 2.546 3.988 4.310 5.334 0.000
## [3,] 2.271 3.775 2.088 0.000 0.000
## [4,] 1.921 1.921 0.000 0.000 0.000
VIF analysis shows that disposition has a very high correlation rate, as soon as the variable is omitted from the analysis, the relative importance of correlated variables is excluded as well and the VIF rates are moderate.
Therefore it is worth to consider two models: - first the one variable model that includes only am - multivariable model that includes: am, wt, hp and cyl
summary(fit5)
##
## Call:
## lm(formula = mpg ~ am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.3923 -3.0923 -0.2974 3.2439 9.5077
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.147 1.125 15.247 1.13e-15 ***
## am 7.245 1.764 4.106 0.000285 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared: 0.3598, Adjusted R-squared: 0.3385
## F-statistic: 16.86 on 1 and 30 DF, p-value: 0.000285
The cars with the automatic transmission has on average 17.15mpg, and the cars with the manual transmission result in 17.15+7.24= 24.49mpg, the p-value <0.05, the adj.r.squared is 33.9% thus covers almost 34% of variance.
summary(fit2)
##
## Call:
## lm(formula = mpg ~ am + wt + hp + cyl, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4765 -1.8471 -0.5544 1.2758 5.6608
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.14654 3.10478 11.642 4.94e-12 ***
## am 1.47805 1.44115 1.026 0.3142
## wt -2.60648 0.91984 -2.834 0.0086 **
## hp -0.02495 0.01365 -1.828 0.0786 .
## cyl -0.74516 0.58279 -1.279 0.2119
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.509 on 27 degrees of freedom
## Multiple R-squared: 0.849, Adjusted R-squared: 0.8267
## F-statistic: 37.96 on 4 and 27 DF, p-value: 1.025e-10
Multivariable model explains almost 93% of variance, which is significant improvement from the 1st model.
anova(fit5, fit2)
The result of the anove test confirms significance of the multivariance model and reject the uni-variable model where only am is considered and rest of the variables are omitted.
plot(mtcarsf$mpg~mtcarsf$am, main='MPG vs Transmission (0= manual, 1=automatic)', xlab='Transmission', ylab='mpg')
par(mfrow = c(2,2))
plot(fit2)