By analyzing a dataset of a collection of cars (mtcars), this study explores the relationship between miles per gallon (MPG) feature and a set of other car features. We are particularly interested in finding out if an automatic or a manual transmission is better for MPG. The study proves prefernce of transmission type and quantifies the difference.
The study uses the mtcars dataset and employs several statistical techniques to reach to a robust conclusion. In summary, the study concluded that using manual-transmission cars is better than automatic for MPG. Besides, MPG has a statistically significant relationship with car weight and quarter mile time (acceleration).
mtcars dataset was extracted from the 1974 Motor Trend US magazine. It comprises fuel consumption and 10 other aspects of automobile features for 32 automobiles. It can be downloaded from R datasets library.
library(datasets)
data("mtcars")
The dataset consists of 32 observations for different automobiles. See Appendix: Fig.1 Data description to know the variables with their descriptions. Besides, the Appendix (Fig.2 A snapshot of data observations) shows the first six rows of the data.
Boxplot and Pair-Panel plot are two EDA tools used to explore data properties and find possible patterns or correlations. A boxplot is used to see the variation of MPG across both types of transmission. On average, using the manual type yields higher mpg compared to automatic. (See Appendix: Fig.3 Boxplot). The Pair Panel Plot gives information about the correlation between pairs of variables. E.g. the resulted plot shows a negative correlation between mpg and weight. (See Appendix: Fig.4 Pair Panel Plot)
A statistical evidence is still required to verify difference between means of manual and automatic cars. We assumed the null hypothesis of no difference in means. Whereas the alternative hypothesis assumes difference in means.
| mean.Auto | mean.Manual | t.statistic | p.value | LCL | UCL | |
|---|---|---|---|---|---|---|
| 17.147 | 24.392 | -3.767 | 0.001 | -11.28 | -3.21 |
Using t.test the p-value is 0.001 (<0.05 \(\alpha\) error rate), which provides a statistically significant difference in means where the manual mpg mean (24.39) is higher than that of the automatic (17.15).
MPG may be affected by other regressors (variables). Hence, modelling the MPG vs transmission type should be tested by adjusting for other variables in the model. We will fit multiple models and select the best one. Our Model Selection strategy goes in the following steps:
f0<-lm(mpg~factor(am)+factor(cyl)+disp+hp+drat+wt+qsec+factor(vs)+gear+carb,data=pmtcars)
Stepwise Regression reduces the number of input variables to those significant ones using a specific algorithm. We will use the Backward approach which starts with all variables, tests the effect on the model by deleting each variable, then deletes the variable that improves the model the most. This process is repeated until no further improvement is possible. The stepwise method will be used only for initial screening of variables.
After runnig the Stepwise method the three variables (Transmission, wt, and qsec) seem to have significant effect on MPG. (p-values are 4.67e-02, 6.95e-06, and 2.16e-04, respectively (<0.05 error rate). (See Appendix: Fig.5 Stepwise Regression results)
So far we have a preliminary model of MPG versus transmission type, wt, and qsec. This model needs to be tested against other models by adjusting for other variables. 10 models are created by adjusting for a new additional variable in each model. Based on the Nested modelling we can confirm that the model of including wt and qsec remains significant by comparing its p-value to those of other models. This model (fit3- third fit from the top in the Appendix) gets a p-value of 0.00063431. (See Appendix: Fig.6 (Nested Model Testing) for the entire nested fits results)
4.1 Low Variance Inflation Factors (VIF)
One way to measure multicollinearity is through the Variance Inflation Factor (VIF). The lower the VIF, the better the model is. VIF of each of the three regressors are all below 5, which confirms absence of multicollinearity.
library(car);VIF.value<-round(vif(fit3),3);VIF.value
## factor(am) wt qsec
## 2.541 2.483 1.364
4.2 Low Standard Error, significant p-value, and high R-squared of the model
The last step to confirm model validity is by testing if the model has the lowest variation around the fitted line (residual standard error), most significant model (lowest p-value), and the highest ratio of explained variation (Adjusted R-squared) compared to other models. The results show that “fit3” is the best fit with optimum values of p-value = 1.2104e-11, Residual Standard Error = 2.459, and R-Squared = 0.834. See Appendix: Fig.7 (Fits parameters) for the entire table of fits parameters.
kable(summary(fit3)$coefficients,align = 'c')
The coefficients table (Appendix: Fig.8 Final model coefficients) of the selected model (fit3) shows that the three regressors (Transmission type, wt, and qsec) are all significant in affecting the output (mpg) where p-values are all < 0.05. Besides, the table shows that on average, automatic cars have 9.618 mpg fuel consumption. Whereas, manual cars are 2.936 mpg higher than that of automatic cars. Besides, MPG decreases by 3.917 for an increase of 1000 lb in weight (wt). Whereas, MPG increases by 1.226 for an increase of one unit acceleration (qsec).
ci<-confint(fit3,level=0.95);kable(ci,align = 'c')
Based on the CI results we can say that 95% of the time MPG of manual cars will be 0.046 higher than that of automatic cars at minimum and 5.826 higher than that of automatic cars at maximum. See Appendix: Fig.9 (Model statistics and CI) for confidence intervals of each of the significant variables.
As explained earlier the VIF of each of the three regressors is below 5. (Transmission = 2.541, wt = 2.483, and qsec = 1.364).
Both Residual vs Fitted and Residual vs Leverage plots (See Appendix: Fig.10 Residuals, leverage, and normality plots) show no specific patterns, and residuals are symmetrical around zero and, hence, randomly distributed.
The points of the model Q-Q Plot lie pretty close to the dashed line which implies good normality of residuals. The Cook’s distance plot shows how individual observations can influence the estimated regression coefficients of the model.
Our Hypothesis Testing showed that manual transmission is better for MPG than automatic where the MPG mean is (24.39) for manual and (17.15) for automatic type.
When the model is adjusted for other variables weight (wt) and acceleration (qsec) proved significant in affecting the MPG vs Transmission relationship. The final model showed that, on average, automatic cars have 9.618 mpg, whereas manual cars are 2.936 mpg higher than that of automatic cars with a confidence interval for MPG difference of (0.046, 5.826) using 95% confidence level (0.05 \(\alpha\) error rate). Hence, manual transmission cars are still better than automatic for MPG.
| Var | Description |
|---|---|
| mpg | Miles/(US) gallon |
| cyl | Number of cylinders |
| disp | Displacement (cu.in.) |
| hp | Gross horsepower |
| drat | Rear axle ratio |
| wt | Weight (lb/1000) |
| qsec | 1/4 mile time (quarter mile time (acceleration)) |
| vs | V/S (V-engine/Straight engine) (0/1) |
| am | Transmission (0 = automatic, 1 = manual) |
| gear | Number of forward gears |
| carb | Number of carburetors |
For better readability the 0/1 levels for factor variables are converted into texts.
kable(head(pmtcars),align = 'c')
| mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Mazda RX4 | 21.0 | 6 | 160 | 110 | 3.90 | 2.620 | 16.46 | V | Manual | 4 | 4 |
| Mazda RX4 Wag | 21.0 | 6 | 160 | 110 | 3.90 | 2.875 | 17.02 | V | Manual | 4 | 4 |
| Datsun 710 | 22.8 | 4 | 108 | 93 | 3.85 | 2.320 | 18.61 | S | Manual | 4 | 1 |
| Hornet 4 Drive | 21.4 | 6 | 258 | 110 | 3.08 | 3.215 | 19.44 | S | Auto | 3 | 1 |
| Hornet Sportabout | 18.7 | 8 | 360 | 175 | 3.15 | 3.440 | 17.02 | V | Auto | 3 | 2 |
| Valiant | 18.1 | 6 | 225 | 105 | 2.76 | 3.460 | 20.22 | S | Auto | 3 | 1 |
pairs(mtcars,main = "Pair Panel - Mtcars variables", panel=panel.smooth,upper.panel = NULL)
library(MASS);step <- stepAIC(f0, direction="backward", trace=FALSE)
kable(summary(step)$coeff,align = 'c')
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 9.617781 | 6.9595930 | 1.381946 | 0.1779152 |
| factor(am)Manual | 2.935837 | 1.4109045 | 2.080819 | 0.0467155 |
| wt | -3.916504 | 0.7112016 | -5.506882 | 0.0000070 |
| qsec | 1.225886 | 0.2886696 | 4.246676 | 0.0002162 |
fit1 <- lm(mpg ~ factor(am), data = pmtcars)
fit2 <- lm(mpg ~ factor(am)+wt, data = pmtcars)
fit3 <- lm(mpg ~ factor(am)+wt+qsec, data = pmtcars)
fit4 <- lm(mpg ~ factor(am)+wt+qsec+factor(cyl), data = pmtcars)
fit5 <- lm(mpg ~ factor(am)+wt+qsec+factor(cyl)+disp, data = pmtcars)
fit6 <- lm(mpg ~ factor(am)+wt+qsec+factor(cyl)+disp+hp, data = pmtcars)
fit7 <- lm(mpg ~ factor(am)+wt+qsec+factor(cyl)+disp+hp+drat, data = pmtcars)
fit8 <- lm(mpg ~ factor(am)+wt+qsec+factor(cyl)+disp+hp+drat+factor(vs), data = pmtcars)
fit9 <- lm(mpg ~ factor(am)+wt+qsec+factor(cyl)+disp+hp+drat+factor(vs)+gear, data = pmtcars)
fit10 <- lm(mpg ~ factor(am)+wt+qsec+factor(cyl)+disp+hp+drat+factor(vs)+gear+carb, data = pmtcars)
nested<-anova(fit1,fit2,fit3,fit4,fit5,fit6,fit7,fit8,fit9,fit10)
kable(nested,align = 'c')
| Res.Df | RSS | Df | Sum of Sq | F | Pr(>F) |
|---|---|---|---|---|---|
| 30 | 720.8966 | NA | NA | NA | NA |
| 29 | 278.3197 | 1 | 442.576902 | 66.3914559 | 0.0000001 |
| 28 | 169.2859 | 1 | 109.033768 | 16.3562774 | 0.0006343 |
| 26 | 159.4244 | 2 | 9.861565 | 0.7396722 | 0.4898800 |
| 25 | 157.7339 | 1 | 1.690499 | 0.2535936 | 0.6200576 |
| 24 | 142.3306 | 1 | 15.403276 | 2.3106626 | 0.1441415 |
| 23 | 141.2059 | 1 | 1.124688 | 0.1687157 | 0.6856232 |
| 22 | 139.0230 | 1 | 2.182858 | 0.3274530 | 0.5735394 |
| 21 | 135.2706 | 1 | 3.752430 | 0.5629063 | 0.4618276 |
| 20 | 133.3235 | 1 | 1.947162 | 0.2920960 | 0.5948487 |
| Fit | pv | sdErr | adjRsq |
|---|---|---|---|
| fit3 | 1.2104e-11 | 2.459 | 0.834 |
| fit4 | 3.0067e-10 | 2.476 | 0.831 |
| fit2 | 1.5788e-09 | 3.098 | 0.736 |
| fit5 | 1.5837e-09 | 2.512 | 0.826 |
| fit6 | 2.5657e-09 | 2.435 | 0.837 |
| fit7 | 1.2059e-08 | 2.478 | 0.831 |
| fit8 | 4.8142e-08 | 2.514 | 0.826 |
| fit9 | 1.5991e-07 | 2.538 | 0.823 |
| fit10 | 5.7224e-07 | 2.582 | 0.816 |
| fit1 | 0.00028502 | 4.902 | 0.338 |
kable(summary(fit3)$coefficients,align = 'c')
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 9.617781 | 6.9595930 | 1.381946 | 0.1779152 |
| factor(am)Manual | 2.935837 | 1.4109045 | 2.080819 | 0.0467155 |
| wt | -3.916504 | 0.7112016 | -5.506882 | 0.0000070 |
| qsec | 1.225886 | 0.2886696 | 4.246676 | 0.0002162 |
| 2.5 % | 97.5 % | |
|---|---|---|
| (Intercept) | -4.6382995 | 23.873860 |
| factor(am)Manual | 0.0457303 | 5.825944 |
| wt | -5.3733342 | -2.459673 |
| qsec | 0.6345732 | 1.817199 |
par(mfrow = c(2, 2),cex=.5);plot(fit3,which=c(1,2,4,5))