In this analysis we will be exploring if manual or automatic transmission is better for mileage for cars. The data which will be used to explore these questions is the Motor Trend Car Road Tests data set present for 32 cars which can be found here with features like Miles/gallon, number of cylinders, weight, 1/4 mile time, transmission, engine type recorded for each car.
Our hypothesis is that there is significant difference in mileage for automatic vs manual transmission. A significant difference in mileage between both transmission will show that choosing one transmission over the other may improve mileage.
To understand the data and test our hypothesis we can see distrubution in Figure 1. Salmon color represents automatic transmission and iris blue represents manual in the graphs. As you can see from the figure there is difference between the two transmissions with respect to mileage but we need to explore this further in order to come up with any result.
Plotting different pairs of variables we can see if there is any correlation in the data. As you can see in Figure 2 there are some strong correlations of Miles Per Gallon with other variables so we can can go ahead and come up with models to test our hypothesis.
First before we create any model we can see from Figure 3 that the distribution of data is not normal and highly skewed on the upper quantiles and moderately skewed on the lower quantiles so we will have to use t-statistic rather than z-statistic.
Let us with a simple but naive model which uses only transmission to predict Mileage(Miles Per Gallon).
simpleModel <- lm(formula = MilesPerGallon ~ Transmission, data = data)
Training the model against the data we get the following model which states that for Automatic Transmission we get mileage of which means that mileage for Automatic transmission is 17.15 Miles per gallon against manual transmission which shows a value of 24.39 MPG which is an increase of 42.22% mileage. T-statistic value was not shown below but it was significant meaning that is a significant difference which means choosing manual is better than automatic statistically according to the regression model but as this model only explains 36% of the variance in mpg we move on to more complex models. Regression model results can be found in Appendix section at end of this document.
[1] "Mileage <- 17.15 + 7.24 * ManualTransmission"
We will explore two additional models which uses multivariate(multiple variables) regression to predict mileage. The two models used are named pModel and adjustedRSquaredModel. pModel regression model generated model with the best p-Value whereas adjustedRSquaredModel used Adjusted R Squared value in addition to ANOVA method to remove redundant variables to get the model with more reliable predictors. P-model uses Transmission, Weight, and miletime to predict Mileage whereas R Squared uses Transmission, Horsepower and Weight to predict mileage. The models are given below:
pModel <- lm(MilesPerGallon ~ Transmission + Weight + MileTime , data)
adjustedRSquaredModel <- lm(MilesPerGallon ~ Transmission + Horsepower + Weight, data)
Regression model results for both the models can be found in Appendix section at end of this document. The models after fitting regression lines for both models can be seen below.
PModel Regression Line
Mileage <- 9.62 + 2.94 * ManualTransmission -3.92 * Weight + 1.23 * MileTime
R Squared Adjusted Regression Line
Mileage <- 34 + 2.08 * ManualTransmission -0.04 * HorsePower -2.88 * Weight
Both of the models were significant statistically using t-statistic with 5% signficance level. Also both of the models explained more than 82% of the variance so these were suitable models based on the dataset and question we are answering.
As both of the multivariate models were significant and the regression models statistics are very close pValue alone cannot help us decide one method over the other. In order to choose of the model i used variance inflation factor analysis(vif) to select one model over the other. This analysis basically tells us that which variables have more impact over inflating variance of the model. In both model i found that transmission inflated variance 125% to 150% more but as we want to include transmission we choose the next variable. The deciding factor is that in pModel weight increase variance by 150% versus 275% in rSquaredAdjustedModel so in order to reduce the vif score choosing pModel to be the better model makes sense.
We can calculate confidence level of manual car compared to automatic car to see if the difference is significant. At 5% signifcance level with 28 degrees of freedom with 95% confidence we can say that manual car are between 0.05 to 5.83 Miles/Gallon better than automatic cars.
In order to analyze pModel more closely i choose to plot hatvalues and dfbeta to determine any influential points in the data. Influential points basically tells us which points greatly affects the slope of the regression line. As you can see in Figure 4 there are about 4 points which affect the regression line slope a lot. So i decided to omit these 4 points and do the analysis again but the results were similar so in this case these influential points were not messing up our resutlts.
Finally i decided to plot some standard regression line plots in Figure 5. The Normal Q-Q Plot confirms which we discussed earlier that there is moderate to high right skew. There are some leverage points as already discussed but removing them didn’t change the models enough. Also the residuals plot show that the residuals are randomly normally distributed along 0 and there is no heteroskedasticity visible.
So for pModel the model is MilesPerGallon = 9.62 + 2.94 ManualTransmission (have not included full model with other variables for simplicity) and it shows that keeping all other variables constant manual car have 2.94 better Miles/gallon or an increase of 30.5% as compared to automatic cars for mileage. So yes it does seem based on the dataset that manual cars are better than automatic cars for mileage.
Below you can see the model results for single variable model.
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 17.147368 | 1.124602 | 15.247492 | 0.000000 |
TransmissionManual | 7.244939 | 1.764422 | 4.106127 | 0.000285 |
Below you can see the p-value model values and R squared model values for the multivariate models.
Intercept | ManualSlope | Manual pValue | DF | R^2 | F-value | Model pValue | |
---|---|---|---|---|---|---|---|
pValue | 9.617781 | 2.935837 | 0.0467155 | 28 | 0.8335561 | 52.74964 | 0 |
Adj R-square | 34.002875 | 2.083710 | 0.1412682 | 28 | 0.8227357 | 48.96003 | 0 |