In this analysis i will be exploring if manual or automatic transmission better for MPG(Miles Per Gallon) and quantify the difference in MPG between the two type of transmission.
The data which will be used to explore these questions is the Motor Trend Car Road Tests(mtcars) data set present in R for 32 automobiles and you can read more about the data set here. There are more details how the data collected in this paper here. There are 11 features collected for each car such as Miles/gallon, number of cylinders, weight, 1/4 mile time, transmission, engine type etc.
I did some basic pre-processing to make the features name more descriptive such as Transmission instead of am and converted categories like number of cylinders, engine type, transmission, gears and carburetors from numeric to factors in R to make it easier to work with. As the study is observational i can only establish correlation not casuation between the mpg and transmission and the study is for 1973-74 automobiles it will be hard to generalize it for cars today.
I will do some data exploration to understand the data more. Once we can see some graphs and statistics we can move on to model the mpg by including transmission by doing multivariate regression.
You can see in graph in appendix Figure 1 that there are differences between the two transmission in mpg but it is too soon to give any judgement. You can see more graphs in appendix Figure 2 which show box plots and scatter plots along with correlation. There are some high correlation of these variables with mpg so these variables can be important for predicting mpg.
In this section i will give three models one simple and naive other using different measures to come with a parsimonious model with high predictive model which is simplistic enough to be able to interpret. Before i build any model it is important to note that any model built using this small dataset will not be reliable as distribution is not normal. We can see that in Figure 3 in appendix that there is large skew at the ends through the quantile quantile plot and the data is far away from normality which is required in data regression. Using t-statistic will help combat the normality condition.
The simplest model we could use which is naive and too simplistic is to use only transmission to predict mpg:
simpleModel <- lm(formula = MilesPerGallon ~ Transmission, data = data)
The model says that manual transmission cars have 7.24 higher Miles/gallon than automatic cars. We will add more covariates to answer this question better as this model only explains 36% of the variance in mpg.
So we need to do multivariate regression to include more covariates. I used forward selection using p-Value and Adjusted R Square value. I generated one model with p-Value to get a model with highest significant predictors and one model with Adjusted R Squared value to get the model with more reliable predictors. I used Adjusted R Squared versus simple R Square as simple R Square increases always with addition of each predictor whereas adjusted r-square balances number of predictors and explained variation. The models that resulted are given following:
pModel <- lm(MilesPerGallon ~ Transmission + Weight + MileTime , data)
rSquaredFullModel <- lm(MilesPerGallon ~ Transmission + Horsepower + Weight + MileTime + Cylinders, data)
rSquaredSimple <- lm(MilesPerGallon ~ Transmission + Horsepower + Weight, data)
I validated the model generated above with the simple model and they were significant. Comparing the pModel and rSquaredSimple using anova revealed they are not significantly different so any model can be used. The model from r square was not parsminious as it had too many regressors so i used anova on the model further to remove variables which were not significant predictors and came up with the model rSquaredSimple which is better to analyze with fewer regressors. Below are some important output from the models and i have supressed values from other covariates. Significance level is 5% for these models.
Intercept | ManualSlope | Manual pValue | DF | R^2 | F-value | Model pValue | |
---|---|---|---|---|---|---|---|
pValue | 9.6178 | 2.9358 | 0.0467 | 28.0000 | 0.8336 | 52.7496 | 0.0000 |
Adj R-square | 34.0029 | 2.0837 | 0.1413 | 28.0000 | 0.8227 | 48.9600 | 0.0000 |
The reference level(intercept) is automatic transmission. As you can see from the output above the two models have signficant model p-value which means both models are significant as a whole and Adjusted R-square for both is greater than 82% and they both are very close to each other so these are better than the single variable regression percentage of 36% but p-Value model manual transmission p-value is less than set significant level of 5% but R SquareModel is saying that manual transmission is not significant. So we need to do more diagnostics to select one model.
I have drawn in Figure 4 hatvalues and dfbeta to determine any influential points. As both of the model showed similar graphs i have only selected pValue model to draw graphs. As you can see from the graph there are 2-3 points which have larger values but they are not too high. I tried to remove these values and refit the models but the models were similar. Also in Figure 5 you can see the ouptut from the model and as both of the graphs were similar i have chosen pValue. The Normal Q-Q Plot confirms which we discussed earlier that there is moderate to high right skew. There are some leverage points as already discussed but removing them didn’t change the models enough. Also the residuals plot show that the residuals are randomly normally distributed along 0 and there is no heteroskedasticity visible.
In order to decide between the two models i will use variance inflaction factor analysis to drop one model. In both model transmission inflates variance 125% to 150% more but as we need that in our analyis we will ignore that. The deciding factor is that weight in pModel increases vif by 150% versus 275% in r-squared model so in favor of reducing vif i will choose pModel.
So for pModel the model is MilesPerGallon = 9.62 + 2.94 ManualTransmission (have not included full model with other variables for simplicity) and it shows that keeping all other variables constant manual car have 2.94 better Miles/gallon as compared to automatic cars. So yes it does seem based on the dataset that the answer to first question is manual cars are better than automatic cars. We can calculate confidence level of manual car compared to automatic car to see if the difference is significant. At 5% signifcance level with 28 degrees of freedom with 95% confidence we can say that manual car are between 0.05 to 5.83 better than automatic cars in Miles/gallon which answers our second question as well. As this interval does not include 0 we can say that difference between manual and automatic transmission in mpg is significant.
This section contains all the figures used in the above text.