This project tries to answer if an automatic transmission is better for miles per gallon (MPG) than a manual one by looking at a data set of a collection of cars. It explores the relationship between a set of variables and MPG in order to identify the best model fit. It begins with some exploratory data analysis and interpretation of linear regression fit using only the dummy variable “Transmission (0= automatic, 1=manual)” as the predictor.
In figure 1 (appx.), comparing MPG data for each transmission type, we can see that manual transmission is correlated with better MPG.
Here is a summary of a linear regression model of MPG on transmission.
ols <- lm(mpg~am, data=mtcars)
summary(ols)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.147368 1.124603 15.247492 1.133983e-15
## am 7.244939 1.764422 4.106127 2.850207e-04
R has chosen automatic transmission (am=0) as the reference variable so the empirical mean for automatic transmission is the intercept and the empirical mean for the manual transmission is the intercept plus its coefficient (i.e. 24.39mpg)
From the summary of the coefficient estimates, we can see that the effect of changing from an automatic to a manual transmission is a gain of 7.24mpg (as explained by the difference in empirical mean of the two transmissions). This is also very evident in Figure 1.
We are interested to know if we need to include other predictors apart from the ones that we are interested in. So first we will check the variance inflation factors (VIFs)
library(car)
fit <- lm(mpg~., data=mtcars)
round(vif(fit), 2)
## cyl disp hp drat wt qsec vs am gear carb
## 15.37 21.62 9.83 3.37 15.16 7.53 4.97 4.65 5.36 7.91
It seems that the number of cylinders, displacement(cu.in.) and weight(lb/1000) is adding most of the variance relative to a model where all the covariates are orthogonal. We shall do the nested model search technique to find the best model.
## Analysis of Variance Table
##
## Model 1: mpg ~ am
## Model 2: mpg ~ am + cyl + disp + wt
## Model 3: mpg ~ am + cyl + disp + wt + hp + carb + qsec
## Model 4: mpg ~ am + cyl + disp + wt + hp + carb + qsec + drat + gear +
## vs
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 30 720.90
## 2 27 188.43 3 532.47 25.2708 3.641e-07 ***
## 3 24 150.96 3 37.46 1.7780 0.1822
## 4 21 147.49 3 3.47 0.1646 0.9190
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
According to the variance table, the P-value tells us that we should include the covariates used in Model 2.
In Figure 2, we are checking for any suspicious patterns in the residual plot that might be influnencing the significance level of the coefficients. As we can see, the plot seems fine.
The dfbetas measures the change in the predicted response when the i’th point is deleted in fitting the model. As you can see in the report below, all of the points are not very influencial.
## Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive
## -0.091 -0.022 -0.218 -0.072
## Hornet Sportabout
## -0.148
Similarly, the hat values are measures of potential of each data point to influence all the coefficients (measures of leverage). Looking at the data points below with their hat values, they don’t seem to have the potential to influence.
round(hatvalues(fit2)[1:5],3)
## Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive
## 0.109 0.137 0.103 0.102
## Hornet Sportabout
## 0.176
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 40.898313414 3.60154037 11.3557837 8.677574e-12
## am 0.129065571 1.32151163 0.0976651 9.229196e-01
## cyl -1.784173258 0.61819218 -2.8861142 7.581533e-03
## disp 0.007403833 0.01208067 0.6128661 5.450930e-01
## wt -3.583425472 1.18650433 -3.0201537 5.468412e-03
The CI explains the variability and the uncertainty of our coefficient estimates.
Here we do the statistical inference.
We see that when we include the other variables, our results are very different from the one earlier. We get a very small t-value for the coefficient of the transmission variable that we are interested in. This implies that there is no difference between an automatic and a manual transmission when it comes to better MPG. Thus, we must accept the null hypothesis that the population mean difference for MPG between a manual and an automatic tranmission is 0.
However, If we were to quantify the MPG difference between automatic and manual transmissions (in this case) it would be 0.13mpg more for manual transmission. Although insignificant, the result is still similar to the earlier case where we found the manual transmission to be better.
In this plot, we are looking for any pattern that might be influential. In the Normal Q-Q plot, we want to check for the normality of the errors. And in the Leverage vs. standardized residuals, we are checking for the overall change in the coefficients when the i’th point is deleted.