This report explores the relationship between a set of variables and miles per gallon (MPG) of cars with automatic and manual transmissions. We found that in general manual cars have better MPG than those of automatic cars. Assuming transmission is independent of all other variables, manual cars has 7.25 higher MPG than automatic cars. However, when accounting dependent variables for a better model, manual cars only has 1.56 higher MPG than automatic cars.
We first load the data, sort the data frame by transmission, and convert transmission into factor.
data(mtcars); str(mtcars); mtcars <- mtcars[order(mtcars$am),]; mtcars$am <- factor(mtcars$am)
Figure 1 is the boxplot which shows that manual cars has overall higher MPG than automatic cars. Figure 2 and 3 show the relationships between all other variables with MPG (by manual and automatic). Clearly MPG is also dependent on other variables, irrespective of transmission.
Our NULL hypothesis is that there is no significant difference between manual and automatic transmission, assuming independence from other variables. To reject this NULL hypothesis, we perform 2-level t-test with alpha = 0.05 by default. The result shows that the p-value is <0.05, so we can reject the NULL hypothesis and deduce there is a statistical difference in MPG between manual and automatic transmission.
t.test(mpg ~ am, data = mtcars)[["p.value"]]
## [1] 0.001373638
A more detailed relationship between one variable to another can be explored through the pairs plot (Figure 4 in the Appendix).
Assume that MPG is also dependent on other variables, but for simplicity, those other variables are independent of one another. We compare the R-squares of MPG versus each variable. Of all variables, “disp”, “cyl”, “wt”, and “hp” have the highest R-squared values (in fact, higher than “am”) meaning these variables affect MPG more than the others (assuming linear relationship). We will therefore account for these additional variables in our new model and ignore the rest.
## am disp cyl wt hp carb qsec gear vs drat
## MPG_r2 0.36 0.72 0.73 0.75 0.60 0.30 0.18 0.23 0.44 0.46
To test this new model, we perform nested likelihood ratio test on three models: MPG vs AM (our initial model), MPG vs AM+DISP+CYL+WT+HP (our preferred model), and MPG vs all variables.
The result shows a very high probability that the preferred model is different from our initial model (indicating that those additional variables do indeed contribute to MPG), and very low probability that it is different than a model that accounts for all variables. This confirms that the second model is better at estimating the true relationship between MPG and transmission.
fit1 <- lm(mpg ~ am, data = mtcars)
fit5 <- lm(mpg ~ am + disp + cyl + wt + hp, data = mtcars)
fit10 <- lm(mpg ~ am + disp + cyl + wt + hp + carb + qsec + gear + vs + drat, data = mtcars)
anova(fit1, fit5, fit10)
## Analysis of Variance Table
##
## Model 1: mpg ~ am
## Model 2: mpg ~ am + disp + cyl + wt + hp
## Model 3: mpg ~ am + disp + cyl + wt + hp + carb + qsec + gear + vs + drat
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 30 720.90
## 2 26 163.12 4 557.78 19.8538 6.809e-07 ***
## 3 21 147.49 5 15.63 0.4449 0.8121
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
If we only do linear regression of MPG vs AM (ignoring dependence from other variables), the coefficient for transmission is 7.25, i.e. manual cars has higher fuel economy than automatic cars by 7.25 MPG. Now if we redo linear regression of the new model, the new coefficient for transmission is now 1.56, i.e. manual cars has higher fuel economy than automatic cars only by 1.56 MPG. This can be explained by looking at the P-values of “wt” and “hp” variables, which apparently have more effects on MPG than “am”.
summary(fit1)$coefficients[,c(1,4)]
## Estimate Pr(>|t|)
## (Intercept) 17.147368 1.133983e-15
## am1 7.244939 2.850207e-04
summary(fit5)$coefficients[,c(1,4)]
## Estimate Pr(>|t|)
## (Intercept) 38.20279869 9.084987e-11
## am1 1.55649163 2.898430e-01
## disp 0.01225708 3.047194e-01
## cyl -1.10637984 1.139322e-01
## wt -3.30262301 7.256888e-03
## hp -0.02796002 5.509659e-02
The initial model’s adjusted R-squared value is 0.3385 whereas the preferred model’s value is 0.8273, so the preferred model fits the relationship between MPG and relevant variables better, compared to just between MPG and transmission. The true effect of transmission on fuel economy is therefore not as dramatic as originally thought.