Motor Trend is particularly interested in the following two questions,
Is an automatic or manual transmission better for MPG?
Quantify the MPG difference between automatic and manual transmissions
Thus this report use the mtcars dataset to answer these two questions, this data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models).
First data is load and exploratory analysis is performed,
data(mtcars)
head(mtcars,2)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21 6 160 110 3.9 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21 6 160 110 3.9 2.875 17.02 0 1 4 4
each entry contains miles per gallon (mpg), number of cylinders (cyl), displacement (disp), gross horsepower (hp), rear axle ratio (drat), weight (wt), 1/4 mile time (qsec), V/straight engine (vs), transmission (am), number of forward gears (gear), and finally number of carburetors (carb).
If we take miles per gallon as our interrogated dependent variable, those pairs of variable other than mpg, who has strong cross relationship should be careful preprocessed in order to prevent a overfitting.
Among these parameters, cyl, vs, am could be categorical variable, number of gear and car could also be taken as categorical ones, and others are pure numerical variable. We first investigate the numerical ones.
mtcarsNum <- data.frame(mpg=mtcars$mpg, disp=mtcars$disp, hp=mtcars$hp,
drat=mtcars$drat, wt=mtcars$wt, qsec=mtcars$qsec)
mtcarsNumCor <- cor(mtcarsNum)
dissimilarity <- 1-abs(mtcarsNumCor)
dis <- dist(dissimilarity,method = "euclidean")
print(dis)
## mpg disp hp drat wt
## disp 0.2195026
## hp 0.5337582 0.5373739
## drat 0.6787464 0.6788544 1.0034293
## wt 0.3330890 0.3317454 0.7777210 0.5318162
## qsec 1.2961403 1.3130441 0.8875936 1.4675313 1.4691905
it is indicated by calculation that,
All numerical variable other than mpg and qsec are highly related to mpg
disp are highly related to hp, drat and wt, thus should not be included
hp, drat are highly related wt too, thus should not be included
so we try to include only weight to the model.
And as introduced in the beginning of this report, the transmission should be an indicator, so let’s use panel scat plot to investigate the relationship among other potential regressors, including factors (cyl, vs, gear, carb) and numerical variables (wt) to the transmission type.
pairs(mtcars$am ~ mtcars$cyl + mtcars$vs + mtcars$wt
+ mtcars$gear + mtcars$carb,
panel=panel.smooth, upper.panel = NULL, lty = 2,
main="mtcars regressors")
mtcarsXs <- data.frame(am=mtcars$am, cyl=mtcars$cyl, vs=mtcars$vs,
wt=mtcars$wt, gear=mtcars$gear, carb=mtcars$carb)
mtcarsXsCor <- cor(mtcarsXs)
dissimilarity <- 1-abs(mtcarsXsCor)
dis <- dist(dissimilarity,method = "euclidean")
print(dis)
## am cyl vs wt gear
## cyl 1.0899008
## vs 1.4468220 0.5769495
## wt 0.7664692 0.4550234 0.9133605
## gear 0.3820390 1.0285131 1.3570794 0.7660955
## carb 1.5093260 0.9464097 0.6958084 1.1043363 1.3243923
Same as we did for the numerical variables, we should,
get rid of gear, since it is highly related to am
we should include carb, because it is not related to am, or cyl, or vs, or wt
we should also include vs, because it is far distance (1-xcor) to am
considering cyl is not too far away from both vs and wt, we choose to add wt and throw away the cyl.
finally we have transmission (am), number of carburetors (carb), V/straight engine (vs), weight (wt) in our model.
Now before we go estimate regression model to answer two question raised in the beginning of this report, we make those categorical variable into factors.
mtcars$am <- factor(mtcars$am,levels=c(0,1),
labels=c("Automatic","Manual"))
mtcars$vs <- factor(mtcars$vs,levels=c(0,1),
labels=c("V-engine","straight engine"))
and we first build a simple regression model with only one regressor (\(\beta_{1}\)am+\(\beta_{0}\)), and add other regressor step by step
fit1 <- lm(mpg ~ am, data = mtcars)
fit2 <- update(fit1, mpg ~ am + carb)
fit3 <- update(fit1, mpg ~ am + carb + vs)
fit4 <- update(fit1, mpg ~ am + carb + vs + wt)
fit5 <- update(fit1, mpg ~ am + carb + vs + wt + gear)
summary(fit1)
##
## Call:
## lm(formula = mpg ~ am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.3923 -3.0923 -0.2974 3.2439 9.5077
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.147 1.125 15.247 1.13e-15 ***
## amManual 7.245 1.764 4.106 0.000285 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared: 0.3598, Adjusted R-squared: 0.3385
## F-statistic: 16.86 on 1 and 30 DF, p-value: 0.000285
the p value of F-statistic (2.85e-4 < 0.05) indicates that the we have about 99.97% confident that both slope \(\beta_{1}\) and intercept \(\beta_{0}\) is not zero, and the p value of t-statistic all secure the significant of not zero of the slope and intercept. Unfotunately this model also shows that only about 36% of total variance in mpg is explained by model. So we need to looking for other better model.
This model also indicates that mean of mpg with a manual transmission is about 7.245 (=\(\beta_{1}+\beta_{0}-\beta_{0}\)) higher than the counterpart with a automatic transmission, which is corrent to our common sense. This answers the first question.
since we have stepwise add extra parameter into the model, we can explore the variance inflation of the coefficent before transmission parameter, and do nested model testing to check the residual variance changing after we insert extra variable.
library(car)
vifam <- data.frame(fit2=vif(fit2)[1], fit3=vif(fit3)[1],
fit4=vif(fit4)[1], fit5=vif(fit5)[1])
print(vifam)
## fit2 fit3 fit4 fit5
## am 1.003321 1.067446 2.792822 3.881481
anova(fit1, fit2, fit3, fit4, fit5)
## Analysis of Variance Table
##
## Model 1: mpg ~ am
## Model 2: mpg ~ am + carb
## Model 3: mpg ~ am + carb + vs
## Model 4: mpg ~ am + carb + vs + wt
## Model 5: mpg ~ am + carb + vs + wt + gear
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 30 720.90
## 2 29 333.68 1 387.22 55.9205 6.12e-08 ***
## 3 28 245.65 1 88.03 12.7125 0.001436 **
## 4 27 183.48 1 62.18 8.9792 0.005935 **
## 5 26 180.04 1 3.44 0.4968 0.487172
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
now the F-statistice say we didn’t bring into unnecessary variables until gear is into the model, which consistant with our analysis in xcor.
In the 4th regression model, the coefficent before transmission parameter is 3.0699233, means if we hold a certain number of carburetors, comparing with S engin with V engin, and also hold a car with certain weight, the manual transmission is still supper than automatic transmission on the sense of mpg. This answers the second question.
Our finaly model to solve this problem is model4 (mpg = \(\beta_{0}\) + \(\beta_{1}\)carb + \(\beta_{2}\)vs + \(\beta_{3}\)wt). And the residual plot of prediction is as following,
plot(resid(fit4), main="residual of regression model 4", type="b", xlab="", ylab="")