Summary

Motor Trend is particularly interested in the following two questions,

  1. Is an automatic or manual transmission better for MPG?

  2. Quantify the MPG difference between automatic and manual transmissions

Thus this report use the mtcars dataset to answer these two questions, this data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models).

Method

First data is load and exploratory analysis is performed,

data(mtcars)
head(mtcars,2)
              mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4      21   6  160 110  3.9 2.620 16.46  0  1    4    4
Mazda RX4 Wag  21   6  160 110  3.9 2.875 17.02  0  1    4    4

each entry contains miles per gallon (mpg), number of cylinders (cyl), displacement (disp), gross horsepower (hp), rear axle ratio (drat), weight (wt), 1/4 mile time (qsec), V/straight engine (vs), transmission (am), number of forward gears (gear), and finally number of carburetors (carb).

If we take miles per gallon as our interrogated dependent variable, those pairs of variable other than mpg, who has strong cross relationship should be careful preprocessed in order to prevent a overfitting.

Among these parameters, cyl, vs, am could be categorical variable, number of gear and car could also be taken as categorical ones, and others are pure numerical variable. We first investigate the numerical ones.

mtcarsNum <- data.frame(mpg=mtcars$mpg, disp=mtcars$disp, hp=mtcars$hp,
                        drat=mtcars$drat, wt=mtcars$wt, qsec=mtcars$qsec)
mtcarsNumCor <- cor(mtcarsNum)
dissimilarity <- 1-abs(mtcarsNumCor)
dis <- dist(dissimilarity,method = "euclidean")
print(dis)
##            mpg      disp        hp      drat        wt
## disp 0.2195026                                        
## hp   0.5337582 0.5373739                              
## drat 0.6787464 0.6788544 1.0034293                    
## wt   0.3330890 0.3317454 0.7777210 0.5318162          
## qsec 1.2961403 1.3130441 0.8875936 1.4675313 1.4691905

it is indicated by calculation that,

so we try to include only weight to the model.

And as introduced in the beginning of this report, the transmission should be an indicator, so let’s use panel scat plot to investigate the relationship among other potential regressors, including factors (cyl, vs, gear, carb) and numerical variables (wt) to the transmission type.

pairs(mtcars$am ~ mtcars$cyl + mtcars$vs + mtcars$wt 
      + mtcars$gear + mtcars$carb,
      panel=panel.smooth, upper.panel = NULL, lty = 2,
      main="mtcars regressors")

mtcarsXs <- data.frame(am=mtcars$am, cyl=mtcars$cyl, vs=mtcars$vs,
                        wt=mtcars$wt, gear=mtcars$gear, carb=mtcars$carb)
mtcarsXsCor <- cor(mtcarsXs)
dissimilarity <- 1-abs(mtcarsXsCor)
dis <- dist(dissimilarity,method = "euclidean")
print(dis)
##             am       cyl        vs        wt      gear
## cyl  1.0899008                                        
## vs   1.4468220 0.5769495                              
## wt   0.7664692 0.4550234 0.9133605                    
## gear 0.3820390 1.0285131 1.3570794 0.7660955          
## carb 1.5093260 0.9464097 0.6958084 1.1043363 1.3243923

Same as we did for the numerical variables, we should,

finally we have transmission (am), number of carburetors (carb), V/straight engine (vs), weight (wt) in our model.

Now before we go estimate regression model to answer two question raised in the beginning of this report, we make those categorical variable into factors.

mtcars$am <- factor(mtcars$am,levels=c(0,1),
      labels=c("Automatic","Manual"))
mtcars$vs <- factor(mtcars$vs,levels=c(0,1),
   labels=c("V-engine","straight engine"))

and we first build a simple regression model with only one regressor (\(\beta_{1}\)am+\(\beta_{0}\)), and add other regressor step by step

fit1 <- lm(mpg ~ am, data = mtcars)
fit2 <- update(fit1, mpg ~ am + carb)
fit3 <- update(fit1, mpg ~ am + carb + vs)
fit4 <- update(fit1, mpg ~ am + carb + vs + wt)
fit5 <- update(fit1, mpg ~ am + carb + vs + wt + gear)

summary(fit1)
## 
## Call:
## lm(formula = mpg ~ am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3923 -3.0923 -0.2974  3.2439  9.5077 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   17.147      1.125  15.247 1.13e-15 ***
## amManual       7.245      1.764   4.106 0.000285 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared:  0.3598, Adjusted R-squared:  0.3385 
## F-statistic: 16.86 on 1 and 30 DF,  p-value: 0.000285

the p value of F-statistic (2.85e-4 < 0.05) indicates that the we have about 99.97% confident that both slope \(\beta_{1}\) and intercept \(\beta_{0}\) is not zero, and the p value of t-statistic all secure the significant of not zero of the slope and intercept. Unfotunately this model also shows that only about 36% of total variance in mpg is explained by model. So we need to looking for other better model.

This model also indicates that mean of mpg with a manual transmission is about 7.245 (=\(\beta_{1}+\beta_{0}-\beta_{0}\)) higher than the counterpart with a automatic transmission, which is corrent to our common sense. This answers the first question.

since we have stepwise add extra parameter into the model, we can explore the variance inflation of the coefficent before transmission parameter, and do nested model testing to check the residual variance changing after we insert extra variable.

library(car)
vifam <- data.frame(fit2=vif(fit2)[1], fit3=vif(fit3)[1], 
                    fit4=vif(fit4)[1], fit5=vif(fit5)[1])
print(vifam)
##        fit2     fit3     fit4     fit5
## am 1.003321 1.067446 2.792822 3.881481
anova(fit1, fit2, fit3, fit4, fit5)
## Analysis of Variance Table
## 
## Model 1: mpg ~ am
## Model 2: mpg ~ am + carb
## Model 3: mpg ~ am + carb + vs
## Model 4: mpg ~ am + carb + vs + wt
## Model 5: mpg ~ am + carb + vs + wt + gear
##   Res.Df    RSS Df Sum of Sq       F   Pr(>F)    
## 1     30 720.90                                  
## 2     29 333.68  1    387.22 55.9205 6.12e-08 ***
## 3     28 245.65  1     88.03 12.7125 0.001436 ** 
## 4     27 183.48  1     62.18  8.9792 0.005935 ** 
## 5     26 180.04  1      3.44  0.4968 0.487172    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

now the F-statistice say we didn’t bring into unnecessary variables until gear is into the model, which consistant with our analysis in xcor.

In the 4th regression model, the coefficent before transmission parameter is 3.0699233, means if we hold a certain number of carburetors, comparing with S engin with V engin, and also hold a car with certain weight, the manual transmission is still supper than automatic transmission on the sense of mpg. This answers the second question.

Conclusion

Our finaly model to solve this problem is model4 (mpg = \(\beta_{0}\) + \(\beta_{1}\)carb + \(\beta_{2}\)vs + \(\beta_{3}\)wt). And the residual plot of prediction is as following,

plot(resid(fit4),  main="residual of regression model 4", type="b", xlab="", ylab="")