1. Overview

We performed the regression analysis about mtcars data, which comprises fuel consumption and 10 aspects of automobile design and performance for 32 mobiles. Since this data has multicollinearity problem, the 1st main question is not answerable. We selected 2 models, based on data’s multicollinearity and p value. These are significant model, but each has limit to explain all variable significantly.

2. Analysis

About mtcars data set

Data set mtcars is composed of these variables. * mpg : Miles/(US)gallon * cyl : Number of cylinders * disp : Displacement (cubic inch) * hp : Gross horsepower * drat : Rear axle ratio * wt : Weight (1000 lbs) * qsec : 1/4 mile time * vs : V/S * am : Transmission (0 for automatic, and 1 for manual) * gear : Number of forward gears * carb : Number of carburetors

str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

Every variable has the form of num. The variable am, which is mentioned in 2 main questions, is consisted of 0 and 1 only. Through Appendix 1, ‘Pairs plot of mtcars data’, we could see the scatter plots and correlations between variables, using GGally package. We could see that every variables were related to each others. That’s why we concerned about multicollinearity. We used lm() to make the linear model, which has mpg as outcome and others as covariates, and car package to see data set’s VIF. We could do regression with mtcars data as mpg outcome, because the points in Appendix 2 are fairly distributed near the line ; mpg are distributed almost normal. A last few quantiles are above the line, because mpg is skewed right. (See enclosed box plot)

library(car) ;fit_1st <- lm(mpg ~ . , mtcars) ;vif(fit_1st)
##       cyl      disp        hp      drat        wt      qsec        vs 
## 15.373833 21.620241  9.832037  3.374620 15.164887  7.527958  4.965873 
##        am      gear      carb 
##  4.648487  5.357452  7.908747

VIF values of variables generally exceeds 3.3, especially high in cyl and disp variables. Since multicollinearity drops the accuracy of linear model, we should compose model with variables which have higher p value than others. And this is why the 1st question, “Is an automatic or manual transmission better for MPG?” is not answerable. When we agonize about mpg, we can’t consider am variable only. We can’t ignore other variables. We should consider other variables too. ### Models

summary(fit_1st)$coef[,4]
## (Intercept)         cyl        disp          hp        drat          wt 
##  0.51812440  0.91608738  0.46348865  0.33495531  0.63527790  0.06325215 
##        qsec          vs          am        gear        carb 
##  0.27394127  0.88142347  0.23398971  0.66520643  0.81217871

As above, wt, am, and qsec variables have higher p value than others. So, we composed 2nd linear model with these 3 variables as covariates. mpg is the outcome, of course. Before we made the linear model, let’s see the relationship of these 4 variables through scatter plot in Appendix 3. The value of mpg is higher when am is 1 than 0. Also it gets bigger as the value of qsec gets bigger. mpg is in inverse proportion to wt, finally.

fit_2nd <- lm(mpg ~ am + wt + qsec, mtcars) ; sm2nd <- summary(fit_2nd) ; sm2nd$coef
##              Estimate Std. Error   t value     Pr(>|t|)
## (Intercept)  9.617781  6.9595930  1.381946 1.779152e-01
## am           2.935837  1.4109045  2.080819 4.671551e-02
## wt          -3.916504  0.7112016 -5.506882 6.952711e-06
## qsec         1.225886  0.2886696  4.246676 2.161737e-04
(F2nd <- 1-pf(sm2nd$fstatistic[1], sm2nd$fstatistic[2], sm2nd$fstatistic[3]))
##        value 
## 1.210454e-11

As we can see, 3 variables are significant in this model, except the intercept. That means this model is appropriate for explaining the manual transmission data than automatic. In this model, mpg has value 9.62 when am is 0, and it will increase 2.94 when am becomes 1 ; i.e. the difference of mpg between automatic and manual transmissions is about 2.94. In addition, mpg is more sensitive to change of am and wt than that of qsec. F2nd represents the p value of F statistic of fit_2nd. It seems that this model is significant. However, in Appendix 4, the residual plot looks like quadratic function and the scale location plot has a trend also. To fix these, we applied log() to wt variable.

fit_3rd <- lm(mpg ~ am + log(wt) + qsec, mtcars) ; sm3rd <- summary(fit_3rd) ; sm3rd$coef
##               Estimate Std. Error   t value     Pr(>|t|)
## (Intercept)  16.462531  6.5532803  2.512105 1.804195e-02
## am            1.597278  1.3232425  1.207094 2.374949e-01
## log(wt)     -14.197957  2.0577796 -6.899649 1.687593e-07
## qsec          1.059212  0.2600559  4.073015 3.453888e-04
(F3rd <- 1-pf(sm3rd$fstatistic[1], sm3rd$fstatistic[2], sm3rd$fstatistic[3]))
##        value 
## 3.259615e-13

Adversely, in this model, variables except am are significant. That is, this 3rd model is better than 2nd model for predicting automatic transmission data. The difference between automatic and manual transmission is about 1.60 in this model. Also mpg is most sensitive to log(wt), and opposite to qsec. This model’s p value of F statistic is also significant. You can see that, in Appendix 5, the line in two plots becomes flatter than 2nd model’s.

3. Result

Since the multicollinearity exists, whether mpg is better in automatic or not is not answerable. We can see that the difference between automatic and manual transmission is about 2.94 in fit_2nd, and 1.60 in fit_3rd. However, each model has default in explaining both transmissions. # Appendix ## Appendix 1 : Pairs plot of mtcars data

library(GGally) ; ggpairs(mtcars, lower=list(continuous="smooth"))

Appendix 2 : Q-Q plot and Box plot of mpg

par(mfrow=c(1,2)) ; qqnorm(mtcars$mpg) ; qqline(mtcars$mpg) ; boxplot(mtcars$mpg)

Appendix 3 : Scatter plot of mtcars data

library(ggplot2) ; g <- ggplot(mtcars, aes(x=qsec, y=mpg))
g + geom_point(aes(color=wt)) + facet_wrap(~am, ncol=2)

Appendix 4 : Diagnostic plot for 2nd linear model

par(mfrow=c(2,2))
plot(fit_2nd)

Appendix 5 : Diagnostic plot for 3rd linear model

par(mfrow=c(2,2))
plot(fit_3rd)