We performed the regression analysis about mtcars data, which comprises fuel consumption and 10 aspects of automobile design and performance for 32 mobiles. Since this data has multicollinearity problem, the 1st main question is not answerable. We selected 2 models, based on data’s multicollinearity and p value. These are significant model, but each has limit to explain all variable significantly.
mtcars data setData set mtcars is composed of these variables. * mpg : Miles/(US)gallon * cyl : Number of cylinders * disp : Displacement (cubic inch) * hp : Gross horsepower * drat : Rear axle ratio * wt : Weight (1000 lbs) * qsec : 1/4 mile time * vs : V/S * am : Transmission (0 for automatic, and 1 for manual) * gear : Number of forward gears * carb : Number of carburetors
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
Every variable has the form of num. The variable am, which is mentioned in 2 main questions, is consisted of 0 and 1 only. Through Appendix 1, ‘Pairs plot of mtcars data’, we could see the scatter plots and correlations between variables, using GGally package. We could see that every variables were related to each others. That’s why we concerned about multicollinearity. We used lm() to make the linear model, which has mpg as outcome and others as covariates, and car package to see data set’s VIF. We could do regression with mtcars data as mpg outcome, because the points in Appendix 2 are fairly distributed near the line ; mpg are distributed almost normal. A last few quantiles are above the line, because mpg is skewed right. (See enclosed box plot)
library(car) ;fit_1st <- lm(mpg ~ . , mtcars) ;vif(fit_1st)
## cyl disp hp drat wt qsec vs
## 15.373833 21.620241 9.832037 3.374620 15.164887 7.527958 4.965873
## am gear carb
## 4.648487 5.357452 7.908747
VIF values of variables generally exceeds 3.3, especially high in cyl and disp variables. Since multicollinearity drops the accuracy of linear model, we should compose model with variables which have higher p value than others. And this is why the 1st question, “Is an automatic or manual transmission better for MPG?” is not answerable. When we agonize about mpg, we can’t consider am variable only. We can’t ignore other variables. We should consider other variables too. ### Models
summary(fit_1st)$coef[,4]
## (Intercept) cyl disp hp drat wt
## 0.51812440 0.91608738 0.46348865 0.33495531 0.63527790 0.06325215
## qsec vs am gear carb
## 0.27394127 0.88142347 0.23398971 0.66520643 0.81217871
As above, wt, am, and qsec variables have higher p value than others. So, we composed 2nd linear model with these 3 variables as covariates. mpg is the outcome, of course. Before we made the linear model, let’s see the relationship of these 4 variables through scatter plot in Appendix 3. The value of mpg is higher when am is 1 than 0. Also it gets bigger as the value of qsec gets bigger. mpg is in inverse proportion to wt, finally.
fit_2nd <- lm(mpg ~ am + wt + qsec, mtcars) ; sm2nd <- summary(fit_2nd) ; sm2nd$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.617781 6.9595930 1.381946 1.779152e-01
## am 2.935837 1.4109045 2.080819 4.671551e-02
## wt -3.916504 0.7112016 -5.506882 6.952711e-06
## qsec 1.225886 0.2886696 4.246676 2.161737e-04
(F2nd <- 1-pf(sm2nd$fstatistic[1], sm2nd$fstatistic[2], sm2nd$fstatistic[3]))
## value
## 1.210454e-11
As we can see, 3 variables are significant in this model, except the intercept. That means this model is appropriate for explaining the manual transmission data than automatic. In this model, mpg has value 9.62 when am is 0, and it will increase 2.94 when am becomes 1 ; i.e. the difference of mpg between automatic and manual transmissions is about 2.94. In addition, mpg is more sensitive to change of am and wt than that of qsec. F2nd represents the p value of F statistic of fit_2nd. It seems that this model is significant. However, in Appendix 4, the residual plot looks like quadratic function and the scale location plot has a trend also. To fix these, we applied log() to wt variable.
fit_3rd <- lm(mpg ~ am + log(wt) + qsec, mtcars) ; sm3rd <- summary(fit_3rd) ; sm3rd$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16.462531 6.5532803 2.512105 1.804195e-02
## am 1.597278 1.3232425 1.207094 2.374949e-01
## log(wt) -14.197957 2.0577796 -6.899649 1.687593e-07
## qsec 1.059212 0.2600559 4.073015 3.453888e-04
(F3rd <- 1-pf(sm3rd$fstatistic[1], sm3rd$fstatistic[2], sm3rd$fstatistic[3]))
## value
## 3.259615e-13
Adversely, in this model, variables except am are significant. That is, this 3rd model is better than 2nd model for predicting automatic transmission data. The difference between automatic and manual transmission is about 1.60 in this model. Also mpg is most sensitive to log(wt), and opposite to qsec. This model’s p value of F statistic is also significant. You can see that, in Appendix 5, the line in two plots becomes flatter than 2nd model’s.
Since the multicollinearity exists, whether mpg is better in automatic or not is not answerable. We can see that the difference between automatic and manual transmission is about 2.94 in fit_2nd, and 1.60 in fit_3rd. However, each model has default in explaining both transmissions. # Appendix ## Appendix 1 : Pairs plot of mtcars data
library(GGally) ; ggpairs(mtcars, lower=list(continuous="smooth"))
mpgpar(mfrow=c(1,2)) ; qqnorm(mtcars$mpg) ; qqline(mtcars$mpg) ; boxplot(mtcars$mpg)
mtcars datalibrary(ggplot2) ; g <- ggplot(mtcars, aes(x=qsec, y=mpg))
g + geom_point(aes(color=wt)) + facet_wrap(~am, ncol=2)
par(mfrow=c(2,2))
plot(fit_2nd)
par(mfrow=c(2,2))
plot(fit_3rd)