This analysis examines the popular mtcars dataset of a collection of cars by exploring the relationship between a set of variables and miles per gallon (MPG) (outcome). It proposes to answer the following questions:
The scenario of this analysis is the following:
The mtcars dataset contains 32 observations and 11 variables. The am variable means automatic(0) versus manual(1) transmission.
head(mtcars,5)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
The appendix A gives an overview of the variables of the dataset, their values histogram and their binary relationships with each other with the help of scatter plots and correlation values.
A quick look at the relation between mpg and am variables prematurely says that manual(1) transmission has in average higher MPG than automatic(0) transmission.
ggplot(mtcars, aes(y=mpg, x=am)) + geom_point(size = 1)
This section starts to fit simple linear models for the MPG outcome using the am only or all variables, and finally defines a optimized linear model based on a minimal subset of well selected variables. Model evaluation is based on the following factor:
linear.model.all <- lm(mpg ~ ., data = mtcars)
coefficients(linear.model.all)
## (Intercept) cyl disp hp drat wt
## 12.30337416 -0.11144048 0.01333524 -0.02148212 0.78711097 -3.71530393
## qsec vs am gear carb
## 0.82104075 0.31776281 2.52022689 0.65541302 -0.19941925
confint(linear.model.all, level=0.95)
## 2.5 % 97.5 %
## (Intercept) -26.62259745 51.22934576
## cyl -2.28468553 2.06180457
## disp -0.02380146 0.05047194
## hp -0.06675236 0.02378812
## drat -2.61383350 4.18805545
## wt -7.65495413 0.22434628
## qsec -0.69883421 2.34091571
## vs -4.05880242 4.69432805
## am -1.75681208 6.79726585
## gear -2.44999107 3.76081711
## carb -1.92290442 1.52406591
The coefficients of the all-variables model are not significant and close to the zero value as shown by the 95% confidence interval.
summary(linear.model.all)$r.squared
## [1] 0.8690158
The R2 indicator gives 0.8690158, this large value joined to poor significance of the coefficients tends to suggest overfitting with inflated estimated standard error due to multiple variables put together.
linear.model.am <- lm(mpg ~ am, data = mtcars)
coefficients(linear.model.am)
## (Intercept) am
## 17.147368 7.244939
t(confint(linear.model.am, level=0.95))
## (Intercept) am
## 2.5 % 14.85062 3.64151
## 97.5 % 19.44411 10.84837
The coefficient of the am regressor is significantly over the zero value. This simple model says manual transmission has, in average, a higher MPG than automatic transmission by a factor of 7.2449393.
summary(linear.model.am)$r.squared
## [1] 0.3597989
The R2 indicator gives 0.3597989, this small value means MPG does not only depend on the transmission.
None of both naiv linear models gives satisfying results. Omitting variables results in bias in the coeficients of interest, unless their regressors are uncorrelated with the omitted ones, this last point is not the case here. Including variables that we should not have increases standard errors of the regression variables.
A better model should include a well-selected subset of the variables. The variable selection is done stepwise, backward and forward. The metric is the AIC criteria:
When comparing models fitted by maximum likelihood to the same data, in a stepwise way, the smaller the AIC, the better the fit.
Forward stepwise selection starts with a intercept-only model, and then adds variables to the model, one at the time, testing the addition of each variable using the AIC fit criterion. At each step the next candidate variable that gives the most statistically significant improvement to the fit is added to the model. This process is repeated until no variable improves the model to a statistically significant extent.
Backward method, on the other hand, starts with the full model, and then removes the least statistically significant variable, one at the time.
fit <- stepAIC(linear.model.all, direction="both", steps=32, trace=FALSE)
fit
##
## Call:
## lm(formula = mpg ~ wt + qsec + am, data = mtcars)
##
## Coefficients:
## (Intercept) wt qsec am
## 9.618 -3.917 1.226 2.936
The R2 indicator gives 0.8496636.
This last model says a manual transmission has, in average, 2.9358372 MPG more than an automatic transmission car, holding values of weights (wt) and 1/4 mile time (qsec) constant.
The model fit could be improved by adding some interaction between the selected variables. There are four interaction model candidates:
fit.int <- lm(mpg ~ wt * qsec * am, data = mtcars)
fit.int.aq <- lm(mpg ~ wt + qsec * am, data = mtcars)
fit.int.aw <- lm(mpg ~ qsec + wt * am, data = mtcars)
fit.int.wq <- lm(mpg ~ am + qsec * wt, data = mtcars)
The R2 indicators for each model candidate are the following:
c(fit.int.sum$r.squared, fit.int.aq.sum$r.squared, fit.int.aw.sum$r.squared, fit.int.wq.sum$r.squared)
## [1] 0.9039610 0.8721092 0.8958514 0.8560765
The interaction model that includes all the selected variables is the one with the highest R2 indicator value: 0.903961. Nevertheless the one having the second highest R2 indicator 0.8958514 is chosen for simpler interpretability.
The regression model for the MPG outcome based on a subset of the variables is the following:
fit.int.aw.sum$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.723053 5.8990407 1.648243 0.1108925394
## qsec 1.016974 0.2520152 4.035366 0.0004030165
## wt -2.936531 0.6660253 -4.409038 0.0001488947
## am 14.079428 3.4352512 4.098515 0.0003408693
## wt:am -4.141376 1.1968119 -3.460340 0.0018085763
This last model keeps the signs of the coefficients and says a manual transmission has, in average, 14.0794278 MPG more than an automatic transmission car, holding values of weights (wt) and 1/4 mile time (qsec) constant.
t(confint(fit.int.aw, level=0.95))
## (Intercept) qsec wt am wt:am
## 2.5 % -2.380779 0.4998811 -4.303102 7.030875 -6.597032
## 97.5 % 21.826884 1.5340661 -1.569960 21.127981 -1.685721
Some plots for residual diagnostics are shown in Figure 2. There is no particular pattern in residuals vs fitted, scale-location, and residuals vs leverage plots. For QQ-plot, it seems that the residual is a little bit right skewed, but it still can be seen as normal from Shapiro-Wilk normality test.
We use 2∗k/n as a threshold for hat values, and there are four high leverage points, but according to dfbeta(), they are not so influential to our model.
shapiro.test(fit.int.aw$res)
##
## Shapiro-Wilk normality test
##
## data: fit.int.aw$res
## W = 0.94444, p-value = 0.1001
round(hatvalues(fit.int.aw)[hatvalues(fit.int.aw) > 2*5/32], 2)
## Merc 230 Lincoln Continental Lotus Europa
## 0.35 0.32 0.33
## Maserati Bora
## 0.37
round(dfbeta(fit.int.aw)[which(hatvalues(fit.int.aw) > 2*5/32), ], 2)
## (Intercept) qsec wt am wt:am
## Merc 230 1.53 -0.09 0.01 0.46 -0.19
## Lincoln Continental 1.87 -0.03 -0.37 -1.14 0.30
## Lotus Europa 0.12 -0.01 0.00 0.10 -0.04
## Maserati Bora 0.38 -0.02 -0.01 -1.37 0.64
The different indicators says the proposed model fits the data pretty well. Four high leverage points have been discovered that do not influence the model too much. Nevertheless they have to be surveyed when doing inference or prediction.
Using this model, in average, has more than 2.9358372 (14.0794278) miles per gallon more than an automatic transmission car, holding values of weights and 1/4 mile time constant.
ggpairs(mtcars, lower = list(continuous = "smooth"), diag = list(continuous = "barDiag"))
par(mfrow = c(2, 2)); plot(fit.int.aw)