Synopsis

This analysis examines the popular mtcars dataset of a collection of cars by exploring the relationship between a set of variables and miles per gallon (MPG) (outcome). It proposes to answer the following questions:

Executive summary

The scenario of this analysis is the following:

  1. To do exploratory analysis for summarizing the relationship between the different variables of the dataset.
  2. To fit a linear regression model by doing subset selection and optimizing BIC and R2 indicators.
  3. To check the model by doing residual analysis and diagnostics.
  4. To answer both questions as conclusion.

Exploratory analysis

The mtcars dataset contains 32 observations and 11 variables. The am variable means automatic(0) versus manual(1) transmission.

head(mtcars,5)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2

The appendix A gives an overview of the variables of the dataset, their values histogram and their binary relationships with each other with the help of scatter plots and correlation values.

A quick look at the relation between mpg and am variables prematurely says that manual(1) transmission has in average higher MPG than automatic(0) transmission.

ggplot(mtcars, aes(y=mpg, x=am)) + geom_point(size = 1)

Regression modelling

This section starts to fit simple linear models for the MPG outcome using the am only or all variables, and finally defines a optimized linear model based on a minimal subset of well selected variables. Model evaluation is based on the following factor:

All- and one-variable models

linear.model.all <- lm(mpg ~ ., data = mtcars)
coefficients(linear.model.all)
## (Intercept)         cyl        disp          hp        drat          wt 
## 12.30337416 -0.11144048  0.01333524 -0.02148212  0.78711097 -3.71530393 
##        qsec          vs          am        gear        carb 
##  0.82104075  0.31776281  2.52022689  0.65541302 -0.19941925
confint(linear.model.all, level=0.95)
##                    2.5 %      97.5 %
## (Intercept) -26.62259745 51.22934576
## cyl          -2.28468553  2.06180457
## disp         -0.02380146  0.05047194
## hp           -0.06675236  0.02378812
## drat         -2.61383350  4.18805545
## wt           -7.65495413  0.22434628
## qsec         -0.69883421  2.34091571
## vs           -4.05880242  4.69432805
## am           -1.75681208  6.79726585
## gear         -2.44999107  3.76081711
## carb         -1.92290442  1.52406591

The coefficients of the all-variables model are not significant and close to the zero value as shown by the 95% confidence interval.

summary(linear.model.all)$r.squared
## [1] 0.8690158

The R2 indicator gives 0.8690158, this large value joined to poor significance of the coefficients tends to suggest overfitting with inflated estimated standard error due to multiple variables put together.

linear.model.am <- lm(mpg ~ am, data = mtcars)
coefficients(linear.model.am)
## (Intercept)          am 
##   17.147368    7.244939
t(confint(linear.model.am, level=0.95))
##        (Intercept)       am
## 2.5 %     14.85062  3.64151
## 97.5 %    19.44411 10.84837

The coefficient of the am regressor is significantly over the zero value. This simple model says manual transmission has, in average, a higher MPG than automatic transmission by a factor of 7.2449393.

summary(linear.model.am)$r.squared
## [1] 0.3597989

The R2 indicator gives 0.3597989, this small value means MPG does not only depend on the transmission.

None of both naiv linear models gives satisfying results. Omitting variables results in bias in the coeficients of interest, unless their regressors are uncorrelated with the omitted ones, this last point is not the case here. Including variables that we should not have increases standard errors of the regression variables.

A better model

A better model should include a well-selected subset of the variables. The variable selection is done stepwise, backward and forward. The metric is the AIC criteria:

  • AIC: −2log l(M) + 2k
  • L(M): the maximum of the likelihood function of model M
  • k: number of parameters
  • n: number of observations (32)

When comparing models fitted by maximum likelihood to the same data, in a stepwise way, the smaller the AIC, the better the fit.

Stepwise regression

Forward stepwise selection starts with a intercept-only model, and then adds variables to the model, one at the time, testing the addition of each variable using the AIC fit criterion. At each step the next candidate variable that gives the most statistically significant improvement to the fit is added to the model. This process is repeated until no variable improves the model to a statistically significant extent.

Backward method, on the other hand, starts with the full model, and then removes the least statistically significant variable, one at the time.

fit <- stepAIC(linear.model.all, direction="both", steps=32, trace=FALSE)
fit
## 
## Call:
## lm(formula = mpg ~ wt + qsec + am, data = mtcars)
## 
## Coefficients:
## (Intercept)           wt         qsec           am  
##       9.618       -3.917        1.226        2.936

The R2 indicator gives 0.8496636.

This last model says a manual transmission has, in average, 2.9358372 MPG more than an automatic transmission car, holding values of weights (wt) and 1/4 mile time (qsec) constant.

Interaction models

The model fit could be improved by adding some interaction between the selected variables. There are four interaction model candidates:

fit.int <- lm(mpg ~ wt * qsec * am, data = mtcars)
fit.int.aq <- lm(mpg ~ wt + qsec * am, data = mtcars)
fit.int.aw <- lm(mpg ~ qsec + wt * am, data = mtcars)
fit.int.wq <- lm(mpg ~ am + qsec * wt, data = mtcars)

The R2 indicators for each model candidate are the following:

c(fit.int.sum$r.squared, fit.int.aq.sum$r.squared, fit.int.aw.sum$r.squared, fit.int.wq.sum$r.squared)
## [1] 0.9039610 0.8721092 0.8958514 0.8560765

The interaction model that includes all the selected variables is the one with the highest R2 indicator value: 0.903961. Nevertheless the one having the second highest R2 indicator 0.8958514 is chosen for simpler interpretability.

The regression model for the MPG outcome based on a subset of the variables is the following:

fit.int.aw.sum$coefficients
##              Estimate Std. Error   t value     Pr(>|t|)
## (Intercept)  9.723053  5.8990407  1.648243 0.1108925394
## qsec         1.016974  0.2520152  4.035366 0.0004030165
## wt          -2.936531  0.6660253 -4.409038 0.0001488947
## am          14.079428  3.4352512  4.098515 0.0003408693
## wt:am       -4.141376  1.1968119 -3.460340 0.0018085763
  • when am = 0 (automatic) then slope of wt stays negative with value -2.9365309 and intercept is 9.7230526
  • when am = 1 (manual) then slope of wt stays negative with value -7.0779073 and intercept is 23.8024804
  • in both cases slope of qsec is 1.0169736 and very close to the identity.

This last model keeps the signs of the coefficients and says a manual transmission has, in average, 14.0794278 MPG more than an automatic transmission car, holding values of weights (wt) and 1/4 mile time (qsec) constant.

t(confint(fit.int.aw, level=0.95))
##        (Intercept)      qsec        wt        am     wt:am
## 2.5 %    -2.380779 0.4998811 -4.303102  7.030875 -6.597032
## 97.5 %   21.826884 1.5340661 -1.569960 21.127981 -1.685721

Residual analysis

Some plots for residual diagnostics are shown in Figure 2. There is no particular pattern in residuals vs fitted, scale-location, and residuals vs leverage plots. For QQ-plot, it seems that the residual is a little bit right skewed, but it still can be seen as normal from Shapiro-Wilk normality test.

We use 2∗k/n as a threshold for hat values, and there are four high leverage points, but according to dfbeta(), they are not so influential to our model.

shapiro.test(fit.int.aw$res)
## 
##  Shapiro-Wilk normality test
## 
## data:  fit.int.aw$res
## W = 0.94444, p-value = 0.1001
round(hatvalues(fit.int.aw)[hatvalues(fit.int.aw) > 2*5/32], 2)
##            Merc 230 Lincoln Continental        Lotus Europa 
##                0.35                0.32                0.33 
##       Maserati Bora 
##                0.37
round(dfbeta(fit.int.aw)[which(hatvalues(fit.int.aw) > 2*5/32), ], 2)
##                     (Intercept)  qsec    wt    am wt:am
## Merc 230                   1.53 -0.09  0.01  0.46 -0.19
## Lincoln Continental        1.87 -0.03 -0.37 -1.14  0.30
## Lotus Europa               0.12 -0.01  0.00  0.10 -0.04
## Maserati Bora              0.38 -0.02 -0.01 -1.37  0.64

Conclusion

The different indicators says the proposed model fits the data pretty well. Four high leverage points have been discovered that do not influence the model too much. Nevertheless they have to be surveyed when doing inference or prediction.

Using this model, in average, has more than 2.9358372 (14.0794278) miles per gallon more than an automatic transmission car, holding values of weights and 1/4 mile time constant.

Appendix

A. Variables relationships

ggpairs(mtcars, lower = list(continuous = "smooth"), diag = list(continuous = "barDiag"))

B. Residuals analysis

par(mfrow = c(2, 2)); plot(fit.int.aw)