Course Project Regression Models

Executive summary

In this report the analysis of the mtcars data is documented, which has been used to analyse and quantify the effect transmission type of cars (automatic or manual) has on miles per gallon. After selecting the best fit, the model’s estimate shows that manual types achieve better levels of miles per gallon, with an estimated coefficient of 2.936, significant at an alpha level of 0.05.

Library and data

The analyses contained in this report use the mtcars data and the usdm, ggplot2, caret, lmtest and car packages. Make sure to have these packages installed prior to running the code in this document. The data and packages can then be loaded as follows:

# Library
library(ggplot2)
library(usdm)
library(caret)
library(lmtest)
library(car)

# Data
df <- mtcars

Exploratory data analysis

To analyze what the data looks like, the str functions is used, as shown here:

str(df)

## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

As can be seen from the output of str, all variables in the set are numeric. It is apparent, however, that several of these variables are not continuous but categorical, hence should be coded as factors. This is confirmed by the documentation on this dataset, which can be accessed by using ?mtcars (in which the descriptions of the variables can also be found). Accordingly, the cyl, vs, am, gear and carb variables are transformed to factors. In addition, the levels of the am variable are transformed to “Automatic” and “Manual”, for easier interpretability.

df[, c(2, 8, 10, 11)] <- lapply(df[, c(2, 8, 10, 11)], as.factor)
df$am <- factor(df$am, labels = c("Automatic", "Manual"))

To get an idea of the potential difference in miles per gallon between the two types of transmission, boxplots can be used, the results of which can be found in appendix A. As can be seen in the figure, a manual transmission appears to be better for miles per gallon than an automatic transmission. To further analyse this relation, the subsequent part of this report will use statistical models and techniques to quantify this relation and evaluate if there truly is a significant difference.

Statistical analysis

Potential variance- and collinearity problems

Before attempting to build a model, there are two potential problems which should be dealt with preliminarily:

Zero- or near-zero variance;
Multicollinearity.

To check whether any of the predictors have (near-)zero variance, the nearZeroVar function of the caret package is used, as shown in appendix B. The output shows that none of the variables have (near-)zero variance, so this will not pose a problem in the estimation of a linear model.

In order to check for multicollinearity, the Variance Inflation Factor (VIF) is computed for each of the predictors. Whenever one or multiple of the calculated VIF’s exceed the threshold of 10, the predictor with the highest VIF is removed. The process is then repeated, resulting in a stepwise removal process which stops when all remaining predictors have a VIF lower than 10. To do so, the vifstep function of the usdm package is used, as can be seen in appendix C.

As appears from the output, two of the ten variables have been removed due to collinearity problems, namely disp (displacement) and cyl (number of cylinders). Hence, the eight remaining predictors will be used to estimate a linear model of the data.

Finding the best model

In order to find the best fit, the Akaike Information Criterion (AIC) will be used to estimate the relative quality of the estimated models, which shall be generated in both a forward and backward stepwise fashion. The forward method starts with the simplest model (mpg ~ am) and expands it with the best, additional predictors until adding any further predictors does not improve the model, as measured by the AIC. The backward method starts with the full model, which means that it includes all eight potential predictors, and tries to remove the least influential predictor, as long as this does not significantly affect the AIC.

To do so, the step function, which is available in base R, can be used as follows. For the sake of brevity, the process of predictor addition or removal is not shown here, but can be found in appendix D.

# Simplest and full model
nullmdl <- lm(mpg ~ am, data=df)
fullmdl <- lm(mpg ~ am + hp + drat + wt + qsec + vs + gear + carb, data=df)

# Forward method
forfit <- step(nullmdl, scope=list(lower=nullmdl, upper=fullmdl), direction="forward", trace=0)
formula(forfit)

## mpg ~ am + hp + wt + qsec

# Backward method
backfit <- step(fullmdl, scope=list(lower=nullmdl, upper=fullmdl), direction="backward", trace=0) 
formula(backfit)

## mpg ~ am + wt + qsec

As can be seen above, the methods result in two different models: one where am, wt and qsec are used as predictors (backward method), and one where hp is included in addition to the three previously mentioned predictors. To check if the inclusion of hp significantly improves the model, the anova function can be used as follows:

anova(backfit, forfit)

## Analysis of Variance Table
## 
## Model 1: mpg ~ am + wt + qsec
## Model 2: mpg ~ am + hp + wt + qsec
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1     28 169.29                           
## 2     27 160.07  1    9.2195 1.5551 0.2231

As the anova shows, the addition of hp does not significantly improve the model (p > 0.05), hence the more parsimonious, backward model is chosen as best fit. Accordingly, the final model is shown below:

summary(backfit)

## 
## Call:
## lm(formula = mpg ~ am + wt + qsec, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4811 -1.5555 -0.7257  1.4110  4.6610 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   9.6178     6.9596   1.382 0.177915    
## amManual      2.9358     1.4109   2.081 0.046716 *  
## wt           -3.9165     0.7112  -5.507 6.95e-06 ***
## qsec          1.2259     0.2887   4.247 0.000216 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.459 on 28 degrees of freedom
## Multiple R-squared:  0.8497, Adjusted R-squared:  0.8336 
## F-statistic: 52.75 on 3 and 28 DF,  p-value: 1.21e-11

The model confirms what was previously found in the exploratory analysis: a manual transmission is significantly better for miles per gallon than an automatic transmission, where the estimated difference is 2.936 with a 95% confidence interval between 0.046 and 5.826 (p < 0.05).

Residual diagnostics

To check if the estimated model does not violate the assumptions which are required to estimate OLS models, residual plots have been generated, which can be found in appendix E. In the first plot, which shows the residuals versus fitted values plot, the points seem to be randomly scattered, confirming independence. In the second, normal Q-Q plot, the points appear to fall approximately on the 45 degrees line, suggesting that the normality condition is met. In the third, Scale-Location plot, the points do not seem to be more spread out for higher, fitted values, although the estimated fit does suggest the opposite. Hence, the assumption of homoskedasticity might be violated. Lastly, the Cook’s distance plot does not show any influential outliers.

To check whether the assumption of homoskedasticity has been violated, the Breusch-Pagan and NCV tests are run on the final model (using the lmtest and car packages), as can be seen in appendix F. Neither of the two tests is significant, hence the null hypothesis of constant variance is not rejected. Accordingly, the assumption of homoskedasticity has not been violated.

Conclusion

In order to investigate the effect of transmission type on miles per gallon, a linear regression model has been estimated. The estimated coefficient was equal to 2.936 and significant (p < 0.05), which means that manual transmission types achieve better levels of miles per gallon than automatic transmission types. The model has been thoroughly checked on potential problems and violations of assumptions, the results of which indicated that multicollinearity posed a potential problem. This was dealt with through a stepwise VIF analysis prior to the estimation of models.

Appendix

A.

g <- ggplot(df, aes(y=mpg, x=am))
g + 
      geom_boxplot(aes(fill=am)) +
      labs(x="Transmission type",
           y="Miles per gallon",
           title="Boxplot of miles per gallon per transmission type") +
      guides(fill=FALSE)

B.

nearZeroVar(df, saveMetrics=TRUE)

##      freqRatio percentUnique zeroVar   nzv
## mpg   1.000000        78.125   FALSE FALSE
## cyl   1.272727         9.375   FALSE FALSE
## disp  1.500000        84.375   FALSE FALSE
## hp    1.000000        68.750   FALSE FALSE
## drat  1.000000        68.750   FALSE FALSE
## wt    1.500000        90.625   FALSE FALSE
## qsec  1.000000        93.750   FALSE FALSE
## vs    1.285714         6.250   FALSE FALSE
## am    1.461538         6.250   FALSE FALSE
## gear  1.250000         9.375   FALSE FALSE
## carb  1.000000        18.750   FALSE FALSE

C.

vifstep(mtcars[,-1])

## 2 variables from the 10 input variables have collinearity problem: 
##  
## disp cyl 
## 
## After excluding the collinear variables, the linear correlation coefficients ranges between: 
## min correlation ( carb ~ am ):  0.05753435 
## max correlation ( gear ~ am ):  0.7940588 
## 
## ---------- VIFs of the remained variables -------- 
##   Variables      VIF
## 1        hp 6.015788
## 2      drat 3.111501
## 3        wt 6.051127
## 4      qsec 5.918682
## 5        vs 4.270956
## 6        am 4.285815
## 7      gear 4.690187
## 8      carb 4.290468

D.

# Simplest and full model
nullmdl <- lm(mpg ~ am, data=df)
fullmdl <- lm(mpg ~ am + hp + drat + wt + qsec + vs + gear + carb, data=df)

# Forward method
forfit <- step(nullmdl, scope=list(lower=nullmdl, upper=fullmdl), direction="forward")

## Start:  AIC=103.67
## mpg ~ am
## 
##        Df Sum of Sq    RSS     AIC
## + hp    1    475.46 245.44  71.194
## + wt    1    442.58 278.32  75.217
## + qsec  1    368.26 352.63  82.790
## + vs    1    367.41 353.49  82.867
## + carb  5    407.78 313.12  86.987
## + drat  1    147.26 573.64  98.361
## + gear  2    150.89 570.00 100.157
## <none>              720.90 103.672
## 
## Step:  AIC=71.19
## mpg ~ am + hp
## 
##        Df Sum of Sq    RSS    AIC
## + wt    1    65.148 180.29 63.323
## + vs    1    26.560 218.88 69.529
## <none>              245.44 71.194
## + drat  1    13.089 232.35 71.440
## + qsec  1     6.879 238.56 72.284
## + gear  2    12.139 233.30 73.571
## + carb  5    44.383 201.06 74.811
## 
## Step:  AIC=63.32
## mpg ~ am + hp + wt
## 
##        Df Sum of Sq    RSS    AIC
## + qsec  1   20.2246 160.07 61.515
## + vs    1   11.3276 168.96 63.246
## <none>              180.29 63.323
## + drat  1    3.3262 176.97 64.727
## + gear  2    0.9508 179.34 67.154
## + carb  5   15.5500 164.74 70.436
## 
## Step:  AIC=61.52
## mpg ~ am + hp + wt + qsec
## 
##        Df Sum of Sq    RSS    AIC
## <none>              160.07 61.515
## + drat  1    1.4278 158.64 63.229
## + vs    1    0.2486 159.82 63.466
## + gear  2    1.7637 158.30 65.161
## + carb  5    6.3933 153.67 70.211

# Backward method
backfit <- step(fullmdl, scope=list(lower=nullmdl, upper=fullmdl), direction="backward")

## Start:  AIC=76.17
## mpg ~ am + hp + drat + wt + qsec + vs + gear + carb
## 
##        Df Sum of Sq    RSS    AIC
## - carb  5    9.4484 153.62 68.201
## - gear  2    1.4607 145.64 72.492
## - vs    1    0.3500 144.53 74.247
## - qsec  1    1.2879 145.47 74.454
## - drat  1    6.3375 150.51 75.546
## - hp    1    7.2711 151.45 75.744
## <none>              144.18 76.170
## - wt    1   10.2497 154.43 76.367
## 
## Step:  AIC=68.2
## mpg ~ am + hp + drat + wt + qsec + vs + gear
## 
##        Df Sum of Sq    RSS    AIC
## - gear  2     4.934 158.56 65.212
## - vs    1     0.537 154.16 66.313
## - drat  1     4.148 157.77 67.054
## <none>              153.62 68.201
## - qsec  1    10.154 163.78 68.249
## - hp    1    12.395 166.02 68.684
## - wt    1    35.315 188.94 72.822
## 
## Step:  AIC=65.21
## mpg ~ am + hp + drat + wt + qsec + vs
## 
##        Df Sum of Sq    RSS    AIC
## - vs    1     0.080 158.64 63.229
## - drat  1     1.259 159.82 63.466
## - qsec  1     9.251 167.81 65.027
## - hp    1     9.269 167.83 65.030
## <none>              158.56 65.212
## - wt    1    54.964 213.52 72.736
## 
## Step:  AIC=63.23
## mpg ~ am + hp + drat + wt + qsec
## 
##        Df Sum of Sq    RSS    AIC
## - drat  1     1.428 160.07 61.515
## - hp    1     9.248 167.89 63.042
## <none>              158.64 63.229
## - qsec  1    18.326 176.97 64.727
## - wt    1    68.320 226.96 72.689
## 
## Step:  AIC=61.52
## mpg ~ am + hp + wt + qsec
## 
##        Df Sum of Sq    RSS    AIC
## - hp    1     9.219 169.29 61.307
## <none>              160.07 61.515
## - qsec  1    20.225 180.29 63.323
## - wt    1    78.494 238.56 72.284
## 
## Step:  AIC=61.31
## mpg ~ am + wt + qsec
## 
##        Df Sum of Sq    RSS    AIC
## <none>              169.29 61.307
## - qsec  1    109.03 278.32 75.217
## - wt    1    183.35 352.63 82.790

E.

par(mfrow=c(2,2))
plot(backfit)

F.

bptest(backfit)

## 
##  studentized Breusch-Pagan test
## 
## data:  backfit
## BP = 6.1871, df = 3, p-value = 0.1029

ncvTest(backfit)

## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 1.55815    Df = 1     p = 0.2119363