Analysis of the mtcars dataset

This is my sandbox for playing around with one of R’s built-in datasets, specifically in the context of building linear models. A few disclaimers:

  1. I don’t know what I’m doing! I know how to call functions like lm that can build linear models, but I don’t yet know when or where doing so is valid. This is a sandbox, not an assertion of “here is the right way to do things.” If you see something that looks weird or wrong, please let me know.
  2. This document will continue to change as I learn new things or want to try new things out.

The simplest possible example of linear regression

The steps in creating a linear regression plot in R are:

  1. Create a formula, which contains a left side (response variable) and a right side (explanatory variable or variables)
  2. Create a linear model with the lm function
  3. Plot the response and explanatory variables
  4. Plot the regression line
# Step 1: create a formula.
# Normally, this is just a parameter in the call to lm.
# If the response variable is mpg, and the explanatory variable is cyl,
# then the formula is:
# mpg ~ cyl
# So the two sides are separated by the tilde operator.
# We can explicitly create and store a formula:
mtcars.cyl.formula <- mpg ~ cyl
# Step 2: create a model.
mtcars.cyl.lm <- lm(formula=mtcars.cyl.formula, data=mtcars)
# Step 3: Plot the response and explanatory variables
# This uses the same formula used to create the model
plot(mtcars.cyl.formula, data=mtcars)
# Step 4: Plot the regression line
# The abline function normally takes both a (intercept) and b (slope)
# arguments, but if a is a linear model, then abline will interrogate the 
# model for a coefficients member, and this will be a vector of
# two values, the first of which will be intercept and the second slope
# so this will work:
# abline(mtcars.cyl.lm$coefficients[1], mtcars.cyl.lm$coefficients[2])
# But this is simpler:
abline(mtcars.cyl.lm)

plot of chunk unnamed-chunk-2

If we did not store a formula, or for that matter a model, the shortest version of the code would be:

plot(mpg ~ cyl, data=mtcars)
abline(lm(mpg ~ cyl, data=mtcars))

Refining a model

This section assumes that you have familiarity with statistical inference, especially the concepts of a P-value and \(R^2\).

We can use several features in multiple linear regression, but not all of those features may prove worthy of remaining in the model. The best model in terms of predictiveness may exclude some features. Here is how to refine the model. It is probably best to take a quick peak at the correlation of each feature.

sapply(mtcars[,2:11], function(x,y)abs(cor(x,y)), mtcars$mpg)
##    cyl   disp     hp   drat     wt   qsec     vs     am   gear   carb 
## 0.8522 0.8476 0.7762 0.6812 0.8677 0.4187 0.6640 0.5998 0.4803 0.5509

These are arranged in descending order of their absolute values:

  • wt
  • cyl
  • disp
  • hp
  • drat
  • vs
  • am
  • carb
  • gear
  • qsec

We will begin with a model that includes all of the features in mtcars other than mpg as the explanatory variables. We will then use backward elimination, which means that we will drop one feature at a time (whichever has the highest P-value) and then build a new model. We keep doing this as long as \(R^2\) increases when a feature is dropped. The first time thta \(R^2\) decreases, we add that feature back, and we have our final model.

# The left side of the formula is mpg
# The right side is all columns except mpg
mtcars.refine.formula <- reformulate(names(mtcars)[-1], "mpg")
# quick look at the formula
mtcars.refine.formula
## mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
mtcars.refine.model <- lm(mtcars.refine.formula, data=mtcars)
summary(mtcars.refine.model)
## 
## Call:
## lm(formula = mtcars.refine.formula, data = mtcars)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -3.45  -1.60  -0.12   1.22   4.63 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  12.3034    18.7179    0.66    0.518  
## cyl          -0.1114     1.0450   -0.11    0.916  
## disp          0.0133     0.0179    0.75    0.463  
## hp           -0.0215     0.0218   -0.99    0.335  
## drat          0.7871     1.6354    0.48    0.635  
## wt           -3.7153     1.8944   -1.96    0.063 .
## qsec          0.8210     0.7308    1.12    0.274  
## vs            0.3178     2.1045    0.15    0.881  
## am            2.5202     2.0567    1.23    0.234  
## gear          0.6554     1.4933    0.44    0.665  
## carb         -0.1994     0.8288   -0.24    0.812  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.65 on 21 degrees of freedom
## Multiple R-squared:  0.869,  Adjusted R-squared:  0.807 
## F-statistic: 13.9 on 10 and 21 DF,  p-value: 3.79e-07

The adjusted \(R^2\) is 0.8066. Now we drop vs, the feature with the highest P-value, and try again.

mtcars.refine.formula <- reformulate(c("cyl","disp","hp","wt","qsec","vs","am","gear","carb"), "mpg")
mtcars.refine.formula
## mpg ~ cyl + disp + hp + wt + qsec + vs + am + gear + carb
mtcars.refine.model <- lm(mtcars.refine.formula, data=mtcars)
summary(mtcars.refine.model)
## 
## Call:
## lm(formula = mtcars.refine.formula, data = mtcars)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.586 -1.669 -0.268  1.190  4.670 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  16.1123    16.6638    0.97    0.344  
## cyl          -0.2501     0.9868   -0.25    0.802  
## disp          0.0143     0.0174    0.82    0.420  
## hp           -0.0224     0.0213   -1.05    0.305  
## wt           -3.8689     1.8345   -2.11    0.047 *
## qsec          0.8083     0.7175    1.13    0.272  
## vs            0.3479     2.0665    0.17    0.868  
## am            2.6750     1.9956    1.34    0.194  
## gear          0.7093     1.4628    0.48    0.633  
## carb         -0.1171     0.7966   -0.15    0.884  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.6 on 22 degrees of freedom
## Multiple R-squared:  0.868,  Adjusted R-squared:  0.813 
## F-statistic:   16 on 9 and 22 DF,  p-value: 1.01e-07

This caused \(R^2\) to increase to 0.8134, so drat is excluded from the final model. The highest P-value in the new model is for carb, so we try dropping it.

mtcars.refine.formula <- reformulate(c("cyl","disp","hp","wt","qsec","vs","am","gear"), "mpg")
mtcars.refine.formula
## mpg ~ cyl + disp + hp + wt + qsec + vs + am + gear
mtcars.refine.model <- lm(mtcars.refine.formula, data=mtcars)
summary(mtcars.refine.model)
## 
## Call:
## lm(formula = mtcars.refine.formula, data = mtcars)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.487 -1.707 -0.339  1.179  4.727 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  16.1947    16.2963    0.99   0.3307   
## cyl          -0.2765     0.9495   -0.29   0.7735   
## disp          0.0160     0.0127    1.27   0.2179   
## hp           -0.0240     0.0178   -1.35   0.1898   
## wt           -4.0535     1.3084   -3.10   0.0051 **
## qsec          0.8380     0.6735    1.24   0.2259   
## vs            0.3748     2.0142    0.19   0.8540   
## am            2.6841     1.9517    1.38   0.1823   
## gear          0.6123     1.2777    0.48   0.6363   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.55 on 23 degrees of freedom
## Multiple R-squared:  0.867,  Adjusted R-squared:  0.821 
## F-statistic: 18.8 on 8 and 23 DF,  p-value: 2.23e-08

\(R^2\) rose again, to 0.8213, so carb is excluded. The highest P-value is now for cyl so we try dropping it.

mtcars.refine.formula <- reformulate(c("disp","hp","wt","qsec","vs","am","gear"), "mpg")
mtcars.refine.formula
## mpg ~ disp + hp + wt + qsec + vs + am + gear
mtcars.refine.model <- lm(mtcars.refine.formula, data=mtcars)
summary(mtcars.refine.model)
## 
## Call:
## lm(formula = mtcars.refine.formula, data = mtcars)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.413 -1.786 -0.311  1.118  4.669 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  13.1394    12.2308    1.07   0.2934   
## disp          0.0154     0.0122    1.26   0.2199   
## hp           -0.0260     0.0162   -1.60   0.1220   
## wt           -4.0825     1.2795   -3.19   0.0039 **
## qsec          0.9060     0.6196    1.46   0.1566   
## vs            0.5895     1.8383    0.32   0.7512   
## am            2.8725     1.8060    1.59   0.1248   
## gear          0.7426     1.1738    0.63   0.5329   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.5 on 24 degrees of freedom
## Multiple R-squared:  0.867,  Adjusted R-squared:  0.828 
## F-statistic: 22.3 on 7 and 24 DF,  p-value: 4.67e-09

\(R^1\) rose again to 0.8281, so cyl is out. The next candidate is vs.

mtcars.refine.formula <- reformulate(c("disp","hp","wt","qsec","am","gear"), "mpg")
mtcars.refine.formula
## mpg ~ disp + hp + wt + qsec + am + gear
mtcars.refine.model <- lm(mtcars.refine.formula, data=mtcars)
summary(mtcars.refine.model)
## 
## Call:
## lm(formula = mtcars.refine.formula, data = mtcars)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -3.27  -1.83  -0.22   1.06   4.70 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  11.3696    10.7174    1.06    0.299   
## disp          0.0149     0.0119    1.25    0.222   
## hp           -0.0252     0.0157   -1.60    0.122   
## wt           -4.1872     1.2147   -3.45    0.002 **
## qsec          1.0277     0.4810    2.14    0.043 * 
## am            2.8167     1.7650    1.60    0.123   
## gear          0.8009     1.1387    0.70    0.488   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.45 on 25 degrees of freedom
## Multiple R-squared:  0.866,  Adjusted R-squared:  0.834 
## F-statistic:   27 on 6 and 25 DF,  p-value: 8.9e-10

\(R^2\) rose again. gear is the next candidate.

mtcars.refine.formula <- reformulate(c("disp","hp","wt","qsec","am"), "mpg")
mtcars.refine.formula
## mpg ~ disp + hp + wt + qsec + am
mtcars.refine.model <- lm(mtcars.refine.formula, data=mtcars)
summary(mtcars.refine.model)
## 
## Call:
## lm(formula = mtcars.refine.formula, data = mtcars)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -3.54  -1.74  -0.32   1.17   4.55 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  14.3619     9.7408    1.47   0.1524   
## disp          0.0112     0.0106    1.06   0.2990   
## hp           -0.0212     0.0145   -1.46   0.1564   
## wt           -4.0843     1.1941   -3.42   0.0021 **
## qsec          1.0069     0.4754    2.12   0.0439 * 
## am            3.4705     1.4858    2.34   0.0275 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.43 on 26 degrees of freedom
## Multiple R-squared:  0.864,  Adjusted R-squared:  0.838 
## F-statistic:   33 on 5 and 26 DF,  p-value: 1.84e-10

Another increase in \(R^2\). disp is the next candidate.

mtcars.refine.formula <- reformulate(c("hp","wt","qsec","am"), "mpg")
mtcars.refine.formula
## mpg ~ hp + wt + qsec + am
mtcars.refine.model <- lm(mtcars.refine.formula, data=mtcars)
summary(mtcars.refine.model)
## 
## Call:
## lm(formula = mtcars.refine.formula, data = mtcars)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.497 -1.590 -0.112  1.180  4.540 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  17.4402     9.3189    1.87   0.0721 . 
## hp           -0.0176     0.0142   -1.25   0.2231   
## wt           -3.2381     0.8899   -3.64   0.0011 **
## qsec          0.8106     0.4389    1.85   0.0757 . 
## am            2.9255     1.3971    2.09   0.0458 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.43 on 27 degrees of freedom
## Multiple R-squared:  0.858,  Adjusted R-squared:  0.837 
## F-statistic: 40.7 on 4 and 27 DF,  p-value: 4.59e-11

This time, \(R^2\) dropped, from 0.8375 to 0.8368. This tells us that disp is a significant predictor and should remian in the model. Because the remaining features all have P-values less than disp, we can be confident that this is our final, best model for prediction using a linear model:

## mpg ~ disp + hp + wt + qsec + am

We saw earlier that cyl had the second-highest correlation of any of the features. Why would this feature be dropped, when features with much lower correlation were retained? I have no mathematical proof, but the answer is probably collinearity. We use \(R^2\) as the criterion for the quality of a given model, and it penalizes a model based on size; the more features, the larger the penalty. This means that if a model contains two features that are like echoes of each other, then having both provides no new information about the model, but \(R^2\) is reduced because of the penalty due to both features being present.

What I can’t say that I understand is why cyl would be the feature eliminated. This seems to suggest that it is collinear with a feature that has a higher correlation, and the only feature above cyl is wt.

We can test this theory with a plot that shows the correlation between each pair of features.

pairs(mtcars, lower.panel=panel.cor, upper.panel = panel.smooth)

plot of chunk unnamed-chunk-12

This does not entirely clarify the situation. The correlation between cyl and wt is 0.78, which is hardly the largest number shown. The correlation between cyl and disp is 0.9, so it seems likely that this is where the collinearity exists. But disp had a (slightly) lower correlation value than cyl. Yet, in our first model summary, the P-value for cyl was 0.9161, and for disp it was only 0.4635. I don’t understand this yet, or maybe I did understand once, and have forgotten. There is clearly more involved than a feature’s correlation value. But we can see that cyl and disp are highly correlated to each other, and disp has the lower P-value; and that is why cyl got booted from the model.

One other important thing to note is that the upper panel in this plot shows a smoothing line for each plot, which may indicate how applicable a feature is to linear regression. Not all of these features are good candidates for linear regression, so this entire exercise has been for purposes of illustration rather than hard analysis.

Analyzing the linear model

Now some early attempts to analyze a linear model to determine its value or validity.

Plots of the linear model

You can simply feed a linear model into the plot function and it will create four plots for you. Of course, the trick is knowing what these plots are, how they are built, and what they tell you.

par(mfrow=c(2,2))
plot(mtcars.refine.model)

plot of chunk unnamed-chunk-13

par(mfrow=c(1,1))

This plots fitted values against observed values.

plot(mtcars.refine.model$fitted, mtcars$mpg, ylab="Observed", xlab="Fitted")

plot of chunk unnamed-chunk-14

There should be a strong correlation between fitted and observed.

cor(mtcars.refine.model$fitted, mtcars$mpg)
## [1] 0.9262

This is a residual plot; it plots the fitted values against the residuals, the same as in the 4x4 plot above, except that outliers are not labeled.

plot(mtcars.refine.model$fitted, mtcars.refine.model$residuals, xlab="Fitted", ylab="Residuals")
abline(a=0,b=0, lty=3)

plot of chunk unnamed-chunk-16

Cook’s Distance is a commonly-used measure of the influence of the points in a linear model. The question for me is how to interpret. If there are no outliers, all is good?

plot(cooks.distance(mtcars.refine.model))

plot of chunk unnamed-chunk-17

Questions remain on what to do with these in terms of assessing the model. Stepwise reduction gave us the best model; do these other tools tells us that this “best” model just isn’t good enough? How do we use R to determine whether the data is amenable to linear regression, or for that matter, logistic regression, as opposed to just eyeballing it?

ANOVA

One tool we can use to analyze the model is ANOVA.

anova(lm(mpg ~ ., data=mtcars))
## Analysis of Variance Table
## 
## Response: mpg
##           Df Sum Sq Mean Sq F value Pr(>F)    
## cyl        1    818     818  116.42  5e-10 ***
## disp       1     38      38    5.35 0.0309 *  
## hp         1      9       9    1.33 0.2610    
## drat       1     16      16    2.34 0.1406    
## wt         1     77      77   11.03 0.0032 ** 
## qsec       1      4       4    0.56 0.4617    
## vs         1      0       0    0.02 0.8932    
## am         1     14      14    2.06 0.1659    
## gear       1      1       1    0.14 0.7137    
## carb       1      0       0    0.06 0.8122    
## Residuals 21    147       7                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

And here is the ANOVA table for the refined model.

anova(mtcars.refine.model)
## Analysis of Variance Table
## 
## Response: mpg
##           Df Sum Sq Mean Sq F value  Pr(>F)    
## hp         1    678     678  114.43 3.3e-11 ***
## wt         1    253     253   42.61 5.3e-07 ***
## qsec       1      9       9    1.52   0.229    
## am         1     26      26    4.38   0.046 *  
## Residuals 27    160       6                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Variance Inflation Factor

I also just read about something else, variance inflation factor, using the vif function in the car package. The following builds a model with all of the features. In theory, any feature shown in the result with a value of 5 or higher shows signs of collinearity.

library(car)
## Warning: package 'car' was built under R version 3.1.2
vif(lm(mpg ~ ., data=mtcars))
##    cyl   disp     hp   drat     wt   qsec     vs     am   gear   carb 
## 15.374 21.620  9.832  3.375 15.165  7.528  4.966  4.648  5.357  7.909

We have some features here that are in our final model. So this is not, apparently, the whole story. Now let’s try vif on our final model.

vif(lm(mpg ~ disp + hp + wt + qsec + am, data=mtcars))
##  disp    hp    wt  qsec    am 
## 9.072 5.195 7.171 3.791 2.887
pairs(mtcars[,c("mpg","disp","hp","wt","qsec","am")], lower.panel=panel.cor, upper.panel = panel.smooth)

plot of chunk unnamed-chunk-21

Appendix

Some other miscellaneous topics.

Example using reformulate

The reformulate function allows you to use character vectors to specify one or both sides of a formula. The termlabels parameter must be a character vector, and contains the names of the terms in the right side (explanatory) or a formula. The response parameter can be a character string, or else a symbol or call giving the left side (response) of formula.

This might be useful if you need the right side of a formula to be a subset of fields, and you have the indices for that subset, but it would tedious to build the list by hand.

# mpg is the response variable
# disp is the explanatory variable
mtcars.disp.formula <- reformulate(c("disp"), "mpg")
# or, assuming we know the index of the column:
mtcars.disp.formula <- reformulate(names(mtcars)[3], names(mtcars[1]))
# here is the traditional way of creating the formula
mtcars.disp.lm <- lm(mpg ~ disp, data=mtcars)
# and now with reformulate
mtcars.disp.lm <- lm(mtcars.disp.formula, data=mtcars)
plot(mpg ~ disp, data=mtcars)
abline(mtcars.disp.lm)

plot of chunk unnamed-chunk-22

mtcars.cyl.lm <- lm(mpg ~ cyl, data=mtcars)
mtcars.disp.lm <- lm(mpg ~ disp, data=mtcars)
mtcars.hp.lm <- lm(mpg ~ hp, data=mtcars)
par(mfrow=c(2,2))
plot(mpg~cyl, data=mtcars)
abline(mtcars.cyl.lm)
plot(mpg~disp, data=mtcars)
abline(mtcars.disp.lm)
plot(mpg~hp, data=mtcars)
abline(mtcars.hp.lm)

plot of chunk unnamed-chunk-23

Some miscellaneous functions

The following are defined as R chunks earlier in the document but they are not displayed. So they are shown here.

panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor, ...)
{
    usr <- par("usr"); on.exit(par(usr))
    par(usr = c(0, 1, 0, 1))
    r <- abs(cor(x, y))
    txt <- format(c(r, 0.123456789), digits = digits)[1]
    txt <- paste0(prefix, txt)
    if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
    text(0.5, 0.5, txt, cex = cex.cor * r)
}
panel.lmline = function (x, y, col = par("col"), bg = NA, pch = par("pch"), 
    cex = 1, col.smooth = "red", ...) 
{
    points(x, y, pch = pch, col = col, bg = bg, cex = cex)
    ok <- is.finite(x) & is.finite(y)
    if (any(ok)) 
       abline(lm(y[ok] ~ x[ok]), 
           col = col.smooth, ...)
}