This is my sandbox for playing around with one of R’s built-in datasets, specifically in the context of building linear models. A few disclaimers:
lm that can build linear models, but I don’t yet know when or where doing so is valid. This is a sandbox, not an assertion of “here is the right way to do things.” If you see something that looks weird or wrong, please let me know.The steps in creating a linear regression plot in R are:
lm function# Step 1: create a formula.
# Normally, this is just a parameter in the call to lm.
# If the response variable is mpg, and the explanatory variable is cyl,
# then the formula is:
# mpg ~ cyl
# So the two sides are separated by the tilde operator.
# We can explicitly create and store a formula:
mtcars.cyl.formula <- mpg ~ cyl
# Step 2: create a model.
mtcars.cyl.lm <- lm(formula=mtcars.cyl.formula, data=mtcars)
# Step 3: Plot the response and explanatory variables
# This uses the same formula used to create the model
plot(mtcars.cyl.formula, data=mtcars)
# Step 4: Plot the regression line
# The abline function normally takes both a (intercept) and b (slope)
# arguments, but if a is a linear model, then abline will interrogate the
# model for a coefficients member, and this will be a vector of
# two values, the first of which will be intercept and the second slope
# so this will work:
# abline(mtcars.cyl.lm$coefficients[1], mtcars.cyl.lm$coefficients[2])
# But this is simpler:
abline(mtcars.cyl.lm)
If we did not store a formula, or for that matter a model, the shortest version of the code would be:
plot(mpg ~ cyl, data=mtcars)
abline(lm(mpg ~ cyl, data=mtcars))
This section assumes that you have familiarity with statistical inference, especially the concepts of a P-value and \(R^2\).
We can use several features in multiple linear regression, but not all of those features may prove worthy of remaining in the model. The best model in terms of predictiveness may exclude some features. Here is how to refine the model. It is probably best to take a quick peak at the correlation of each feature.
sapply(mtcars[,2:11], function(x,y)abs(cor(x,y)), mtcars$mpg)
## cyl disp hp drat wt qsec vs am gear carb
## 0.8522 0.8476 0.7762 0.6812 0.8677 0.4187 0.6640 0.5998 0.4803 0.5509
These are arranged in descending order of their absolute values:
We will begin with a model that includes all of the features in mtcars other than mpg as the explanatory variables. We will then use backward elimination, which means that we will drop one feature at a time (whichever has the highest P-value) and then build a new model. We keep doing this as long as \(R^2\) increases when a feature is dropped. The first time thta \(R^2\) decreases, we add that feature back, and we have our final model.
# The left side of the formula is mpg
# The right side is all columns except mpg
mtcars.refine.formula <- reformulate(names(mtcars)[-1], "mpg")
# quick look at the formula
mtcars.refine.formula
## mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
mtcars.refine.model <- lm(mtcars.refine.formula, data=mtcars)
summary(mtcars.refine.model)
##
## Call:
## lm(formula = mtcars.refine.formula, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.45 -1.60 -0.12 1.22 4.63
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.3034 18.7179 0.66 0.518
## cyl -0.1114 1.0450 -0.11 0.916
## disp 0.0133 0.0179 0.75 0.463
## hp -0.0215 0.0218 -0.99 0.335
## drat 0.7871 1.6354 0.48 0.635
## wt -3.7153 1.8944 -1.96 0.063 .
## qsec 0.8210 0.7308 1.12 0.274
## vs 0.3178 2.1045 0.15 0.881
## am 2.5202 2.0567 1.23 0.234
## gear 0.6554 1.4933 0.44 0.665
## carb -0.1994 0.8288 -0.24 0.812
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.65 on 21 degrees of freedom
## Multiple R-squared: 0.869, Adjusted R-squared: 0.807
## F-statistic: 13.9 on 10 and 21 DF, p-value: 3.79e-07
The adjusted \(R^2\) is 0.8066. Now we drop vs, the feature with the highest P-value, and try again.
mtcars.refine.formula <- reformulate(c("cyl","disp","hp","wt","qsec","vs","am","gear","carb"), "mpg")
mtcars.refine.formula
## mpg ~ cyl + disp + hp + wt + qsec + vs + am + gear + carb
mtcars.refine.model <- lm(mtcars.refine.formula, data=mtcars)
summary(mtcars.refine.model)
##
## Call:
## lm(formula = mtcars.refine.formula, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.586 -1.669 -0.268 1.190 4.670
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16.1123 16.6638 0.97 0.344
## cyl -0.2501 0.9868 -0.25 0.802
## disp 0.0143 0.0174 0.82 0.420
## hp -0.0224 0.0213 -1.05 0.305
## wt -3.8689 1.8345 -2.11 0.047 *
## qsec 0.8083 0.7175 1.13 0.272
## vs 0.3479 2.0665 0.17 0.868
## am 2.6750 1.9956 1.34 0.194
## gear 0.7093 1.4628 0.48 0.633
## carb -0.1171 0.7966 -0.15 0.884
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.6 on 22 degrees of freedom
## Multiple R-squared: 0.868, Adjusted R-squared: 0.813
## F-statistic: 16 on 9 and 22 DF, p-value: 1.01e-07
This caused \(R^2\) to increase to 0.8134, so drat is excluded from the final model. The highest P-value in the new model is for carb, so we try dropping it.
mtcars.refine.formula <- reformulate(c("cyl","disp","hp","wt","qsec","vs","am","gear"), "mpg")
mtcars.refine.formula
## mpg ~ cyl + disp + hp + wt + qsec + vs + am + gear
mtcars.refine.model <- lm(mtcars.refine.formula, data=mtcars)
summary(mtcars.refine.model)
##
## Call:
## lm(formula = mtcars.refine.formula, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.487 -1.707 -0.339 1.179 4.727
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16.1947 16.2963 0.99 0.3307
## cyl -0.2765 0.9495 -0.29 0.7735
## disp 0.0160 0.0127 1.27 0.2179
## hp -0.0240 0.0178 -1.35 0.1898
## wt -4.0535 1.3084 -3.10 0.0051 **
## qsec 0.8380 0.6735 1.24 0.2259
## vs 0.3748 2.0142 0.19 0.8540
## am 2.6841 1.9517 1.38 0.1823
## gear 0.6123 1.2777 0.48 0.6363
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.55 on 23 degrees of freedom
## Multiple R-squared: 0.867, Adjusted R-squared: 0.821
## F-statistic: 18.8 on 8 and 23 DF, p-value: 2.23e-08
\(R^2\) rose again, to 0.8213, so carb is excluded. The highest P-value is now for cyl so we try dropping it.
mtcars.refine.formula <- reformulate(c("disp","hp","wt","qsec","vs","am","gear"), "mpg")
mtcars.refine.formula
## mpg ~ disp + hp + wt + qsec + vs + am + gear
mtcars.refine.model <- lm(mtcars.refine.formula, data=mtcars)
summary(mtcars.refine.model)
##
## Call:
## lm(formula = mtcars.refine.formula, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.413 -1.786 -0.311 1.118 4.669
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.1394 12.2308 1.07 0.2934
## disp 0.0154 0.0122 1.26 0.2199
## hp -0.0260 0.0162 -1.60 0.1220
## wt -4.0825 1.2795 -3.19 0.0039 **
## qsec 0.9060 0.6196 1.46 0.1566
## vs 0.5895 1.8383 0.32 0.7512
## am 2.8725 1.8060 1.59 0.1248
## gear 0.7426 1.1738 0.63 0.5329
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.5 on 24 degrees of freedom
## Multiple R-squared: 0.867, Adjusted R-squared: 0.828
## F-statistic: 22.3 on 7 and 24 DF, p-value: 4.67e-09
\(R^1\) rose again to 0.8281, so cyl is out. The next candidate is vs.
mtcars.refine.formula <- reformulate(c("disp","hp","wt","qsec","am","gear"), "mpg")
mtcars.refine.formula
## mpg ~ disp + hp + wt + qsec + am + gear
mtcars.refine.model <- lm(mtcars.refine.formula, data=mtcars)
summary(mtcars.refine.model)
##
## Call:
## lm(formula = mtcars.refine.formula, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.27 -1.83 -0.22 1.06 4.70
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.3696 10.7174 1.06 0.299
## disp 0.0149 0.0119 1.25 0.222
## hp -0.0252 0.0157 -1.60 0.122
## wt -4.1872 1.2147 -3.45 0.002 **
## qsec 1.0277 0.4810 2.14 0.043 *
## am 2.8167 1.7650 1.60 0.123
## gear 0.8009 1.1387 0.70 0.488
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.45 on 25 degrees of freedom
## Multiple R-squared: 0.866, Adjusted R-squared: 0.834
## F-statistic: 27 on 6 and 25 DF, p-value: 8.9e-10
\(R^2\) rose again. gear is the next candidate.
mtcars.refine.formula <- reformulate(c("disp","hp","wt","qsec","am"), "mpg")
mtcars.refine.formula
## mpg ~ disp + hp + wt + qsec + am
mtcars.refine.model <- lm(mtcars.refine.formula, data=mtcars)
summary(mtcars.refine.model)
##
## Call:
## lm(formula = mtcars.refine.formula, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.54 -1.74 -0.32 1.17 4.55
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14.3619 9.7408 1.47 0.1524
## disp 0.0112 0.0106 1.06 0.2990
## hp -0.0212 0.0145 -1.46 0.1564
## wt -4.0843 1.1941 -3.42 0.0021 **
## qsec 1.0069 0.4754 2.12 0.0439 *
## am 3.4705 1.4858 2.34 0.0275 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.43 on 26 degrees of freedom
## Multiple R-squared: 0.864, Adjusted R-squared: 0.838
## F-statistic: 33 on 5 and 26 DF, p-value: 1.84e-10
Another increase in \(R^2\). disp is the next candidate.
mtcars.refine.formula <- reformulate(c("hp","wt","qsec","am"), "mpg")
mtcars.refine.formula
## mpg ~ hp + wt + qsec + am
mtcars.refine.model <- lm(mtcars.refine.formula, data=mtcars)
summary(mtcars.refine.model)
##
## Call:
## lm(formula = mtcars.refine.formula, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.497 -1.590 -0.112 1.180 4.540
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.4402 9.3189 1.87 0.0721 .
## hp -0.0176 0.0142 -1.25 0.2231
## wt -3.2381 0.8899 -3.64 0.0011 **
## qsec 0.8106 0.4389 1.85 0.0757 .
## am 2.9255 1.3971 2.09 0.0458 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.43 on 27 degrees of freedom
## Multiple R-squared: 0.858, Adjusted R-squared: 0.837
## F-statistic: 40.7 on 4 and 27 DF, p-value: 4.59e-11
This time, \(R^2\) dropped, from 0.8375 to 0.8368. This tells us that disp is a significant predictor and should remian in the model. Because the remaining features all have P-values less than disp, we can be confident that this is our final, best model for prediction using a linear model:
## mpg ~ disp + hp + wt + qsec + am
We saw earlier that cyl had the second-highest correlation of any of the features. Why would this feature be dropped, when features with much lower correlation were retained? I have no mathematical proof, but the answer is probably collinearity. We use \(R^2\) as the criterion for the quality of a given model, and it penalizes a model based on size; the more features, the larger the penalty. This means that if a model contains two features that are like echoes of each other, then having both provides no new information about the model, but \(R^2\) is reduced because of the penalty due to both features being present.
What I can’t say that I understand is why cyl would be the feature eliminated. This seems to suggest that it is collinear with a feature that has a higher correlation, and the only feature above cyl is wt.
We can test this theory with a plot that shows the correlation between each pair of features.
pairs(mtcars, lower.panel=panel.cor, upper.panel = panel.smooth)
This does not entirely clarify the situation. The correlation between cyl and wt is 0.78, which is hardly the largest number shown. The correlation between cyl and disp is 0.9, so it seems likely that this is where the collinearity exists. But disp had a (slightly) lower correlation value than cyl. Yet, in our first model summary, the P-value for cyl was 0.9161, and for disp it was only 0.4635. I don’t understand this yet, or maybe I did understand once, and have forgotten. There is clearly more involved than a feature’s correlation value. But we can see that cyl and disp are highly correlated to each other, and disp has the lower P-value; and that is why cyl got booted from the model.
One other important thing to note is that the upper panel in this plot shows a smoothing line for each plot, which may indicate how applicable a feature is to linear regression. Not all of these features are good candidates for linear regression, so this entire exercise has been for purposes of illustration rather than hard analysis.
Now some early attempts to analyze a linear model to determine its value or validity.
You can simply feed a linear model into the plot function and it will create four plots for you. Of course, the trick is knowing what these plots are, how they are built, and what they tell you.
par(mfrow=c(2,2))
plot(mtcars.refine.model)
par(mfrow=c(1,1))
This plots fitted values against observed values.
plot(mtcars.refine.model$fitted, mtcars$mpg, ylab="Observed", xlab="Fitted")
There should be a strong correlation between fitted and observed.
cor(mtcars.refine.model$fitted, mtcars$mpg)
## [1] 0.9262
This is a residual plot; it plots the fitted values against the residuals, the same as in the 4x4 plot above, except that outliers are not labeled.
plot(mtcars.refine.model$fitted, mtcars.refine.model$residuals, xlab="Fitted", ylab="Residuals")
abline(a=0,b=0, lty=3)
Cook’s Distance is a commonly-used measure of the influence of the points in a linear model. The question for me is how to interpret. If there are no outliers, all is good?
plot(cooks.distance(mtcars.refine.model))
Questions remain on what to do with these in terms of assessing the model. Stepwise reduction gave us the best model; do these other tools tells us that this “best” model just isn’t good enough? How do we use R to determine whether the data is amenable to linear regression, or for that matter, logistic regression, as opposed to just eyeballing it?
One tool we can use to analyze the model is ANOVA.
anova(lm(mpg ~ ., data=mtcars))
## Analysis of Variance Table
##
## Response: mpg
## Df Sum Sq Mean Sq F value Pr(>F)
## cyl 1 818 818 116.42 5e-10 ***
## disp 1 38 38 5.35 0.0309 *
## hp 1 9 9 1.33 0.2610
## drat 1 16 16 2.34 0.1406
## wt 1 77 77 11.03 0.0032 **
## qsec 1 4 4 0.56 0.4617
## vs 1 0 0 0.02 0.8932
## am 1 14 14 2.06 0.1659
## gear 1 1 1 0.14 0.7137
## carb 1 0 0 0.06 0.8122
## Residuals 21 147 7
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
And here is the ANOVA table for the refined model.
anova(mtcars.refine.model)
## Analysis of Variance Table
##
## Response: mpg
## Df Sum Sq Mean Sq F value Pr(>F)
## hp 1 678 678 114.43 3.3e-11 ***
## wt 1 253 253 42.61 5.3e-07 ***
## qsec 1 9 9 1.52 0.229
## am 1 26 26 4.38 0.046 *
## Residuals 27 160 6
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
I also just read about something else, variance inflation factor, using the vif function in the car package. The following builds a model with all of the features. In theory, any feature shown in the result with a value of 5 or higher shows signs of collinearity.
library(car)
## Warning: package 'car' was built under R version 3.1.2
vif(lm(mpg ~ ., data=mtcars))
## cyl disp hp drat wt qsec vs am gear carb
## 15.374 21.620 9.832 3.375 15.165 7.528 4.966 4.648 5.357 7.909
We have some features here that are in our final model. So this is not, apparently, the whole story. Now let’s try vif on our final model.
vif(lm(mpg ~ disp + hp + wt + qsec + am, data=mtcars))
## disp hp wt qsec am
## 9.072 5.195 7.171 3.791 2.887
pairs(mtcars[,c("mpg","disp","hp","wt","qsec","am")], lower.panel=panel.cor, upper.panel = panel.smooth)
Some other miscellaneous topics.
The reformulate function allows you to use character vectors to specify one or both sides of a formula. The termlabels parameter must be a character vector, and contains the names of the terms in the right side (explanatory) or a formula. The response parameter can be a character string, or else a symbol or call giving the left side (response) of formula.
This might be useful if you need the right side of a formula to be a subset of fields, and you have the indices for that subset, but it would tedious to build the list by hand.
# mpg is the response variable
# disp is the explanatory variable
mtcars.disp.formula <- reformulate(c("disp"), "mpg")
# or, assuming we know the index of the column:
mtcars.disp.formula <- reformulate(names(mtcars)[3], names(mtcars[1]))
# here is the traditional way of creating the formula
mtcars.disp.lm <- lm(mpg ~ disp, data=mtcars)
# and now with reformulate
mtcars.disp.lm <- lm(mtcars.disp.formula, data=mtcars)
plot(mpg ~ disp, data=mtcars)
abline(mtcars.disp.lm)
mtcars.cyl.lm <- lm(mpg ~ cyl, data=mtcars)
mtcars.disp.lm <- lm(mpg ~ disp, data=mtcars)
mtcars.hp.lm <- lm(mpg ~ hp, data=mtcars)
par(mfrow=c(2,2))
plot(mpg~cyl, data=mtcars)
abline(mtcars.cyl.lm)
plot(mpg~disp, data=mtcars)
abline(mtcars.disp.lm)
plot(mpg~hp, data=mtcars)
abline(mtcars.hp.lm)
The following are defined as R chunks earlier in the document but they are not displayed. So they are shown here.
panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor, ...)
{
usr <- par("usr"); on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
r <- abs(cor(x, y))
txt <- format(c(r, 0.123456789), digits = digits)[1]
txt <- paste0(prefix, txt)
if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
text(0.5, 0.5, txt, cex = cex.cor * r)
}
panel.lmline = function (x, y, col = par("col"), bg = NA, pch = par("pch"),
cex = 1, col.smooth = "red", ...)
{
points(x, y, pch = pch, col = col, bg = bg, cex = cex)
ok <- is.finite(x) & is.finite(y)
if (any(ok))
abline(lm(y[ok] ~ x[ok]),
col = col.smooth, ...)
}