This post provides some quick examples of tools for model selection, the diagnosis of multicollinearity, and the calculation of the partial coefficient of determination. These are examples we discussed in class.
## LOAD LIBRARIES
library(car)
library(leaps)
library(foreign)
Load only the data table (dbase file) from the USA county shapefile. We've used this file before, it includes county level returns from the 2004 presidential election and a bunch of demographic data.
USA <- read.dbf("/PATH_TO_THE_FILE/2004_Election_Counties.dbf")
lm1 <- lm(Bush_pct ~ pcturban + pctfemhh + pctpoor + HISP_LAT + MEDAGE2000,
USA)
## Calculate variance inflation factors
vif(lm1)
## pcturban pctfemhh pctpoor HISP_LAT MEDAGE2000
## 1.736 2.206 2.147 1.287 1.282
The model lm1 seems OK in terms of multicollinearity. However, it only describes ~20% of the variance in the percent of people who voted for George Bush in 2004.
To illustrate the utility of vif() add some multicollinearity to the model by adding \( pctpoor^2 \):
lm2 <- lm(Bush_pct ~ pcturban + pctfemhh + pctpoor + I(pctpoor^2) + HISP_LAT +
MEDAGE2000, USA)
vif(lm2)
## pcturban pctfemhh pctpoor I(pctpoor^2) HISP_LAT
## 1.751 2.206 10.664 10.114 1.291
## MEDAGE2000
## 1.356
regsubsets()In the code below we use regsubsets from the leaps library to search all possible 1,2,3,…,20 variable models. Please don't forget our discussion of the pros + cons of these theory-free automated model selection procedeures.
d <- regsubsets(Bush_pct ~ ., nbest = 1, nvmax = 20, data = USA[, c(16:38, 13)])
d.fit <- summary(d)
# Plot Layout
layout(matrix(c(1, 2, 3, 3), 2, 2, byrow = TRUE), respect = TRUE)
# Generate scree plots for fit statistics include regsubsets plot at
# bottom.
plot(1:20, y = d.fit$adjr2, type = "o", xlab = "Number of Parameters", ylab = "Adj. r^2")
plot(1:20, y = d.fit$bic, type = "o", xlab = "Number of Parameters", ylab = "BIC")
plot(d)
The 10 variable model seems to balance fit and parsimony (based on my read of the scree plots of BIC and adjusted-\( R^2 \)). However regsubsets makes finding the variables in this model a pain, you could try reading the regsubsets plot, or you could type:
# Here I am looking at 10th row of the regsubsets object. it contains the
# variables in the best 10 variable model.
summary(d)$which[10, ]
## (Intercept) MDratio hosp pcthisp pcturban urbrural
## TRUE FALSE TRUE TRUE FALSE FALSE
## pctfemhh pcincome pctpoor pctlt9ed pcthsed pctcoled
## TRUE FALSE TRUE FALSE FALSE TRUE
## unemploy pctwhtcl homevalu rent popdens crowded
## TRUE FALSE TRUE FALSE FALSE TRUE
## ginirev SmokecurM SmokevrM SmokecurF SmokevrF Obese
## TRUE FALSE FALSE FALSE TRUE FALSE
Partial coefficients of determination allow us to assess the marginal contribution of a variable given that other variables are already in the model. See lecture notes for a complete discussion.
Here we will build the 10-variable model suggested by regsubsets and check the marginal extra sum of squares for each variable using the library lmSupport
lm10 <- lm(Bush_pct ~ hosp + pcthisp + pctfemhh + pctpoor + pctcoled + unemploy +
homevalu + crowded + ginirev + SmokevrF, USA)
summary(lm10)
##
## Call:
## lm(formula = Bush_pct ~ hosp + pcthisp + pctfemhh + pctpoor +
## pctcoled + unemploy + homevalu + crowded + ginirev + SmokevrF,
## data = USA)
##
## Residuals:
## Min 1Q Median 3Q Max
## -54.50 -6.44 0.66 6.84 29.97
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.45e+01 1.23e+00 44.44 < 2e-16 ***
## hosp 9.54e-02 2.39e-02 3.99 6.7e-05 ***
## pcthisp -1.03e-02 2.11e-03 -4.85 1.3e-06 ***
## pctfemhh -1.29e+00 4.72e-02 -27.41 < 2e-16 ***
## pctpoor -5.31e-01 4.48e-02 -11.84 < 2e-16 ***
## pctcoled -3.94e-01 3.97e-02 -9.94 < 2e-16 ***
## unemploy -7.26e-01 7.36e-02 -9.87 < 2e-16 ***
## homevalu -6.54e-05 8.47e-06 -7.72 1.5e-14 ***
## crowded 9.29e-01 9.54e-02 9.75 < 2e-16 ***
## ginirev 1.24e+02 4.71e+00 26.37 < 2e-16 ***
## SmokevrF -2.50e+01 3.06e+00 -8.15 5.4e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.77 on 3100 degrees of freedom
## Multiple R-squared: 0.416, Adjusted R-squared: 0.414
## F-statistic: 221 on 10 and 3100 DF, p-value: <2e-16
vif(lm10)
## hosp pcthisp pctfemhh pctpoor pctcoled unemploy homevalu crowded
## 1.145 1.797 2.191 4.357 2.192 1.733 2.537 3.024
## ginirev SmokevrF
## 3.180 1.739
# plot(lm10)
By very quick and dirty diagnostics the model seems OK and explains around 40% of the variance in votes for Bush. We can test each individual predictor with a term wise deletion function drop1:
drop1(lm10, test = "F")
As expected the all terms are significant. However, it might be useful to see how much of the variance each of these variables is explaining given that others are already in the model:
library(lmSupport)
lm.sumSquares(lm10)
## SS dR-sqr pEta-sqr df F p-value
## (Intercept) 188300 0.3719 0.3891 1 1974.47 0e+00
## hosp 1519 0.0030 0.0051 1 15.93 1e-04
## pcthisp 2243 0.0044 0.0075 1 23.52 0e+00
## pctfemhh 71654 0.1415 0.1951 1 751.35 0e+00
## pctpoor 13366 0.0264 0.0433 1 140.16 0e+00
## pctcoled 9421 0.0186 0.0309 1 98.79 0e+00
## unemploy 9287 0.0183 0.0305 1 97.38 0e+00
## homevalu 5686 0.0112 0.0189 1 59.62 0e+00
## crowded 9057 0.0179 0.0297 1 94.97 0e+00
## ginirev 66296 0.1309 0.1832 1 695.17 0e+00
## SmokevrF 6330 0.0125 0.0210 1 66.37 0e+00
## Error (SSE) 295638 NA NA 3100 NA NA
## Total (SST) 506358 NA NA NA NA NA
With the exception of pctfemhh and ginirev most variables have small partial coefficients of determination - 8 of the 10 variables are significant but not very powerful predictors. In fact a model with only pctfemhh and ginirev two variables explains 30% of the variation in votes for Bush.