Misc Examples from Lecture:

Model Selection Machines, Extra Sum of Squares and Multicollinearity

This post provides some quick examples of tools for model selection, the diagnosis of multicollinearity, and the calculation of the partial coefficient of determination. These are examples we discussed in class.

## LOAD LIBRARIES
library(car)
library(leaps)
library(foreign)

Load only the data table (dbase file) from the USA county shapefile. We've used this file before, it includes county level returns from the 2004 presidential election and a bunch of demographic data.

USA <- read.dbf("/PATH_TO_THE_FILE/2004_Election_Counties.dbf")

Variance Inflation Factors

lm1 <- lm(Bush_pct ~ pcturban + pctfemhh + pctpoor + HISP_LAT + MEDAGE2000, 
    USA)

## Calculate variance inflation factors
vif(lm1)
##   pcturban   pctfemhh    pctpoor   HISP_LAT MEDAGE2000 
##      1.736      2.206      2.147      1.287      1.282

The model lm1 seems OK in terms of multicollinearity. However, it only describes ~20% of the variance in the percent of people who voted for George Bush in 2004.

To illustrate the utility of vif() add some multicollinearity to the model by adding \( pctpoor^2 \):

lm2 <- lm(Bush_pct ~ pcturban + pctfemhh + pctpoor + I(pctpoor^2) + HISP_LAT + 
    MEDAGE2000, USA)
vif(lm2)
##     pcturban     pctfemhh      pctpoor I(pctpoor^2)     HISP_LAT 
##        1.751        2.206       10.664       10.114        1.291 
##   MEDAGE2000 
##        1.356

Model Selection with regsubsets()

In the code below we use regsubsets from the leaps library to search all possible 1,2,3,…,20 variable models. Please don't forget our discussion of the pros + cons of these theory-free automated model selection procedeures.

d <- regsubsets(Bush_pct ~ ., nbest = 1, nvmax = 20, data = USA[, c(16:38, 13)])
d.fit <- summary(d)
# Plot Layout
layout(matrix(c(1, 2, 3, 3), 2, 2, byrow = TRUE), respect = TRUE)

# Generate scree plots for fit statistics include regsubsets plot at
# bottom.
plot(1:20, y = d.fit$adjr2, type = "o", xlab = "Number of Parameters", ylab = "Adj. r^2")
plot(1:20, y = d.fit$bic, type = "o", xlab = "Number of Parameters", ylab = "BIC")
plot(d)

plot of chunk unnamed-chunk-6

The 10 variable model seems to balance fit and parsimony (based on my read of the scree plots of BIC and adjusted-\( R^2 \)). However regsubsets makes finding the variables in this model a pain, you could try reading the regsubsets plot, or you could type:

# Here I am looking at 10th row of the regsubsets object. it contains the
# variables in the best 10 variable model.
summary(d)$which[10, ]
## (Intercept)     MDratio        hosp     pcthisp    pcturban    urbrural 
##        TRUE       FALSE        TRUE        TRUE       FALSE       FALSE 
##    pctfemhh    pcincome     pctpoor    pctlt9ed     pcthsed    pctcoled 
##        TRUE       FALSE        TRUE       FALSE       FALSE        TRUE 
##    unemploy    pctwhtcl    homevalu        rent     popdens     crowded 
##        TRUE       FALSE        TRUE       FALSE       FALSE        TRUE 
##     ginirev   SmokecurM    SmokevrM   SmokecurF    SmokevrF       Obese 
##        TRUE       FALSE       FALSE       FALSE        TRUE       FALSE

The Partial Coefficient of Determination

Partial coefficients of determination allow us to assess the marginal contribution of a variable given that other variables are already in the model. See lecture notes for a complete discussion.

Here we will build the 10-variable model suggested by regsubsets and check the marginal extra sum of squares for each variable using the library lmSupport

lm10 <- lm(Bush_pct ~ hosp + pcthisp + pctfemhh + pctpoor + pctcoled + unemploy + 
    homevalu + crowded + ginirev + SmokevrF, USA)
summary(lm10)
## 
## Call:
## lm(formula = Bush_pct ~ hosp + pcthisp + pctfemhh + pctpoor + 
##     pctcoled + unemploy + homevalu + crowded + ginirev + SmokevrF, 
##     data = USA)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -54.50  -6.44   0.66   6.84  29.97 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.45e+01   1.23e+00   44.44  < 2e-16 ***
## hosp         9.54e-02   2.39e-02    3.99  6.7e-05 ***
## pcthisp     -1.03e-02   2.11e-03   -4.85  1.3e-06 ***
## pctfemhh    -1.29e+00   4.72e-02  -27.41  < 2e-16 ***
## pctpoor     -5.31e-01   4.48e-02  -11.84  < 2e-16 ***
## pctcoled    -3.94e-01   3.97e-02   -9.94  < 2e-16 ***
## unemploy    -7.26e-01   7.36e-02   -9.87  < 2e-16 ***
## homevalu    -6.54e-05   8.47e-06   -7.72  1.5e-14 ***
## crowded      9.29e-01   9.54e-02    9.75  < 2e-16 ***
## ginirev      1.24e+02   4.71e+00   26.37  < 2e-16 ***
## SmokevrF    -2.50e+01   3.06e+00   -8.15  5.4e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 9.77 on 3100 degrees of freedom
## Multiple R-squared: 0.416,   Adjusted R-squared: 0.414 
## F-statistic:  221 on 10 and 3100 DF,  p-value: <2e-16
vif(lm10)
##     hosp  pcthisp pctfemhh  pctpoor pctcoled unemploy homevalu  crowded 
##    1.145    1.797    2.191    4.357    2.192    1.733    2.537    3.024 
##  ginirev SmokevrF 
##    3.180    1.739
# plot(lm10)

By very quick and dirty diagnostics the model seems OK and explains around 40% of the variance in votes for Bush. We can test each individual predictor with a term wise deletion function drop1:

drop1(lm10, test = "F")

As expected the all terms are significant. However, it might be useful to see how much of the variance each of these variables is explaining given that others are already in the model:

library(lmSupport)
lm.sumSquares(lm10)
##                 SS dR-sqr pEta-sqr   df       F p-value
## (Intercept) 188300 0.3719   0.3891    1 1974.47   0e+00
## hosp          1519 0.0030   0.0051    1   15.93   1e-04
## pcthisp       2243 0.0044   0.0075    1   23.52   0e+00
## pctfemhh     71654 0.1415   0.1951    1  751.35   0e+00
## pctpoor      13366 0.0264   0.0433    1  140.16   0e+00
## pctcoled      9421 0.0186   0.0309    1   98.79   0e+00
## unemploy      9287 0.0183   0.0305    1   97.38   0e+00
## homevalu      5686 0.0112   0.0189    1   59.62   0e+00
## crowded       9057 0.0179   0.0297    1   94.97   0e+00
## ginirev      66296 0.1309   0.1832    1  695.17   0e+00
## SmokevrF      6330 0.0125   0.0210    1   66.37   0e+00
## Error (SSE) 295638     NA       NA 3100      NA      NA
## Total (SST) 506358     NA       NA   NA      NA      NA

With the exception of pctfemhh and ginirev most variables have small partial coefficients of determination - 8 of the 10 variables are significant but not very powerful predictors. In fact a model with only pctfemhh and ginirev two variables explains 30% of the variation in votes for Bush.