Multiple Regression

Voting patterns can be complex and difficult to understand. What factors shape voting behavior at the county level? The answer to this question has significant implications for the balance of power in local, state, and federal governments. Wouldn't it be nice if we could predict voting behavior, if we could tell who was likely to win an election based on the characteristics of the population in a county. If we figured this out I'm sure we'd be popular on cable news shows. Using regression to predict voting behavior is a tall order, but we can use a regression to describe how a set of variables are associated with voting. What characteristics of a county are significantly associated with turn-out for Republican candidates? What factors are the most important, least important, or irrelevant to voting behavior?

The answers to these questions are unknown, or at least honest and intelligent people can disagree about them. Instead of developing a regression equation with one intercept and slope we are going to fit a multiple regression model. In a multiple regression model the b's (or \( \mathbf{\beta} \)) describe the strength of the relationship between a variable and an outcome. Regression coefficients describe the relationship between the multiple x's and the Y independent of all of the other factors. You can read a regression coefficient in the following way, “Holding all else constant if variable m increases by 1 unit Y will increase by \( \beta_{m} \) units,” where \( \beta_{m} \) is the regression coefficient for variable m.

A little knowledge is a dangerous thing. Today, you'll learn to fit a multiple regression with spatial data without a full understanding of the mathematical and theoretical issues that arise in this case. Our goal today is to gain some exposure to a really useful technique and see how that technique is practiced in applied geographic analysis. This exercise is just the tip of the iceberg, meant to motivate interest in the potential of statistics to provide useful insights into complex problems.

## INSTALL LIBRARIES FOR SPATIAL DATA
install.packages("sp")
## Error: trying to use CRAN without setting a mirror
library(sp)
library(maptools)
library(classInt)
library(RColorBrewer)

# READ DATA
USA <- readShapePoly("/Users/seth/Dropbox/Teaching/Principles and Methods of GIS/2009/County Voting/2004_Election_Counties.shp")

Plot Data

## ITS A MAP
plot(USA)

plot of chunk unnamed-chunk-2

# Remove counties with no votes
USA <- USA[USA$Total > 0, ]

The USA object you just created is a SpatialDataFrame, this is an object that contains lots of different pieces, we won't discuss them all here, you can see them by typing, “USA.” SpatialDataFrames have special behavior, if you use the plot() function they draw a map. If you want to see only the data you have to access the “data slot.” This is done using the “@” symbol.

# List the pieces of the file

slotNames(USA)

# summarize the file

summary(USA)

# summarize the data

summary(USA@data)

# plotting the data slot is like plotting a regular table
plot(USA@data)
## Error: figure margins too large

Making Maps in R

Drawing a map in R is way more complicated than it needs to be. Here, I'm presenting a simple method for making thematic maps, this method, because it is simple, doesn't allow you to easily create legends and marginalia. The most robust method for creating maps involves the use of the package “lattice” and the function spplot().

Decide what colors we want on our map. To do this we have to select a palette. The package “RColorBrewer”, which you should have already loaded using a call to library(), contains several palettes especially designed for cartography and data visualization:

display.brewer.all()

plot of chunk unnamed-chunk-6


# lets make a 7 color 'spectral' palette

pal7 <- brewer.pal(7, "Spectral")

# to see the colors

display.brewer.pal(7, "Spectral")

plot of chunk unnamed-chunk-6

The next step is to associate the data with colors so that we can “see” the data on the map. To do this we have to divide the data into groups. This is done using the “classInt” package. Divide the USA up into seven groups, each group corresponds with a color in our palette, and each group has the same number of counties. Each group will be displayed with a different color on the map and within each group the percent of people who voted for Bush will be similar.

# create a column that holds the percent of

# all votes that went to G.W. Bush in 2004

USA$BushPct <- USA$Bush/USA$Total

# create categories

cats7 <- classIntervals(USA$BushPct, n = 7, style = "quantile")

cats7
## style: quantile
## [0.09308,0.4746)  [0.4746,0.5415)  [0.5415,0.5907)  [0.5907,0.6336) 
##              444              444              444              444 
##  [0.6336,0.6801)  [0.6801,0.7421)  [0.7421,0.9283] 
##              444              444              444

# output shows the range for BushPct within each category each group
# should have about 440 counties.

Now connect our categories to our palette with findColours(). It is spelled the un-American way, with a “u”, since its free we'll have to forgive their royalist “u.” Once the colours and the palette have been connected we can draw a map. Note that in the call to plot we do not specify a variable, this was done through the construction of categories with classIntervals():

SevenColors <- findColours(cats7, pal7)

# draw map using specificed data and colors
plot(USA, col = SevenColors)

plot of chunk unnamed-chunk-8

This map would earn you an F in cartography!!!

Those of you who have taken cartography will notice that this map isn't very good. It uses a diverging color scheme to represent continuous data. Color schemes like “spectral” are usually used to show deviations from the mean. Lets convert our BushPct column into standard units so that the mean is equal to zero and we are mapping deviations from the mean.

# Create a new column to hold the standardized percent bush

USA$BushPctZ <- (USA$BushPct - mean(USA$BushPct, na.rm = TRUE))/sd(USA$BushPct, 
    na.rm = TRUE)

# Crete new categories with the standardized column and map
pal7 <- brewer.pal(7, "Spectral")
cats7 <- classIntervals(USA$BushPctZ, n = 7, style = "quantile")

SevenColors <- findColours(cats7, pal7)

plot(USA, col = SevenColors)

plot of chunk unnamed-chunk-9

In this map the areas that are red have below average BushPct and blue areas are above average. Lets not dwell on the map too much, this is a class on regression not cartography. Our goal in this class is to try to explain the pattern on this map with regression analysis. Are the patterns of high and low values due to the characteristics of the population in those red and blue regions?

Multiple Regression

We will try to predict the percent of votes for George Bush based upon the percent urban, the percent poor, the percent of households that are female headed, and the median age.

lm1 <- lm(BushPct ~ pcturban + pctfemhh + pctpoor + HISP_LAT + MEDAGE2000, USA)

summary(lm1)
## 
## Call:
## lm(formula = BushPct ~ pcturban + pctfemhh + pctpoor + HISP_LAT + 
##     MEDAGE2000, data = USA)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.5687 -0.0733  0.0102  0.0808  0.2708 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.82e-01   2.43e-02   32.23   <2e-16 ***
## pcturban    -1.32e-05   8.88e-05   -0.15    0.882    
## pctfemhh    -1.35e-02   5.35e-04  -25.19   <2e-16 ***
## pctpoor      2.97e-03   3.55e-04    8.36   <2e-16 ***
## HISP_LAT    -2.91e-04   1.87e-04   -1.56    0.119    
## MEDAGE2000  -1.25e-03   5.65e-04   -2.21    0.027 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 0.11 on 3102 degrees of freedom
## Multiple R-squared: 0.242,   Adjusted R-squared: 0.241 
## F-statistic:  198 on 5 and 3102 DF,  p-value: <2e-16

Do the signs of coefficients align with your expectation? Now we should calculate residuals and make some diagnostic plots

lm1.resid <- resid(lm1)

plot(lm1.resid ~ USA$BushPct)

plot of chunk unnamed-chunk-11

These residuals look really bad! Lets try again.

lmBush <- lm(BushPct ~ pctfemhh + homevalu, data = USA)
summary(lmBush)
## 
## Call:
## lm(formula = BushPct ~ pctfemhh + homevalu, data = USA)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.6082 -0.0731  0.0089  0.0779  0.3558 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.77e-01   5.60e-03   138.9   <2e-16 ***
## pctfemhh    -1.01e-02   3.61e-04   -27.9   <2e-16 ***
## homevalu    -7.60e-07   6.01e-08   -12.7   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 0.109 on 3105 degrees of freedom
## Multiple R-squared: 0.256,   Adjusted R-squared: 0.255 
## F-statistic:  533 on 2 and 3105 DF,  p-value: <2e-16

lmBush.resid <- resid(lmBush)

plot(lmBush.resid ~ USA$BushPct)

plot of chunk unnamed-chunk-12

The residuals still look bad! What might be the problem?

THere might be a variable missing from our analysis. Are there geographic patterns in the residuals? What might cause these patterns? Sometimes mapping residuals can suggest a missing/omitted variable. What are the implications of these patterns for the regression model? For now lets just focus on describing the geographic patterns, to do this we have to create a map of the residuals:

## add a new column to USA to hold the residuals

USA$resid <- resid(lmBush)

pal3 <- brewer.pal(3, "Spectral")
cats3 <- classIntervals(USA$resid, n = 3, style = "quantile")

ThreeColors <- findColours(cats3, pal3)

plot(USA, col = ThreeColors)

plot of chunk unnamed-chunk-13

There is a clear pattern to the residuals. Which suggests some sort of missing variable, but what? Continue this exercise with the aim of finding the best possible model. Remember to make all of the diagnostic plots discussed in the lecture: 1. Plot Residuals against the model variables 2. Plot Residuals against fitted values (use predict(model) to get fitted values) 4. Plot the residuals against omitted variable. This might be especially helpful. 5. Check the residuals for normality (qqplot and/or Shapiro-Wilk Test)

To compare two models you can use anova(mdl1, mdl2) as discussed in lecture.

The person with the best model wins!!

The person with the best map is cool too.