Spatial Regression

Disclaimer: The contents of this document come from An Introduction to Spatial Econometrics in R (Sarmiento-Barbieri, 2016). This document is prepared for CP6521 Advanced GIS, a graduate-level city planning elective course at Georgia Tech in Spring 2019. For any question, contact the instructor, Yongsung Lee, Ph.D. via yongsung.lee(at)gatech.edu.

This document is also published on RPubs.

library(maptools) # read in shapefiles to an R session 
library(sp) 
library(spdep)
library(tmap)
library(RColorBrewer)

Motivation

Multivariate regression helps estimate the (separate) effects of explanatory variables on the dependent variable of interest under certain assumptions (e.g., no omitted variable and no correlation among residuals).
When we estimate regression models on a set of observations, which are spatially related, we may have spatial autocorrelation, or spatial dependence between its dependent variable or between its residuals. As a result, our models produce either/both biased estimates or/and inefficient standard errors. In any case, these models cannot help statistically infer underlying population generating processes.
Thus, we need to handle spatial dependence between observations.

Learning Objectives

Learn how to test the presence of spatial dependence in a data set
Learn how to estimate regression models, which handles spatial dependence
- Spatial lag model (i.e., SAR), when the dependent variable is spatially correlated
- Spatial error model (i.e., SEM), when the residuals are spatially correlated

3 Introduction to spatial analysis in `R`

3.3 Reading and Mapping spatial data in `R`

Save three input files from the Ninth Annual Midwest Graduate Student Summit on Applied Economics, Regional, and Urban Studies (AERUS), held on April 23rd-24th, 2016 at the University of Illinois at Urbana Champaign.

chi.poly <- maptools::readShapePoly('foreclosures.shp')
class(chi.poly)

## [1] "SpatialPolygonsDataFrame"
## attr(,"package")
## [1] "sp"

# check variables for individual census tracts 
str(slot(chi.poly,"data"))

## 'data.frame':    897 obs. of  16 variables:
##  $ SP_ID     : Factor w/ 897 levels "1","10","100",..: 1 112 223 334 445 556 667 778 887 2 ...
##  $ fips      : Factor w/ 897 levels "17031010100",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ est_fcs   : int  43 129 55 21 64 56 107 43 7 51 ...
##  $ est_mtgs  : int  904 2122 1151 574 1427 1241 1959 830 208 928 ...
##  $ est_fcs_rt: num  4.76 6.08 4.78 3.66 4.48 4.51 5.46 5.18 3.37 5.5 ...
##  $ res_addr  : int  2530 3947 3204 2306 5485 2994 3701 1694 443 1552 ...
##  $ est_90d_va: num  12.61 12.36 10.46 5.03 8.44 ...
##  $ bls_unemp : num  8.16 8.16 8.16 8.16 8.16 8.16 8.16 8.16 8.16 8.16 ...
##  $ county    : Factor w/ 1 level "Cook County": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fips_num  : num  1.7e+10 1.7e+10 1.7e+10 1.7e+10 1.7e+10 ...
##  $ totpop    : int  5391 10706 6649 5325 10944 7178 10799 5403 1089 3634 ...
##  $ tothu     : int  2557 3981 3281 2464 5843 3136 3875 1768 453 1555 ...
##  $ huage     : int  61 53 56 60 54 58 48 57 61 48 ...
##  $ oomedval  : int  169900 147000 119800 151500 143600 145900 153400 170500 215900 114700 ...
##  $ property  : num  646 914 478 509 641 612 678 332 147 351 ...
##  $ violent   : num  433 421 235 159 240 266 272 146 78 84 ...
##  - attr(*, "data_types")= chr  "C" "C" "N" "N" ...

Here are explanations on selected variables:

est_fcs: estimated count of foreclosure starts from Jan. 2007 through June 2008
est_mtgs: estimated number of active mortgages from Jan. 2007 through June 2008
est_fcs_rt: number of foreclosure starts divided by number of mortgages times 100
bls_unemp: June 2008 place or county unemployment rate
totpop: total population from 2000 Census
violent: number of violent crimes reported between Jan. 2007 through December 2008
property: number of property crimes reported between Jan. 2007 through December 2008

summary(chi.poly@data$violent)

plot(chi.poly)

tmap_mode("view")
tm_basemap("Stamen.TonerLite", alpha = 0.25) + 
  tm_shape(chi.poly) + 
  tm_fill(col = "violent", alpha = 0.75, style = "quantile", n = 10, 
          palette = "OrRd") + 
  tm_borders(col = "white", lwd = 0.5)

4 Spatial Econometrics in `R`

4.1 OLS (as a reference model)

Ordinary Least Square models do not take (possible) spatial dependence into account.

chi.ols<-lm(violent~est_fcs_rt+bls_unemp, data=chi.poly@data)
summary(chi.ols)

## 
## Call:
## lm(formula = violent ~ est_fcs_rt + bls_unemp, data = chi.poly@data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -892.02  -77.02  -23.73   41.90 1238.22 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -18.627     45.366  -0.411    0.681    
## est_fcs_rt    28.298      1.435  19.720   <2e-16 ***
## bls_unemp     -0.308      5.770  -0.053    0.957    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 157.3 on 894 degrees of freedom
## Multiple R-squared:  0.3141, Adjusted R-squared:  0.3126 
## F-statistic: 204.7 on 2 and 894 DF,  p-value: < 2.2e-16

4.2 Modeling Spatial Dependence

We have a few ways of allowing spatial dependence among observations:
1. Rook, or ploy2nb(..., queen=FALSE): two units are close to one another if they share a side
2. Queen, or poly2nb(..., queen=TRUE): two units are close if they share a side or an edge
3. Distance-based: first extract the centroids of polygons by coordinates and identify the neighbors (within a given distance threshold) for each polygon by dnearneigh

# create a list of neighbors for each polygon 
list.queen <- poly2nb(chi.poly, queen=TRUE)
# next, supplement the neighbor list with spatial weights 
W <- nb2listw(list.queen, style="W", zero.policy=TRUE) # "W" row-standardize weights 
W

## Characteristics of weights list object:
## Neighbour list object:
## Number of regions: 897 
## Number of nonzero links: 6140 
## Percentage nonzero weights: 0.7631036 
## Average number of links: 6.845039 
## 
## Weights style: W 
## Weights constants summary:
##     n     nn  S0       S1       S2
## W 897 804609 897 274.4893 3640.864

plot(W,coordinates(chi.poly))

coords <- coordinates(chi.poly)
W_dist <- dnearneigh(coords,0,1,longlat = FALSE)

As for spatial dependence, we can choose one or the other modeling approach:
1. Spatial Autoregressive (SAR) Models
- Allows the values of neighbors to affect that of a given polygon (e.g., Home values)
2. Spatial Error Models (SEM)
- Allows the residuals of observations to be correlated (e.g., unobserved neighborhood characteristics affect the value of neighboring houses)

4.3 Testing for spatial autocorrelation

4.3.1 Moran’s I Test

lm.morantest tests the presence/absence of correlation among residuals (i.e., spatial autocorrelation), but it doesn’t guide us to right alternatives.

moran.lm<-lm.morantest(chi.ols, W, alternative="two.sided")
print(moran.lm)

## 
##  Global Moran I for regression residuals
## 
## data:  
## model: lm(formula = violent ~ est_fcs_rt + bls_unemp, data =
## chi.poly@data)
## weights: W
## 
## Moran I statistic standard deviate = 11.785, p-value < 2.2e-16
## alternative hypothesis: two.sided
## sample estimates:
## Observed Moran I      Expectation         Variance 
##     0.2142252370    -0.0020099108     0.0003366648

4.3.2 Lagrange Multiplier Test

lm.LMtests tests against various alternative models, which helps determine more/less desirable alternatives
1. LMerr: simple LM test for error dependence
2. LMlag: simple LM test for a missing spatially lagged dependent variable
3. RLMerr: robust test for error dependence in the possible presence of a missing lagged dependent variable
4. RLMlag: robust test for a missing lagged dependent variable in the possible presence of error dependence
5. SARMA: a portmanteau test for seasonal autoregressive moving average (SARMA, in fact LMerr + RLMlag)

LM <- lm.LMtests(chi.ols, W, test="all")
print(LM)

## 
##  Lagrange multiplier diagnostics for spatial dependence
## 
## data:  
## model: lm(formula = violent ~ est_fcs_rt + bls_unemp, data =
## chi.poly@data)
## weights: W
## 
## LMerr = 134.52, df = 1, p-value < 2.2e-16
## 
## 
##  Lagrange multiplier diagnostics for spatial dependence
## 
## data:  
## model: lm(formula = violent ~ est_fcs_rt + bls_unemp, data =
## chi.poly@data)
## weights: W
## 
## LMlag = 182.18, df = 1, p-value < 2.2e-16
## 
## 
##  Lagrange multiplier diagnostics for spatial dependence
## 
## data:  
## model: lm(formula = violent ~ est_fcs_rt + bls_unemp, data =
## chi.poly@data)
## weights: W
## 
## RLMerr = 0.00066762, df = 1, p-value = 0.9794
## 
## 
##  Lagrange multiplier diagnostics for spatial dependence
## 
## data:  
## model: lm(formula = violent ~ est_fcs_rt + bls_unemp, data =
## chi.poly@data)
## weights: W
## 
## RLMlag = 47.653, df = 1, p-value = 5.089e-12
## 
## 
##  Lagrange multiplier diagnostics for spatial dependence
## 
## data:  
## model: lm(formula = violent ~ est_fcs_rt + bls_unemp, data =
## chi.poly@data)
## weights: W
## 
## SARMA = 182.18, df = 2, p-value < 2.2e-16

4.4 Running Spatial Regressions

4.4.1 SAR Models

# the first two inputs are the same as chi.ols (the model specificaiotn and the data)   
# additionally, a spatial weight matrix enters 
chi.sar <-lagsarlm(violent~est_fcs_rt+bls_unemp, data=chi.poly@data, W)
summary(chi.sar)

## 
## Call:
## lagsarlm(formula = violent ~ est_fcs_rt + bls_unemp, data = chi.poly@data, 
##     listw = W)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -519.127  -65.003  -15.226   36.423 1184.193 
## 
## Type: lag 
## Coefficients: (asymptotic standard errors) 
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept) -93.7885    41.3162  -2.270  0.02321
## est_fcs_rt   15.6822     1.5600  10.053  < 2e-16
## bls_unemp     8.8949     5.2447   1.696  0.08989
## 
## Rho: 0.49037, LR test value: 141.33, p-value: < 2.22e-16
## Asymptotic standard error: 0.039524
##     z-value: 12.407, p-value: < 2.22e-16
## Wald statistic: 153.93, p-value: < 2.22e-16
## 
## Log likelihood: -5738.047 for lag model
## ML residual variance (sigma squared): 20200, (sigma: 142.13)
## Number of observations: 897 
## Number of parameters estimated: 5 
## AIC: 11486, (AIC for lm: 11625)
## LM test for residual autocorrelation
## test value: 8.1464, p-value: 0.0043146

Check how the two models, OLS and SAR differ. First, save the residuals from the two models as separate variables. Next, plot these residuals on the map to see how much/less spatial dependence is present (the less, the better, because they are residuals).

chi.poly@data$chi.ols.res <- resid(chi.ols) #residuals ols
chi.poly@data$chi.sar.res <- resid(chi.sar) #residual sar

Below, we use sp::spplot to visually examine the distribution of residuals from two models.

sp::spplot(
  chi.poly, 
  "chi.ols.res", 
  at = seq(min(chi.poly@data$chi.ols.res,na.rm=TRUE), max(chi.poly@data$chi.ols.res,na.rm=TRUE), length=12),
  col.regions = rev(brewer.pal(11,"RdBu"))
)

sp::spplot(
  chi.poly,
  "chi.sar.res",
  at = seq(min(chi.poly@data$chi.sar.res,na.rm=TRUE),max(chi.poly@data$chi.sar,na.rm=TRUE), length=12),
  col.regions=rev(brewer.pal(11,"RdBu"))
)

4.4.2 SEM Models

chi.err <- errorsarlm(violent~est_fcs_rt+bls_unemp, data=chi.poly@data, W)
summary(chi.err)

## 
## Call:
## errorsarlm(formula = violent ~ est_fcs_rt + bls_unemp, data = chi.poly@data, 
##     listw = W)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -650.506  -64.355  -22.646   35.461 1206.346 
## 
## Type: error 
## Coefficients: (asymptotic standard errors) 
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)  -1.2624    43.0509 -0.0293   0.9766
## est_fcs_rt   19.4620     1.9450 10.0062   <2e-16
## bls_unemp     4.0380     5.5134  0.7324   0.4639
## 
## Lambda: 0.52056, LR test value: 109.68, p-value: < 2.22e-16
## Asymptotic standard error: 0.042291
##     z-value: 12.309, p-value: < 2.22e-16
## Wald statistic: 151.51, p-value: < 2.22e-16
## 
## Log likelihood: -5753.875 for error model
## ML residual variance (sigma squared): 20796, (sigma: 144.21)
## Number of observations: 897 
## Number of parameters estimated: 5 
## AIC: 11518, (AIC for lm: 11625)