Exploratory Spatial Data Analysis (ESDA)

The first step in ESDA is to map your dependent variable. Map the COVID-19 case rates

Basic multiple linear regression

Let’s examine the zip code characteristics associated with the number of COVID-19 cases per 1,000 residents. Our dependent variable is covidrate and our independent variables are percent black (pblk), percent Hispanic (phisp), median household income (medincome), total population (totp), and percent of residents 65 years old and older (p65old).

Call:
lm(formula = covidrate ~ pblk + phisp + medincome + totp + p65old, 
    data = zctanyc)

Residuals:
     Min       1Q   Median       3Q      Max 
-10.5625  -4.2955  -0.8762   3.7455  15.4670 

Coefficients:
               Estimate  Std. Error t value  Pr(>|t|)    
(Intercept) 12.57451167  3.21496495   3.911  0.000132 ***
pblk         0.09056056  0.02013347   4.498 0.0000126 ***
phisp        0.12825817  0.03007490   4.265 0.0000331 ***
medincome   -0.00004588  0.00001737  -2.642  0.009012 ** 
totp        -0.00004357  0.00001757  -2.480  0.014097 *  
p65old       0.36634418  0.09276337   3.949  0.000114 ***
---
Signif. codes:  
0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.48 on 171 degrees of freedom
Multiple R-squared:  0.3767,    Adjusted R-squared:  0.3585 
F-statistic: 20.67 on 5 and 171 DF,  p-value: 0.0000000000000004017

It appears that all variables are associated with Covid-19 rates. Tools and approaches for testing OLS assumptions are not addressed here because it is beyond the scope of this analysis. Let’s focus on spatial exploratory analysis next.

Second, mapping the residuals from the OLS regression model to see if there is visual evidence of spatial autocorrelation in the error term.

The exploratory maps show signs of spatial dependency, so further analysis is necessary.

Let’s first examine Spatial Autocorrelation using the Moran scatterplot. It looks like there is a spatial association.

The map and Moran scatter plot provide descriptive visualizations of clustering (autocorrelation) in COVID-19 case rates, so we can calculate a global index of spatial autocorrelation: the Global Moran’s I test to get the p-value. A rule of thumb is a spatial autocorrelation higher than 0.3 and lower than -0.3 is meaningful. First, for the dependent variable, with Monte-Carlo simulation:

    Monte-Carlo simulation of Moran I

data:  zctanyc$covidrate 
weights: nycw  
number of simulations + 1: 1000 

statistic = 0.7491, observed rank = 1000, p-value =
0.001
alternative hypothesis: greater
Second, for the OLS regression residuals:

    Global Moran I for regression residuals

data:  
model: lm(formula = covidrate ~ pblk + phisp +
medincome + totp + p65old, data = zctanyc)
weights: nycw

Moran I statistic standard deviate = 12.578, p-value
< 0.00000000000000022
alternative hypothesis: greater
sample estimates:
Observed Moran I      Expectation         Variance 
     0.633692666     -0.021814188      0.002715825 

Both the dependent variable and the residuals indicate spatial autocorrelation and are statistically significant.

Based on the exploratory mapping, Moran scatterplot, and the global Moran’s I, there appears to be spatial autocorrelation in the dependent variable. This means that if there is a spatial lag process going on and we fit an OLS model the regression coefficients will be biased and inefficient. That is, the coefficient sizes and signs are not close to their true value and its standard errors are underestimated.

There are two standard types of spatial regression models: a spatial lag model (SLM), which models dependency in the outcome, and a spatial error model (SEM), which models dependency in the residuals.

Let’s start with the SLM:


Call:
lagsarlm(formula = covidrate ~ pblk + phisp + medincome + totp + 
    p65old, data = zctanyc, listw = nycw, zero.policy = TRUE)

Residuals:
     Min       1Q   Median       3Q      Max 
-7.70893 -2.21115 -0.51621  2.03252 12.76569 

Type: lag 
Regions with no neighbours included:
 59 
Coefficients: (asymptotic standard errors) 
                Estimate   Std. Error z value Pr(>|z|)
(Intercept)  2.181482373  2.056079141  1.0610 0.288694
pblk         0.040057707  0.013213379  3.0316 0.002433
phisp        0.062861580  0.020085387  3.1297 0.001750
medincome   -0.000014023  0.000010870 -1.2900 0.197057
totp        -0.000023471  0.000010986 -2.1365 0.032641
p65old       0.186119523  0.059389122  3.1339 0.001725

Rho: 0.70864, LR test value: 132.91, p-value: < 0.000000000000000222
Asymptotic standard error: 0.046717
    z-value: 15.169, p-value: < 0.000000000000000222
Wald statistic: 230.09, p-value: < 0.000000000000000222

Log likelihood: -482.7463 for lag model
ML residual variance (sigma squared): 11.661, (sigma: 3.4148)
Number of observations: 177 
Number of parameters estimated: 8 
AIC: 981.49, (AIC for lm: 1112.4)
LM test for residual autocorrelation
test value: 0.54636, p-value: 0.45981

All variables, except for the median income, continue to be statistically significant. The lag parameter is Rho, whose value is 0.71 and statistically significant across all tests. This indicates that the spatial lag in the dependent variable is not completely accounted for through the demographic and socioeconomic variables already included in the model. This likely shows that a spatial lag on the dependent variable is needed.

Spatial error model (SEM)

The spatial error model incorporates spatial dependence in the errors. If there is a spatial error process going on and we fit an OLS model our coefficients will be unbiased but inefficient. That is, the coefficient size and sign are asymptotically correct but its standard errors are underestimated.

Call:
errorsarlm(formula = covidrate ~ pblk + phisp + medincome + totp + 
    p65old, data = zctanyc, listw = nycw, zero.policy = TRUE)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.2334 -1.9737 -0.3498  1.6442 13.4324 

Type: error 
Regions with no neighbours included:
 59 
Coefficients: (asymptotic standard errors) 
                Estimate   Std. Error z value    Pr(>|z|)
(Intercept) 12.226926447  2.689463319  4.5462 0.000005461
pblk         0.061604533  0.022419247  2.7478    0.005999
phisp        0.122124864  0.029625877  4.1222 0.000037521
medincome   -0.000019168  0.000014531 -1.3191    0.187120
totp        -0.000020288  0.000011160 -1.8180    0.069068
p65old       0.188907342  0.070122943  2.6939    0.007061

Lambda: 0.78837, LR test value: 136.64, p-value: < 0.000000000000000222
Asymptotic standard error: 0.042485
    z-value: 18.556, p-value: < 0.000000000000000222
Wald statistic: 344.34, p-value: < 0.000000000000000222

Log likelihood: -480.8819 for error model
ML residual variance (sigma squared): 10.804, (sigma: 3.287)
Number of observations: 177 
Number of parameters estimated: 8 
AIC: 977.76, (AIC for lm: 1112.4)

The percent black, percent Hispanic, and percent of residents 65 years old and older continue to be statistically significant. The lag error parameter Lambda is positive and significant, indicating the need to control for spatial autocorrelation in the error term.

One way of deciding which model is appropriate is to compare goodness of fit measures, specifically the Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC). AIC and BIC rely on the log likelihood, which captures the probability of the observed data given the model parameters. The lower the AIC and BIC, the better model fit. In both cases, SEM seems to be the best fit.

Models AIC BIC
OLS 1112.40 1134.63
SLM 981.49 1006.90
SEM 977.76 1003.17

Another way is to run formal hypothesis tests comparing the models using Lagrange Multiplier tests.

The non robust LM tests both reject the null, so let’s turn to the robust LM tests. As both reject the null, let’s choose the one that is “More significant” (higher LM test statistic): RLMlag = 8.5867 vs RLMerr = 11.124 → the SEM is the better model in this case.

    Lagrange multiplier diagnostics for spatial
    dependence

data:  
model: lm(formula = covidrate ~ pblk + phisp +
medincome + totp + p65old, data = zctanyc)
weights: nycw

LMerr = 139.96, df = 1, p-value < 0.00000000000000022


    Lagrange multiplier diagnostics for spatial
    dependence

data:  
model: lm(formula = covidrate ~ pblk + phisp +
medincome + totp + p65old, data = zctanyc)
weights: nycw

LMlag = 137.42, df = 1, p-value < 0.00000000000000022


    Lagrange multiplier diagnostics for spatial
    dependence

data:  
model: lm(formula = covidrate ~ pblk + phisp +
medincome + totp + p65old, data = zctanyc)
weights: nycw

RLMerr = 11.124, df = 1, p-value = 0.0008521


    Lagrange multiplier diagnostics for spatial
    dependence

data:  
model: lm(formula = covidrate ~ pblk + phisp +
medincome + totp + p65old, data = zctanyc)
weights: nycw

RLMlag = 8.5867, df = 1, p-value = 0.003386


    Lagrange multiplier diagnostics for spatial
    dependence

data:  
model: lm(formula = covidrate ~ pblk + phisp +
medincome + totp + p65old, data = zctanyc)
weights: nycw

SARMA = 148.55, df = 2, p-value < 0.00000000000000022