Exploratory Spatial Data Analysis (ESDA)
The first step in ESDA is to map your dependent variable. Map the COVID-19 case ratesBasic multiple linear regression
Let’s examine the zip code characteristics associated with the number of COVID-19 cases per 1,000 residents. Our dependent variable is covidrate and our independent variables are percent black (pblk), percent Hispanic (phisp), median household income (medincome), total population (totp), and percent of residents 65 years old and older (p65old).
Call:
lm(formula = covidrate ~ pblk + phisp + medincome + totp + p65old,
data = zctanyc)
Residuals:
Min 1Q Median 3Q Max
-10.5625 -4.2955 -0.8762 3.7455 15.4670
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 12.57451167 3.21496495 3.911 0.000132 ***
pblk 0.09056056 0.02013347 4.498 0.0000126 ***
phisp 0.12825817 0.03007490 4.265 0.0000331 ***
medincome -0.00004588 0.00001737 -2.642 0.009012 **
totp -0.00004357 0.00001757 -2.480 0.014097 *
p65old 0.36634418 0.09276337 3.949 0.000114 ***
---
Signif. codes:
0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5.48 on 171 degrees of freedom
Multiple R-squared: 0.3767, Adjusted R-squared: 0.3585
F-statistic: 20.67 on 5 and 171 DF, p-value: 0.0000000000000004017
It appears that all variables are associated with Covid-19 rates. Tools and approaches for testing OLS assumptions are not addressed here because it is beyond the scope of this analysis. Let’s focus on spatial exploratory analysis next.
The exploratory maps show signs of spatial dependency, so further analysis is necessary.
Monte-Carlo simulation of Moran I
data: zctanyc$covidrate
weights: nycw
number of simulations + 1: 1000
statistic = 0.7491, observed rank = 1000, p-value =
0.001
alternative hypothesis: greater
Second, for the OLS regression residuals:
Global Moran I for regression residuals
data:
model: lm(formula = covidrate ~ pblk + phisp +
medincome + totp + p65old, data = zctanyc)
weights: nycw
Moran I statistic standard deviate = 12.578, p-value
< 0.00000000000000022
alternative hypothesis: greater
sample estimates:
Observed Moran I Expectation Variance
0.633692666 -0.021814188 0.002715825
Both the dependent variable and the residuals indicate spatial autocorrelation and are statistically significant.
Based on the exploratory mapping, Moran scatterplot, and the global Moran’s I, there appears to be spatial autocorrelation in the dependent variable. This means that if there is a spatial lag process going on and we fit an OLS model the regression coefficients will be biased and inefficient. That is, the coefficient sizes and signs are not close to their true value and its standard errors are underestimated.
There are two standard types of spatial regression models: a spatial lag model (SLM), which models dependency in the outcome, and a spatial error model (SEM), which models dependency in the residuals.
Let’s start with the SLM:
Call:
lagsarlm(formula = covidrate ~ pblk + phisp + medincome + totp +
p65old, data = zctanyc, listw = nycw, zero.policy = TRUE)
Residuals:
Min 1Q Median 3Q Max
-7.70893 -2.21115 -0.51621 2.03252 12.76569
Type: lag
Regions with no neighbours included:
59
Coefficients: (asymptotic standard errors)
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.181482373 2.056079141 1.0610 0.288694
pblk 0.040057707 0.013213379 3.0316 0.002433
phisp 0.062861580 0.020085387 3.1297 0.001750
medincome -0.000014023 0.000010870 -1.2900 0.197057
totp -0.000023471 0.000010986 -2.1365 0.032641
p65old 0.186119523 0.059389122 3.1339 0.001725
Rho: 0.70864, LR test value: 132.91, p-value: < 0.000000000000000222
Asymptotic standard error: 0.046717
z-value: 15.169, p-value: < 0.000000000000000222
Wald statistic: 230.09, p-value: < 0.000000000000000222
Log likelihood: -482.7463 for lag model
ML residual variance (sigma squared): 11.661, (sigma: 3.4148)
Number of observations: 177
Number of parameters estimated: 8
AIC: 981.49, (AIC for lm: 1112.4)
LM test for residual autocorrelation
test value: 0.54636, p-value: 0.45981
All variables, except for the median income, continue to be statistically significant. The lag parameter is Rho, whose value is 0.71 and statistically significant across all tests. This indicates that the spatial lag in the dependent variable is not completely accounted for through the demographic and socioeconomic variables already included in the model. This likely shows that a spatial lag on the dependent variable is needed.
Spatial error model (SEM)
The spatial error model incorporates spatial dependence in the errors. If there is a spatial error process going on and we fit an OLS model our coefficients will be unbiased but inefficient. That is, the coefficient size and sign are asymptotically correct but its standard errors are underestimated.
Call:
errorsarlm(formula = covidrate ~ pblk + phisp + medincome + totp +
p65old, data = zctanyc, listw = nycw, zero.policy = TRUE)
Residuals:
Min 1Q Median 3Q Max
-7.2334 -1.9737 -0.3498 1.6442 13.4324
Type: error
Regions with no neighbours included:
59
Coefficients: (asymptotic standard errors)
Estimate Std. Error z value Pr(>|z|)
(Intercept) 12.226926447 2.689463319 4.5462 0.000005461
pblk 0.061604533 0.022419247 2.7478 0.005999
phisp 0.122124864 0.029625877 4.1222 0.000037521
medincome -0.000019168 0.000014531 -1.3191 0.187120
totp -0.000020288 0.000011160 -1.8180 0.069068
p65old 0.188907342 0.070122943 2.6939 0.007061
Lambda: 0.78837, LR test value: 136.64, p-value: < 0.000000000000000222
Asymptotic standard error: 0.042485
z-value: 18.556, p-value: < 0.000000000000000222
Wald statistic: 344.34, p-value: < 0.000000000000000222
Log likelihood: -480.8819 for error model
ML residual variance (sigma squared): 10.804, (sigma: 3.287)
Number of observations: 177
Number of parameters estimated: 8
AIC: 977.76, (AIC for lm: 1112.4)
The percent black, percent Hispanic, and percent of residents 65 years old and older continue to be statistically significant. The lag error parameter Lambda is positive and significant, indicating the need to control for spatial autocorrelation in the error term.
One way of deciding which model is appropriate is to compare goodness of fit measures, specifically the Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC). AIC and BIC rely on the log likelihood, which captures the probability of the observed data given the model parameters. The lower the AIC and BIC, the better model fit. In both cases, SEM seems to be the best fit.
Models | AIC | BIC |
---|---|---|
OLS | 1112.40 | 1134.63 |
SLM | 981.49 | 1006.90 |
SEM | 977.76 | 1003.17 |
Another way is to run formal hypothesis tests comparing the models using Lagrange Multiplier tests.
The non robust LM tests both reject the null, so let’s turn to the robust LM tests. As both reject the null, let’s choose the one that is “More significant” (higher LM test statistic): RLMlag = 8.5867 vs RLMerr = 11.124 → the SEM is the better model in this case.
Lagrange multiplier diagnostics for spatial
dependence
data:
model: lm(formula = covidrate ~ pblk + phisp +
medincome + totp + p65old, data = zctanyc)
weights: nycw
LMerr = 139.96, df = 1, p-value < 0.00000000000000022
Lagrange multiplier diagnostics for spatial
dependence
data:
model: lm(formula = covidrate ~ pblk + phisp +
medincome + totp + p65old, data = zctanyc)
weights: nycw
LMlag = 137.42, df = 1, p-value < 0.00000000000000022
Lagrange multiplier diagnostics for spatial
dependence
data:
model: lm(formula = covidrate ~ pblk + phisp +
medincome + totp + p65old, data = zctanyc)
weights: nycw
RLMerr = 11.124, df = 1, p-value = 0.0008521
Lagrange multiplier diagnostics for spatial
dependence
data:
model: lm(formula = covidrate ~ pblk + phisp +
medincome + totp + p65old, data = zctanyc)
weights: nycw
RLMlag = 8.5867, df = 1, p-value = 0.003386
Lagrange multiplier diagnostics for spatial
dependence
data:
model: lm(formula = covidrate ~ pblk + phisp +
medincome + totp + p65old, data = zctanyc)
weights: nycw
SARMA = 148.55, df = 2, p-value < 0.00000000000000022