Exploratory Spatial Data Analysis (ESDA)
ESDA techniques can help detect spatial patterns in data, lead to the formulation of hypotheses based on the geography of the data, and in assessing spatial models. ESDA helps determine whether the OLS model needs to incorporate spatial dependency. In this section, the goal is to visually detect for spatial dependency or autocorrelation in outcome variable, mental health prevalence. The outcome appears to cluster, so further analysis in necessary.As a first step, let’s examine the relationship between poor mental health and the unemployment rate unempr, the percent of residents who moved in the past year pmob, percent of 25 year olds with a college degree pcol, percent poverty ppov, percent non-Hispanic black pnhblk, percent Hispanic phisp, and the log population size.
Call:
lm(formula = MHLTH_CrudePrev ~ unempr + pmob + pcol + ppov +
pnhblk + phisp + log(tpop), data = sea.tracts)
Residuals:
Min 1Q Median 3Q Max
-1.47690 -0.40419 -0.01758 0.42427 2.33766
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.088319 1.393019 6.524 1.53e-09 ***
unempr 0.090346 0.025929 3.484 0.000681 ***
pmob 0.004222 0.007091 0.595 0.552608
pcol -0.034845 0.006390 -5.453 2.54e-07 ***
ppov 0.149402 0.009736 15.346 < 2e-16 ***
pnhblk -0.015599 0.009847 -1.584 0.115693
phisp 0.028165 0.015996 1.761 0.080719 .
log(tpop) 0.222953 0.164192 1.358 0.176949
---
Signif. codes:
0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.6642 on 125 degrees of freedom
Multiple R-squared: 0.919, Adjusted R-squared: 0.9144
F-statistic: 202.5 on 7 and 125 DF, p-value: < 2.2e-16
It appears that higher unemployment and poverty rates are associated with higher levels of poor mental health whereas higher percent college educated is associated with lower levels. Tools and approaches for testing OLS assumptions are not addressed here because it is beyond the scope of this analysis. Let’s focus on spatial exploratory analysis next.
The residuals appear to cluster, so further analysis in necessary.
Monte-Carlo simulation of Moran I
data: sea.tracts$MHLTH_CrudePrev
weights: seaw
number of simulations + 1: 1000
statistic = 0.39728, observed rank = 1000, p-value =
0.001
alternative hypothesis: greater
Second, for the OLS regression residuals:
Global Moran I for regression residuals
data:
model: lm(formula = MHLTH_CrudePrev ~ unempr + pmob +
pcol + ppov + pnhblk + phisp + log(tpop), data =
sea.tracts)
weights: seaw
Moran I statistic standard deviate = 3.9687, p-value
= 3.614e-05
alternative hypothesis: greater
sample estimates:
Observed Moran I Expectation Variance
0.175864778 -0.025326838 0.002570005
Both the dependent variable and the residuals indicate spatial autocorrelation, although the Moran’s I for the residuals is not strong (but yet statistically significant).
Based on the exploratory mapping, Moran scatterplot, and the global Moran’s I, there appears to be spatial autocorrelation in the dependent variable. This means that if there is a spatial lag process going on and we fit an OLS model the regression coefficients will be biased and inefficient. That is, the coefficient sizes and signs are not close to their true value and its standard errors are underestimated.
There are two standard types of spatial regression models: a spatial lag model (SLM), which models dependency in the outcome, and a spatial error model (SEM), which models dependency in the residuals.
Let’s start with the SLM:
Call:
lagsarlm(formula = MHLTH_CrudePrev ~ unempr + pmob + pcol + ppov +
pnhblk + phisp + log(tpop), data = sea.tracts, listw = seaw)
Residuals:
Min 1Q Median 3Q Max
-1.494101 -0.412011 -0.012753 0.446070 2.279226
Type: lag
Coefficients: (asymptotic standard errors)
Estimate Std. Error z value Pr(>|z|)
(Intercept) 9.5003409 1.5662067 6.0658 1.313e-09
unempr 0.0904386 0.0251047 3.6025 0.0003152
pmob 0.0039999 0.0068667 0.5825 0.5602229
pcol -0.0360700 0.0065844 -5.4781 4.299e-08
ppov 0.1516731 0.0100004 15.1666 < 2.2e-16
pnhblk -0.0161411 0.0095358 -1.6927 0.0905151
phisp 0.0277670 0.0155579 1.7848 0.0743011
log(tpop) 0.2280948 0.1590558 1.4341 0.1515565
Rho: -0.034898, LR test value: 0.38247, p-value: 0.53628
Asymptotic standard error: 0.055763
z-value: -0.62583, p-value: 0.53142
Wald statistic: 0.39167, p-value: 0.53142
Log likelihood: -129.9789 for lag model
ML residual variance (sigma squared): 0.41332, (sigma: 0.6429)
Number of observations: 133
Number of parameters estimated: 10
AIC: 279.96, (AIC for lm: 278.34)
LM test for residual autocorrelation
test value: 14.719, p-value: 0.00012481
The unemployment rate, percent college educated and percent poverty continue to be statistically significant. The lag parameter is Rho, whose value is quite small at -0.035 and not statistically significant across all tests. This indicates that the spatial lag in the dependent variable is accounted for through the demographic and socioeconomic variables already included in the model. This likely shows that a spatial lag on the dependent variable is not needed.
Spatial error model (SEM)
The spatial error model incorporates spatial dependence in the errors. If there is a spatial error process going on and we fit an OLS model our coefficients will be unbiased but inefficient. That is, the coefficient size and sign are asymptotically correct but its standard errors are underestimated.
Call:
errorsarlm(formula = MHLTH_CrudePrev ~ unempr + pmob + pcol +
ppov + pnhblk + phisp + log(tpop), data = sea.tracts, listw = seaw)
Residuals:
Min 1Q Median 3Q Max
-1.365103 -0.419244 -0.015275 0.413074 1.964454
Type: error
Coefficients: (asymptotic standard errors)
Estimate Std. Error z value Pr(>|z|)
(Intercept) 11.6061151 1.3058441 8.8878 < 2.2e-16
unempr 0.0780587 0.0229153 3.4064 0.0006583
pmob 0.0147918 0.0077040 1.9200 0.0548564
pcol -0.0429317 0.0068995 -6.2224 4.896e-10
ppov 0.1442649 0.0096688 14.9206 < 2.2e-16
pnhblk -0.0041949 0.0103829 -0.4040 0.6861972
phisp 0.0175817 0.0146865 1.1971 0.2312555
log(tpop) -0.0286693 0.1502948 -0.1908 0.8487186
Lambda: 0.51295, LR test value: 14.233, p-value: 0.00016152
Asymptotic standard error: 0.098797
z-value: 5.1919, p-value: 2.0814e-07
Wald statistic: 26.956, p-value: 2.0814e-07
Log likelihood: -123.0536 for error model
ML residual variance (sigma squared): 0.35148, (sigma: 0.59286)
Number of observations: 133
Number of parameters estimated: 10
AIC: 266.11, (AIC for lm: 278.34)
The unemployment rate, percent college educated and percent poverty continue to be statistically significant. The lag error parameter Lambda is positive and significant, indicating the need to control for spatial autocorrelation in the error.
One way of deciding which model is appropriate is to examine the fit statistic Akaike Information Criterion (AIC), which is a index of sorts to indicate how close the model is to reality. A lower value indicates a better fitting model.
| Model | AIC |
|---|---|
| OLS | 278.34 |
| SLM | 279.96 |
| SEM | 266.11 |