Exploratory Spatial Data Analysis (ESDA) extends traditional data analysis by explicitly including location. While standard statistics ask “How much?”, spatial statistics ask “How much and where?”
Tobler’s First Law of Geography states: “Everything is related to everything else, but near things are more related than distant things.”
In this module, we apply 16 statistical methods to test this law.
The Data: We are using the Somali Health and Demographic Survey (SHDS) 2020. This was a historic milestone—the first-ever comprehensive health survey conducted in Somalia.
The Variable: We focus on Measles Vaccination coverage among children. Measles is highly contagious, so understanding if vaccination coverage is clustered (herd immunity) or fragmented (outbreak risk) is critical for public health planning.
# Load Libraries
if(!require(pacman)) install.packages("pacman")
pacman::p_load(tidyverse, sf, spdep, haven, classInt, RColorBrewer, scales, gridExtra)
# Disable S2 geometry for planar assumptions
sf_use_s2(FALSE)# 1. Load Data
if(file.exists("measles5.dta")) {
df <- read_dta("measles5.dta")
} else {
stop("Data file not found. Please ensure measles5.dta is in the folder.")
}
somalia_shp <- st_read("som_admbnda_adm1_ocha_20250108.shp", quiet = TRUE)
# 2. Data Cleaning & Aggregation
region_stats <- df %>%
mutate(
dv_binary = case_when(as_factor(dv) == "Yes" ~ 1, as_factor(dv) == "No" ~ 0),
Region = str_trim(as_factor(Region))
) %>%
drop_na(dv_binary) %>%
group_by(Region) %>%
summarise(
cases = sum(dv_binary),
population = n(),
prevalence = cases / population
) %>%
ungroup()
# 3. Merge with Shapefile
map_data <- somalia_shp %>%
mutate(ADM1_EN = str_trim(ADM1_EN)) %>%
left_join(region_stats, by = c("ADM1_EN" = "Region")) %>%
filter(!is.na(prevalence))A Spatial Weights Matrix (\(W\)) defines the “neighborhood” structure. It determines which regions are allowed to influence each other in the mathematical calculations.
We use Queen Contiguity: \(w_{ij} = 1\) if regions share a border or point, \(0\) otherwise.
# Create Neighbors (Queen)
nb <- poly2nb(map_data, queen = TRUE)
# Handle islands if necessary (KNN k=1)
if(any(card(nb) == 0)){
coords <- st_coordinates(st_centroid(map_data))
nb <- knn2nb(knearneigh(coords, k=1))
}
# Create Weights List (Row Standardized)
listw <- nb2listw(nb, style = "W", zero.policy = TRUE)
# Plotting
coords <- st_coordinates(st_centroid(map_data))
plot(st_geometry(map_data), border = "grey", main = "1. Connectivity Graph (The Network)")
plot(nb, coords, add = TRUE, col = "red", pch = 19, cex = 0.6)Interpretation of Results: The graph shows a fully connected network across Somalia. The connections (red lines) represent the pathways through which we assume vaccination behaviors or healthcare access “spill over” from one region to another.
A thematic map shading areas in proportion to the statistical variable.
\[ Rate_i = \frac{y_i}{n_i} \]
ggplot(data = map_data) +
geom_sf(aes(fill = prevalence), color = "black", size = 0.2) +
scale_fill_distiller(palette = "YlGnBu", direction = 1, labels = percent) +
labs(title = "2. Raw Prevalence (Visual Inspection)") +
theme_void()Interpretation of Results: We observe a distinct North-South divide. The northern regions appear to have darker blue shading (higher vaccination rates), while the southern/central regions are lighter (lower rates). This suggests potential inequality in health access.
A Bayesian technique that “shrinks” unstable rates in small populations toward the global mean to reduce statistical noise.
\[ \hat{\theta}_i^{EB} = w_i t_i + (1-w_i) \mu \]
eb_res <- EBlocal(ri = map_data$cases, ni = map_data$population, nb = nb, zero.policy = TRUE)
map_data$eb_rate <- eb_res$est
ggplot(map_data) +
geom_sf(aes(fill = eb_rate), color = "black", size = 0.2) +
scale_fill_distiller(palette = "YlGnBu", direction = 1, labels = percent) +
labs(title = "3. EBS Smoothed Rates (Noise Reduction)") +
theme_void()Interpretation of Results: Comparing Plot 2 and Plot 3, the patterns remain largely consistent. This indicates that the sample sizes in the Somalia DHS 2020 were likely large enough that the raw rates are reliable, and the North-South divide is a real phenomenon, not a statistical artifact.
Classifies data into equal-sized groups (e.g., quintiles) to visualize relative rankings.
# Create Quantiles (5 groups)
breaks_qt <- classIntervals(map_data$prevalence, n = 5, style = "quantile")
map_data$quantile_cat <- cut(map_data$prevalence, breaks_qt$brks, include.lowest = TRUE)
ggplot(map_data) +
geom_sf(aes(fill = quantile_cat), color = "black", size = 0.2) +
scale_fill_brewer(palette = "Blues", name = "Quintiles") +
labs(title = "4. Quantile Map (Ranking)") +
theme_void()Interpretation of Results: The map clearly demarcates the “Top Tier” (Dark Blue) regions in the North and the “Bottom Tier” (Light Blue) in the South. This ranking is useful for prioritizing resource allocation to the bottom 20% of regions.
A single statistic measuring the degree of spatial autocorrelation for the entire study area.
\[ I = \frac{N}{\sum_i \sum_j w_{ij}} \frac{\sum_i \sum_j w_{ij} (x_i - \bar{x})(x_j - \bar{x})}{\sum_i (x_i - \bar{x})^2} \]
gmoran <- moran.test(map_data$prevalence, listw, zero.policy = TRUE)
z_score <- gmoran$statistic
p_value <- gmoran$p.value
moran_i <- gmoran$estimate[1]
# Normal Curve Data
x_vals <- seq(-4, 4, length.out = 1000)
df_norm <- data.frame(x = x_vals, y = dnorm(x_vals))
ggplot(df_norm, aes(x, y)) +
geom_line(size = 1) +
geom_area(data = subset(df_norm, x >= z_score), fill = "red", alpha = 0.5) +
geom_vline(xintercept = z_score, color = "red", linetype="dashed") +
annotate("text", x = 2.5, y = 0.3, label = paste("p-value:", format.pval(p_value)), color="red", fontface="bold") +
labs(title = "5. Global Moran's I (Systemic Test)",
subtitle = paste("Moran's I:", round(moran_i, 3), "| Z-Score:", round(z_score, 2)),
x = "Z-Score", y = "Density") +
theme_minimal()Interpretation of Results: Moran’s I = 0.662: This indicates a very strong positive autocorrelation. Z-Score = 3.44: This is well beyond the critical value of 1.96. P-value < 0.001: We are >99.9% confident that vaccination rates in Somalia are not random. High rates cluster with high rates, and low rates cluster with low rates.
Visualizes the relationship between a region’s value and its neighbors’ average.
map_data$lag_prev <- lag.listw(listw, map_data$prevalence, zero.policy = TRUE)
mean_prev <- mean(map_data$prevalence)
map_data <- map_data %>%
mutate(quadrant = case_when(
prevalence >= mean_prev & lag_prev >= mean_prev ~ "High-High",
prevalence <= mean_prev & lag_prev <= mean_prev ~ "Low-Low",
prevalence >= mean_prev & lag_prev <= mean_prev ~ "High-Low",
prevalence <= mean_prev & lag_prev >= mean_prev ~ "Low-High"
))
ggplot(map_data, aes(x = prevalence, y = lag_prev, color = quadrant)) +
geom_vline(xintercept = mean_prev, linetype = "dashed") +
geom_hline(yintercept = mean(map_data$lag_prev), linetype = "dashed") +
geom_point(size = 3) +
geom_text(aes(label = ADM1_EN), vjust = -0.5, size = 3, show.legend = FALSE) +
geom_smooth(method = "lm", se = FALSE, color = "black") +
labs(title = "6. Moran Scatterplot (Association)") + theme_minimal()Interpretation of Results: Most points fall in the Top-Right (High-High) and Bottom-Left (Low-Low) quadrants. This confirms the strong clustering found in Method 5. Regions like “Nugaal” or “Bari” (likely in the North) appear in the High-High quadrant, while regions in the South appear in Low-Low.
Shows how spatial autocorrelation fades as distance increases.
cor_results <- sp.correlogram(nb, map_data$prevalence, order = 4, method = "I", style = "W", zero.policy = TRUE)
cor_df <- data.frame(Lag = 1:4, Morans_I = cor_results$res[, 1],
P_Value = 2 * (1 - pnorm(abs(cor_results$res[, 1] - cor_results$res[, 2]) / cor_results$res[, 3]))) %>%
mutate(Significance = ifelse(P_Value < 0.05, "Significant", "Not Significant"))
p_corr <- ggplot(cor_df, aes(x = Lag, y = Morans_I)) +
geom_hline(yintercept = 0, linetype = "dashed", color = "gray") + geom_line() +
geom_point(aes(color = Significance), size = 4) +
scale_color_manual(values = c("Significant" = "red", "Not Significant" = "blue")) +
labs(title = "8. Spatial Correlogram", y = "Moran's I") + theme_minimal()
plot(p_corr)Interpretation of Results: The plot shows high autocorrelation at Lag 1 (immediate neighbors) which drops significantly at Lag 2 and crosses zero (becomes random/negative) by Lag 3. This suggests the “neighborhood effect” of vaccination performance extends to immediate neighbors and neighbors-of-neighbors, but not across the whole country.
Decomposes the Global I to identify the specific locations of clusters.
\[ I_i = \frac{(x_i - \bar{x})}{S^2} \sum_j w_{ij} (x_j - \bar{x}) \]
lisa <- localmoran(map_data$prevalence, listw, zero.policy = TRUE)
map_data$Pr <- lisa[,5]
map_data$lisa_cluster <- case_when(
map_data$quadrant == "High-High" & map_data$Pr <= 0.05 ~ "High-High",
map_data$quadrant == "Low-Low" & map_data$Pr <= 0.05 ~ "Low-Low",
TRUE ~ "Not Significant"
)
ggplot(map_data) +
geom_sf(aes(fill = lisa_cluster), color = "black", size = 0.1) +
scale_fill_manual(values = c("High-High"="red", "Low-Low"="blue", "Not Significant"="white")) +
labs(title = "8. LISA Cluster Map (Cluster Location)") + theme_void()Interpretation of Results: Red (High-High): We see a significant cluster of high vaccination in the Northern regions. Blue (Low-Low): We see a significant cluster of low vaccination in the Southern regions. White: Central regions appear to be transition zones with no significant clustering.
Measures the intensity of clustering using Z-scores.
\[ G_i^* = \frac{\sum_j w_{ij} x_j - \bar{X} \sum_j w_{ij}}{S \sqrt{\dots}} \]
gi_stat <- localG(map_data$prevalence, listw, zero.policy = TRUE)
map_data$gi_cat <- cut(as.numeric(gi_stat),
breaks = c(-Inf, -1.96, 1.96, Inf),
labels = c("Cold Spot", "Not Sig", "Hot Spot"))
ggplot(map_data) +
geom_sf(aes(fill = gi_cat), color = "black", size = 0.1) +
scale_fill_manual(values = c("Cold Spot"="blue", "Hot Spot"="red", "Not Sig"="white")) +
labs(title = "9. Getis-Ord Gi* (Intensity)") + theme_void()Interpretation of Results: This map reinforces the LISA results but focuses on intensity. The Red areas in the North are “Hotspots” where high vaccination rates are intensely concentrated (Z > 1.96). The Blue areas in the South are “Coldspots” where low rates are intensely concentrated.
Clustering statistics for binary (Categorical) data.
\[ BB = \sum_i \sum_j w_{ij} x_i x_j \]
global_mean <- mean(map_data$prevalence, na.rm = TRUE)
map_data$Coverage_Class <- factor(ifelse(map_data$prevalence > global_mean, "High", "Low"))
listw_binary <- nb2listw(nb, style = "B", zero.policy = TRUE)
ggplot(data = map_data) +
geom_sf(aes(fill = Coverage_Class), color = "black", size = 0.2) +
scale_fill_manual(values = c("High" = "forestgreen", "Low" = "darkorange")) +
labs(title = "10. Join Count Binary Map (Categorical Clustering)") + theme_void()Interpretation of Results: Even when simplifying the data to just “Above Average” (Green) and “Below Average” (Orange), the spatial divide is stark. The Green regions are contiguous in the North, and the Orange regions are contiguous in the South, confirming that the clustering is robust to data simplification.
Maps statistical outliers based on the Box Plot principles.
probs <- c(0, 0.25, 0.5, 0.75, 1)
cuts <- quantile(map_data$prevalence, probs)
iqr <- diff(cuts[c(2, 4)])
lower_fence <- cuts[2] - 1.5 * iqr
upper_fence <- cuts[4] + 1.5 * iqr
map_data$Hinge_Cat <- cut(map_data$prevalence,
breaks = c(-Inf, lower_fence, cuts[2], cuts[3], cuts[4], upper_fence, Inf),
labels = c("Lower Outlier", "< 25%", "25-50%", "50-75%", "> 75%", "Upper Outlier"),
include.lowest = TRUE)
ggplot(map_data) +
geom_sf(aes(fill = Hinge_Cat), color = "black", size = 0.1) +
scale_fill_brewer(palette = "Spectral", direction = -1) +
labs(title = "11. Hinge Map (Outlier Detection)") + theme_void()Interpretation of Results: This map identifies if any region is performing exceptionally poorly or well compared to the national distribution. If Red or Blue regions appear, they are the statistical outliers that require immediate investigation.
Compares the probability density functions (distribution shapes) across different spatial regimes.
map_data$Regime <- case_when(
grepl("Hot Spot", map_data$gi_cat) ~ "Hot Spot Regime",
grepl("Cold Spot", map_data$gi_cat) ~ "Cold Spot Regime",
TRUE ~ "Transition Zone"
)
ggplot(map_data, aes(x = prevalence, fill = Regime)) +
geom_density(alpha = 0.6) +
scale_fill_manual(values = c("Cold Spot Regime" = "blue", "Hot Spot Regime" = "red", "Transition Zone" = "grey")) +
labs(title = "12. Density Analysis (Distribution Comparison)", x = "Prevalence") +
theme_minimal()Interpretation of Results: The Red Curve (Hotspots) is shifted far to the right (high prevalence), while the Blue Curve (Coldspots) is shifted to the left. The lack of overlap between these curves suggests that the North and South of Somalia are operating under two completely different public health realities.
Maps the average value of a region’s neighbors.
\[ Lag_i = \sum_j w_{ij} x_j \]
ggplot(map_data) +
geom_sf(aes(fill = lag_prev), color = "black", size = 0.2) +
scale_fill_distiller(palette = "YlGnBu", direction = 1, labels = percent) +
labs(title = "13. Spatial Lag Map (Neighborhood Context)",
subtitle = "Average Rate of Neighbors") +
theme_void()Interpretation of Results: This map smooths out the local variations and shows the regional trends. It highlights the “environment” a region is situated in. A region in the South might have a moderate rate, but this map shows it is surrounded by a “low rate environment.”
Measures dissimilarity to highlight edges where values change abruptly.
\[ C_i = \sum_j w_{ij} (x_i - x_j)^2 \]
# Calculate Local Geary using localC
geary_loc <- localC(map_data$prevalence, listw)
map_data$geary_c <- geary_loc
ggplot(map_data) +
geom_sf(aes(fill = geary_c), color = "black", size = 0.2) +
scale_fill_viridis_c(option = "magma", direction = -1) +
labs(title = "14. Local Geary's C (Edge Detection)",
subtitle = "Darker = High Dissimilarity") +
theme_void()Interpretation of Results: Darker areas on this map indicate “friction” zones—borders where a high-vaccination region touches a low-vaccination region. These are often areas of inequality or administrative barriers.
Maps only the P-values from the LISA analysis.
map_data$Sig_Cat <- cut(map_data$Pr,
breaks = c(-Inf, 0.01, 0.05, 0.1, Inf),
labels = c("p < 0.01 (Critical)", "p < 0.05 (Warning)", "p < 0.1 (Watch)", "Not Sig"))
ggplot(map_data) +
geom_sf(aes(fill = Sig_Cat), color = "black", size = 0.2) +
scale_fill_manual(values = c("p < 0.01 (Critical)" = "darkred",
"p < 0.05 (Warning)" = "red",
"p < 0.1 (Watch)" = "pink",
"Not Sig" = "grey90")) +
labs(title = "15. Significance Map (Confidence Levels)") + theme_void()Interpretation of Results: The dark red areas indicate regions where the clustering is so strong that the probability of it happening by chance is less than 1% (\(p < 0.01\)). These are the most critical areas for policy intervention.
Boxplot of the variable grouped by the quality of the neighborhood.
# Categorize the Lag (Neighborhood Quality)
map_data$Lag_Quality <- cut(map_data$lag_prev,
breaks = quantile(map_data$lag_prev, probs = c(0, 0.33, 0.66, 1)),
labels = c("Low Tier Neighbors", "Mid Tier Neighbors", "High Tier Neighbors"),
include.lowest = TRUE)
ggplot(map_data, aes(x = Lag_Quality, y = prevalence, fill = Lag_Quality)) +
geom_boxplot(alpha = 0.7) +
geom_jitter(width = 0.1, alpha = 0.3) +
scale_fill_brewer(palette = "Pastel1") +
labs(title = "16. Conditional Boxplot (Neighborhood Effect)",
y = "Region Value", x = "Neighborhood Status") +
theme_minimal()Interpretation of Results: We observe a clear “step-up” pattern. Regions surrounded by “High Tier Neighbors” have a significantly higher median vaccination rate than those surrounded by “Low Tier Neighbors.” This confirms that geography is a strong predictor of vaccination success in Somalia.
| Method | Best Concise Definition | Research Question | General Tool Analogy | Health Tool Analogy |
|---|---|---|---|---|
| 1. Spatial Weights | Defining neighbor relationships. | Who is connected to whom? | The Network Cables | The Nervous System |
| 2. Choropleth Map | Visualizing data values using color. | What does the landscape look like? | The Snapshot Photo | Visual Inspection |
| 3. Empirical Bayes | Adjusting rates for small populations. | Is that rate real or just noise? | The Noise Filter | Corrective Glasses |
| 4. Quantile Map | Grouping data into equal categories. | How does this region rank? | The Ranking Board | BMI Categories |
| 5. Global Moran’s I | Statistic measuring total clustering. | Is the system clustered or random? | The Smoke Alarm | Body Thermometer |
| 6. Moran Scatterplot | Plotting region vs. neighbors. | Is the region acting like its peers? | The Mirror | Growth Chart |
| 7. Spatial Correlogram | Autocorrelation over distance. | How far does the influence travel? | The Ripple Effect | Sneeze Radius |
| 8. LISA (Local Moran) | Finding specific cluster locations. | Where exactly are the clusters? | The Metal Detector | The X-Ray |
| 9. Getis-Ord Gi* | Z-scores of intensity. | How intense is the concentration? | The Heat Sensor | Pain Scale |
| 10. Join Count | Clustering for binary data. | Do similar types stick together? | The Magnet Test | Contact Tracing |
| 11. Hinge Map | Mapping statistical outliers. | Who are the extreme anomalies? | The Bouncer | Triage Nurse |
| 12. Density by Regime | Comparing distributions by zone. | Is the makeup of a Hotspot different? | The DNA Test | The Biopsy |
| 13. Spatial Lag Map | Mapping neighbor averages. | What is the surrounding context? | The Background Check | Family History |
| 14. Local Geary’s C | Measuring dissimilarity/edges. | Where are the sharp boundaries? | The Edge Detector | Reflex Hammer |
| 15. Significance Map | Mapping P-values only. | Where are we statistically confident? | The Confidence Flag | Red Flag Chart |
| 16. Conditional Plot | Boxplot by neighbor quality. | Does the environment affect the outcome? | The Environment Test | Environmental Check |
```