1 Introduction

Exploratory Spatial Data Analysis (ESDA) extends traditional data analysis by explicitly including location. While standard statistics ask “How much?”, spatial statistics ask “How much and where?”

Tobler’s First Law of Geography states: “Everything is related to everything else, but near things are more related than distant things.”

In this module, we apply 16 statistical methods to test this law.

Data Source Note

The Data: We are using the Somali Health and Demographic Survey (SHDS) 2020. This was a historic milestone—the first-ever comprehensive health survey conducted in Somalia.

The Variable: We focus on Measles Vaccination coverage among children. Measles is highly contagious, so understanding if vaccination coverage is clustered (herd immunity) or fragmented (outbreak risk) is critical for public health planning.

2 Data Preparation

# Load Libraries
if(!require(pacman)) install.packages("pacman")
pacman::p_load(tidyverse, sf, spdep, haven, classInt, RColorBrewer, scales, gridExtra)

# Disable S2 geometry for planar assumptions
sf_use_s2(FALSE)

# 1. Load Data
if(file.exists("measles5.dta")) {
  df <- read_dta("measles5.dta")
} else {
  stop("Data file not found. Please ensure measles5.dta is in the folder.")
}

somalia_shp <- st_read("som_admbnda_adm1_ocha_20250108.shp", quiet = TRUE)

# 2. Data Cleaning & Aggregation
region_stats <- df %>%
  mutate(
    dv_binary = case_when(as_factor(dv) == "Yes" ~ 1, as_factor(dv) == "No" ~ 0),
    Region = str_trim(as_factor(Region)) 
  ) %>%
  drop_na(dv_binary) %>%
  group_by(Region) %>%
  summarise(
    cases = sum(dv_binary),      
    population = n(),            
    prevalence = cases / population
  ) %>%
  ungroup()

# 3. Merge with Shapefile
map_data <- somalia_shp %>%
  mutate(ADM1_EN = str_trim(ADM1_EN)) %>%
  left_join(region_stats, by = c("ADM1_EN" = "Region")) %>%
  filter(!is.na(prevalence))

3 Part 1: The Foundation

3.1 Method 1: Spatial Weights Connectivity

3.1.1 Definition

A Spatial Weights Matrix (\(W\)) defines the “neighborhood” structure. It determines which regions are allowed to influence each other in the mathematical calculations.

3.1.2 Research Question

General: How does information, trade, or contagion flow? Who is connected to whom?

3.1.3 Mathematical Logic

We use Queen Contiguity: \(w_{ij} = 1\) if regions share a border or point, \(0\) otherwise.

# Create Neighbors (Queen)
nb <- poly2nb(map_data, queen = TRUE)

# Handle islands if necessary (KNN k=1)
if(any(card(nb) == 0)){
  coords <- st_coordinates(st_centroid(map_data))
  nb <- knn2nb(knearneigh(coords, k=1))
}

# Create Weights List (Row Standardized)
listw <- nb2listw(nb, style = "W", zero.policy = TRUE)

# Plotting
coords <- st_coordinates(st_centroid(map_data))
plot(st_geometry(map_data), border = "grey", main = "1. Connectivity Graph (The Network)")
plot(nb, coords, add = TRUE, col = "red", pch = 19, cex = 0.6)

Interpretation of Results: The graph shows a fully connected network across Somalia. The connections (red lines) represent the pathways through which we assume vaccination behaviors or healthcare access “spill over” from one region to another.

4 Part 2: Basic Visualization Methods

4.1 Method 2: Choropleth Map

4.1.1 Definition

A thematic map shading areas in proportion to the statistical variable.

4.1.2 Research Question

General: What is the spatial distribution? Are there visible patterns of high or low values?

4.1.3 Mathematical Formula

\[ Rate_i = \frac{y_i}{n_i} \]

ggplot(data = map_data) +
  geom_sf(aes(fill = prevalence), color = "black", size = 0.2) +
  scale_fill_distiller(palette = "YlGnBu", direction = 1, labels = percent) +
  labs(title = "2. Raw Prevalence (Visual Inspection)") +
  theme_void()

Interpretation of Results: We observe a distinct North-South divide. The northern regions appear to have darker blue shading (higher vaccination rates), while the southern/central regions are lighter (lower rates). This suggests potential inequality in health access.

4.2 Method 3: Empirical Bayes Smoothing (EBS)

4.2.1 Definition

A Bayesian technique that “shrinks” unstable rates in small populations toward the global mean to reduce statistical noise.

4.2.2 Research Question

General: Is the extreme value in a small region real, or just noise due to small sample size?

4.2.3 Mathematical Formula

\[ \hat{\theta}_i^{EB} = w_i t_i + (1-w_i) \mu \]

eb_res <- EBlocal(ri = map_data$cases, ni = map_data$population, nb = nb, zero.policy = TRUE)
map_data$eb_rate <- eb_res$est

ggplot(map_data) +
  geom_sf(aes(fill = eb_rate), color = "black", size = 0.2) +
  scale_fill_distiller(palette = "YlGnBu", direction = 1, labels = percent) +
  labs(title = "3. EBS Smoothed Rates (Noise Reduction)") +
  theme_void()

Interpretation of Results: Comparing Plot 2 and Plot 3, the patterns remain largely consistent. This indicates that the sample sizes in the Somalia DHS 2020 were likely large enough that the raw rates are reliable, and the North-South divide is a real phenomenon, not a statistical artifact.

4.3 Method 4: Quantile Map

4.3.1 Definition

Classifies data into equal-sized groups (e.g., quintiles) to visualize relative rankings.

4.3.2 Research Question

General: How does a region rank relative to others? (Top Tier vs Bottom Tier).

# Create Quantiles (5 groups)
breaks_qt <- classIntervals(map_data$prevalence, n = 5, style = "quantile")
map_data$quantile_cat <- cut(map_data$prevalence, breaks_qt$brks, include.lowest = TRUE)

ggplot(map_data) +
  geom_sf(aes(fill = quantile_cat), color = "black", size = 0.2) +
  scale_fill_brewer(palette = "Blues", name = "Quintiles") +
  labs(title = "4. Quantile Map (Ranking)") +
  theme_void()

Interpretation of Results: The map clearly demarcates the “Top Tier” (Dark Blue) regions in the North and the “Bottom Tier” (Light Blue) in the South. This ranking is useful for prioritizing resource allocation to the bottom 20% of regions.

5 Part 3: Global Clustering Methods

5.1 Method 5: Global Moran’s I

5.1.1 Definition

A single statistic measuring the degree of spatial autocorrelation for the entire study area.

5.1.2 Research Question

General: Is the system clustered, dispersed, or random?

5.1.3 Mathematical Formula

\[ I = \frac{N}{\sum_i \sum_j w_{ij}} \frac{\sum_i \sum_j w_{ij} (x_i - \bar{x})(x_j - \bar{x})}{\sum_i (x_i - \bar{x})^2} \]

gmoran <- moran.test(map_data$prevalence, listw, zero.policy = TRUE)
z_score <- gmoran$statistic
p_value <- gmoran$p.value
moran_i <- gmoran$estimate[1]

# Normal Curve Data
x_vals <- seq(-4, 4, length.out = 1000)
df_norm <- data.frame(x = x_vals, y = dnorm(x_vals))

ggplot(df_norm, aes(x, y)) +
  geom_line(size = 1) +
  geom_area(data = subset(df_norm, x >= z_score), fill = "red", alpha = 0.5) +
  geom_vline(xintercept = z_score, color = "red", linetype="dashed") +
  annotate("text", x = 2.5, y = 0.3, label = paste("p-value:", format.pval(p_value)), color="red", fontface="bold") +
  labs(title = "5. Global Moran's I (Systemic Test)",
       subtitle = paste("Moran's I:", round(moran_i, 3), "| Z-Score:", round(z_score, 2)),
       x = "Z-Score", y = "Density") +
  theme_minimal()

Interpretation of Results: Moran’s I = 0.662: This indicates a very strong positive autocorrelation. Z-Score = 3.44: This is well beyond the critical value of 1.96. P-value < 0.001: We are >99.9% confident that vaccination rates in Somalia are not random. High rates cluster with high rates, and low rates cluster with low rates.

5.2 Method 6: Moran Scatterplot

5.2.1 Definition

Visualizes the relationship between a region’s value and its neighbors’ average.

5.2.2 Research Question

General: Which regions follow the trend (High-High, Low-Low) and which are outliers?

map_data$lag_prev <- lag.listw(listw, map_data$prevalence, zero.policy = TRUE)
mean_prev <- mean(map_data$prevalence)

map_data <- map_data %>%
  mutate(quadrant = case_when(
    prevalence >= mean_prev & lag_prev >= mean_prev ~ "High-High",
    prevalence <= mean_prev & lag_prev <= mean_prev ~ "Low-Low",
    prevalence >= mean_prev & lag_prev <= mean_prev ~ "High-Low",
    prevalence <= mean_prev & lag_prev >= mean_prev ~ "Low-High"
  ))

ggplot(map_data, aes(x = prevalence, y = lag_prev, color = quadrant)) +
  geom_vline(xintercept = mean_prev, linetype = "dashed") +
  geom_hline(yintercept = mean(map_data$lag_prev), linetype = "dashed") +
  geom_point(size = 3) +
  geom_text(aes(label = ADM1_EN), vjust = -0.5, size = 3, show.legend = FALSE) +
  geom_smooth(method = "lm", se = FALSE, color = "black") +
  labs(title = "6. Moran Scatterplot (Association)") + theme_minimal()

Interpretation of Results: Most points fall in the Top-Right (High-High) and Bottom-Left (Low-Low) quadrants. This confirms the strong clustering found in Method 5. Regions like “Nugaal” or “Bari” (likely in the North) appear in the High-High quadrant, while regions in the South appear in Low-Low.

5.3 Method 7: Spatial Correlogram

5.3.1 Definition

Shows how spatial autocorrelation fades as distance increases.

5.3.2 Research Question

General: How far does the influence travel? What is the range of the cluster?

cor_results <- sp.correlogram(nb, map_data$prevalence, order = 4, method = "I", style = "W", zero.policy = TRUE)
cor_df <- data.frame(Lag = 1:4, Morans_I = cor_results$res[, 1], 
                     P_Value = 2 * (1 - pnorm(abs(cor_results$res[, 1] - cor_results$res[, 2]) / cor_results$res[, 3]))) %>%
  mutate(Significance = ifelse(P_Value < 0.05, "Significant", "Not Significant"))

p_corr <- ggplot(cor_df, aes(x = Lag, y = Morans_I)) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "gray") + geom_line() +
  geom_point(aes(color = Significance), size = 4) +
  scale_color_manual(values = c("Significant" = "red", "Not Significant" = "blue")) +
  labs(title = "8. Spatial Correlogram", y = "Moran's I") + theme_minimal()
plot(p_corr)

Interpretation of Results: The plot shows high autocorrelation at Lag 1 (immediate neighbors) which drops significantly at Lag 2 and crosses zero (becomes random/negative) by Lag 3. This suggests the “neighborhood effect” of vaccination performance extends to immediate neighbors and neighbors-of-neighbors, but not across the whole country.

6 Part 4: Local Clustering Methods

6.1 Method 8: Local Moran’s I (LISA)

6.1.1 Definition

Decomposes the Global I to identify the specific locations of clusters.

6.1.2 Research Question

General: Where exactly are the Hotspots and Coldspots located?

6.1.3 Mathematical Formula

\[ I_i = \frac{(x_i - \bar{x})}{S^2} \sum_j w_{ij} (x_j - \bar{x}) \]

lisa <- localmoran(map_data$prevalence, listw, zero.policy = TRUE)
map_data$Pr <- lisa[,5] 
map_data$lisa_cluster <- case_when(
  map_data$quadrant == "High-High" & map_data$Pr <= 0.05 ~ "High-High",
  map_data$quadrant == "Low-Low" & map_data$Pr <= 0.05 ~ "Low-Low",
  TRUE ~ "Not Significant"
)

ggplot(map_data) +
  geom_sf(aes(fill = lisa_cluster), color = "black", size = 0.1) +
  scale_fill_manual(values = c("High-High"="red", "Low-Low"="blue", "Not Significant"="white")) +
  labs(title = "8. LISA Cluster Map (Cluster Location)") + theme_void()

Interpretation of Results: Red (High-High): We see a significant cluster of high vaccination in the Northern regions. Blue (Low-Low): We see a significant cluster of low vaccination in the Southern regions. White: Central regions appear to be transition zones with no significant clustering.

6.2 Method 9: Getis-Ord Gi*

6.2.1 Definition

Measures the intensity of clustering using Z-scores.

6.2.2 Research Question

General: How intense is the concentration? (95% vs 99% confidence).

6.2.3 Mathematical Formula

\[ G_i^* = \frac{\sum_j w_{ij} x_j - \bar{X} \sum_j w_{ij}}{S \sqrt{\dots}} \]

gi_stat <- localG(map_data$prevalence, listw, zero.policy = TRUE)
map_data$gi_cat <- cut(as.numeric(gi_stat),
                       breaks = c(-Inf, -1.96, 1.96, Inf),
                       labels = c("Cold Spot", "Not Sig", "Hot Spot"))

ggplot(map_data) +
  geom_sf(aes(fill = gi_cat), color = "black", size = 0.1) +
  scale_fill_manual(values = c("Cold Spot"="blue", "Hot Spot"="red", "Not Sig"="white")) +
  labs(title = "9. Getis-Ord Gi* (Intensity)") + theme_void()

Interpretation of Results: This map reinforces the LISA results but focuses on intensity. The Red areas in the North are “Hotspots” where high vaccination rates are intensely concentrated (Z > 1.96). The Blue areas in the South are “Coldspots” where low rates are intensely concentrated.

6.3 Method 10: Join Count Analysis

6.3.1 Definition

Clustering statistics for binary (Categorical) data.

6.3.2 Research Question

General: If we simplify the world into “High” vs “Low”, do similar types stick together?

6.3.3 Mathematical Formula

\[ BB = \sum_i \sum_j w_{ij} x_i x_j \]

global_mean <- mean(map_data$prevalence, na.rm = TRUE)
map_data$Coverage_Class <- factor(ifelse(map_data$prevalence > global_mean, "High", "Low"))
listw_binary <- nb2listw(nb, style = "B", zero.policy = TRUE) 

ggplot(data = map_data) +
  geom_sf(aes(fill = Coverage_Class), color = "black", size = 0.2) +
  scale_fill_manual(values = c("High" = "forestgreen", "Low" = "darkorange")) +
  labs(title = "10. Join Count Binary Map (Categorical Clustering)") + theme_void()

Interpretation of Results: Even when simplifying the data to just “Above Average” (Green) and “Below Average” (Orange), the spatial divide is stark. The Green regions are contiguous in the North, and the Orange regions are contiguous in the South, confirming that the clustering is robust to data simplification.

7 Part 5: Advanced & Distributional Methods

7.1 Method 11: Hinge Map (Box Map)

7.1.1 Definition

Maps statistical outliers based on the Box Plot principles.

7.1.2 Research Question

General: Who are the extreme anomalies?

7.1.3 Mathematical Formula

Lower Outlier: \(< Q1 - 1.5 \times IQR\)
Upper Outlier: \(> Q3 + 1.5 \times IQR\)

probs <- c(0, 0.25, 0.5, 0.75, 1)
cuts <- quantile(map_data$prevalence, probs)
iqr <- diff(cuts[c(2, 4)])
lower_fence <- cuts[2] - 1.5 * iqr
upper_fence <- cuts[4] + 1.5 * iqr
map_data$Hinge_Cat <- cut(map_data$prevalence, 
                          breaks = c(-Inf, lower_fence, cuts[2], cuts[3], cuts[4], upper_fence, Inf),
                          labels = c("Lower Outlier", "< 25%", "25-50%", "50-75%", "> 75%", "Upper Outlier"), 
                          include.lowest = TRUE)

ggplot(map_data) +
  geom_sf(aes(fill = Hinge_Cat), color = "black", size = 0.1) +
  scale_fill_brewer(palette = "Spectral", direction = -1) +
  labs(title = "11. Hinge Map (Outlier Detection)") + theme_void()

Interpretation of Results: This map identifies if any region is performing exceptionally poorly or well compared to the national distribution. If Red or Blue regions appear, they are the statistical outliers that require immediate investigation.

7.2 Method 12: Density Analysis by Regime

7.2.1 Definition

Compares the probability density functions (distribution shapes) across different spatial regimes.

7.2.2 Research Question

General: Is the makeup of a Hotspot fundamentally different from a Coldspot?

map_data$Regime <- case_when(
  grepl("Hot Spot", map_data$gi_cat) ~ "Hot Spot Regime",
  grepl("Cold Spot", map_data$gi_cat) ~ "Cold Spot Regime",
  TRUE ~ "Transition Zone"
)

ggplot(map_data, aes(x = prevalence, fill = Regime)) +
  geom_density(alpha = 0.6) +
  scale_fill_manual(values = c("Cold Spot Regime" = "blue", "Hot Spot Regime" = "red", "Transition Zone" = "grey")) +
  labs(title = "12. Density Analysis (Distribution Comparison)", x = "Prevalence") + 
  theme_minimal()

Interpretation of Results: The Red Curve (Hotspots) is shifted far to the right (high prevalence), while the Blue Curve (Coldspots) is shifted to the left. The lack of overlap between these curves suggests that the North and South of Somalia are operating under two completely different public health realities.

7.3 Method 13: Spatial Lag Map

7.3.1 Definition

Maps the average value of a region’s neighbors.

7.3.2 Research Question

General: What is the surrounding context?

7.3.3 Mathematical Formula

\[ Lag_i = \sum_j w_{ij} x_j \]

ggplot(map_data) +
  geom_sf(aes(fill = lag_prev), color = "black", size = 0.2) +
  scale_fill_distiller(palette = "YlGnBu", direction = 1, labels = percent) +
  labs(title = "13. Spatial Lag Map (Neighborhood Context)", 
       subtitle = "Average Rate of Neighbors") +
  theme_void()

Interpretation of Results: This map smooths out the local variations and shows the regional trends. It highlights the “environment” a region is situated in. A region in the South might have a moderate rate, but this map shows it is surrounded by a “low rate environment.”

7.4 Method 14: Local Geary’s C

7.4.1 Definition

Measures dissimilarity to highlight edges where values change abruptly.

7.4.2 Research Question

General: Where are the sharp boundaries or edges?

7.4.3 Mathematical Formula

\[ C_i = \sum_j w_{ij} (x_i - x_j)^2 \]

# Calculate Local Geary using localC
geary_loc <- localC(map_data$prevalence, listw)
map_data$geary_c <- geary_loc

ggplot(map_data) +
  geom_sf(aes(fill = geary_c), color = "black", size = 0.2) +
  scale_fill_viridis_c(option = "magma", direction = -1) +
  labs(title = "14. Local Geary's C (Edge Detection)", 
       subtitle = "Darker = High Dissimilarity") +
  theme_void()

Interpretation of Results: Darker areas on this map indicate “friction” zones—borders where a high-vaccination region touches a low-vaccination region. These are often areas of inequality or administrative barriers.

7.5 Method 15: Significance Map

7.5.1 Definition

Maps only the P-values from the LISA analysis.

7.5.2 Research Question

General: Where are we statistically confident that a pattern exists?

map_data$Sig_Cat <- cut(map_data$Pr, 
                        breaks = c(-Inf, 0.01, 0.05, 0.1, Inf),
                        labels = c("p < 0.01 (Critical)", "p < 0.05 (Warning)", "p < 0.1 (Watch)", "Not Sig"))

ggplot(map_data) +
  geom_sf(aes(fill = Sig_Cat), color = "black", size = 0.2) +
  scale_fill_manual(values = c("p < 0.01 (Critical)" = "darkred", 
                               "p < 0.05 (Warning)" = "red", 
                               "p < 0.1 (Watch)" = "pink", 
                               "Not Sig" = "grey90")) +
  labs(title = "15. Significance Map (Confidence Levels)") + theme_void()

Interpretation of Results: The dark red areas indicate regions where the clustering is so strong that the probability of it happening by chance is less than 1% (\(p < 0.01\)). These are the most critical areas for policy intervention.

7.6 Method 16: Conditional Boxplot

7.6.1 Definition

Boxplot of the variable grouped by the quality of the neighborhood.

7.6.2 Research Question

General: Does the environment affect the outcome?

# Categorize the Lag (Neighborhood Quality)
map_data$Lag_Quality <- cut(map_data$lag_prev, 
                            breaks = quantile(map_data$lag_prev, probs = c(0, 0.33, 0.66, 1)),
                            labels = c("Low Tier Neighbors", "Mid Tier Neighbors", "High Tier Neighbors"),
                            include.lowest = TRUE)

ggplot(map_data, aes(x = Lag_Quality, y = prevalence, fill = Lag_Quality)) +
  geom_boxplot(alpha = 0.7) +
  geom_jitter(width = 0.1, alpha = 0.3) +
  scale_fill_brewer(palette = "Pastel1") +
  labs(title = "16. Conditional Boxplot (Neighborhood Effect)",
       y = "Region Value", x = "Neighborhood Status") +
  theme_minimal()

Interpretation of Results: We observe a clear “step-up” pattern. Regions surrounded by “High Tier Neighbors” have a significantly higher median vaccination rate than those surrounded by “Low Tier Neighbors.” This confirms that geography is a strong predictor of vaccination success in Somalia.

8 Summary Table: The ESDA Toolbox

Method	Best Concise Definition	Research Question	General Tool Analogy	Health Tool Analogy
1. Spatial Weights	Defining neighbor relationships.	Who is connected to whom?	The Network Cables	The Nervous System
2. Choropleth Map	Visualizing data values using color.	What does the landscape look like?	The Snapshot Photo	Visual Inspection
3. Empirical Bayes	Adjusting rates for small populations.	Is that rate real or just noise?	The Noise Filter	Corrective Glasses
4. Quantile Map	Grouping data into equal categories.	How does this region rank?	The Ranking Board	BMI Categories
5. Global Moran’s I	Statistic measuring total clustering.	Is the system clustered or random?	The Smoke Alarm	Body Thermometer
6. Moran Scatterplot	Plotting region vs. neighbors.	Is the region acting like its peers?	The Mirror	Growth Chart
7. Spatial Correlogram	Autocorrelation over distance.	How far does the influence travel?	The Ripple Effect	Sneeze Radius
8. LISA (Local Moran)	Finding specific cluster locations.	Where exactly are the clusters?	The Metal Detector	The X-Ray
9. Getis-Ord Gi*	Z-scores of intensity.	How intense is the concentration?	The Heat Sensor	Pain Scale
10. Join Count	Clustering for binary data.	Do similar types stick together?	The Magnet Test	Contact Tracing
11. Hinge Map	Mapping statistical outliers.	Who are the extreme anomalies?	The Bouncer	Triage Nurse
12. Density by Regime	Comparing distributions by zone.	Is the makeup of a Hotspot different?	The DNA Test	The Biopsy
13. Spatial Lag Map	Mapping neighbor averages.	What is the surrounding context?	The Background Check	Family History
14. Local Geary’s C	Measuring dissimilarity/edges.	Where are the sharp boundaries?	The Edge Detector	Reflex Hammer
15. Significance Map	Mapping P-values only.	Where are we statistically confident?	The Confidence Flag	Red Flag Chart
16. Conditional Plot	Boxplot by neighbor quality.	Does the environment affect the outcome?	The Environment Test	Environmental Check

```

Module VII: Exploratory Spatial Statistics (ESDA)

Course: Spatial Statistics for Data Science

Abdisalam Hassan Muse (PhD)

2025-12-13