Spatial distribution and Determinants of unimproved sources of drinking water among Somali Households: Spatial and Multilevel Analysis using Nationwide Survey Data

Introduction

This document presents a spatial analysis of unimproved sources of drinking water among households in Somalia using nationwide survey data 2020. Utilizing data from the Demographic and Health Survey (DHS), we explore the spatial distribution of unimproved sources of drinking water and examine patterns of spatial autocorrelation. This analysis will use spatial analysis techniques including spatial mapping, spatial autocorrelation statistics such as Global and Local Moran’s I, and Getis-Ord Gi* statistic. Understanding the geographic distribution ofunimproved sources of drinking water among households in Somalia and the spatial factors is critical for planning targeted public services.

Data and Methods

Libraries

First we load the necessary R libraries for data manipulation, spatial analysis, and visualization.

# `sf`: For handling spatial vector data (shapefiles).
library(sf)

## Warning: package 'sf' was built under R version 4.3.3

## Linking to GEOS 3.11.2, GDAL 3.8.2, PROJ 9.3.1; sf_use_s2() is TRUE

# `ggplot2`: For creating static and publication-quality graphics.
library(ggplot2)
# `spdep`: For spatial analysis and spatial autocorrelation.
library(spdep)

## Warning: package 'spdep' was built under R version 4.3.1

## Loading required package: spData

## Warning: package 'spData' was built under R version 4.3.1

## The legacy packages maptools, rgdal, and rgeos, underpinning the sp package,
## which was just loaded, will retire in October 2023.
## Please refer to R-spatial evolution reports for details, especially
## https://r-spatial.org/r/2023/05/15/evolution4.html.
## It may be desirable to make the sf package available;
## package maintainers should consider adding sf to Suggests:.
## The sp package is now running under evolution status 2
##      (status 2 uses the sf package in place of rgdal)

## To access larger datasets in this package, install the spDataLarge
## package with: `install.packages('spDataLarge',
## repos='https://nowosad.github.io/drat/', type='source')`

# `tmap`: For creating thematic maps.
library(tmap)

## Warning: package 'tmap' was built under R version 4.3.3

## Breaking News: tmap 3.x is retiring. Please test v4, e.g. with
## remotes::install_github('r-tmap/tmap')

# `haven`: For reading data files from various statistical packages (including .dta).
library(haven)
# `survey`: For complex survey analysis
library(survey)

## Warning: package 'survey' was built under R version 4.3.3

## Loading required package: grid

## Loading required package: Matrix

## Loading required package: survival

## 
## Attaching package: 'survey'

## The following object is masked from 'package:graphics':
## 
##     dotchart

# `RColorBrewer`: Provides color schemes for maps and charts.
library(RColorBrewer)

Loading Data

We load the Somalia shapefile and the DHS data, perform some data cleaning and pre-processing.

# Load shapefile
# Specify the file path to the shapefile representing the administrative boundaries of Somalia.
filepath<-"~/gadm41_SOM_1.shp"
# Read the shapefile using st_read() from the sf package.
somalia<-st_read(filepath)

## Reading layer `gadm41_SOM_1' from data source 
##   `C:\Users\hp\Documents\gadm41_SOM_1.shp' using driver `ESRI Shapefile'
## Simple feature collection with 18 features and 11 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: 40.9785 ymin: -1.647082 xmax: 51.4157 ymax: 11.98931
## Geodetic CRS:  WGS 84

# Print the names of the first-level administrative units (regions) in Somalia.
somalia$NAME_1

##  [1] "Awdal"             "Bakool"            "Banaadir"         
##  [4] "Bari"              "Bay"               "Galguduud"        
##  [7] "Gedo"              "Hiiraan"           "Jubbada Dhexe"    
## [10] "Jubbada Hoose"     "Mudug"             "Nugaal"           
## [13] "Sanaag"            "Shabeellaha Dhexe" "Shabeellaha Hoose"
## [16] "Sool"              "Togdheer"          "Woqooyi Galbeed"

# Load DHS data
# Specify the file path to the DHS dataset (.dta file).
water <- read_dta("~/water.dta")
# Print the structure of the 'HV024' column, which likely contains region codes.
table(water$HV024,water$dv)

##     
##         0    1
##   11  439  408
##   12  486  836
##   13  666  596
##   14  297  989
##   15  451  906
##   16  447  422
##   17  311  565
##   18  565  324
##   19  608  245
##   20  324  539
##   21  500  332
##   22 1680   40
##   24  156  143
##   25  154  625
##   26  355  531
##   28  440  443

Calculating Unimproved water Proportion

We calculate the proportion of unimproved sources of drinking water proportion by region, using survey weights from the DHS data.

# Calculate the proportion unimproved sources of drinking water by region
# Apply the function to subsets of the survey data using svyby
# Create a survey design object
options(survey.lonely.psu = "remove")
design <- svydesign(id = ~HV001, strata = ~STRU_CD, weights = ~HV005, data = water,nest=TRUE)
result <- svyby(~dv, ~HV024, FUN = svyciprop, design = design, vartype = NULL)
# Print the resulting data frame showing the mean fuel use per region.
print(result)

##    HV024         dv se.as.numeric(dv)
## 11    11 0.42676883       0.044586878
## 12    12 0.71198977       0.076644530
## 13    13 0.39949998       0.046623999
## 14    14 0.77007411       0.023935665
## 15    15 0.69225961       0.033531644
## 16    16 0.40642037       0.035543746
## 17    17 0.48917276       0.031636619
## 18    18 0.41075271       0.061214642
## 19    19 0.14543827       0.015501695
## 20    20 0.58836273       0.172101114
## 21    21 0.27406782       0.073596023
## 22    22 0.02353069       0.007190857
## 24    24 0.47826087       0.092681596
## 25    25 0.78995401       0.056969945
## 26    26 0.40401631       0.073247081
## 28    28 0.33989798       0.046929558

# Add region names to the result data frame
# Define a character vector with the names of the regions, ordered to match the codes in the result.
labels <- c("Awdal", "Woqooyi Galbeed", "Togdheer", "Sool", "Sanaag", "Bari", "Nugaal", "Mudug",
            "Galguduud", "Hiiraan", "Shabeellaha Dhexe", "Banaadir",  "Bay",
            "Bakool", "Gedo",  "Jubbada Hoose")
# Add a 'Region' column to the result data frame, mapping region names to region codes.
result$Region <- labels
# Sort the result by the 'Region' column
result <- result[order(result$Region), ]
# Print the resulting dataframe that is now labeled and sorted.
print(result)

##    HV024         dv se.as.numeric(dv)            Region
## 11    11 0.42676883       0.044586878             Awdal
## 25    25 0.78995401       0.056969945            Bakool
## 22    22 0.02353069       0.007190857          Banaadir
## 16    16 0.40642037       0.035543746              Bari
## 24    24 0.47826087       0.092681596               Bay
## 19    19 0.14543827       0.015501695         Galguduud
## 26    26 0.40401631       0.073247081              Gedo
## 20    20 0.58836273       0.172101114           Hiiraan
## 28    28 0.33989798       0.046929558     Jubbada Hoose
## 18    18 0.41075271       0.061214642             Mudug
## 17    17 0.48917276       0.031636619            Nugaal
## 15    15 0.69225961       0.033531644            Sanaag
## 21    21 0.27406782       0.073596023 Shabeellaha Dhexe
## 14    14 0.77007411       0.023935665              Sool
## 13    13 0.39949998       0.046623999          Togdheer
## 12    12 0.71198977       0.076644530   Woqooyi Galbeed

Merging Data

The shapefile and the calculated unimproved sources of drinking water proportions are merged using a common identifier, region name. We also calculate the centroids of each region, these will be used for placing labels on the spatial map.

# Merge the shapefile and result data frame
# Merge the shapefile data (`somalia`) with the calculated mean unimproved sources of drinking water data (`result`).
merged_data <- merge(somalia, result, by.x = "NAME_1", by.y = "Region")

# Calculate centroids for label placement
# Calculate the centroids (geometric center points) of each polygon in the merged dataset.
centroid <- st_centroid(merged_data)

## Warning: st_centroid assumes attributes are constant over geometries

# Extract x-coordinates of centroids and add them as column named 'x_position'
merged_data$x_position <- centroid[, 1]
# Extract y-coordinates of centroids and add them as column named 'y_position'
merged_data$y_position <- centroid[, 2]

# Convert SpatialPolygonsDataFrame to sf object
# Convert merged_data which is a SpatialPolygonsDataFrame into an sf object for use in ggplot
merged_data_sf <- st_as_sf(merged_data)

Spatial Autocorrelation Analysis

Spatial Weights Matrix

We create a spatial weights matrix using queen contiguity to define the neighborhood structure of our spatial units. We remove the records with NA values for the unimproved sources of drinking water, which are not useful for the analysis.

# Exploratory Spatial Data Analysis
# Spatial Autocorrelation Analysis

# 1. Create spatial weights matrix: Queen contiguity
# Remove rows with NA values in the `dv` column (mean fuel use)
merged_data <- merged_data[!is.na(merged_data$dv), ]
# Create a neighborhood object based on queen contiguity between regions.
nb <- poly2nb(merged_data, queen = TRUE)
# Create a spatial weights list from the neighborhood object using the 'W' style (row-standardized weights).
nbw <- nb2listw(nb, style = "W")

Global Moran’s I

We calculate the Global Moran’s I statistic to assess overall spatial autocorrelation. This step provides us with a measure of whether the data is randomly distributed, or spatially clustered. A statistically significant Moran’s I indicates significant spatial clustering.

# 2. Calculate Global Moran's I
set.seed(123)
# Calculate Global Moran's I statistic for the unimproved sources of drinking water data.
gmoran <- moran.test(merged_data$dv, nbw, alternative = "greater")
# Print the output of the Moran's I test.
print(gmoran)

## 
##  Moran I test under randomisation
## 
## data:  merged_data$dv  
## weights: nbw    
## 
## Moran I statistic standard deviate = 1.6742, p-value = 0.04704
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##        0.27819967       -0.06666667        0.04242903

# Extract and print the Moran's I statistic.
print(gmoran[["estimate"]][["Moran I statistic"]])

## [1] 0.2781997

# Extract and print the z-score of the test.
print(gmoran[["statistic"]])

## Moran I statistic standard deviate 
##                           1.674246

# Extract and print the p-value of the test.
print(gmoran[["p.value"]])

## [1] 0.04704114

# Interpretation:
# - If p-value < 0.05 (significance level), reject the null hypothesis of no spatial autocorrelation.
# - Conclusion: There is evidence of spatial autocorrelation in fuel use.

# Moran's for MC
# Conduct a Monte Carlo simulation for Moran's I to assess significance.
gmoranMC <- moran.mc(merged_data$dv, nbw, nsim = 999)
# Print the output of the Monte Carlo simulation
print(gmoranMC)

## 
##  Monte-Carlo simulation of Moran I
## 
## data:  merged_data$dv 
## weights: nbw  
## number of simulations + 1: 1000 
## 
## statistic = 0.2782, observed rank = 948, p-value = 0.052
## alternative hypothesis: greater

# Open a new graphics window.
#windows()
# Create a histogram of simulated Moran's I values.
#hist(gmoranMC$res)
# Add a vertical line at the observed Moran's I statistic.
#abline(v = gmoranMC$statistic, col = "red")

Local Moran’s I

Next, we calculate Local Moran’s I to identify specific regions with statistically significant local clusters. This statistic identifies both “hot spots” and “cold spots.”

# 4. Calculate Local Moran's I
# Calculate Local Moran's I for each region.
lmoran <- localmoran(merged_data$dv, nbw,
                     alternative = "greater")
# Print the first few rows of the local Moran's I results
print(head(lmoran))

##            Ii          E.Ii      Var.Ii       Z.Ii Pr(z > E(Ii))
## 1 -0.19066095 -0.0016422912 0.026233505 -1.1670152     0.8783979
## 2  0.23554899 -0.1684796812 0.640430986  0.5048659     0.3068265
## 3  1.86846933 -0.2929507102 3.314089466  1.1872907     0.1175565
## 4 -0.23418775 -0.0043287248 0.019702797 -1.6375615     0.9492434
## 5  0.06000519 -0.0005482829 0.004070725  0.9490808     0.1712898
## 6  0.25423567 -0.1519982616 0.589233326  0.5292155     0.2983280

# 5. Add Local Moran's results to merged_data
# Add the local Moran's I statistic to the merged dataset.
merged_data$lmI <- lmoran[, "Ii"]
# Add the z-scores of local Moran's I to the merged dataset.
merged_data$lmZ <- lmoran[, "Z.Ii"]
# Add p-values of local Moran's I to the merged dataset.
merged_data$lmp <- lmoran[, "Pr(z > E(Ii))"]

Getis-Ord Gi*

We compute the Getis-Ord Gi* statistic to identify spatial clusters of high and low values.

# 6. Calculate Getis-Ord Gi*
# Calculate Getis-Ord Gi* statistic for each region.
local_gi <- localG(merged_data$dv, nbw)
# Add Getis-Ord Gi* statistic to the merged dataset
merged_data$gi <- local_gi

# Convert Getis-Ord Gi* results into z-score to use for plotting
# Calculate z-scores for the Getis-Ord Gi* statistic.
z_gi <- (local_gi - mean(local_gi)) / sd(local_gi)
# Add the z-scores of Gi* to the merged dataset.
merged_data$zgi <- z_gi

# Calculate P-values for Getis-Ord statistic:
# The original localG function doesn't directly give p-values, we need to calculate them based on the z-scores:

merged_data$p_gi <- pnorm(abs(merged_data$zgi), lower.tail = FALSE)*2

Spatial Visualization

We visualize the spatial data using tmap to display spatial patterns of unimproved sources of drinking water and associated statistics.

# Create maps using tmap
# Set tmap mode to "view" for interactive map viewing.
tmap_mode("view")

## tmap mode set to interactive viewing

Figure 1: Spatial Distribution of unimproved sources of drinking water and Spatial Lag

The first visualization shows the spatial distribution of unimproved sources of drinking water, highlighting the spatial differences in the proportion of unimproved sources of drinking water across Somalia. Along with that, the spatial lag map will reveal the influence of the neighbours for the unimproved sources of drinking water distribution.

#################################################
# Figure 1 : Spatial Map of dv and its Spatial Lag Map
# (Visualizing Data & Spatial Context)
#################################################


# Map 1: unimproved sources of drinking water by region
p1 <- tm_shape(merged_data) +
  # Plot polygons filled based on `dv` (unimproved sources of drinking water) using quantile breaks.
  tm_polygons(col = "dv", title = "unimproved sources of drinking water", style = "quantile") +
  # Adjust layout so that the legend is outside of the plot.
  tm_layout(legend.outside = TRUE)

# Calculate spatial lag variable
merged_data$dv_lag <- lag.listw(nbw, merged_data$dv)
# Create the spatial lag map using tmap
p2 <- tm_shape(merged_data) +
  tm_polygons("dv_lag", title = "Spatial Lag of unimproved sources of drinking water", style = "quantile") +
  tm_layout(legend.outside = TRUE)

tmap_arrange(p1, p2, ncol = 2)

Figure 2: Global Moran’s I

The second visualization is the Normal Distribution Curve Plot, which shows the significance level of the global Moran’s I value. This will show the reader how likely the spatial distribution of the data is random.

###################################################
# Figure 2 Global Moran's I (Overall Spatial Dependence)
###################################################

# Create the Normal Distribution Curve Plot
# Generate data for the normal distribution curve
set.seed(123)
x <- seq(-4, 4, length = 200)
y <- dnorm(x)
df <- data.frame(x, y)

# Define critical z-scores
critical_z <- c(-2.58, -1.96, -1.65, 1.65, 1.96, 2.58)

# Define colors for the legend
legend_colors <- c("blue", "lightblue", "yellow", "lightcoral", "red")
legend_labels <- c("< -2.58", "-2.58 - -1.96", "-1.96 - -1.65", "1.65 - 1.96", "1.96 - 2.58", "> 2.58")

# Create the plot
p3 <- ggplot(df, aes(x, y)) +
  geom_line(color = "black") +
  # Add shaded areas for different significance levels
  geom_area(aes(x = ifelse(x < critical_z[1], x, NA)), fill = "blue", alpha = 0.5) +
  geom_area(aes(x = ifelse(x > critical_z[1] & x < critical_z[2], x, NA)), fill = "lightblue", alpha = 0.5) +
  geom_area(aes(x = ifelse(x > critical_z[2] & x < critical_z[3], x, NA)), fill = "yellow", alpha = 0.5) +
  geom_area(aes(x = ifelse(x > critical_z[3] & x < critical_z[4], x, NA)), fill = "yellow", alpha = 0.5) +
  geom_area(aes(x = ifelse(x > critical_z[4] & x < critical_z[5], x, NA)), fill = "lightcoral", alpha = 0.5) +
  geom_area(aes(x = ifelse(x > critical_z[5], x, NA)), fill = "red", alpha = 0.5) +
  geom_vline(xintercept = gmoran[["statistic"]], color = "red", linetype = "dashed", linewidth = 1) +
  # Add critical value lines and labels
  geom_vline(xintercept = critical_z, color = "gray", linetype = "dotted") +
  annotate("text", x = critical_z, y = 0, label = critical_z, vjust = 1.5, size = 3, color = "black") +
  # Add text annotations for z-score and p-value
  annotate("text", x = gmoran[["statistic"]], y = max(y)/2,
           label = paste("z-score:", round(gmoran[["statistic"]], 2)),
           size = 3, color = "red", hjust = 0, vjust = -1) +
  annotate("text", x = 0, y = max(y) * 0.3,
           label = "Random",
           size = 3, color = "black", hjust = 0.5, vjust = -1) +
  annotate("text", x = -3, y = max(y) * 0.3,
           label = "Significant",
           size = 3, color = "black", hjust = 0.5, vjust = -1) +
  annotate("text", x = 3, y = max(y) * 0.3,
           label = "Significant",
           size = 3, color = "black", hjust = 0.5, vjust = -1) +
  # Add text annotations for Moran's I, z-score, and p-value
  annotate("text", x = -3.5, y = max(y) * 0.8,
           label = paste("Moran's Index:", round(gmoran[["estimate"]][["Moran I statistic"]], 5)),
           size = 2.5, color = "black", hjust = 0) +
  annotate("text", x = -3.5, y = max(y) * 0.7,
           label = paste("z-score:", round(gmoran[["statistic"]], 5)),
           size = 2.5, color = "black", hjust = 0) +
  annotate("text", x = -3.5, y = max(y) * 0.6,
           label = paste("p-value:", format(gmoran[["p.value"]], scientific = FALSE, digits = 5)),
           size = 2.5, color = "black", hjust = 0) +
  labs(title = "Normal Distribution Curve", x = "Z-score", y = "Density") +
  theme_minimal() +
  theme(plot.title = element_text(size=9,face = "bold", color = "blue"),
        legend.position = "right",
        panel.spacing = unit(0.05, "lines"),
        plot.margin = margin(t = 0.2, r = 0.2, b = 0.2, l = 0.2, "cm")
  ) +
  # Add legend using annotate
  annotate("rect", xmin = 4.1, xmax = 4.3, ymin = seq(0.05, 0.45, length.out = length(legend_colors)),
           ymax = seq(0.1, 0.5, length.out = length(legend_colors)), fill = legend_colors, alpha = 0.5) +
  annotate("text", x = 4.4, y = seq(0.075, 0.475, length.out = length(legend_labels)),
           label = legend_labels, hjust = 0, size = 2.5) +
  annotate("text", x = 4.2, y = 0.55, label = "Critical Value (z-score)", hjust = 0.5, size = 2.5, fontface = "bold")+
  coord_cartesian(xlim = c(-4,5), ylim = c(0,0.6))


# Display the plot
#windows()
print(p3)

## Warning: Removed 164 rows containing non-finite values (`stat_align()`).

## Warning: Removed 185 rows containing non-finite values (`stat_align()`).

## Warning: Removed 192 rows containing non-finite values (`stat_align()`).

## Warning: Removed 118 rows containing non-finite values (`stat_align()`).

## Warning: Removed 192 rows containing non-finite values (`stat_align()`).

## Warning: Removed 149 rows containing non-finite values (`stat_align()`).

Figure 3: Local Moran’s I and Significance

The third visualization displays the Local Moran’s I and its p-values, pinpointing regions with statistically significant clusters of high and low unimproved sources of drinking water.

#################################################
# Figure 3 Local Moran's I and its p-values
# (Pinpointing Local Clusters)
#################################################


# Map 4: Local Moran's I statistic
p4 <- tm_shape(merged_data) +
  # Plot polygons filled based on the `lmI` variable using quantile breaks.
  tm_polygons(col = "lmI", title = "Local Moran's I",
              style = "quantile") +
  # Adjust layout so that the legend is outside of the plot.
  tm_layout(legend.outside = TRUE)



# Map 5: P-value for Local Moran's I
p5 <- tm_shape(merged_data) +
  # Plot polygons filled based on `lmp` (p-values) with specified breaks
  tm_polygons(col = "lmp", title = "p-value",
              breaks = c(-Inf, 0.05, Inf)) +
  # Adjust layout so that the legend is outside of the plot.
  tm_layout(legend.outside = TRUE)

tmap_arrange(p4,p5, ncol = 2)

## Variable(s) "lmI" contains positive and negative values, so midpoint is set to 0. Set midpoint = NA to show the full spectrum of the color palette.

Figure 4: Getis-Ord Gi* and Significance

The fourth visualization shows the Getis-Ord Gi* statistic and its p-values. This is another way to evaluate clustering, specifically focusing on identifying statistically significant hot spots and cold spots.

#####################################################
# Figure 4: Getis-Ord Gi* and its p-values
# (Hot and Cold Spots, and Significance)
#####################################################

# Map 5: Getis-Ord Gi* statistic
p6 <- tm_shape(merged_data) +
  # Plot polygons filled based on `gi` (Getis-Ord Gi*) using quantile breaks
  tm_polygons(col = "gi", title = "Getis-Ord Gi*",
              style = "quantile") +
  # Adjust layout so that the legend is outside of the plot.
  tm_layout(legend.outside = TRUE)




p7 <- tm_shape(merged_data) +
  # Plot polygons filled based on `p_gi` (p-values of Getis-Ord Gi*) with specified breaks
  tm_polygons(col = "p_gi", title = "P-value of Getis-Ord Gi*",
              breaks = c(-Inf, 0.05, Inf))+
  # Adjust layout so that the legend is outside of the plot.
  tm_layout(legend.outside = TRUE)

tmap_arrange(p6,p7, ncol = 2)

## Variable(s) "gi" contains positive and negative values, so midpoint is set to 0. Set midpoint = NA to show the full spectrum of the color palette.

Conclusion

This document provides a detailed spatial analysis of unimproved sources of drinking water among Somali households using nationwide survey dataset 2020. The combined use of spatial mapping, the spatial lag variable, the Global Moran’s I statistic, Local Moran’s I statistics, and Getis-Ord Gi* statistic help reveal significant patterns of spatial dependence and clusters of high and low unimproved sources of drinking water. The findings of this spatial analysis can help to inform policy makers in Somalia.

Future analysis will involve multi-level modeling to examine the individual and community level determinants of unimproved sources of drinking water, accounting for spatial autocorrelation observed in this analysis.

References

Roger Bivand, 2024. spdep: Spatial Dependence: Weighting Schemes, Statistics. R package version 1.3-1.
Martijn Tennekes, 2023. tmap: Thematic Maps. R package version 3.4.1.
Hadley Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2016.
Thomas Lumley (2023). survey: Analysis of Complex Survey Samples. R package version 4.2.
Edzer Pebesma (2023). sf: Simple Features for R. R package version 1.0-14.

```