Welcome to Module 7! In this module, we’ll delve into the concept of spatial autocorrelation, a core idea in spatial data analysis. We’ll explore how to measure and visualize the extent to which values of a variable are correlated with themselves through space. We’ll also look at how to identify spatial clusters and outliers. Understanding these concepts is crucial when dealing with spatial data, as it allows us to see if spatial patterns exist in your data and avoid violating the independence assumption in statistical tests.
By the end of this module, you should be able to:
R
packages like spdep
,
tmap
and ggplot2
for spatial autocorrelation
and cluster analysis.Spatial autocorrelation refers to the degree to which values at one location are similar to values at nearby locations. This concept is closely linked to Tobler’s First Law of Geography: “Everything is related to everything else, but near things are more related than distant things”.
Global Moran’s I is a single index that summarizes the spatial autocorrelation across an entire study area. It quantifies how similar each region is with its neighbors, and averages all these assessments.
Mathematically, Global Moran’s I is calculated as: \[ I = \frac{n \sum_i \sum_j w_{ij} (Y_i - \bar{Y})(Y_j - \bar{Y})}{(\sum_{i \neq j} w_{ij}) \sum_i (Y_i - \bar{Y})^2} \]
Where:
n
is the number of regions.Y_i
is the value of the variable in region
i
.Ȳ
is the mean of all values.w_ij
is the spatial weight between regions
i
and j
. It reflects the spatial proximity
between regions.Spatial weights are crucial in calculating Moran’s I. They quantify how near regions are to each other. Common weight schemes include:
In this module we will use queen contiguity weights.
To test for spatial autocorrelation, we use a hypothesis testing approach
We test this hypothesis by comparing the observed value of Moran’s I to the distribution of values that would be observed if spatial autocorrelation was absent.
The test statistic is calculated as a z-score:
\[ z = \frac{I - E[I]}{\sqrt{Var[I]}} \]
Where:
I
is the observed Moran’s I value.E[I]
is the expected Moran’s I value under the null
hypothesis.Var[I]
is the variance of Moran’s I under the null
hypothesis.When the number of regions is sufficiently large, the distribution of the test statistic is assumed to be normal with mean 0 and variance 1. An alternative approach to judge significance is Monte Carlo randomization. This method creates random patterns by reassigning the observed values among the areas and calculates the Moran’s I for each of the patterns, providing a randomization distribution for the Moran’s I.
Based on the p-value, we make our decision:
Let’s use the data from the example codes you provided to see how we
can implement the global Moran’s I test in R
.
First we load the necessary libraries and data. Note that we have changed the file paths to be reproducible. Make sure you use your local paths.
# Load libraries
library(sf)
library(ggplot2)
library(spdep)
library(tmap)
library(haven)
library(survey)
library(RColorBrewer)
# Load shapefile
filepath<-"gadm41_SOM_1.shp"
somalia<-st_read(filepath)
somalia$NAME_1
Now, let’s load the fuel use data and generate the variable we will use to do the test of spatial autocorrelation.
# Load DHS data
fueldata <- read_dta("fueluse.dta")
str(fueldata$Region)
# Calculate the mean solid fuel use by region
options(survey.lonely.psu = "remove")
design <- svydesign(id = ~HV001, strata = ~STRU_CD, weights = ~HV005, data = fueldata,nest=TRUE)
result <- svyby(~dv, ~Region, FUN = svyciprop, design = design, vartype = NULL)
print(result)
# Add region names to the result data frame
labels <- c("Awdal", "Woqooyi Galbeed", "Togdheer", "Sool", "Sanaag", "Bari", "Nugaal", "Mudug",
"Galguduud", "Hiiraan", "Shabeellaha Dhexe", "Banaadir", "Bay",
"Bakool", "Gedo", "Jubbada Hoose")
result$Region <- labels
result <- result[order(result$Region), ]
print(result)
# Merge the shapefile and result data frame
merged_data <- merge(somalia, result, by.x = "NAME_1", by.y = "Region")
Now we create the spatial weights matrix based on queen contiguity
and use the moran.test()
function of the spdep
package to calculate Moran’s I.
# Remove rows with NA values in the `dv` column (mean fuel use)
merged_data <- merged_data[!is.na(merged_data$dv), ]
# Create a neighborhood object based on queen contiguity between regions.
nb <- poly2nb(merged_data, queen = TRUE)
# Create a spatial weights list from the neighborhood object using the 'W' style (row-standardized weights).
nbw <- nb2listw(nb, style = "W")
# Calculate Global Moran's I
gmoran <- moran.test(merged_data$dv, nbw, alternative = "greater")
print(gmoran)
print(gmoran[["estimate"]][["Moran I statistic"]])
print(gmoran[["statistic"]])
print(gmoran[["p.value"]])
The results of the global moran’s I test indicate a positive spatial autocorrelation in the use of solid fuel for cooking. This is because the estimated Moran’s I is higher than the expected value and because the p-value is lower than the significance level of 0.05
We can use the moran.mc()
function to check the
significance by using a Monte Carlo approach.
# Moran's for MC
gmoranMC <- moran.mc(merged_data$dv, nbw, nsim = 999)
print(gmoranMC)
windows()
hist(gmoranMC$res)
abline(v = gmoranMC$statistic, col = "red")
The results of the Monte Carlo approach also suggest the existence of positive spatial autocorrelation. The observed statistic is in the tails of the distribution.
A Moran’s I scatterplot helps visualize spatial autocorrelation.
Here’s the code to produce a Moran’s I scatterplot using our fuel data:
# Visualize with Moran's I scatterplot
moran.plot(merged_data$dv, nbw)
The Moran’s scatter plot shows a positive trend, indicating again positive spatial autocorrelation.
While the Global Moran’s I provides a single index to assess the overall spatial autocorrelation across an entire study region, Local Indicators of Spatial Association (LISAs), such as the Local Moran’s I, allow us to identify areas with significant spatial clustering of similar values or areas that are significantly different from its neighbors. It is a local version of the global Moran’s I, calculated for each region or polygon in the dataset, and defined as: \[ I_i = \frac{n(Y_i - \bar{Y})}{\sum_j(Y_j - \bar{Y})^2}\sum_j w_{ij}(Y_j - \bar{Y}) \] Where:
n
is the number of regions.Y_i
is the value of the variable in region
i
.Ȳ
is the mean of all values.w_ij
is the spatial weight between regions
i
and j
.Y_i
at region i
is fixed, and the
remaining values are randomly reassigned over the other regions. This
allows us to find a p-value representing the probability of observing
the given local I statistic if the values were randomly
distributed.Let’s use the localmoran()
function to calculate local
Moran’s I statistic.
# Calculate Local Moran's I
lmoran <- localmoran(merged_data$dv, nbw, alternative = "greater")
print(head(lmoran))
# Add Local Moran's results to merged_data
merged_data$lmI <- lmoran[, "Ii"]
merged_data$lmZ <- lmoran[, "Z.Ii"]
merged_data$lmp <- lmoran[, "Pr(z > E(Ii))"]
The localmoran()
function returns a matrix with the
local Moran’s I statistic, its expectation and variance, and the
p-value. We have added the local Moran’s I, the z-scores, and p-values
to the dataframe to visualize them in a map.
Mapping local Moran’s I statistic, z-scores, or p-values can help visualize spatial clusters.
The following is the code to produce maps of the local Moran’s I
statistics, the z-scores, and p-values. We use tmap
to do
this.
# Create maps using tmap
tmap_mode("view")
# Create thematic maps using tmap
# Map 1: Fuel Use by region
p1 <- tm_shape(merged_data) +
tm_polygons(col = "dv", title = "Fuel Use", style = "quantile") +
tm_layout(legend.outside = TRUE)
# Map 2: Local Moran's I statistic
p2 <- tm_shape(merged_data) +
tm_polygons(col = "lmI", title = "Local Moran's I",
style = "quantile") +
tm_layout(legend.outside = TRUE)
# Map 3: Z-score for Local Moran's I
p3 <- tm_shape(merged_data) +
tm_polygons(col = "lmZ", title = "Z-score",
breaks = c(-Inf, 1.65, Inf)) +
tm_layout(legend.outside = TRUE)
# Map 4: P-value for Local Moran's I
p4 <- tm_shape(merged_data) +
tm_polygons(col = "lmp", title = "p-value",
breaks = c(-Inf, 0.05, Inf)) +
tm_layout(legend.outside = TRUE)
# Arrange the generated tmap objects into a single interactive display
# Arrange maps in a grid with 2 columns
tmap_arrange(p1, p2, p3, p4, ncol = 2)
The maps above help visualizing the spatial pattern of the local Moran’s I statistics and to check which areas are significantly autocorrelated based on the p-values and z-scores.
Getis-Ord Gi* statistic is another measure of local spatial autocorrelation which is especially useful for identifying hot spots or cold spots. Unlike the local Moran’s I, Getis-Ord Gi* focuses on identifying clusters of high and low values.
For a given area i
it is defined as: \[
G_i^* = \frac{\sum_j w_{ij} Y_j}{\sum_j Y_j}
\]
Where:
Y_j
is the value of the variable in region jw_ij
is the spatial weight between regions
i
and j
. A positive z-score for the Getis-Ord
Gi* statistic suggests that area i is part of a hot spot, whereas a
negative z-score suggests that area i is part of a cold spot.Let’s calculate the Getis-Ord Gi* using the localG()
function.
# Calculate Getis-Ord Gi*
local_gi <- localG(merged_data$dv, nbw)
merged_data$gi <- local_gi
# Convert Getis-Ord Gi* results into z-score to use for plotting
z_gi <- (local_gi - mean(local_gi)) / sd(local_gi)
merged_data$zgi <- z_gi
Now, we can add these to our map of the other statistics.
# Create maps using tmap
tmap_mode("view")
# Create thematic maps using tmap
# Map 1: Fuel Use by region
p1 <- tm_shape(merged_data) +
tm_polygons(col = "dv", title = "Fuel Use", style = "quantile") +
tm_layout(legend.outside = TRUE)
# Map 2: Local Moran's I statistic
p2 <- tm_shape(merged_data) +
tm_polygons(col = "lmI", title = "Local Moran's I",
style = "quantile") +
tm_layout(legend.outside = TRUE)
# Map 3: Z-score for Local Moran's I
p3 <- tm_shape(merged_data) +
tm_polygons(col = "lmZ", title = "Z-score",
breaks = c(-Inf, 1.65, Inf)) +
tm_layout(legend.outside = TRUE)
# Map 4: P-value for Local Moran's I
p4 <- tm_shape(merged_data) +
tm_polygons(col = "lmp", title = "p-value",
breaks = c(-Inf, 0.05, Inf)) +
tm_layout(legend.outside = TRUE)
# Map 5: Getis-Ord Gi* statistic
p5 <- tm_shape(merged_data) +
tm_polygons(col = "gi", title = "Getis-Ord Gi*",
style = "quantile") +
tm_layout(legend.outside = TRUE)
# Map 6: Z-score for Getis-Ord Gi* statistic
p6 <- tm_shape(merged_data) +
tm_polygons(col = "zgi", title = "Z-score of Getis-Ord Gi*",
breaks = c(-Inf, 1.65, Inf))+
tm_layout(legend.outside = TRUE)
# Arrange the generated tmap objects into a single interactive display
# Arrange maps in a grid with 2 columns
tmap_arrange(p1, p2, p3, p4, p5, p6, ncol = 2)
The maps with the Gi* statistics helps identifying hot spots and cold spots. You can compare them with the results from the local Moran’s I test.
We can combine the Moran’s I scatterplot information and local Moran’s I statistic to identify the type of spatial clusters.
# Identifying Cluster Types
mp <- moran.plot(as.vector(scale(merged_data$dv)), nbw)
merged_data$quadrant <- NA
# high-high
merged_data[(mp$x >= 0 & mp$wx >= 0) & (merged_data$lmp <= 0.05), "quadrant"]<- 1
# low-low
merged_data[(mp$x <= 0 & mp$wx <= 0) & (merged_data$lmp <= 0.05), "quadrant"]<- 2
# high-low
merged_data[(mp$x >= 0 & mp$wx <= 0) & (merged_data$lmp <= 0.05), "quadrant"]<- 3
# low-high
merged_data[(mp$x <= 0 & mp$wx >= 0) & (merged_data$lmp <= 0.05), "quadrant"]<- 4
# non-significant
merged_data[(merged_data$lmp > 0.05), "quadrant"] <- 5
tm_shape(merged_data) + tm_fill(col = "quadrant", title = "",
breaks = c(1, 2, 3, 4, 5, 6),
palette = c("red", "blue", "lightpink", "skyblue2", "white"),
labels = c("High-High", "Low-Low", "High-Low",
"Low-High", "Non-significant")) +
tm_legend(text.size = 1) + tm_borders(alpha = 0.5) +
tm_layout(frame = FALSE, title = "Clusters") +
tm_layout(legend.outside = TRUE)
The map above shows the clusters based on local Moran’s I and quadrants of the Moran’s I scatter plot.
In this module, we have explored how to quantify and visualize spatial autocorrelation using global and local measures. These techniques can be used to find spatial patterns that can be used in further analysis such as Bayesian spatial models, which will be explored in the next module. You should now have the tools to assess and map the spatial structure in your own datasets.
Here is the code with the example of sanitation access for additional practice and reference.
# Load libraries
# `sf`: For handling spatial vector data (shapefiles).
library(sf)
# `ggplot2`: For creating static and publication-quality graphics.
library(ggplot2)
# `spdep`: For spatial analysis and spatial autocorrelation.
library(spdep)
# `tmap`: For creating thematic maps.
library(tmap)
# `haven`: For reading data files from various statistical packages (including .dta).
library(haven)
# `survey`: For complex survey analysis
library(survey)
# `RColorBrewer`: Provides color schemes for maps and charts.
library(RColorBrewer)
# Load shapefile
# Specify the file path to the shapefile representing the administrative boundaries of Somalia.
filepath<-"gadm41_SOM_1.shp"
# Read the shapefile using st_read() from the sf package.
somalia<-st_read(filepath)
# Print the names of the first-level administrative units (regions) in Somalia.
somalia$NAME_1
# Load DHS data
# Specify the file path to the DHS dataset (.dta file).
sanitationdata <- read_dta("sanitation_access.dta")
# Print the structure of the 'HV024' column, which likely contains region codes.
str(sanitationdata$Region)
# Calculate the mean solid fuel use by region
# Apply the function to subsets of the survey data using svyby
# Create a survey design object
library(survey)
options(survey.lonely.psu = "remove")
design <- svydesign(id = ~HV001, strata = ~STRU_CD, weights = ~HV005, data = sanitationdata,nest=TRUE)
result <- svyby(~dv, ~Region, FUN = svyciprop, design = design, vartype = NULL)
# Print the resulting data frame showing the mean fuel use per region.
print(result)
# Add region names to the result data frame
# Define a character vector with the names of the regions, ordered to match the codes in the result.
labels <- c("Awdal", "Woqooyi Galbeed", "Togdheer", "Sool", "Sanaag", "Bari", "Nugaal", "Mudug",
"Galguduud", "Hiiraan", "Shabeellaha Dhexe", "Banaadir", "Bay",
"Bakool", "Gedo", "Jubbada Hoose")
# Add a 'Region' column to the result data frame, mapping region names to region codes.
result$Region <- labels
# Sort the result by the 'Region' column
result <- result[order(result$Region), ]
# Print the resulting dataframe that is now labeled and sorted.
print(result)
# Merge the shapefile and result data frame
# Merge the shapefile data (`somalia`) with the calculated mean fuel use data (`result`).
merged_data <- merge(somalia, result, by.x = "NAME_1", by.y = "Region")
# Calculate centroids for label placement
# Calculate the centroids (geometric center points) of each polygon in the merged dataset.
centroid <- st_centroid(merged_data)
# Extract x-coordinates of centroids and add them as column named 'x_position'
merged_data$x_position <- centroid[, 1]
# Extract y-coordinates of centroids and add them as column named 'y_position'
merged_data$y_position <- centroid[, 2]
# Convert SpatialPolygonsDataFrame to sf object
# Convert merged_data which is a SpatialPolygonsDataFrame into an sf object for use in ggplot
merged_data_sf <- st_as_sf(merged_data)
# Define a custom color palette using RColorBrewer
colors <- brewer.pal(9, "YlOrRd")
# Plot the spatial data
# Initiate a ggplot object.
p1 <- ggplot() +
# Add polygon layer from merged_data_sf, fill based on the `dv` (fuel use) variable.
geom_sf(data = merged_data_sf, aes(fill = dv)) +
# Define a color gradient fill scale using the defined color palette.
scale_fill_gradientn(colors = colors, na.value = "grey") +
# Customize the legend title.
guides(fill = guide_legend(title = "Unimproved Sanitation Access")) +
# Add region name labels at the centroid location of each region.
geom_sf_text(data = merged_data_sf, aes(label = NAME_1),
cex = 3, hjust = 0.5, vjust = 0.5, color = "black", fontface = "bold") +
# Remove y axis label
ylab(label = "") +
# Remove x axis label
xlab(label = "") +
# Set the plot title.
ggtitle("Spatial Distribution of Unimproved Sanitatino Access among Households in Somalia") +
# Customize the plot theme.
theme(legend.text = element_text(size = 10),
plot.title = element_text(color = "blue", size = 10, face = "bold"),
strip.text.x = element_text(size = 10))
# Print the created ggplot object
print(p1)
# Exploratory Spatial Data Analysis
# Spatial Autocorrelation Analysis
# 1. Create spatial weights matrix: Queen contiguity
# Remove rows with NA values in the `dv` column (mean fuel use)
merged_data <- merged_data[!is.na(merged_data$dv), ]
# Create a neighborhood object based on queen contiguity between regions.
nb <- poly2nb(merged_data, queen = TRUE)
# Create a spatial weights list from the neighborhood object using the 'W' style (row-standardized weights).
nbw <- nb2listw(nb, style = "W")
# 2. Calculate Global Moran's I
# Calculate Global Moran's I statistic for the fuel use data.
gmoran <- moran.test(merged_data$dv, nbw, alternative = "greater")
# Print the output of the Moran's I test.
print(gmoran)
# Extract and print the Moran's I statistic.
print(gmoran[["estimate"]][["Moran I statistic"]])
# Extract and print the z-score of the test.
print(gmoran[["statistic"]])
# Extract and print the p-value of the test.
print(gmoran[["p.value"]])
# Interpretation:
# - If p-value < 0.05 (significance level), reject the null hypothesis of no spatial autocorrelation.
# - Conclusion: There is evidence of spatial autocorrelation in fuel use.
# Moran's for MC
# Conduct a Monte Carlo simulation for Moran's I to assess significance.
gmoranMC <- moran.mc(merged_data$dv, nbw, nsim = 999)
# Print the output of the Monte Carlo simulation
print(gmoranMC)
# Open a new graphics window.
windows()
# Create a histogram of simulated Moran's I values.
hist(gmoranMC$res)
# Add a vertical line at the observed Moran's I statistic.
abline(v = gmoranMC$statistic, col = "red")
# 3. Visualize with Moran's I scatterplot
# Generate a Moran's I scatterplot to visualize spatial autocorrelation.
moran.plot(merged_data$dv, nbw)
# 4. Calculate Local Moran's I
# Calculate Local Moran's I for each region.
lmoran <- localmoran(merged_data$dv, nbw,
alternative = "greater")
# Print the first few rows of the local Moran's I results
print(head(lmoran))
# 5. Add Local Moran's results to merged_data
# Add the local Moran's I statistic to the merged dataset.
merged_data$lmI <- lmoran[, "Ii"]
# Add the z-scores of local Moran's I to the merged dataset.
merged_data$lmZ <- lmoran[, "Z.Ii"]
# Add p-values of local Moran's I to the merged dataset.
merged_data$lmp <- lmoran[, "Pr(z > E(Ii))"]
# 6. Calculate Getis-Ord Gi*
# Calculate Getis-Ord Gi* statistic for each region.
local_gi <- localG(merged_data$dv, nbw)
# Add Getis-Ord Gi* statistic to the merged dataset
merged_data$gi <- local_gi
# Convert Getis-Ord Gi* results into z-score to use for plotting
# Calculate z-scores for the Getis-Ord Gi* statistic.
z_gi <- (local_gi - mean(local_gi)) / sd(local_gi)
# Add the z-scores of Gi* to the merged dataset.
merged_data$zgi <- z_gi
# Create maps using tmap
# Set tmap mode to "view" for interactive map viewing.
tmap_mode("view")
# Create thematic maps using tmap
# Map 1: Fuel Use by region
p1 <- tm_shape(merged_data) +
# Plot polygons filled based on `dv` (fuel use) using quantile breaks.
tm_polygons(col = "dv", title = "Fuel Use", style = "quantile") +
# Adjust layout so that the legend is outside of the plot.
tm_layout(legend.outside = TRUE)
# Map 2: Local Moran's I statistic
p2 <- tm_shape(merged_data) +
# Plot polygons filled based on the `lmI` variable using quantile breaks.
tm_polygons(col = "lmI", title = "Local Moran's I",
style = "quantile") +
# Adjust layout so that the legend is outside of the plot.
tm_layout(legend.outside = TRUE)
# Map 3: Z-score for Local Moran's I
p3 <- tm_shape(merged_data) +
# Plot polygons filled based on `lmZ` (z-scores) with specified breaks
tm_polygons(col = "lmZ", title = "Z-score",
breaks = c(-Inf, 1.65, Inf)) +
# Adjust layout so that the legend is outside of the plot.
tm_layout(legend.outside = TRUE)
# Map 4: P-value for Local Moran's I
p4 <- tm_shape(merged_data) +
# Plot polygons filled based on `lmp` (p-values) with specified breaks
tm_polygons(col = "lmp", title = "p-value",
breaks = c(-Inf, 0.05, Inf)) +
# Adjust layout so that the legend is outside of the plot.
tm_layout(legend.outside = TRUE)
# Map 5: Getis-Ord Gi* statistic
p5 <- tm_shape(merged_data) +
# Plot polygons filled based on `gi` (Getis-Ord Gi*) using quantile breaks
tm_polygons(col = "gi", title = "Getis-Ord Gi*",
style = "quantile") +
# Adjust layout so that the legend is outside of the plot.
tm_layout(legend.outside = TRUE)
# Map 6: Z-score for Getis-Ord Gi* statistic
p6 <- tm_shape(merged_data) +
# Plot polygons filled based on `zgi` (z-scores) with specified breaks
tm_polygons(col = "zgi", title = "Z-score of Getis-Ord Gi*",
breaks = c(-Inf, 1.65, Inf))+
# Adjust layout so that the legend is outside of the plot.
tm_layout(legend.outside = TRUE)
# Arrange the generated tmap objects into a single interactive display
# Arrange maps in a grid with 2 columns
tmap_arrange(p1, p2, p3, p4, p5, p6, ncol = 2)
arranged_map<-tmap_arrange(p1, p2, p3, p4, p5, p6, ncol = 2)
tmap_save(arranged_map, filename = "arranged_map.png", width=10, height=12, units = "in")
The module above explains the methods used to perform the analysis of spatial autocorrelation. It provides all the theory and the codes.