Module XI: Geostatistical Data Analysis

Introduction

In our previous modules, we’ve explored various ways to visualize and analyze disease patterns. Now, we’ll delve into the powerful world of geostatistical data and spatial interpolation. These techniques allow us to understand and predict the spread of phenomena (like diseases, environmental factors, etc.) across continuous landscapes, even where we don’t have direct measurements.

11.1 What are Geostatistical Data?

Geostatistical data consists of measurements of a spatially continuous phenomenon taken at specific locations. The key feature is the spatial continuity; we assume the phenomenon exists everywhere, not just at the sampled points.

Examples:

Disease prevalence at survey locations (e.g., % of children with malaria at different villages)
Mosquito density at sampling sites
Environmental factors like temperature, rainfall, or vegetation indices.

Why do we need this? Often, we want to understand and map disease risk for entire regions, not just where we sampled. Geostatistical methods help us “fill in the gaps.”

11.2 The Idea of Random Processes

Geostatistics is based on the idea that our data is a partial realization of a random process occurring across a region. A random process is a collection of spatially related random variables. Think of our data as “snapshots” of this landscape at specific places.

11.3 Key Concepts: Stationarity and Isotropy

To model spatial data effectively, we often assume the following:

Stationarity:

Strict Stationarity: The pattern of the phenomenon looks the same no matter where you move it in the study area.
Weak (Second-Order) Stationarity: A more realistic assumption: the average level of the phenomenon is consistent across the region, and how much nearby points are similar depends only on the distance between them. This involves two assumptions: the mean is constant across the region ( E[Z(s)] = μ ) and the covariance between two locations depends only on the distance between them (Cov(Z(s), Z(s+h)) = C(h) ).

Isotropy: The spatial relationship (e.g., correlation) depends only on distance and not on the direction between locations. If the spatial relationship varies by direction, we would have anisotropy.

11.4 Gaussian Random Fields (GRFs)

A powerful tool is the Gaussian Random Field (GRF). A GRF is a random process where the values at any set of locations follow a normal (Gaussian) distribution.

Why Gaussian? * Many real-world phenomena approximate a normal distribution. * Makes the models we use mathematically tractable.

11.5 Covariance and Correlation Functions

We use covariance functions (or correlation functions) to describe how spatial locations are related. Covariance measures how two variables change together. A high covariance between two locations suggests that the phenomenon is probably similar there.

Correlation is a standardized form of covariance, ranging from -1 to 1.

Matérn Family: A flexible family of correlation functions often used in geostatistics. This is defined by:

Corr(h) = 1/(2^(ν-1)Γ(ν)) (h/ϕ)^ν * K_ν(h/ϕ)

where:

h: distance between locations
ϕ: Parameter controlling how fast correlation decays with distance (related to the range)
ν: Smoothness parameter
K_ν is the modified Bessel function

Special cases: * Exponential: ν = 0.5 * Squared exponential or Gaussian: ν -> ∞

11.6 Simulating Random Fields

Let’s use the geoR package to simulate Gaussian random fields. We’ll generate realizations with different values for the range parameter (ϕ) of the Matérn correlation function, to show how spatial variability changes.

# Install necessary packages if not installed
# install.packages(c("geoR", "ggplot2", "sf"))

library(geoR)

## --------------------------------------------------------------
##  Analysis of Geostatistical Data
##  For an Introduction to geoR go to http://www.leg.ufpr.br/geoR
##  geoR version 1.9-4 (built on 2024-02-14) is now loaded
## --------------------------------------------------------------

library(ggplot2)
library(sf)

## Linking to GEOS 3.11.2, GDAL 3.8.2, PROJ 9.3.1; sf_use_s2() is TRUE

covmodel <- "matern"
sigma2 <- 1

# Plot covariance function
curve(cov.spatial(x, cov.pars = c(sigma2, 1),
                  cov.model = covmodel), lty = 1,
      from = 0, to = 1, ylim = c(0, 1), main = "Matérn",
      xlab = "distance", ylab = "Covariance(distance)")
curve(cov.spatial(x, cov.pars = c(sigma2, 0.2),
                  cov.model = covmodel), lty = 2, add = TRUE)
curve(cov.spatial(x, cov.pars = c(sigma2, 0.01),
                  cov.model = covmodel), lty = 3, add = TRUE)

legend(cex = 1.5, "topright", lty = c(1, 1, 2, 3),
       col = c("white", "black", "black", "black"),
       lwd = 2, bty = "n", inset = .01,
       c(expression(paste(sigma^2, " = 1 ")),
         expression(paste(phi, " = 1")),
         expression(paste(phi, " = 0.2")),
         expression(paste(phi, " = 0.01"))))

# Simulate Gaussian random field in a regular grid 32 X 32
sim1 <- grf(1024, grid = "reg",
            cov.model = covmodel, cov.pars = c(sigma2, 1))

## grf: generating grid  32  *  32  with  1024  points
## grf: process with  1  covariance structure(s)
## grf: nugget effect is: tausq= 0 
## grf: covariance model 1 is: exponential(sigmasq=1, phi=1)
## grf: decomposition algorithm used is:  cholesky 
## grf: End of simulation procedure. Number of realizations: 1

sim2 <- grf(1024, grid = "reg",
            cov.model = covmodel, cov.pars = c(sigma2, 0.2))

## grf: generating grid  32  *  32  with  1024  points
## grf: process with  1  covariance structure(s)
## grf: nugget effect is: tausq= 0 
## grf: covariance model 1 is: exponential(sigmasq=1, phi=0.2)
## grf: decomposition algorithm used is:  cholesky 
## grf: End of simulation procedure. Number of realizations: 1

sim3 <- grf(1024, grid = "reg",
            cov.model = covmodel, cov.pars = c(sigma2, 0.01))

## grf: generating grid  32  *  32  with  1024  points
## grf: process with  1  covariance structure(s)
## grf: nugget effect is: tausq= 0 
## grf: covariance model 1 is: exponential(sigmasq=1, phi=0.01)
## grf: decomposition algorithm used is:  cholesky 
## grf: End of simulation procedure. Number of realizations: 1

# Plot Gaussian random field
par(mfrow = c(1, 3), mar = c(2, 2, 2, 0.2))
image(sim1, main = expression(paste(phi, " = 1")), cex.main = 2,
      col = gray(seq(1, 0.1, l = 30)), xlab = "", ylab = "")
image(sim2, main = expression(paste(phi, " = 0.2")), cex.main = 2,
      col = gray(seq(1, 0.1, l = 30)), xlab = "", ylab = "")
image(sim3, main = expression(paste(phi, " = 0.01")), cex.main=2,
      col = gray(seq(1, 0.1, l = 30)), xlab = "", ylab = "")

The figure above shows that when ϕ (related to range) is small, the generated field exhibits less spatial correlation (patchier), while a higher value (ϕ) has a higher spatial correlation and smoother maps.

11.7 The Variogram

The variogram (or semivariogram) is another crucial tool to explore spatial data. The variogram is defined as:

Var[Z(s_i) - Z(s_j)] = 2γ(s_i - s_j)

Where: * Z(s_i) and Z(s_j) are the values at locations s_i and s_j respectively. * γ(h) is the semivariogram.

Under the assumption of intrinsic stationarity (constant mean), this formula can be written as:

2γ(h) = E[(Z(s + h) - Z(s))^2]

Where h is the lag distance.

Empirical Variogram: A graph that plots the average of the squared differences in values between pairs of locations against their distances. The semivariogram is estimated as follows: 2^γ(h)=1/|N(h)| * ∑(Z(s_i)-Z(s_j))^2

where |N(h)| is the number of distinct pairs of points with a lag vector h.

Key Features of the Variogram: * Range: The distance at which spatial correlation effectively disappears. * Sill: The plateau value of the variogram. * Nugget effect: The discontinuity or “jump” at the origin, usually due to measurement error or very local variation.

11.8 Example: Rainfall Data in Paraná, Brazil

Let’s look at a practical example using the parana dataset, which contains rainfall measurements in Paraná state, Brazil.

First, we’ll create a map of the rainfall data:

library(geoR)
library(ggplot2)
library(sf)
d <- st_as_sf(data.frame(x = parana$coords[, 1],
                         y = parana$coords[, 2],
                         value = parana$data),
              coords = c("x", "y"))

ggplot(d) + geom_sf(aes(color = value), size = 2) +
scale_color_gradient(low = "blue", high = "orange") +
geom_path(data = data.frame(parana$border), aes(east, north)) +
theme_bw()

Now, we will estimate and plot the empirical variogram using the data. We will show the variogram cloud (scatterplot of all pairs) and the averaged empirical variogram.

plot(variog(coords = st_coordinates(d), data = d$value,
            option = "cloud", max.dist = 400), main="Variogram Cloud")

## variog: computing omnidirectional variogram

plot(variog(parana, max.dist = 400), main= "Empirical Variogram")

## variog: computing omnidirectional variogram

The left plot is the variogram cloud, while the right plot is the empirical variogram, calculated by averaging the variogram cloud values. Both plots depict how rainfall differences change with distance. The empirical variogram plot will be used to model spatial dependencies.

11.9 Summary and Preview

In this module, you’ve learned:

Geostatistical data is spatially continuous, and allows us to predict values where we don’t have measurements.
We assume our data is generated by a random spatial process.
Stationarity and isotropy help to simplify models.
Gaussian random fields are commonly used to describe spatial patterns in disease.
The variogram is a tool to explore spatial patterns and understand the range of spatial correlation.

In the next modules, we will see how these ideas will allow us to make spatial predictions using the method called Kriging.

Homework Exercise: Spatial Analysis in Somalia

Using what you learned in this module, perform a similar analysis using a hypothetical dataset for Somalia:

Create a hypothetical dataset for Somalia. This should include:
- Spatial coordinates (longitude and latitude)
- A variable of interest (e.g., simulated disease prevalence, rainfall, etc.). You can use a simulation procedure like the one used in the previous sections.
- Consider an area that resembles Somalia’s geographical area.
Create a map of your hypothetical variable in Somalia.
Calculate and plot the empirical variogram for your dataset. Show both the cloud and averaged variograms.
Describe the variogram:
- What do you observe in terms of spatial structure (range, sill, nugget effect)?
- Does the variogram suggest there is spatial autocorrelation?
Write a short report summarizing your findings.

This exercise will give you practical experience applying the geostatistical concepts learned in this module.

Important Note for Learners:

This module covers a lot of new terms and ideas. Take your time, and review the definitions and examples carefully.
Try the code examples to see how different parameters affect the results.
Pay attention to the assumptions underlying these methods (stationarity, isotropy) and think about whether they’re appropriate for your own data. ```