Geostatistics with R: Spatial Data Fundamentals with R

Introduction

This document explores fundamental concepts in spatial data analysis using R. We’ll work with simulated datasets to understand key ideas like spatial correlation and variograms.

Setup

knitr::opts_chunk$set(echo = TRUE)
library(sp)

## Warning: package 'sp' was built under R version 4.3.3

library(gstat)

## Warning: package 'gstat' was built under R version 4.3.3

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.3.3

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

# Install and load necessary packages
packages <- c("sp", "gstat", "ggplot2", "dplyr", "viridis", "spdep")
installed_packages <- rownames(installed.packages())

for (pkg in packages) {
  if (!pkg %in% installed_packages) {
    install.packages(pkg, dependencies = TRUE)
  }
  library(pkg, character.only = TRUE)
}

## Warning: package 'viridis' was built under R version 4.3.3

## Loading required package: viridisLite

## Warning: package 'spdep' was built under R version 4.3.3

## Loading required package: spData

## Warning: package 'spData' was built under R version 4.3.3

## To access larger datasets in this package, install the spDataLarge
## package with: `install.packages('spDataLarge',
## repos='https://nowosad.github.io/drat/', type='source')`

## Loading required package: sf

## Warning: package 'sf' was built under R version 4.3.3

## Linking to GEOS 3.11.2, GDAL 3.8.2, PROJ 9.3.1; sf_use_s2() is TRUE

set.seed(123)
n <- 100  # number of points
x <- runif(n, 0, 10)
y <- runif(n, 0, 10)
z <- sin(x) + cos(y) + rnorm(n, 0, 0.1)  # spatially correlated data

data <- data.frame(x = x, y = y, z = z)
coordinates(data) <- ~x+y

Visualizing the Data

Let’s visualize the simulated spatial data.

data_df <- as.data.frame(data)  # Convert to data frame for ggplot

ggplot(data_df, aes(x = x, y = y, color = z)) +
  geom_point() +
  scale_color_viridis_c() +
  theme_minimal() +
  labs(title = "Simulated Spatial Data",
       x = "X coordinate",
       y = "Y coordinate",
       color = "Z value")

Spatial Correlation

We use Moran’s I to measure spatial autocorrelation.

Moran’s I: A Measure of Spatial Autocorrelation

Moran’s I is a statistical measure that quantifies the degree to which a variable is clustered, dispersed, or randomly distributed in space. It assesses the similarity of values in neighboring locations.

Mathematical Formulation

Moran’s I is calculated as:

\[ I = \frac{N}{S_0} \cdot \frac{\sum_{i} \sum_{j} w_{ij} (z_i - \bar{z})(z_j - \bar{z})}{\sum_{i} (z_i - \bar{z})^2} \]

Where:

\(N\) is the total number of spatial units (e.g., counties, census tracts).
\(z_i\) is the value of the variable of interest (e.g., crime rate) in spatial unit \(i\).
\(\bar{z}\) is the mean of the variable across all spatial units.
\(w_{ij}\) is the spatial weight between spatial units \(i\) and \(j\). This is often based on adjacency (1 if neighbors, 0 otherwise) or distance.
\(S_0\) is the sum of all spatial weights: \[ S_0 = \sum_{i} \sum_{j} w_{ij} \]

Interpretation

\(I > 0\): Indicates positive spatial autocorrelation (clustering of similar values).
\(I < 0\): Indicates negative spatial autocorrelation (clustering of dissimilar values).
\(I = 0\): Indicates no spatial autocorrelation (random distribution).

The magnitude of \(I\) reflects the strength of spatial autocorrelation:

Closer to 1: Stronger positive autocorrelation.
Closer to -1: Stronger negative autocorrelation.
Closer to 0: Weaker spatial autocorrelation.

Statistical Significance

To assess the statistical significance of Moran’s I, you need to:

Calculate the expected value (\(E(I)\)) and variance (\(Var(I)\)) of Moran’s I under the null hypothesis of no spatial autocorrelation.
Calculate a Z-score: \[ Z = \frac{I - E(I)}{\sqrt{Var(I)}} \]
Compare the Z-score to a standard normal distribution. A large Z-score (positive or negative) indicates that the observed Moran’s I is unlikely to occur by chance alone, suggesting statistically significant spatial autocorrelation.

coords <- as.matrix(data@coords)
nb <- knn2nb(knearneigh(coords, k = 4))
lw <- nb2listw(nb, style = "W")

moran_test <- moran.test(data$z, lw)
moran_test

## 
##  Moran I test under randomisation
## 
## data:  data$z  
## weights: lw    
## 
## Moran I statistic standard deviate = 11.405, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##       0.738644364      -0.010101010       0.004310062

Variogram

We calculate and plot the empirical variogram of the data. The Empirical Variogram: Quantifying Spatial Dependence The empirical variogram is a fundamental tool in geostatistics for quantifying the degree of spatial dependence or dissimilarity between data points as a function of distance. It provides a visual and quantitative representation of how a variable’s values change with increasing separation in space.

Mathematical Formulation The empirical variogram is calculated as:

\[ \hat{\gamma}(h) = \frac{1}{2N(h)} \sum_{i=1}^{N(h)} [Z(x_i) - Z(x_i + h)]^2 \]

The variogram is a fundamental tool in geostatistics for quantifying the degree of spatial dependence or dissimilarity between data points as a function of their separation distance.

The variogram, denoted as \(\gamma(h)\), is defined as half the average squared difference between paired data values separated by a distance \(h\):

\[ \gamma(h) = \frac{1}{2N(h)} \sum_{i=1}^{N(h)} [Z(x_i) - Z(x_i + h)]^2 \]

where:

\(h\) is the separation distance (lag) between pairs of points.
\(Z(x_i)\) is the value of the variable of interest at location \(x_i\).
\(N(h)\) is the number of pairs of points separated by distance \(h\).

The empirical variogram is an estimate of the theoretical variogram calculated from the available data. It involves the following steps:

Group pairs of data points into distance classes (bins) based on their separation distance.
For each bin, compute the average squared difference between the paired values.
Plot the average squared differences (semivariances) against the corresponding distances (lags).

The empirical variogram plot reveals important characteristics of the spatial structure of the data: The empirical variogram is typically fitted with a theoretical variogram model to describe the spatial structure more smoothly and allow for predictions at unsampled locations. Common models include:

vgm_emp <- variogram(z ~ 1, data)
plot(vgm_emp, main = "Empirical Variogram")

A scatterplot where the x-axis represents the distances (lags) between pairs of data points, and the y-axis represents the corresponding semivariances (half the average squared difference between values at those distances). Theoretical Basis of the Empirical Variogram

The empirical variogram is a non-parametric estimate of the underlying spatial correlation structure in your data. Here’s the theoretical underpinning:

Spatial Dependence: Geostatistics assumes that things that are close together tend to be more similar than things that are far apart (Tobler’s First Law of Geography).

This concept is known as spatial dependence or spatial autocorrelation.

Variogram as a Measure of Dissimilarity:

The variogram quantifies this spatial dependence by measuring the dissimilarity between values as a function of distance. Specifically, for a given lag distance (h), the variogram calculates the average squared difference between all pairs of data points separated by that distance.

Empirical Estimation: The empirical variogram is a sample-based estimate of this theoretical function. It is calculated by binning the data into distance classes and computing the average squared difference within each bin. Interpretation of the Plot

Nugget Effect: The y-intercept of the variogram represents the nugget effect, which measures the microscale variation or measurement error in your data.

Sill: The plateau reached by the variogram is the sill, which represents the overall variance of the process.

Range: The distance at which the variogram levels off (approximately) is the range, which indicates the distance beyond which values are no longer spatially correlated. Important Considerations

Binning Choice: The choice of distance bins and the number of pairs within each bin can affect the shape of the empirical variogram. It’s important to choose appropriate binning parameters for your data.

Model Fitting: The empirical variogram is often used as a basis for fitting a theoretical variogram model, which can then be used for kriging (spatial prediction) and other geostatistical analyses.

We fit a theoretical variogram model to the empirical variogram.

vgm_model <- fit.variogram(vgm_emp, model = vgm("Sph", nugget = 0, range = 1, psill = 1))

## Warning in fit.variogram(object, model, fit.sills = fit.sills, fit.ranges =
## fit.ranges, : No convergence after 200 iterations: try different initial
## values?

plot(vgm_emp, model = vgm_model, main = "Fitted Variogram Model")

vgm_model

##   model    psill    range
## 1   Nug  0.00000  0.00000
## 2   Sph 19.95329 97.46547

Finally, we use kriging for spatial interpolation.

Why Kriging?

Kriging provides not only predicted values but also an estimate of prediction uncertainty (kriging variance).

It considers both the distance and the spatial correlation structure when making predictions, making it more powerful than simpler interpolation methods.

Kriging is a geostatistical method for spatial interpolation that aims to provide the of an unobserved value at a given location. It leverages the spatial correlation structure captured in the variogram to determine optimal weights for neighboring observations.

Let \(Z(\mathbf{x})\) be the random variable representing the value of interest (e.g., temperature) at location \(\mathbf{x}\). We want to predict the value at an unobserved location \(\mathbf{x}_0\), denoted as \(\hat{Z}(\mathbf{x}_0)\). Kriging does this using a linear combination of the observed values \(Z(\mathbf{x}_i)\) at neighboring locations \(\mathbf{x}_i\):

\[\begin{equation} \hat{Z}(\mathbf{x}_0) = \sum_{i=1}^n \lambda_i Z(\mathbf{x}_i) \end{equation}\]

where:

\(\lambda_i\) are the assigned to each observed value.
\(n\) is the number of neighboring observations used for prediction.

Kriging weights are determined by optimizing two conditions:

The expected value of the prediction error should be zero. \[\begin{equation} E[\hat{Z}(\mathbf{x}_0) - Z(\mathbf{x}_0)] = 0 \end{equation}\]
The variance of the prediction error should be minimized. \[\begin{equation} Var[\hat{Z}(\mathbf{x}_0) - Z(\mathbf{x}_0)] = \min \end{equation}\]

These conditions lead to a system of linear equations (the ) that can be solved to obtain the optimal kriging weights. The variogram model (or covariance function) provides the necessary information about spatial correlation for solving this system.

The kriging variance, denoted as \(\sigma^2_k(\mathbf{x}_0)\), quantifies the uncertainty of the prediction at \(\mathbf{x}_0\). It is calculated as:

\[\begin{equation} \sigma^2_k(\mathbf{x}_0) = Var[\hat{Z}(\mathbf{x}_0) - Z(\mathbf{x}_0)] = C(0) - \sum_{i=1}^n \lambda_i C(\mathbf{x}_0 - \mathbf{x}_i) \end{equation}\]

where:

\(C(\mathbf{h})\) is the covariance function at lag distance \(\mathbf{h}\), which can be derived from the variogram model.

# Define a grid for prediction
grd <- expand.grid(x = seq(0, 10, by = 0.1), y = seq(0, 10, by = 0.1))
coordinates(grd) <- ~x+y
gridded(grd) <- TRUE

# Perform ordinary kriging
kriged <- krige(z ~ 1, data, grd, model = vgm_model)

## [using ordinary kriging]

# Convert to a data frame for plotting
kriged_df <- as.data.frame(kriged)
names(kriged_df)[3] <- "predicted"

# Plot the kriging result
ggplot(kriged_df, aes(x = x, y = y, fill = predicted)) +
  geom_raster() +
  scale_fill_viridis_c() +
  theme_minimal() +
  labs(title = "Kriging Interpolation",
       x = "X coordinate",
       y = "Y coordinate",
       fill = "Predicted Z")

Simulating Spatial Data

First, we simulate some spatial data using a more complex spatial model.

set.seed(123)
n <- 200  # number of points
x <- runif(n, 0, 10)
y <- runif(n, 0, 10)
z <- sin(x) * cos(y) + rnorm(n, 0, 0.1)  # spatially correlated data with interaction term

data <- data.frame(x = x, y = y, z = z)
coordinates(data) <- ~x+y

Visualizing the Data

Let’s visualize the simulated spatial data using enhanced plotting techniques.

data_df <- as.data.frame(data)  # Convert to data frame for ggplot

ggplot(data_df, aes(x = x, y = y, color = z)) +
  geom_point(size = 3) +
  scale_color_viridis_c(option = "C") +
  theme_minimal(base_size = 15) +
  labs(title = "Simulated Spatial Data",
       x = "X Coordinate",
       y = "Y Coordinate",
       color = "Z Value") +
  theme(legend.position = "right")

Spatial Correlation

We use Moran’s I to measure spatial autocorrelation.

coords <- as.matrix(data@coords)
nb <- knn2nb(knearneigh(coords, k = 4))
lw <- nb2listw(nb, style = "W")

moran_test <- moran.test(data$z, lw)
moran_test

## 
##  Moran I test under randomisation
## 
## data:  data$z  
## weights: lw    
## 
## Moran I statistic standard deviate = 16.853, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##       0.782143740      -0.005025126       0.002181511

Visualizing Spatial Correlation

Let’s visualize the spatial correlation using a correlogram.

correlog <- sp.correlogram(nb, data$z, order = 10, method = "I", style = "W")
plot(correlog, main = "Spatial Correlogram of Z Values")

Variogram

We calculate and plot the empirical variogram of the data

vgm_emp <- variogram(z ~ 1, data, cloud = FALSE, cutoff = 5, width = 0.5)
plot(vgm_emp, main = "Empirical Variogram")

Fitting a Variogram Model

We fit a theoretical variogram model to the empirical variogram.

vgm_model <- fit.variogram(vgm_emp, model = vgm("Sph", nugget = 0.1, range = 1, psill = 0.9))
plot(vgm_emp, model = vgm_model, main = "Fitted Variogram Model")

vgm_model

##   model     psill    range
## 1   Nug 0.0000000 0.000000
## 2   Sph 0.3067406 3.218348

Kriging

Finally, we use kriging for spatial interpolation and visualize the result

# Define a grid for prediction
grd <- expand.grid(x = seq(0, 10, by = 0.1), y = seq(0, 10, by = 0.1))
coordinates(grd) <- ~x+y
gridded(grd) <- TRUE

# Perform ordinary kriging
kriged <- krige(z ~ 1, data, grd, model = vgm_model)

## [using ordinary kriging]

# Convert to a data frame for plotting
kriged_df <- as.data.frame(kriged)
names(kriged_df)[3] <- "predicted"

# Plot the kriging result
ggplot(kriged_df, aes(x = x, y = y, fill = predicted)) +
  geom_raster() +
  scale_fill_viridis_c(option = "C") +
  theme_minimal(base_size = 15) +
  labs(title = "Kriging Interpolation",
       x = "X Coordinate",
       y = "Y Coordinate",
       fill = "Predicted Z") +
  theme(legend.position = "right")

In this document, we explored advanced concepts in spatial data analysis using R. We simulated spatial data, visualized it using enhanced plots, and performed spatial correlation analysis, variogram estimation, and kriging interpolation.

Explanation of Flowchart Steps

Simulating Spatial Data
- Generate Random Coordinates: Generate random spatial coordinates within a specified range.
- Generate Spatially Correlated Values: Create spatially correlated values using a mathematical model.
- Create Data Frame and Set Coordinates: Combine the coordinates and values into a data frame and set the coordinates for spatial analysis.
Visualizing the Data
- Convert to Data Frame for ggplot: Convert the spatial data object to a data frame format compatible with ggplot2.
- Plot Simulated Data: Create a scatter plot of the simulated spatial data using ggplot2 with enhanced aesthetics.
Spatial Correlation Analysis
- Calculate Spatial Neighbors: Determine spatial neighbors using the k-nearest neighbors method.
- Create Spatial Weights: Generate spatial weights based on the spatial neighbors.
- Calculate Moran’s I: Calculate Moran’s I statistic to measure spatial autocorrelation.
- Plot Correlogram: Visualize the spatial correlation with a correlogram.
Variogram Analysis
- Calculate Empirical Variogram: Compute the empirical variogram to describe spatial dependence.
- Plot Empirical Variogram: Visualize the empirical variogram.
- Fit Theoretical Variogram Model: Fit a theoretical variogram model to the empirical variogram data.
- Plot Fitted Variogram Model: Visualize the fitted variogram model alongside the empirical variogram.
Kriging Interpolation
- Define Prediction Grid: Define a grid over the spatial domain for prediction.
- Perform Ordinary Kriging: Use ordinary kriging to interpolate the spatial data over the prediction grid.
- Convert Kriging Result to Data Frame: Convert the kriging result to a data frame format for plotting.
- Plot Kriging Result: Create a raster plot of the kriging interpolation results using ggplot2.

This flowchart and its explanation provide an overview of the steps and logical flow in the spatial data analysis process using R. Each step builds on the previous ones to create a comprehensive analysis and visualization of spatial data.

Geostatistics with R: Spatial Data Fundamentals with R

Dr. Debashis Chatterjee

5/28/2024

Introduction

Setup

Visualizing the Data

Spatial Correlation

Moran’s I: A Measure of Spatial Autocorrelation

Mathematical Formulation

Interpretation

Statistical Significance

Variogram

Variogram as a Measure of Dissimilarity:

Why Kriging?

Simulating Spatial Data

Visualizing the Data

Spatial Correlation

Visualizing Spatial Correlation

Variogram

Fitting a Variogram Model

Kriging

Explanation of Flowchart Steps