This document explores fundamental concepts in spatial data analysis using R. We’ll work with simulated datasets to understand key ideas like spatial correlation and variograms.
knitr::opts_chunk$set(echo = TRUE)
library(sp)
## Warning: package 'sp' was built under R version 4.3.3
library(gstat)
## Warning: package 'gstat' was built under R version 4.3.3
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.3.3
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# Install and load necessary packages
packages <- c("sp", "gstat", "ggplot2", "dplyr", "viridis", "spdep")
installed_packages <- rownames(installed.packages())
for (pkg in packages) {
if (!pkg %in% installed_packages) {
install.packages(pkg, dependencies = TRUE)
}
library(pkg, character.only = TRUE)
}
## Warning: package 'viridis' was built under R version 4.3.3
## Loading required package: viridisLite
## Warning: package 'spdep' was built under R version 4.3.3
## Loading required package: spData
## Warning: package 'spData' was built under R version 4.3.3
## To access larger datasets in this package, install the spDataLarge
## package with: `install.packages('spDataLarge',
## repos='https://nowosad.github.io/drat/', type='source')`
## Loading required package: sf
## Warning: package 'sf' was built under R version 4.3.3
## Linking to GEOS 3.11.2, GDAL 3.8.2, PROJ 9.3.1; sf_use_s2() is TRUE
set.seed(123)
n <- 100 # number of points
x <- runif(n, 0, 10)
y <- runif(n, 0, 10)
z <- sin(x) + cos(y) + rnorm(n, 0, 0.1) # spatially correlated data
data <- data.frame(x = x, y = y, z = z)
coordinates(data) <- ~x+y
Let’s visualize the simulated spatial data.
data_df <- as.data.frame(data) # Convert to data frame for ggplot
ggplot(data_df, aes(x = x, y = y, color = z)) +
geom_point() +
scale_color_viridis_c() +
theme_minimal() +
labs(title = "Simulated Spatial Data",
x = "X coordinate",
y = "Y coordinate",
color = "Z value")
We use Moran’s I to measure spatial autocorrelation.
Moran’s I is a statistical measure that quantifies the degree to which a variable is clustered, dispersed, or randomly distributed in space. It assesses the similarity of values in neighboring locations.
Moran’s I is calculated as:
\[ I = \frac{N}{S_0} \cdot \frac{\sum_{i} \sum_{j} w_{ij} (z_i - \bar{z})(z_j - \bar{z})}{\sum_{i} (z_i - \bar{z})^2} \]
Where:
The magnitude of \(I\) reflects the strength of spatial autocorrelation:
To assess the statistical significance of Moran’s I, you need to:
coords <- as.matrix(data@coords)
nb <- knn2nb(knearneigh(coords, k = 4))
lw <- nb2listw(nb, style = "W")
moran_test <- moran.test(data$z, lw)
moran_test
##
## Moran I test under randomisation
##
## data: data$z
## weights: lw
##
## Moran I statistic standard deviate = 11.405, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic Expectation Variance
## 0.738644364 -0.010101010 0.004310062
We calculate and plot the empirical variogram of the data. The Empirical Variogram: Quantifying Spatial Dependence The empirical variogram is a fundamental tool in geostatistics for quantifying the degree of spatial dependence or dissimilarity between data points as a function of distance. It provides a visual and quantitative representation of how a variable’s values change with increasing separation in space.
Mathematical Formulation The empirical variogram is calculated as:
\[ \hat{\gamma}(h) = \frac{1}{2N(h)} \sum_{i=1}^{N(h)} [Z(x_i) - Z(x_i + h)]^2 \]
The variogram is a fundamental tool in geostatistics for quantifying the degree of spatial dependence or dissimilarity between data points as a function of their separation distance.
The variogram, denoted as \(\gamma(h)\), is defined as half the average squared difference between paired data values separated by a distance \(h\):
\[ \gamma(h) = \frac{1}{2N(h)} \sum_{i=1}^{N(h)} [Z(x_i) - Z(x_i + h)]^2 \]
where:
The empirical variogram is an estimate of the theoretical variogram calculated from the available data. It involves the following steps:
vgm_emp <- variogram(z ~ 1, data)
plot(vgm_emp, main = "Empirical Variogram")
A scatterplot where the x-axis represents the distances (lags) between pairs of data points, and the y-axis represents the corresponding semivariances (half the average squared difference between values at those distances). Theoretical Basis of the Empirical Variogram
The empirical variogram is a non-parametric estimate of the underlying spatial correlation structure in your data. Here’s the theoretical underpinning:
Spatial Dependence: Geostatistics assumes that things that are close together tend to be more similar than things that are far apart (Tobler’s First Law of Geography).
This concept is known as spatial dependence or spatial autocorrelation.
The variogram quantifies this spatial dependence by measuring the dissimilarity between values as a function of distance. Specifically, for a given lag distance (h), the variogram calculates the average squared difference between all pairs of data points separated by that distance.
Empirical Estimation: The empirical variogram is a sample-based estimate of this theoretical function. It is calculated by binning the data into distance classes and computing the average squared difference within each bin. Interpretation of the Plot
Nugget Effect: The y-intercept of the variogram represents the nugget effect, which measures the microscale variation or measurement error in your data.
Sill: The plateau reached by the variogram is the sill, which represents the overall variance of the process.
Range: The distance at which the variogram levels off (approximately) is the range, which indicates the distance beyond which values are no longer spatially correlated. Important Considerations
Binning Choice: The choice of distance bins and the number of pairs within each bin can affect the shape of the empirical variogram. It’s important to choose appropriate binning parameters for your data.
Model Fitting: The empirical variogram is often used as a basis for fitting a theoretical variogram model, which can then be used for kriging (spatial prediction) and other geostatistical analyses.
We fit a theoretical variogram model to the empirical variogram.
vgm_model <- fit.variogram(vgm_emp, model = vgm("Sph", nugget = 0, range = 1, psill = 1))
## Warning in fit.variogram(object, model, fit.sills = fit.sills, fit.ranges =
## fit.ranges, : No convergence after 200 iterations: try different initial
## values?
plot(vgm_emp, model = vgm_model, main = "Fitted Variogram Model")
vgm_model
## model psill range
## 1 Nug 0.00000 0.00000
## 2 Sph 19.95329 97.46547
Finally, we use kriging for spatial interpolation.
Kriging provides not only predicted values but also an estimate of prediction uncertainty (kriging variance).
It considers both the distance and the spatial correlation structure when making predictions, making it more powerful than simpler interpolation methods.
Kriging is a geostatistical method for spatial interpolation that aims to provide the of an unobserved value at a given location. It leverages the spatial correlation structure captured in the variogram to determine optimal weights for neighboring observations.
Let \(Z(\mathbf{x})\) be the random variable representing the value of interest (e.g., temperature) at location \(\mathbf{x}\). We want to predict the value at an unobserved location \(\mathbf{x}_0\), denoted as \(\hat{Z}(\mathbf{x}_0)\). Kriging does this using a linear combination of the observed values \(Z(\mathbf{x}_i)\) at neighboring locations \(\mathbf{x}_i\):
\[\begin{equation} \hat{Z}(\mathbf{x}_0) = \sum_{i=1}^n \lambda_i Z(\mathbf{x}_i) \end{equation}\]
where:
Kriging weights are determined by optimizing two conditions:
The expected value of the prediction error should be zero. \[\begin{equation} E[\hat{Z}(\mathbf{x}_0) - Z(\mathbf{x}_0)] = 0 \end{equation}\]
The variance of the prediction error should be minimized. \[\begin{equation} Var[\hat{Z}(\mathbf{x}_0) - Z(\mathbf{x}_0)] = \min \end{equation}\]
These conditions lead to a system of linear equations (the ) that can be solved to obtain the optimal kriging weights. The variogram model (or covariance function) provides the necessary information about spatial correlation for solving this system.
The kriging variance, denoted as \(\sigma^2_k(\mathbf{x}_0)\), quantifies the uncertainty of the prediction at \(\mathbf{x}_0\). It is calculated as:
\[\begin{equation} \sigma^2_k(\mathbf{x}_0) = Var[\hat{Z}(\mathbf{x}_0) - Z(\mathbf{x}_0)] = C(0) - \sum_{i=1}^n \lambda_i C(\mathbf{x}_0 - \mathbf{x}_i) \end{equation}\]
where:
# Define a grid for prediction
grd <- expand.grid(x = seq(0, 10, by = 0.1), y = seq(0, 10, by = 0.1))
coordinates(grd) <- ~x+y
gridded(grd) <- TRUE
# Perform ordinary kriging
kriged <- krige(z ~ 1, data, grd, model = vgm_model)
## [using ordinary kriging]
# Convert to a data frame for plotting
kriged_df <- as.data.frame(kriged)
names(kriged_df)[3] <- "predicted"
# Plot the kriging result
ggplot(kriged_df, aes(x = x, y = y, fill = predicted)) +
geom_raster() +
scale_fill_viridis_c() +
theme_minimal() +
labs(title = "Kriging Interpolation",
x = "X coordinate",
y = "Y coordinate",
fill = "Predicted Z")
First, we simulate some spatial data using a more complex spatial model.
set.seed(123)
n <- 200 # number of points
x <- runif(n, 0, 10)
y <- runif(n, 0, 10)
z <- sin(x) * cos(y) + rnorm(n, 0, 0.1) # spatially correlated data with interaction term
data <- data.frame(x = x, y = y, z = z)
coordinates(data) <- ~x+y
Let’s visualize the simulated spatial data using enhanced plotting techniques.
data_df <- as.data.frame(data) # Convert to data frame for ggplot
ggplot(data_df, aes(x = x, y = y, color = z)) +
geom_point(size = 3) +
scale_color_viridis_c(option = "C") +
theme_minimal(base_size = 15) +
labs(title = "Simulated Spatial Data",
x = "X Coordinate",
y = "Y Coordinate",
color = "Z Value") +
theme(legend.position = "right")
We use Moran’s I to measure spatial autocorrelation.
coords <- as.matrix(data@coords)
nb <- knn2nb(knearneigh(coords, k = 4))
lw <- nb2listw(nb, style = "W")
moran_test <- moran.test(data$z, lw)
moran_test
##
## Moran I test under randomisation
##
## data: data$z
## weights: lw
##
## Moran I statistic standard deviate = 16.853, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic Expectation Variance
## 0.782143740 -0.005025126 0.002181511
Let’s visualize the spatial correlation using a correlogram.
correlog <- sp.correlogram(nb, data$z, order = 10, method = "I", style = "W")
plot(correlog, main = "Spatial Correlogram of Z Values")
We calculate and plot the empirical variogram of the data
vgm_emp <- variogram(z ~ 1, data, cloud = FALSE, cutoff = 5, width = 0.5)
plot(vgm_emp, main = "Empirical Variogram")
We fit a theoretical variogram model to the empirical variogram.
vgm_model <- fit.variogram(vgm_emp, model = vgm("Sph", nugget = 0.1, range = 1, psill = 0.9))
plot(vgm_emp, model = vgm_model, main = "Fitted Variogram Model")
vgm_model
## model psill range
## 1 Nug 0.0000000 0.000000
## 2 Sph 0.3067406 3.218348
Finally, we use kriging for spatial interpolation and visualize the result
# Define a grid for prediction
grd <- expand.grid(x = seq(0, 10, by = 0.1), y = seq(0, 10, by = 0.1))
coordinates(grd) <- ~x+y
gridded(grd) <- TRUE
# Perform ordinary kriging
kriged <- krige(z ~ 1, data, grd, model = vgm_model)
## [using ordinary kriging]
# Convert to a data frame for plotting
kriged_df <- as.data.frame(kriged)
names(kriged_df)[3] <- "predicted"
# Plot the kriging result
ggplot(kriged_df, aes(x = x, y = y, fill = predicted)) +
geom_raster() +
scale_fill_viridis_c(option = "C") +
theme_minimal(base_size = 15) +
labs(title = "Kriging Interpolation",
x = "X Coordinate",
y = "Y Coordinate",
fill = "Predicted Z") +
theme(legend.position = "right")
In this document, we explored advanced concepts in spatial data analysis using R. We simulated spatial data, visualized it using enhanced plots, and performed spatial correlation analysis, variogram estimation, and kriging interpolation.
This flowchart and its explanation provide an overview of the steps and logical flow in the spatial data analysis process using R. Each step builds on the previous ones to create a comprehensive analysis and visualization of spatial data.