library(MASS)
library(terra)
library(sf)Spatial Data Simulation with the Multivariate Normal Distribution
1 Overview
This tutorial shows how to simulate spatially structured data using ideas from the multivariate normal (MVN) distribution. We will:
- simulate correlated multivariate data with
MASS::mvrnorm() - create a spatial grid with spatial packages:
terraandsf - simulate a spatially autocorrelated landscape
- sample points from that landscape
- generate additional covariates for later modeling
The key conceptual link is that a spatial surface can be treated as a multivariate random vector, where each cell in the landscape is one element of the vector. The covariance matrix then controls how strongly values at different locations co-vary.
2 Packages
3 The multivariate normal distribution
A multivariate normal distribution is defined by a mean vector, \(\boldsymbol{\mu}\), and a variance-covariance matrix, \(\boldsymbol{\Sigma}\):
\[ \mathbf{Y} \sim \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma}) \]
For a 3-variable example,
\[ \boldsymbol{\mu} = \begin{bmatrix} \mu_1 \\ \mu_2 \\ \mu_3 \end{bmatrix}, \qquad \boldsymbol{\Sigma} = \begin{bmatrix} \sigma_1^2 & \sigma_{12} & \sigma_{13} \\ \sigma_{12} & \sigma_2^2 & \sigma_{23} \\ \sigma_{13} & \sigma_{23} & \sigma_3^2 \end{bmatrix} \]
- The mean vector \(\boldsymbol{\mu}\) sets the expected value of each variable.
- The diagonal elements of \(\boldsymbol{\Sigma}\) are variances.
- The off-diagonal elements of \(\boldsymbol{\Sigma}\) are covariances, which determine how variables move together.
4 Simulating a small multivariate normal example
We begin with a simple 3-dimensional MVN example.
mu_vec <- c(-5, 0.2, 1)
covar_mat <- matrix(
c(1.0, 0.7, 0.3,
0.7, 1.0, 0.4,
0.3, 0.4, 1.0),
nrow = 3,
byrow = TRUE
)
one_draw <- MASS::mvrnorm(n = 1, mu = mu_vec, Sigma = covar_mat)
one_draw[1] -5.4715156 -1.3350612 0.8062145
Now simulate many observations:
set.seed(123)
try1 <- MASS::mvrnorm(n = 100, mu = mu_vec, Sigma = covar_mat)
pairs(try1, main = "Simulated 3-variable MVN data")Exercise for students: What would you modify in covar_mat to make the relationship between two variables tighter?
4.1 Interpretation
If you want two variables to be more tightly correlated, you would generally increase the magnitude of the corresponding covariance (or correlation), while keeping the covariance matrix valid (positive definite if you want to use linear algebra jargon).
5 Connecting MVN ideas to spatial simulation
A spatial raster with \(n\) cells can be viewed as an \(n\)-dimensional random vector:
\[ \mathbf{X} = (X_1, X_2, \dots, X_n)^T \]
where each \(X_i\) is the value at one location on the landscape.
If we assume
\[ \mathbf{X} \sim \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma}) \]
then the covariance matrix \(\boldsymbol{\Sigma}\) determines how similar nearby cells are. A common spatial covariance model is:
\[ \text{Cov}(X_i, X_j) = \sigma^2 \exp(-\phi d_{ij}) \]
where:
- \(d_{ij}\) is the distance between cells \(i\) and \(j\)
- \(\sigma^2\) is the marginal variance
- \(\phi\) controls how quickly correlation decays with distance
When \(\phi\) is small, correlation decays slowly and the surface is smoother over long distances. When \(\phi\) is large, correlation decays quickly and the surface becomes patchier at fine scales.
6 Simulating a landscape
We will simulate a square lattice and then use an exponential covariance function to generate a spatially autocorrelated surface.
# Set up a square lattice region
simgrid <- expand.grid(x = 1:50, y = 1:50)
n <- nrow(simgrid)
head(simgrid) x y
1 1 1
2 2 1
3 3 1
4 4 1
5 5 1
6 6 1
6.1 Make the grid into spatial objects
We can represent the simulated locations as an sf point object and also create a terra raster later.
simgrid_sf <- st_as_sf(simgrid, coords = c("x", "y"), crs = NA)
simgrid_sfSimple feature collection with 2500 features and 0 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 1 ymin: 1 xmax: 50 ymax: 50
CRS: NA
First 10 features:
geometry
1 POINT (1 1)
2 POINT (2 1)
3 POINT (3 1)
4 POINT (4 1)
5 POINT (5 1)
6 POINT (6 1)
7 POINT (7 1)
8 POINT (8 1)
9 POINT (9 1)
10 POINT (10 1)
6.2 Distance matrix
distance_mat <- as.matrix(dist(simgrid))
dim(distance_mat)[1] 2500 2500
6.3 Spatial correlation as a function of distance
phi <- 0.7
plot(
1:100,
exp(-phi * 1:100),
type = "l",
xlab = "Distance",
ylab = "Correlation",
main = "Exponential decay in spatial correlation"
)Question for students: How does changing phi change the spatial aggregation in the simulated landscape?
6.4 Simulate the spatial random field
Here, the covariance between two cells depends on the distance between them:
Sigma_spatial <- exp(-phi * distance_mat)
X <- MASS::mvrnorm(n = 1, mu = rep(10, n), Sigma = Sigma_spatial)
head(X) 1 2 3 4 5 6
9.376144 8.633929 8.367130 9.182953 9.007860 9.028344
This means the whole landscape is one draw from a very high-dimensional multivariate normal distribution.
7 Convert the simulated values to a raster
We now combine the coordinates and simulated values, then create a SpatRaster with terra.
sim_df <- data.frame(
x = simgrid$x,
y = simgrid$y,
value = as.numeric(X)
)
Xraster <- terra::rast(sim_df, type = "xyz", crs = "")
Xrasterclass : SpatRaster
size : 50, 50, 1 (nrow, ncol, nlyr)
resolution : 1, 1 (x, y)
extent : 0.5, 50.5, 0.5, 50.5 (xmin, xmax, ymin, ymax)
coord. ref. :
source(s) : memory
name : value
min value : 6.603952
max value : 13.198275
plot(Xraster, main = "Simulated spatial landscape")7.1 Interpretation
- Each raster cell is one component of a multivariate normal vector.
- The mean vector was constant at 10, so the surface fluctuates around
- The covariance matrix created spatial structure: nearby cells tend to have similar values.
8 Convert raster cells to points and sample the landscape
Pretend this surface represents wildlife abundance, a habitat suitability index, vegetation biomass, ghost orchid density, or dragon activity. Because the data are simulated, you can invent the ecological story.
spat_dat <- as.data.frame(Xraster, xy = TRUE, na.rm = TRUE)
head(spat_dat) x y value
1 1 50 10.477432
2 2 50 9.868974
3 3 50 10.756138
4 4 50 10.630947
5 5 50 10.554223
6 6 50 9.462484
Choose a sample size:
sample_size <- 250
sample_rows <- sample(x = seq_len(nrow(spat_dat)), size = sample_size)Plot the landscape and the sampled points:
plot(Xraster, main = "Random sample from the simulated landscape")
points(spat_dat[sample_rows, c("x", "y")], pch = 19, cex = 0.5)We can also store the sample as an sf object:
sample_sf <- st_as_sf(spat_dat[sample_rows, ], coords = c("x", "y"), crs = NA)
sample_sfSimple feature collection with 250 features and 1 field
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 1 ymin: 1 xmax: 50 ymax: 50
CRS: NA
First 10 features:
value geometry
675 9.307929 POINT (25 37)
1280 9.893317 POINT (30 25)
900 10.781421 POINT (50 33)
167 10.053942 POINT (17 47)
2180 11.493424 POINT (30 7)
1943 10.069342 POINT (43 12)
173 11.045629 POINT (23 47)
2175 9.219720 POINT (25 7)
1249 11.218943 POINT (49 26)
967 10.862394 POINT (17 31)
8.1 Discussion
This is pure random sampling.
Questions for students:
- What are the benefits of pure random sampling?
- What are the costs?
- Would a stratified or spatially balanced design be more efficient?
9 Simulating additional covariates
We have one main spatial covariate: the simulated surface value at the sampled locations. Now we will create additional variables.
canopy_height <- plogis(spat_dat[sample_rows, "value"])
prey_abundance <- rnbinom(sample_size, mu = canopy_height, size = 1.5)
temperature <- rnorm(n = sample_size, mean = canopy_height, sd = 10)
hunting_pressure <- rnorm(sample_size, mean = spat_dat[sample_rows, "x"], sd = 5)
covariates <- data.frame(
canopy_height = canopy_height,
prey_abundance = prey_abundance,
temperature = temperature,
hunting_pressure = hunting_pressure
)
pairs(covariates, main = "Simulated covariates")9.1 Interpretation
These variables are not jointly multivariate normal, but they are built from quantities that depend on the spatial process.
canopy_heightis a transformed version of the spatial surfaceprey_abundanceis a count variable generated from a negative binomial distributiontemperaturedepends oncanopy_heightplus noisehunting_pressureis linked to the x-coordinate, creating a spatial trend
This mirrors real ecological data, where different variables often have different distributions and are related through latent environmental structure.
10 Linking back to the multivariate normal
The most important conceptual bridge is this:
A spatially autocorrelated landscape can be generated by treating all cells as one multivariate random vector and specifying a covariance matrix that depends on distance.
That is, instead of simulating a few correlated variables, we simulate thousands of correlated locations.
The MVN framework helps us understand:
- why nearby cells look similar
- how covariance creates smooth or patchy surfaces
- how spatial structure can propagate into sampled covariates
11 Marginal and joint ideas in the spatial setting
Each cell value, by itself, has a marginal distribution. But the full landscape is defined by the joint distribution of all cells together.
- The marginal distribution of one cell describes the distribution of values at that cell
- The joint distribution describes how all cells vary together
- Spatial autocorrelation arises because those cell values are not independent
12 Exercises
- Change
phiand re-run the simulation. How does the visual pattern of the raster change? - Increase or decrease the variance of the spatial process by multiplying
Sigma_spatialby a constant. What changes? - Replace the constant mean vector
rep(10, n)with a mean that varies across x or y. How does this change the landscape? - Simulate multiple spatial surfaces and compare them.
- What would be a more efficient way to generate several correlated covariates directly from a multivariate normal distribution?
13 Optional extension: a spatial trend in the mean
So far the mean vector has been constant: \[ \mu_i = 10 \]
But in many applications, the mean depends on location or covariates:
\[ \mu_i = \beta_0 + \beta_1 x_i + \beta_2 y_i \]
This separates large-scale trend in the mean from small-scale spatial dependence in the covariance structure.
14 Key takeaways
MASS::mvrnorm()can simulate multivariate normal data and spatial random fields.- A spatial raster can be treated as one realization of a high-dimensional MVN random vector.
- The covariance matrix controls the strength of association among cells.
- In spatial simulation, covariance often depends on distance.
terraandsfprovide modern tools for handling simulated rasters and vector data in R.