Spatial Data Simulation with the Multivariate Normal Distribution

Author

Trevor Caughlin

Published

November 4, 2027

1 Overview

This tutorial shows how to simulate spatially structured data using ideas from the multivariate normal (MVN) distribution. We will:

  1. simulate correlated multivariate data with MASS::mvrnorm()
  2. create a spatial grid with spatial packages: terra and sf
  3. simulate a spatially autocorrelated landscape
  4. sample points from that landscape
  5. generate additional covariates for later modeling

The key conceptual link is that a spatial surface can be treated as a multivariate random vector, where each cell in the landscape is one element of the vector. The covariance matrix then controls how strongly values at different locations co-vary.

2 Packages

library(MASS)
library(terra)
library(sf)

3 The multivariate normal distribution

A multivariate normal distribution is defined by a mean vector, \(\boldsymbol{\mu}\), and a variance-covariance matrix, \(\boldsymbol{\Sigma}\):

\[ \mathbf{Y} \sim \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma}) \]

For a 3-variable example,

\[ \boldsymbol{\mu} = \begin{bmatrix} \mu_1 \\ \mu_2 \\ \mu_3 \end{bmatrix}, \qquad \boldsymbol{\Sigma} = \begin{bmatrix} \sigma_1^2 & \sigma_{12} & \sigma_{13} \\ \sigma_{12} & \sigma_2^2 & \sigma_{23} \\ \sigma_{13} & \sigma_{23} & \sigma_3^2 \end{bmatrix} \]

  • The mean vector \(\boldsymbol{\mu}\) sets the expected value of each variable.
  • The diagonal elements of \(\boldsymbol{\Sigma}\) are variances.
  • The off-diagonal elements of \(\boldsymbol{\Sigma}\) are covariances, which determine how variables move together.

4 Simulating a small multivariate normal example

We begin with a simple 3-dimensional MVN example.

mu_vec <- c(-5, 0.2, 1)

covar_mat <- matrix(
  c(1.0,  0.7,  0.3,
    0.7,  1.0,  0.4,
    0.3,  0.4,  1.0),
  nrow = 3,
  byrow = TRUE
)

one_draw <- MASS::mvrnorm(n = 1, mu = mu_vec, Sigma = covar_mat)
one_draw
[1] -5.4715156 -1.3350612  0.8062145

Now simulate many observations:

set.seed(123)

try1 <- MASS::mvrnorm(n = 100, mu = mu_vec, Sigma = covar_mat)
pairs(try1, main = "Simulated 3-variable MVN data")

Exercise for students: What would you modify in covar_mat to make the relationship between two variables tighter?

4.1 Interpretation

If you want two variables to be more tightly correlated, you would generally increase the magnitude of the corresponding covariance (or correlation), while keeping the covariance matrix valid (positive definite if you want to use linear algebra jargon).

5 Connecting MVN ideas to spatial simulation

A spatial raster with \(n\) cells can be viewed as an \(n\)-dimensional random vector:

\[ \mathbf{X} = (X_1, X_2, \dots, X_n)^T \]

where each \(X_i\) is the value at one location on the landscape.

If we assume

\[ \mathbf{X} \sim \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma}) \]

then the covariance matrix \(\boldsymbol{\Sigma}\) determines how similar nearby cells are. A common spatial covariance model is:

\[ \text{Cov}(X_i, X_j) = \sigma^2 \exp(-\phi d_{ij}) \]

where:

  • \(d_{ij}\) is the distance between cells \(i\) and \(j\)
  • \(\sigma^2\) is the marginal variance
  • \(\phi\) controls how quickly correlation decays with distance

When \(\phi\) is small, correlation decays slowly and the surface is smoother over long distances. When \(\phi\) is large, correlation decays quickly and the surface becomes patchier at fine scales.

6 Simulating a landscape

We will simulate a square lattice and then use an exponential covariance function to generate a spatially autocorrelated surface.

# Set up a square lattice region
simgrid <- expand.grid(x = 1:50, y = 1:50)
n <- nrow(simgrid)

head(simgrid)
  x y
1 1 1
2 2 1
3 3 1
4 4 1
5 5 1
6 6 1

6.1 Make the grid into spatial objects

We can represent the simulated locations as an sf point object and also create a terra raster later.

simgrid_sf <- st_as_sf(simgrid, coords = c("x", "y"), crs = NA)

simgrid_sf
Simple feature collection with 2500 features and 0 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 1 ymin: 1 xmax: 50 ymax: 50
CRS:           NA
First 10 features:
       geometry
1   POINT (1 1)
2   POINT (2 1)
3   POINT (3 1)
4   POINT (4 1)
5   POINT (5 1)
6   POINT (6 1)
7   POINT (7 1)
8   POINT (8 1)
9   POINT (9 1)
10 POINT (10 1)

6.2 Distance matrix

distance_mat <- as.matrix(dist(simgrid))
dim(distance_mat)
[1] 2500 2500

6.3 Spatial correlation as a function of distance

phi <- 0.7

plot(
  1:100,
  exp(-phi * 1:100),
  type = "l",
  xlab = "Distance",
  ylab = "Correlation",
  main = "Exponential decay in spatial correlation"
)

Question for students: How does changing phi change the spatial aggregation in the simulated landscape?

6.4 Simulate the spatial random field

Here, the covariance between two cells depends on the distance between them:

Sigma_spatial <- exp(-phi * distance_mat)

X <- MASS::mvrnorm(n = 1, mu = rep(10, n), Sigma = Sigma_spatial)
head(X)
       1        2        3        4        5        6 
9.376144 8.633929 8.367130 9.182953 9.007860 9.028344 

This means the whole landscape is one draw from a very high-dimensional multivariate normal distribution.

7 Convert the simulated values to a raster

We now combine the coordinates and simulated values, then create a SpatRaster with terra.

sim_df <- data.frame(
  x = simgrid$x,
  y = simgrid$y,
  value = as.numeric(X)
)

Xraster <- terra::rast(sim_df, type = "xyz", crs = "")
Xraster
class       : SpatRaster 
size        : 50, 50, 1  (nrow, ncol, nlyr)
resolution  : 1, 1  (x, y)
extent      : 0.5, 50.5, 0.5, 50.5  (xmin, xmax, ymin, ymax)
coord. ref. :  
source(s)   : memory
name        :     value 
min value   :  6.603952 
max value   : 13.198275 
plot(Xraster, main = "Simulated spatial landscape")

7.1 Interpretation

  • Each raster cell is one component of a multivariate normal vector.
  • The mean vector was constant at 10, so the surface fluctuates around
  • The covariance matrix created spatial structure: nearby cells tend to have similar values.

8 Convert raster cells to points and sample the landscape

Pretend this surface represents wildlife abundance, a habitat suitability index, vegetation biomass, ghost orchid density, or dragon activity. Because the data are simulated, you can invent the ecological story.

spat_dat <- as.data.frame(Xraster, xy = TRUE, na.rm = TRUE)
head(spat_dat)
  x  y     value
1 1 50 10.477432
2 2 50  9.868974
3 3 50 10.756138
4 4 50 10.630947
5 5 50 10.554223
6 6 50  9.462484

Choose a sample size:

sample_size <- 250
sample_rows <- sample(x = seq_len(nrow(spat_dat)), size = sample_size)

Plot the landscape and the sampled points:

plot(Xraster, main = "Random sample from the simulated landscape")
points(spat_dat[sample_rows, c("x", "y")], pch = 19, cex = 0.5)

We can also store the sample as an sf object:

sample_sf <- st_as_sf(spat_dat[sample_rows, ], coords = c("x", "y"), crs = NA)
sample_sf
Simple feature collection with 250 features and 1 field
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 1 ymin: 1 xmax: 50 ymax: 50
CRS:           NA
First 10 features:
         value      geometry
675   9.307929 POINT (25 37)
1280  9.893317 POINT (30 25)
900  10.781421 POINT (50 33)
167  10.053942 POINT (17 47)
2180 11.493424  POINT (30 7)
1943 10.069342 POINT (43 12)
173  11.045629 POINT (23 47)
2175  9.219720  POINT (25 7)
1249 11.218943 POINT (49 26)
967  10.862394 POINT (17 31)

8.1 Discussion

This is pure random sampling.

Questions for students:

  • What are the benefits of pure random sampling?
  • What are the costs?
  • Would a stratified or spatially balanced design be more efficient?

9 Simulating additional covariates

We have one main spatial covariate: the simulated surface value at the sampled locations. Now we will create additional variables.

canopy_height <- plogis(spat_dat[sample_rows, "value"])
prey_abundance <- rnbinom(sample_size, mu = canopy_height, size = 1.5)
temperature <- rnorm(n = sample_size, mean = canopy_height, sd = 10)
hunting_pressure <- rnorm(sample_size, mean = spat_dat[sample_rows, "x"], sd = 5)

covariates <- data.frame(
  canopy_height = canopy_height,
  prey_abundance = prey_abundance,
  temperature = temperature,
  hunting_pressure = hunting_pressure
)

pairs(covariates, main = "Simulated covariates")

9.1 Interpretation

These variables are not jointly multivariate normal, but they are built from quantities that depend on the spatial process.

  • canopy_height is a transformed version of the spatial surface
  • prey_abundance is a count variable generated from a negative binomial distribution
  • temperature depends on canopy_height plus noise
  • hunting_pressure is linked to the x-coordinate, creating a spatial trend

This mirrors real ecological data, where different variables often have different distributions and are related through latent environmental structure.

10 Linking back to the multivariate normal

The most important conceptual bridge is this:

A spatially autocorrelated landscape can be generated by treating all cells as one multivariate random vector and specifying a covariance matrix that depends on distance.

That is, instead of simulating a few correlated variables, we simulate thousands of correlated locations.

The MVN framework helps us understand:

  • why nearby cells look similar
  • how covariance creates smooth or patchy surfaces
  • how spatial structure can propagate into sampled covariates

11 Marginal and joint ideas in the spatial setting

Each cell value, by itself, has a marginal distribution. But the full landscape is defined by the joint distribution of all cells together.

  • The marginal distribution of one cell describes the distribution of values at that cell
  • The joint distribution describes how all cells vary together
  • Spatial autocorrelation arises because those cell values are not independent

12 Exercises

  1. Change phi and re-run the simulation. How does the visual pattern of the raster change?
  2. Increase or decrease the variance of the spatial process by multiplying Sigma_spatial by a constant. What changes?
  3. Replace the constant mean vector rep(10, n) with a mean that varies across x or y. How does this change the landscape?
  4. Simulate multiple spatial surfaces and compare them.
  5. What would be a more efficient way to generate several correlated covariates directly from a multivariate normal distribution?

13 Optional extension: a spatial trend in the mean

So far the mean vector has been constant: \[ \mu_i = 10 \]

But in many applications, the mean depends on location or covariates:

\[ \mu_i = \beta_0 + \beta_1 x_i + \beta_2 y_i \]

This separates large-scale trend in the mean from small-scale spatial dependence in the covariance structure.

14 Key takeaways

  • MASS::mvrnorm() can simulate multivariate normal data and spatial random fields.
  • A spatial raster can be treated as one realization of a high-dimensional MVN random vector.
  • The covariance matrix controls the strength of association among cells.
  • In spatial simulation, covariance often depends on distance.
  • terra and sf provide modern tools for handling simulated rasters and vector data in R.