Reimplementation of Ripley’s K function using sf library

Author

Filip Piotrowski

Published

June 4, 2025

Introduction

Spatial point pattern analysis (PPA) aims to detect patterns of spatial interaction (e.g., clustering) among points distributed in space. Ripley’s K function is a second-order statistic that provides insights into spatial dependence over varying scales.

To evaluate the significance of spatial patterns, simulation envelopes are often used. They compare the observed K function with simulated K functions under a null hypothesis, typically complete spatial randomness (CSR).

This report details the reimplementation of simulation envelopes for Ripley’s K function using the sf R package. The main reason for this programming exercise is the shortcomings of the standard library used for point pattern analysis in R – spatstat. Called by some researchers as a “dinosaur package” for its longevity, spatstat is well-established in its specialized field. However, it lacks frequent updates. In comparison with the newer sf package, spatstat has a steeper learning curve and a less intuitive interface. Therefore, the purpose of this study is to make an attempt to translate a basic PPA tool into a modern R spatial ecosystem and compare the performance with existing functions.

Ripley’s K function

Ripley’s K function, denoted as \(K(r)\), quantifies the expected number of additional points within a distance \(r\) of an arbitrary point in the pattern, normalized by the point pattern’s intensity (average number of points per unit area).

A key aspect of \(K(r)\) is its ability to reveal patterns beyond complete spatial randomness (CSR). For a CSR process, the theoretical \(K(r)\) is given by \(K_{CSR}(r)=\pi r^2\).

If the observed \(K(r)\) is greater than \(\pi r^2\), it suggests clustering of points.
If the observed \(K(r)\) is less than \(\pi r^2\), it suggests a regular or inhibited pattern.
If the observed \(K(r)\) is equal to \(\pi r^2\), it indicates a random distribution.

Similarly as in the package spatstat, this study adopts a formulation of Ripley’s K found in Baddeley, Rubak, and Turner (2016):

\[ \hat{K}(r) = \frac{a}{n(n-1)}\sum_i\sum_j I(d_{ij}\leq r)e_{ij} \]

where:

\(a\) is the area of the window,
\(n\) is the number of data points,
\(d_{ij}\) is the distance between the two points,
\(I(d_{ij}\leq r)\) is the indicator that equals \(1\) if the distance is less than or equal to \(r\),
\(e_{ij}\) is the edge correction weight, not applied in our case of Monte Carlo simulations.

Code implementation

Prerequisits

Among required packages are spatstat, sf, ggplot2 (for plotting results) and microbenchmark (for comparing implemented function with spatstat equivalents). The implemented functions are tested on a sample dataset nztrees provided in spatstat library.

rm(list = ls())

library(sf)             # spatial manipulation
library(spatstat)       # datasets
library(ggplot2)        # plots
library(microbenchmark) # performance comparison

data("nztrees")
data.ppp <- nztrees
plot(data.ppp)

Data preparation

The first step involves converting spatstat’s ppp object to sf (simple features) format to facilitate modern spatial operations. The ppp object is separated into a spatial data frame containing points (points.sf) and an observation window defining the spatial bounds (window.sf).

pattern.sf <- st_as_sf(data.ppp)
points.sf <- pattern.sf[pattern.sf$label == "point", ]
window.sf <- st_geometry(pattern.sf[pattern.sf$label == "window", ])

Defining optimal radius (\(r\) argument) range

While running a standard spatstat::Kest() function to obtain Ripley’s K, little attention is paid to the optimal range of the radius (\(r\) argument). The idea behind this default value is primarly to minimize edge effects and provide a sensible range for interpretation. Furthermore, at large \(r\) the estimation becomes unstable as fewer point pairs are available.

The choice of the radius range affects also visual interpretation. If the radius range is set to the maximum, it is more difficult to see deviations between empirical and theoretical K functions in a proper scale. An example is shown below.

data("clmfires")
par(mfrow = c(1,2))
plot(Kest(clmfires, rmax = 100))
plot(Kest(clmfires))

par(mfrow = c(1,1))

Careful reading of spatstat’s documentation gives a useful rule of thumb for setting maximum \(r\) value. The answer is hidden inside internal spatstat function rmax.rule and can be formulated as:

\[ r_{\text{max}} = \min\left(\frac{\min(\text{width}, \text{height})}{4}, \sqrt{\frac{1000}{\pi\lambda}} \right) \]

where height and width variables indicate the height and width of the window. The vector of default \(r\) values in spatstat::Kest() function is equally spaced from 0 to \(r_{\max}\) with a default step value of \(r_{\max} / 512\), resulting in \(513\) values.

The whole procedure can be written programmatically as:

area <- st_area(window.sf)
n.points <- nrow(points.sf)
lambda <- n.points / area
bbox <- st_bbox(window.sf)

window.height <- bbox[["ymax"]] - bbox[["ymin"]]
window.width <- bbox[["xmax"]] - bbox[["xmin"]]
r.max <- min(min(window.width, window.height) / 4, 
             sqrt(1000 / (pi * lambda)))
r.args <- seq(from = 0, to = r.max, length.out = 513)

K function implementation

To calculate a standard form of Ripley’s K, KDataFrame() function is defined. Two obligatory arguments should be passed as input: points.sf, window.sf. Additionally, user can pass optional arguments that are necessary in the intermediate steps of computation:

area: area of the window,
n.points: number of points in the dataset
lambda: intensity of the point process
bbox: bounding box of the window
r.args: a vector of of values for the argument \(r\) at which \(K(r)\) should be evaluated

First, the function calculates the distance matrix between points, using sf::st_distance() function. Because the matrix is symmetrical, values in the lower/upper triangle should be removed, as well as diagonal values. Then an empirical CDF of the pair-wise distances is computed using a highly-efficient built-in ecdf() function.

KDataFrame() returns a data frame with two columns: r.args and corresponding K values.

KDataFrame <- function(points.sf, window.sf, area = NULL, n.points = NULL,
                       lambda = NULL, bbox = NULL, r.args = NULL) {
    # data characteristics
    if (is.null(area)) area <- st_area(window.sf)
    if (is.null(n.points)) n.points <- nrow(points.sf)
    if (is.null(lambda)) lambda <- n.points / area
    if (is.null(bbox)) bbox <- st_bbox(window.sf)
    
    
    # range of r values for which the K is computed
    if (is.null(r.args)) {
        window.height <- bbox[["ymax"]] - bbox[["ymin"]]
        window.width <- bbox[["xmax"]] - bbox[["xmin"]]
        r.max <- min(min(window.width, window.height) / 4, 
                     sqrt(1000 / (pi * lambda)))
        r.args <- seq(from = 0, to = r.max, length.out = 513)
    }
    
    # distance CDF
    dist.mx <- st_distance(points.sf)
    dist.mx[lower.tri(dist.mx, diag = TRUE)] <- NA
    dist <- na.omit(as.vector(dist.mx))
    dist.cdf <- ecdf(dist)
    
    # K computation
    K <- area * dist.cdf(r.args)
    df <- data.frame(r = r.args, k = K)
    return(df)
}

To ensure that KDataFrame() function is implemented correctly, its results can be compared visually with spatstat::Kest() plot, as shown below. The edge correction is set to “none”, because it will not be needed while generating simulation envelopes.

k.emp.df <- KDataFrame(points.sf, window.sf) # K results from defined function
k.emp.df$k.theo <- pi * k.emp.df$r^2         # theoretical K values
k.max <- max(k.emp.df$k)

par(mfrow = c(1,2))
# plot - defined function
plot(k.emp.df$r, k.emp.df$k, type = "l", main = "Programmed function")
lines(k.emp.df$r, k.emp.df$k.theo, col = "red", lty = "dashed")
# plot - spatstat function
plot(Kest(data.ppp, correction = "none"), ylim = c(0, k.max))

par(mfrow = c(1,1))

Envelope simulation

The most computationally intensive and programmatically challenging part of this work is generating simulation envelopes. The function KEnvelope() takes two standard arguments with point pattern data (points.sf, window.sf) and a number of simulations to be generated when computing the envelopes (nsim).

The next steps of the procedure are as follows:

a default \(r_{\max}\) and \(r\) values vector are defined,
an empirical Ripley’s K is calculated using KDataFrame(),
for each simulation: a point pattern is generated with the same number of points and its K function is computed,
For each radius value \(r\), the minimum and maximum K values are computed across simulations (i.e., lower and upper bounds of the envelope).

The output is a data frame containing:

k.obs: Observed K values,
k.min, k.max: Minimum and maximum simulated K values for each \(r\).

KEnvelope <- function(points.sf, window.sf, nsim = 99) {
    # data characteristics
    area <- st_area(window.sf)
    n.points <- nrow(points.sf)
    lambda <- n.points / area
    bbox <- st_bbox(window.sf)
    
    # range of r values for which the K is computed
    window.height <- bbox[["ymax"]] - bbox[["ymin"]]
    window.width <- bbox[["xmax"]] - bbox[["xmin"]]
    r.max <- min(min(window.width, window.height) / 4, sqrt(1000 / (pi * lambda)))
    r.args <- seq(from = 0, to = r.max, length.out = 513)
    
    # empirical K
    k.obs.df <- KDataFrame(points.sf, window.sf, area, n.points, 
                           lambda, bbox, r.args)
    
    # empty data frame for simulation results
    sim.df <- data.frame(matrix(ncol = 0, nrow = length(r.args)))
    
    # setting up progess bar
    tick <- 0
    pb <- txtProgressBar(min = 0,      
                         max = nsim,
                         style = 3,
                         width = 80,
                         char = "=")
    
    # main loop 
    for (i in 1:nsim) {
        counter <- 0
        x.coord <- double(n.points)
        y.coord <- double(n.points)
        
        # drawing n points inside the window
        while (counter < n.points) {
            x.drawn <- runif(1, bbox["xmin"], bbox["xmax"])
            y.drawn <- runif(1, bbox["ymin"], bbox["ymax"])
            point.sf <- st_sfc(st_point(c(x.drawn, y.drawn)))
            is.inside <- as.logical(length(st_contains(window.sf, point.sf)[[1]]))
            
            if (!is.inside) next
            
            counter <- counter + 1
            x.coord[counter] <- x.drawn
            y.coord[counter] <- y.drawn
        }
        
        # saving simulated points as sf object
        sim.points.df <- data.frame(x = x.coord, y = y.coord)
        sim.points.sf <- st_as_sf(sim.points.df, coords = c("x", "y"))
        
        # K values for simulated points
        k.df <- KDataFrame(sim.points.sf, window.sf, area, n.points, 
                           lambda, bbox, r.args)
        
        # saving K value of i-th simulation
        sim.df[paste0("sim", i)] <- k.df$k
        
        tick <- tick + 1
        setTxtProgressBar(pb, tick) 
    }
    
    # computing envelope
    envelope.df <- data.frame(
        r = r.args, 
        k.obs = k.obs.df$k,
        k.min = apply(sim.df, 1, min), 
        k.max = apply(sim.df, 1, max)
    )
    
    return(envelope.df)
}

The results of generating simulation envelopes are visualised on the plot below:

The results can be visually compared with the simulation envelope computed with spatstat, as plotted below:

set.seed(2025)
envelope.spatstat <- envelope(data.ppp, fun = Kest)
plot(envelope.spatstat)

Performance comparison

After verifying the correctness of the implemented function’s output, the next step is to assess its performance in comparison to an existing spatstat::envelope() function.

The microbenchmark library allows to measure the execution time both functions under the same conditions. In this case each expression is evaluated only 10 times, because the Monte Carlo simulations are computionally expensive.

The benchmark results demonstrate that the custom implementation is much less efficient, with execution times more than 15 times longer on average. This suggests that while the custom function may be correct, it is not suitable for performance-critical tasks compared to the optimized spatstat alternative.

performance.comparison <- microbenchmark(
    new = KEnvelope(points.sf, window.sf, 99), 
    spatstat = envelope(data.ppp, fun = Kest, nsim = 99), 
    times = 10
)

plot(performance.comparison)

print(performance.comparison)

Unit: milliseconds
     expr       min        lq      mean    median        uq       max neval cld
      new 4170.2048 4307.9626 4975.3350 5017.2680 5527.3929 5973.1281    10  a 
 spatstat  231.8772  237.8948  301.6768  285.4713  365.5789  402.7721    10   b

Discussion

This project successfully demonstrates a full pipeline for generating simulation envelopes of Ripley’s K function using modern R spatial tools. The programmed functions cannot replace spatstat tools due to computational limitations in handling large datasets. However, developing these functions from scratch offers significant educational value by deconstructing the underlying algorithms.

The core strengths include:

modular design separating K computation (KDataFrame) and envelope generation (KEnvelope),
integration with modern spatial packages (sf),
self-contained reimplementation without relying on spatstat’s K function tools.

For future enhancements, the following steps should be taken into consideration:

performance optimisation: distance matrix calculation is \(O(n^2)\), infeasible for large datasets,
CSR simulation: rejection sampling of points inside polygons is inefficient for complex windows
confidence levels: quantile-based envelopes (e.g., 5%–95%) would provide a smoother estimation of significance in comparison with the absolute min/max

References

Baddeley, Adrian, Ege Rubak, and Rolf Turner. 2016. Spatial Point Patterns: Methodology and Applications with r. Vol. 1. CRC press Boca Raton.