Overview

In order to statistically interrogate spatial point data, a spatial point pattern analysis (e.g. Cressie 1993; Diggle 2003; Baddeley et al 2016) is required. This form of analysis may arise in many different contexts, for example:

  1. Crime incidences in Chicago, USA
  2. Gorilla nesting sites in a National park in Cameroon
  3. Kimboto trees in Paracou, French Guiana
  4. Galaxies in an astronomical survey

Each example can be analysed for understanding, say:

  1. Do crime patterns centre on specific neighbourhoods? Are patterns driven by poverty? Does this change over time?
  2. Are nests more dense in some areas? Does this relate to vegetation characteristics?
  3. Do trees cluster according to soils characteristics?
  4. Are galaxies clustered in space! Do specific galaxy-types cluster and others not?

Such examples are just 4 of over 40 data sets that are freely available with the spatstat package, a comprehensive package for spatial point pattern analysis, model fitting, simulation and model testing - some of whose functions will be demonstrated here.

Tools available in the spatstat package are designed to work with points stored as ppp objects and not a SpatialPointsDataFrame or sf object.

Critically, a ppp object may or may not have attribute information (referred to as marks - as it constitutes a ‘marked’ point pattern process). Knowing whether or not a function requires an attribute table to be present in the ppp object is crucial to its interpretation.

Marks relay extra information that were collected with each point. For example, male or female burglar, tree height, number of eggs in a bird’s nest, galaxy type, etc.

In this session, we will introduce and demonstrate techniques for patterns generated by the points only (i.e. without marks). Patterns generated by their attributes (i.e. with marks) are not covered.

Core to this Session is the spatstat package, which needs installing:

# library(knitr)
library(spatstat)

Once again, remember to:

  1. Create a seperate folder for this Session
  2. Always write your code into an R script… always!
  3. Save the script to the folder
  4. Set the working directory to the folder location

Having done that you should clear your workspace for this session:

rm(list = ls())

1. Case study: redwood tree data

The redwood data set considered in this session consists of the locations of 62 seedlings and saplings of California redwood trees in the USA. It is fully described using the help function as follows:

# help("redwood")

We can plot the locations of this data (Figure 1) as follows:

plot(redwood, pch=20, cols="grey70", main="Locations of 62 seedlings & saplings")
Redwood data: locations of 62 seedlings and saplings

Redwood data: locations of 62 seedlings and saplings

We can confirm its object-type and look at its properties (including its bounding box or window) as follows:

class(redwood)
summary(redwood)

2. Exploratory Spatial Data Analysis (ESDA)

2.1 Kernel Density Estimation

EDA is key to a judged point pattern analysis. It commonly involves: (a) mapping the intensity of the process, and (b) assessing whether or not the point pattern deviates from random expectations. The core R density function can be used to explore the former using a spatial Kernel Density Estimation (KDE) analysis. Here we get our first insights to whether or not the seedlings & saplings are more dense in some areas and not others.

Further more we conduct the spatial KDE, highlighting the effects of different KDE specifications: (i) Gaussian kernels but with different bandwidth selection procedures, (ii) Gaussian kernels but with different bandwidth adjustments, and (iii) different kernels. It appears specifications (ii) and (iii) play a key role in the resultant surfaces and our interpretations:

# First - all with Gaussian kernels but with different bandwidth selection procedures
density.r.1 <- density(redwood) # default with kernel="gaussian", bw = "nrd0"
density.r.2 <- density(redwood, kernel="gaussian",bw = "nrd")
density.r.3 <- density(redwood, kernel="gaussian",bw = "ucv")
density.r.4 <- density(redwood, kernel="gaussian",bw = "SJ-ste")

# Second - all with Gaussian kernels but with different bandwidth adjustments
density.r.5 <- density(redwood, kernel="gaussian",bw = "ucv",adjust=0.2) # 20%
density.r.6 <- density(redwood, kernel="gaussian",bw = "ucv",adjust=0.4) # 40%
density.r.7 <- density(redwood, kernel="gaussian",bw = "ucv",adjust=0.6) # 60%
density.r.8 <- density(redwood, kernel="gaussian",bw = "ucv",adjust=0.8) # 80%

# Third - with different kernels
density.r.9 <- density(redwood, kernel="quartic")
density.r.10 <- density(redwood, kernel="disc")
density.r.11 <- density(redwood, kernel="epanechnikov")
density.r.12 <- density(redwood, kernel="gaussian")

# help("density")) # for details

Plot results (i), as in Figure 2. Note the parameters passed to par to set the plot layout (2 columns and 2 rows) and to set the margins to 1 inch around each side of each plot. To be able to reset this at the end of the session, we need to save of the default parameters

.pardefault <- par(no.readonly = T)
par(mfrow=c(2,2),mar=c(1,1,1,1))
plot(density.r.1, main = "# 1")
plot(density.r.2, main = "# 2")
plot(density.r.3, main = "# 3")
plot(density.r.4, main = "# 4")
KDE - Gaussian kernels but with different bandwidth selection procedures

KDE - Gaussian kernels but with different bandwidth selection procedures

Plot results (ii) in Figure 3:

par(mfrow=c(2,2),mar=c(1,1,1,1))
plot(density.r.5, main = "20%")
plot(density.r.6, main = "40%")
plot(density.r.7, main = "60%")
plot(density.r.8, main = "80%")
KDE - Gaussian kernels but with different bandwidth adjustments

KDE - Gaussian kernels but with different bandwidth adjustments

Plot results (iii) in Figure 4:

par(mfrow=c(2,2),mar=c(1,1,1,1))
plot(density.r.9, main="quartic")
plot(density.r.10, main="disc")
plot(density.r.11, main="epanechnikov")
plot(density.r.12, main="gaussian")
KDE - each with different kernels

KDE - each with different kernels

Observe the similarities between spatial KDE and Geographically Weighted Models (Sessions 3 and 7), where both use kernel weighting functions.

2.2 Simulations

It is useful to further explore for homogeneity of the seedlings & saplings data, either: (i) through simulation (i.e. randomisation) or (ii) quadrat counts and an associated chi-square test.

The first exploration consists of simulating spatially random point patterns based on the average intensity in the observed point pattern. If the density estimates of the observed and simulated point patterns are similar then there is evidence that the observed point pattern is homogenous. If the real dataset stands out among the simulated ones (which it does!) then there is evidence that an inhomogenous process generated the redwood seedlings & saplings data. It is useful to repeat this exercise a few times for clarity.

# Select a random position for the observed data in the figure.
# Set the random seed for reproducibility:
set.seed(123) 
pos <- sample(1:6,1)

# Simulate 5 CSR point patterns:
simp <- rpoispp(lambda = intensity(redwood),win = Window(redwood),nsim=5)

# Replace the simulated set at the pos'th position by the observed dataset:
tmp <- simp[[pos]]
simp[[pos]] <- redwood
simp[[6]] <- tmp
names(simp)[6] <- "Simulation 6"

# Compute the KDEs:
densp <- density(simp)

Plot the results (Figure 5). Can you detect which one was the observed dataset?

par(mfrow=c(2,3),mar=c(1,1,1,1))
plot(as.listof(densp), zlim=range(unlist(lapply(densp,range))),
     main ="Simulations for homogeneity")
Simulations for homegeneity (with one observed)

Simulations for homegeneity (with one observed)

Remember to re-set the plot parameters after each time you change them:

par(.pardefault)

2.3 Quadrat counts

This EDA divides the sample area into quadrats and counts the number of points per quadrat (Figure 6):

Q <- quadratcount(redwood, nx= 5, ny=5)
plot(redwood, pch=20, cols="grey70", main=NULL)  # Plot points
plot(Q, add=TRUE)  # Add quadrat grid
Data in quadrats

Data in quadrats

Which can also be presented with the density of points within each quadrat (Figure 7), (see help for the ‘intensity’ function):

Q.d <- intensity(Q)
plot(intensity(Q, image=TRUE), main=NULL, las=1)  # Plot density raster
plot(redwood, pch=20, cex=0.6, col=rgb(0,0,0,.5), add=TRUE)  # Add points
Intensity of points in quadrats

Intensity of points in quadrats

A chi-square test can then be used to infer if the redwood pattern is homogenous (p > 0.05) or inhomogenous (p < 0.05):

quadrat.test(redwood)
## 
##  Chi-squared test of CSR using quadrat counts
## 
## data:  redwood
## X2 = 64.613, df = 24, p-value = 2.774e-05
## alternative hypothesis: two.sided
## 
## Quadrats: 5 by 5 grid of tiles

Thus, the null hypothesis of the redwood seedlings & saplings pattern being generated by a spatial random process is rejected - i.e. strong evidence that the point pattern is inhomogenous (or non-stationary). A strong caveat to the quadrat approach is the subjective choice for the number of quadrats (where the default is 5 by 5 = 25 quadrats).

2.4 Ripley’s K function

A final illustration of a useful EDA procedure for a spatial point pattern is with Ripley’s K function. This counts the number of neighbouring points within an increasing distance from a given focal point. If the point pattern follows Complete Spatial Randomness (CSR) then there is a known relationship between this count number (K) and the distance considered (r). Ripley’s K function is a graphical procedure and is found as follows:

K.Ripley.1 <- envelope(redwood,fun=Kest,funargs=list(correction="border"),global=TRUE)

This can be plotted as in Figure 8:

plot(K.Ripley.1, main="Ripley’s K function")
Plot for Ripley’s K function.

Plot for Ripley’s K function.

Here a constant envelop (or ‘confidence band’ see help(“envelop”)) is calculated following procedures outlined in http://book.spatstat.org/sample-chapters/chapter07.pdf).

In this instance, the observed curve (\(\hat{K}_{obs}(r)\)) is consistently higher than the expected curve (\(K_{theo}(r)\)). This means there are more points than expected within certain distances of each other - i.e. the points are more clustered than expected and the process is not CSR. If \(\hat{K}_{obs}(r)\) were consistently lower than \(K_{theo}(r)\) then the points are more dispersed than expected.

The parameter Kest in the above assumes a homogenous (stationary) intensity function (i.e. specifically, ‘model/parameter stationarity’). However, we again have evidence that the redwood point pattern is not homogenous. This can be accounted for by using a modified version of Ripley’s K through specifying the Kinhom function as follows:

K.Ripley.2 <- envelope(redwood,fun=Kinhom,global = TRUE)

This can be plotted as in Figure 9:

plot(K.Ripley.2, main="Modified Ripley’s K function")
Plot for Modified Ripley’s K function.

Plot for Modified Ripley’s K function.

In this instance, the observed curve (\(\hat{K}_{obs}(r)\)) is lower than the expected curve (\(K_{theo}(r)\)) for large distances. This means more dispersion in the redwood seedlings and saplings than expected under CSR - accounting for the inhomogeneity in the point pattern. The Kinhom function derives an intensity estimate from the data (similar to KDE) by weighting each point based on their estimated intensity.

Alternative confidence bands can also be found using varblock or lohboot functions to derive bootstrapped confidence intervals of the expected K values under CSR.

3. Model fitting

3.1 Poisson Point Process models

In summary, our extensive EDA has suggested that the 62 redwood seedlings and saplings display an inhomogenous spatial pattern where at large distances they are more spread out than expected.

Now we can start to think of ways to formally model the redwood process. This can be achieved using the function ppp, which fits a Poisson Point Process to the data. Formal models are key for a RICHER UNDESTANDING of the point proces where covariates such the coordinates, topography, remote sensing data can be incorporated.

For example, fit an intercept-only model:

mod.intercept <- ppm(redwood ~ 1)
mod.intercept
## Stationary Poisson process
## Intensity: 62
##             Estimate      S.E.  CI95.lo CI95.hi Ztest     Zval
## log(lambda) 4.127134 0.1270001 3.878219 4.37605   *** 32.49709

For example, fit a model, where the spatial coordinates inform the point process:

mod.polynomial <- ppm(redwood ~ polynom(x,y,2))
mod.polynomial
## Nonstationary Poisson process
## 
## Log intensity:  ~x + y + I(x^2) + I(x * y) + I(y^2)
## 
## Fitted trend coefficients:
## (Intercept)           x           y      I(x^2)    I(x * y)      I(y^2) 
##    2.208281    4.491415   -3.949507   -2.443234    3.346023   -2.137482 
## 
##              Estimate      S.E.    CI95.lo   CI95.hi Ztest      Zval
## (Intercept)  2.208281 0.9222806  0.4006437 4.0159173     *  2.394369
## x            4.491415 2.3619729 -0.1379668 9.1207970        1.901552
## y           -3.949507 2.3234694 -8.5034238 0.6044089       -1.699832
## I(x^2)      -2.443234 1.8576133 -6.0840887 1.1976217       -1.315254
## I(x * y)     3.346023 1.8596583 -0.2988407 6.9908858        1.799267
## I(y^2)      -2.137482 1.8356763 -5.7353415 1.4603774       -1.164411

The first model fitted a stationary Poisson Process, while the second model fitted a non-stationary Poisson Process. The first model assumes the intensity (the density of the redwood seedlings and saplings) is \(exp(4.127) = 62.0\) throughout the study area (i.e. the trend is stationary and constant). Observe that the exponential arises as the models are log-linear.

The second model fitted a second-order polynomial of the coordinates of the form: \(log(lambda)=x+y+x_2+x:y+y_2\). Thus, the trend is taken as non-stationary and non-constant as depicted in Figure 10. Observe that all the polynomial terms of this model are not significant, and in this instance, the intercept model should be preferred.

As stated above, the model predictors (or covariates) could be other variables that are similarly available everywhere, not just the coordinates, such as topography (elevation, slope, aspect) or a remotely-sensed image.

plot(mod.polynomial,se=FALSE,how="image")
Surface depicting an example trend component in Poisson Point Process Model

Surface depicting an example trend component in Poisson Point Process Model

Figure 10, maps the predicted intensity (the lambda) from the model with the observed redwood seedlings and saplings sites added. As for every model, some form of model validation is required to assess its value. Here several functions are available via diagnose.ppm. For example, the smoothed residuals from the model fit can be found using (and sticking with the polynomial model for illustration as in Figure 11):

diagnose.ppm(mod.intercept, which = "smooth")
Surface depicting the smoothed raw residuals

Surface depicting the smoothed raw residuals

## Model diagnostics (raw residuals)
## Diagnostics available:
##  smoothed residual field
## range of smoothed field =  [-59.56, 50.85]

3.2 Ripley’s K function

Here, we can identify areas where the model poorly fits the observed point patterns. We can also use the fitted intensity in the Kinhom function to see if the observed point pattern is more or less clustered than expected from the model fit, again using Ripley’s K function, as given in Figure 12 (which takes a minute or two to run…):

K.Ripley.3 <- envelope(redwood,Kinhom,funargs = list(lambda=mod.polynomial),global=TRUE)
plot(K.Ripley.3, main="Modified Ripley’s K function with respect to model fit")
Plot for Modified Ripley’s K function with respect to model fit

Plot for Modified Ripley’s K function with respect to model fit

We see that the observed point pattern is more clustered than expected based on the model, as the observed curve (\(K_{obs}(r)\)) is consistently higher than the expected curve (\(K_{theo}(r)\)). One solution would be to use a clustered Poisson point process model via the function kppm:

mod.polynomial.c <- kppm(redwood ~ polynom(x,y,2))
mod.polynomial.c
## Inhomogeneous cluster point process model
## Fitted to point pattern dataset 'redwood'
## Fitted by minimum contrast
##  Summary statistic: inhomogeneous K-function
## 
## Log intensity:  ~x + y + I(x^2) + I(x * y) + I(y^2)
## 
## Fitted trend coefficients:
## (Intercept)           x           y      I(x^2)    I(x * y)      I(y^2) 
##    2.208281    4.491415   -3.949507   -2.443234    3.346023   -2.137482 
## 
## Cluster model: Thomas process
## Fitted cluster parameters:
##       kappa       scale 
## 26.48878724  0.05192903 
## Mean cluster size:  [pixel image]
plot(mod.polynomial.c, what="statistic", pause=FALSE, main="Inhomogeneous K-function")
Plot for Inhomogeneous K-function

Plot for Inhomogeneous K-function

In Figure 13:

  1. The dashed red line represents the (iso-corrected) observed \(K\) values.
  2. The dotted green line represents the expected \(K\) values based on Poisson model.
  3. The solid black line adds the fitted clustering process (in this case a Thomas process, via kppm) to the Poisson model predictions in (2)

Adding a clustering process to the model, clearly improved the model fit (as the solid black line runs through the dashed red line).

3.3 Simulation

Simulating point patterns from the fitted model is also straightforward. Here we can see if there are differences between the observed and the simulated point pattern (all via KDE surfaces):

# Select a random position for the observed data in the figure.
# again set the random seed for preducibility:
set.seed(4007)
pos <- sample(1,1:6)

# Simulate 5 point patterns from the model
sims <- simulate(mod.polynomial.c,nsim = 5)
## Generating 5 simulations... 1, 2, 3, 4,  5.
## Done.
# Place the observed point pattern in the random position
tmp <- sims[[pos]]
sims[[pos]] <- redwood
sims[[6]] <- tmp
names(sims)[6] <- "Simulation 6"

# Compute the KDEs
densp.new <- density(sims)

Plot the results (Figure 14) - can you detect which one was the observed dataset?

par(mfrow=c(2,3),mar=c(1,1,1,1))
plot(as.listof(densp.new), zlim=range(unlist(lapply(densp.new,range))),
     main ="Simulations from fitted model")
Simulations from fitted model (with one observed)

Simulations from fitted model (with one observed)

# re-set the plot paramters 
par(.pardefault)

That is, can you see one with the same pattern as in Figure 15?

plot(density.r.1, main ="KDE surface (observed)")
KDE on observed redwood data

KDE on observed redwood data

If you cannot pick out the observed pattern (i.e. they all look the same), then the model can be assumed a reasonably good fit. Unfortunately, the observed pattern can still easily be picked out (note to account for change in the legend limits), so further experimentation with model fits is necessary…

4. Summary

In summary, this session has briefly introduced the spatstat package which contains procedures to handle, explore and fit models to point pattern data. Only an introduction to techniques for patterns generated by the points (i.e. without marks) were covered. Patterns generated by their attributes (i.e. with marks) were not covered. Point pattern data arise for many reasons, such as incidences of crime and disease outbreaks.

John Snow’s famous Cholera study provides historical context to a spatial point pattern analysis:

https://en.wikipedia.org/wiki/John_Snow

Tasks

In your own time, repeat the whole session using the paracou (Kimboto trees in Paracou, French Guiana) data instead of the redwood data (both in the spatstat package). Try to get familiar with all the options and tuning parameters for an assured and spatial point pattern analysis. Also look at fitting models to patterns generated by their attributes (i.e. with marks).

Useful resources

Useful online resources of which much of this session has been based on include:

http://spatstat.org/book.html

https://rpubs.com/hughes/295880

https://mgimond.github.io/Spatial/point-pattern-analysis-in-r.html

References

Baddeley, A.J., E. Rubak, and R. Turner. 2016. Spatial point patterns - Methodology and applications with R. Boca Raton: CRC Press.

Diggle, P.J., 2003. Statistical analysis of spatial point patterns. London: Edward Arnold.

Cressie, N., 1993. Statistics for spatial data. Wiley, New Jersey