In order to statistically interrogate spatial point data, a spatial point pattern analysis (e.g. Cressie 1993; Diggle 2003; Baddeley et al 2016) is required. This form of analysis may arise in many different contexts, for example:
Each example can be analysed for understanding, say:
Such examples are just 4 of over 40 data sets that are freely available with the spatstat package, a comprehensive package for spatial point pattern analysis, model fitting, simulation and model testing - some of whose functions will be demonstrated here.
Tools available in the spatstat package are designed to work with points stored as ppp objects and not a SpatialPointsDataFrame or sf object.
Critically, a ppp object may or may not have attribute information (referred to as marks - as it constitutes a ‘marked’ point pattern process). Knowing whether or not a function requires an attribute table to be present in the ppp object is crucial to its interpretation.
Marks relay extra information that were collected with each point. For example, male or female burglar, tree height, number of eggs in a bird’s nest, galaxy type, etc.
In this session, we will introduce and demonstrate techniques for patterns generated by the points only (i.e. without marks). Patterns generated by their attributes (i.e. with marks) are not covered.
Core to this Session is the spatstat package, which needs installing:
# library(knitr)
library(spatstat)
Once again, remember to:
Having done that you should clear your workspace for this session:
rm(list = ls())
The redwood data set considered in this session consists of the locations of 62 seedlings and saplings of California redwood trees in the USA. It is fully described using the help function as follows:
# help("redwood")
We can plot the locations of this data (Figure 1) as follows:
plot(redwood, pch=20, cols="grey70", main="Locations of 62 seedlings & saplings")
Redwood data: locations of 62 seedlings and saplings
We can confirm its object-type and look at its properties (including its bounding box or window) as follows:
class(redwood)
summary(redwood)
EDA is key to a judged point pattern analysis. It commonly involves: (a) mapping the intensity of the process, and (b) assessing whether or not the point pattern deviates from random expectations. The core R density function can be used to explore the former using a spatial Kernel Density Estimation (KDE) analysis. Here we get our first insights to whether or not the seedlings & saplings are more dense in some areas and not others.
Further more we conduct the spatial KDE, highlighting the effects of different KDE specifications: (i) Gaussian kernels but with different bandwidth selection procedures, (ii) Gaussian kernels but with different bandwidth adjustments, and (iii) different kernels. It appears specifications (ii) and (iii) play a key role in the resultant surfaces and our interpretations:
# First - all with Gaussian kernels but with different bandwidth selection procedures
density.r.1 <- density(redwood) # default with kernel="gaussian", bw = "nrd0"
density.r.2 <- density(redwood, kernel="gaussian",bw = "nrd")
density.r.3 <- density(redwood, kernel="gaussian",bw = "ucv")
density.r.4 <- density(redwood, kernel="gaussian",bw = "SJ-ste")
# Second - all with Gaussian kernels but with different bandwidth adjustments
density.r.5 <- density(redwood, kernel="gaussian",bw = "ucv",adjust=0.2) # 20%
density.r.6 <- density(redwood, kernel="gaussian",bw = "ucv",adjust=0.4) # 40%
density.r.7 <- density(redwood, kernel="gaussian",bw = "ucv",adjust=0.6) # 60%
density.r.8 <- density(redwood, kernel="gaussian",bw = "ucv",adjust=0.8) # 80%
# Third - with different kernels
density.r.9 <- density(redwood, kernel="quartic")
density.r.10 <- density(redwood, kernel="disc")
density.r.11 <- density(redwood, kernel="epanechnikov")
density.r.12 <- density(redwood, kernel="gaussian")
# help("density")) # for details
Plot results (i), as in Figure 2. Note the parameters passed to par to set the plot layout (2 columns and 2 rows) and to set the margins to 1 inch around each side of each plot. To be able to reset this at the end of the session, we need to save of the default parameters
.pardefault <- par(no.readonly = T)
par(mfrow=c(2,2),mar=c(1,1,1,1))
plot(density.r.1, main = "# 1")
plot(density.r.2, main = "# 2")
plot(density.r.3, main = "# 3")
plot(density.r.4, main = "# 4")
KDE - Gaussian kernels but with different bandwidth selection procedures
Plot results (ii) in Figure 3:
par(mfrow=c(2,2),mar=c(1,1,1,1))
plot(density.r.5, main = "20%")
plot(density.r.6, main = "40%")
plot(density.r.7, main = "60%")
plot(density.r.8, main = "80%")
KDE - Gaussian kernels but with different bandwidth adjustments
Plot results (iii) in Figure 4:
par(mfrow=c(2,2),mar=c(1,1,1,1))
plot(density.r.9, main="quartic")
plot(density.r.10, main="disc")
plot(density.r.11, main="epanechnikov")
plot(density.r.12, main="gaussian")
KDE - each with different kernels
Observe the similarities between spatial KDE and Geographically Weighted Models (Sessions 3 and 7), where both use kernel weighting functions.
It is useful to further explore for homogeneity of the seedlings & saplings data, either: (i) through simulation (i.e. randomisation) or (ii) quadrat counts and an associated chi-square test.
The first exploration consists of simulating spatially random point patterns based on the average intensity in the observed point pattern. If the density estimates of the observed and simulated point patterns are similar then there is evidence that the observed point pattern is homogenous. If the real dataset stands out among the simulated ones (which it does!) then there is evidence that an inhomogenous process generated the redwood seedlings & saplings data. It is useful to repeat this exercise a few times for clarity.
# Select a random position for the observed data in the figure.
# Set the random seed for reproducibility:
set.seed(123)
pos <- sample(1:6,1)
# Simulate 5 CSR point patterns:
simp <- rpoispp(lambda = intensity(redwood),win = Window(redwood),nsim=5)
# Replace the simulated set at the pos'th position by the observed dataset:
tmp <- simp[[pos]]
simp[[pos]] <- redwood
simp[[6]] <- tmp
names(simp)[6] <- "Simulation 6"
# Compute the KDEs:
densp <- density(simp)
Plot the results (Figure 5). Can you detect which one was the observed dataset?
par(mfrow=c(2,3),mar=c(1,1,1,1))
plot(as.listof(densp), zlim=range(unlist(lapply(densp,range))),
main ="Simulations for homogeneity")
Simulations for homegeneity (with one observed)
Remember to re-set the plot parameters after each time you change them:
par(.pardefault)
This EDA divides the sample area into quadrats and counts the number of points per quadrat (Figure 6):
Q <- quadratcount(redwood, nx= 5, ny=5)
plot(redwood, pch=20, cols="grey70", main=NULL) # Plot points
plot(Q, add=TRUE) # Add quadrat grid
Data in quadrats
Which can also be presented with the density of points within each quadrat (Figure 7), (see help for the ‘intensity’ function):
Q.d <- intensity(Q)
plot(intensity(Q, image=TRUE), main=NULL, las=1) # Plot density raster
plot(redwood, pch=20, cex=0.6, col=rgb(0,0,0,.5), add=TRUE) # Add points
Intensity of points in quadrats
A chi-square test can then be used to infer if the redwood pattern is homogenous (p > 0.05) or inhomogenous (p < 0.05):
quadrat.test(redwood)
##
## Chi-squared test of CSR using quadrat counts
##
## data: redwood
## X2 = 64.613, df = 24, p-value = 2.774e-05
## alternative hypothesis: two.sided
##
## Quadrats: 5 by 5 grid of tiles
Thus, the null hypothesis of the redwood seedlings & saplings pattern being generated by a spatial random process is rejected - i.e. strong evidence that the point pattern is inhomogenous (or non-stationary). A strong caveat to the quadrat approach is the subjective choice for the number of quadrats (where the default is 5 by 5 = 25 quadrats).
A final illustration of a useful EDA procedure for a spatial point pattern is with Ripley’s K function. This counts the number of neighbouring points within an increasing distance from a given focal point. If the point pattern follows Complete Spatial Randomness (CSR) then there is a known relationship between this count number (K) and the distance considered (r). Ripley’s K function is a graphical procedure and is found as follows:
K.Ripley.1 <- envelope(redwood,fun=Kest,funargs=list(correction="border"),global=TRUE)
This can be plotted as in Figure 8:
plot(K.Ripley.1, main="Ripley’s K function")
Plot for Ripley’s K function.
Here a constant envelop (or ‘confidence band’ see help(“envelop”)) is calculated following procedures outlined in http://book.spatstat.org/sample-chapters/chapter07.pdf).
In this instance, the observed curve (\(\hat{K}_{obs}(r)\)) is consistently higher than the expected curve (\(K_{theo}(r)\)). This means there are more points than expected within certain distances of each other - i.e. the points are more clustered than expected and the process is not CSR. If \(\hat{K}_{obs}(r)\) were consistently lower than \(K_{theo}(r)\) then the points are more dispersed than expected.
The parameter Kest in the above assumes a homogenous (stationary) intensity function (i.e. specifically, ‘model/parameter stationarity’). However, we again have evidence that the redwood point pattern is not homogenous. This can be accounted for by using a modified version of Ripley’s K through specifying the Kinhom function as follows:
K.Ripley.2 <- envelope(redwood,fun=Kinhom,global = TRUE)
This can be plotted as in Figure 9:
plot(K.Ripley.2, main="Modified Ripley’s K function")
Plot for Modified Ripley’s K function.
In this instance, the observed curve (\(\hat{K}_{obs}(r)\)) is lower than the expected curve (\(K_{theo}(r)\)) for large distances. This means more dispersion in the redwood seedlings and saplings than expected under CSR - accounting for the inhomogeneity in the point pattern. The Kinhom function derives an intensity estimate from the data (similar to KDE) by weighting each point based on their estimated intensity.
Alternative confidence bands can also be found using varblock or lohboot functions to derive bootstrapped confidence intervals of the expected K values under CSR.
In summary, our extensive EDA has suggested that the 62 redwood seedlings and saplings display an inhomogenous spatial pattern where at large distances they are more spread out than expected.
Now we can start to think of ways to formally model the redwood process. This can be achieved using the function ppp, which fits a Poisson Point Process to the data. Formal models are key for a RICHER UNDESTANDING of the point proces where covariates such the coordinates, topography, remote sensing data can be incorporated.
For example, fit an intercept-only model:
mod.intercept <- ppm(redwood ~ 1)
mod.intercept
## Stationary Poisson process
## Intensity: 62
## Estimate S.E. CI95.lo CI95.hi Ztest Zval
## log(lambda) 4.127134 0.1270001 3.878219 4.37605 *** 32.49709
For example, fit a model, where the spatial coordinates inform the point process:
mod.polynomial <- ppm(redwood ~ polynom(x,y,2))
mod.polynomial
## Nonstationary Poisson process
##
## Log intensity: ~x + y + I(x^2) + I(x * y) + I(y^2)
##
## Fitted trend coefficients:
## (Intercept) x y I(x^2) I(x * y) I(y^2)
## 2.208281 4.491415 -3.949507 -2.443234 3.346023 -2.137482
##
## Estimate S.E. CI95.lo CI95.hi Ztest Zval
## (Intercept) 2.208281 0.9222806 0.4006437 4.0159173 * 2.394369
## x 4.491415 2.3619729 -0.1379668 9.1207970 1.901552
## y -3.949507 2.3234694 -8.5034238 0.6044089 -1.699832
## I(x^2) -2.443234 1.8576133 -6.0840887 1.1976217 -1.315254
## I(x * y) 3.346023 1.8596583 -0.2988407 6.9908858 1.799267
## I(y^2) -2.137482 1.8356763 -5.7353415 1.4603774 -1.164411
The first model fitted a stationary Poisson Process, while the second model fitted a non-stationary Poisson Process. The first model assumes the intensity (the density of the redwood seedlings and saplings) is \(exp(4.127) = 62.0\) throughout the study area (i.e. the trend is stationary and constant). Observe that the exponential arises as the models are log-linear.
The second model fitted a second-order polynomial of the coordinates of the form: \(log(lambda)=x+y+x_2+x:y+y_2\). Thus, the trend is taken as non-stationary and non-constant as depicted in Figure 10. Observe that all the polynomial terms of this model are not significant, and in this instance, the intercept model should be preferred.
As stated above, the model predictors (or covariates) could be other variables that are similarly available everywhere, not just the coordinates, such as topography (elevation, slope, aspect) or a remotely-sensed image.
plot(mod.polynomial,se=FALSE,how="image")
Surface depicting an example trend component in Poisson Point Process Model
Figure 10, maps the predicted intensity (the lambda) from the model with the observed redwood seedlings and saplings sites added. As for every model, some form of model validation is required to assess its value. Here several functions are available via diagnose.ppm. For example, the smoothed residuals from the model fit can be found using (and sticking with the polynomial model for illustration as in Figure 11):
diagnose.ppm(mod.intercept, which = "smooth")
Surface depicting the smoothed raw residuals
## Model diagnostics (raw residuals)
## Diagnostics available:
## smoothed residual field
## range of smoothed field = [-59.56, 50.85]
Here, we can identify areas where the model poorly fits the observed point patterns. We can also use the fitted intensity in the Kinhom function to see if the observed point pattern is more or less clustered than expected from the model fit, again using Ripley’s K function, as given in Figure 12 (which takes a minute or two to run…):
K.Ripley.3 <- envelope(redwood,Kinhom,funargs = list(lambda=mod.polynomial),global=TRUE)
plot(K.Ripley.3, main="Modified Ripley’s K function with respect to model fit")
Plot for Modified Ripley’s K function with respect to model fit
We see that the observed point pattern is more clustered than expected based on the model, as the observed curve (\(K_{obs}(r)\)) is consistently higher than the expected curve (\(K_{theo}(r)\)). One solution would be to use a clustered Poisson point process model via the function kppm:
mod.polynomial.c <- kppm(redwood ~ polynom(x,y,2))
mod.polynomial.c
## Inhomogeneous cluster point process model
## Fitted to point pattern dataset 'redwood'
## Fitted by minimum contrast
## Summary statistic: inhomogeneous K-function
##
## Log intensity: ~x + y + I(x^2) + I(x * y) + I(y^2)
##
## Fitted trend coefficients:
## (Intercept) x y I(x^2) I(x * y) I(y^2)
## 2.208281 4.491415 -3.949507 -2.443234 3.346023 -2.137482
##
## Cluster model: Thomas process
## Fitted cluster parameters:
## kappa scale
## 26.48878724 0.05192903
## Mean cluster size: [pixel image]
plot(mod.polynomial.c, what="statistic", pause=FALSE, main="Inhomogeneous K-function")
Plot for Inhomogeneous K-function
In Figure 13:
kppm) to the Poisson model predictions in (2)Adding a clustering process to the model, clearly improved the model fit (as the solid black line runs through the dashed red line).
Simulating point patterns from the fitted model is also straightforward. Here we can see if there are differences between the observed and the simulated point pattern (all via KDE surfaces):
# Select a random position for the observed data in the figure.
# again set the random seed for preducibility:
set.seed(4007)
pos <- sample(1,1:6)
# Simulate 5 point patterns from the model
sims <- simulate(mod.polynomial.c,nsim = 5)
## Generating 5 simulations... 1, 2, 3, 4, 5.
## Done.
# Place the observed point pattern in the random position
tmp <- sims[[pos]]
sims[[pos]] <- redwood
sims[[6]] <- tmp
names(sims)[6] <- "Simulation 6"
# Compute the KDEs
densp.new <- density(sims)
Plot the results (Figure 14) - can you detect which one was the observed dataset?
par(mfrow=c(2,3),mar=c(1,1,1,1))
plot(as.listof(densp.new), zlim=range(unlist(lapply(densp.new,range))),
main ="Simulations from fitted model")
Simulations from fitted model (with one observed)
# re-set the plot paramters
par(.pardefault)
That is, can you see one with the same pattern as in Figure 15?
plot(density.r.1, main ="KDE surface (observed)")
KDE on observed redwood data
If you cannot pick out the observed pattern (i.e. they all look the same), then the model can be assumed a reasonably good fit. Unfortunately, the observed pattern can still easily be picked out (note to account for change in the legend limits), so further experimentation with model fits is necessary…
In summary, this session has briefly introduced the spatstat package which contains procedures to handle, explore and fit models to point pattern data. Only an introduction to techniques for patterns generated by the points (i.e. without marks) were covered. Patterns generated by their attributes (i.e. with marks) were not covered. Point pattern data arise for many reasons, such as incidences of crime and disease outbreaks.
John Snow’s famous Cholera study provides historical context to a spatial point pattern analysis:
In your own time, repeat the whole session using the paracou (Kimboto trees in Paracou, French Guiana) data instead of the redwood data (both in the spatstat package). Try to get familiar with all the options and tuning parameters for an assured and spatial point pattern analysis. Also look at fitting models to patterns generated by their attributes (i.e. with marks).
Useful online resources of which much of this session has been based on include:
https://rpubs.com/hughes/295880
https://mgimond.github.io/Spatial/point-pattern-analysis-in-r.html
Baddeley, A.J., E. Rubak, and R. Turner. 2016. Spatial point patterns - Methodology and applications with R. Boca Raton: CRC Press.
Diggle, P.J., 2003. Statistical analysis of spatial point patterns. London: Edward Arnold.
Cressie, N., 1993. Statistics for spatial data. Wiley, New Jersey