Spatial data analysis, like time series analysis, is an enigma. If I have any hope of acing my final project, I need a step-by-step guide that tells me what I should be doing and how I should be doing it. That is what this notebook is meant to do; it’s to cover the primary procedures employed in the field of spatial data analysis. It will go over best practices as well as the main theoretical ideas underpinning it all. The end goal is for this to be a template that anyone may use as a starting point for their projects.

This guide heavily reerences ‘Applied Spatial Data Analysis with R’ by Roger S. Bivand, Edzer Pebesma, and Virgilio Gómez-Rubio. The material I choose to refer to is based on the syllabus to the spatial data analysis course taught by Dana Sylvan. The material covered in class includes the following:

We will divide this article into the two sections above; one for point pattern analysis and the other for geostats.

Spatial Point Pattern Analysis

In general, a point process is a stochastic process in which we observe the locations of some events of interest within a bounded region A.

1. Perform a test for Complete Spatial Randomness (CSR)

CSR means that events are distributed independently at random and uniformly over the study area. This implies that there are no regions where the events are more (or less) likely to occur and that the presence of a given event does not modify the probability of other events appearing nearby.

This is done by 2 ways:

Informally, by plotting the point pattern and observing whether the points tend to appear in clusters or, on the contrary, they follow a regular pattern.
Formally, using tests and funcitons that calculate the degree of accomplishment of the CSR as well as the uncertainty related to the observed pattern.
1. G Function: Distance to the Nearest Event (pg.179)
  - Measures the distribution of the distances from an arbitrary event to its nearest event.
  - The compatibility with CSR of the point pattern can be assessed by plotting the empirical function ,Gˆ(d), against the theoretical expectation.
  - In addition, point-wise envelopes under CSR can be computed by repeatedly simulating a CSR point process with the same estimated intensity λˆ in the study region and check whether the empirical function is contained inside.
  - If the line falls within the grey envelope, then the data is homogeneously distributed, while a line above the envelope denotes a clusterd pattern, and a line below the envelope denotes a regular pattern.
2. F Function: Distance from a Point to the Nearest Event
  - Measures the distribution of all distances from an arbitrary point of the plane to its nearest event.
  - Point-wise envelopes under CSR can be computed by repeatedly simulating a CSR point process with the same estimated intensity λˆ in the study region and check whether the empirical function is contained inside.
  - If the line falls within the grey envelope, then the data is homogeneously distributed, while a line above the envelope denotes a regular pattern, and a line below the envelope denotes a clustered pattern.
  F-Function

# Reading in the data + package imports
library(spatstat)
library(spatstat)
library(maptools)
library(cowplot)
library(ggplot2)
library(gridExtra)
library(splancs)
library(cubature)
data(japanesepines)
data(cells)
data(redwoodfull)
data(lansing)
# as.ppp(japanesepines)
# as.data.frame(japanesepines)

# Plotting the point pattern
plot1 <- ggplot(as.data.frame(cells), aes(x = x, y = y)) + geom_point() + ggtitle("CELLS")
plot2 <- ggplot(as.data.frame(japanesepines), aes(x = x, y = y)) + geom_point() + ggtitle("JAPANESE")
plot3 <- ggplot(as.data.frame(redwoodfull), aes(x = x, y = y)) + geom_point() + ggtitle("REDWOOD")

grid.arrange(plot1, plot2, plot3, ncol=3, nrow=1)

# Evaluating CSR using the G-Function
set.seed(120109)
r <- seq(0, sqrt(2)/6, by = 0.005)
G_JAPANESE <- envelope(japanesepines, fun = Gest, r = r, nrank = 2, nsim = 99)
G_REDWOOD <- envelope(redwoodfull, fun = Gest, r = r, nrank = 2, nsim = 99)
G_CELLS <- envelope(cells, fun = Gest, r = r, nrank = 2, nsim = 99)

plot(G_CELLS)

plot(G_JAPANESE)

plot(G_REDWOOD)

The results show that only the Japanese trees seem to be homogeneously distributed, whilst the redwood seeds show a clustered pattern (values of Gˆ(r) above the envelopes) and the location of the cells shows a more regular pattern (values of Gˆ(r) below the envelopes).

# Evaluating CSR using the F-Function
set.seed(120109)
r <- seq(0, sqrt(2)/6, by = 0.001)
F_JAPANESE <- envelope(japanesepines, fun = Fest, r = r, nrank = 2, nsim = 99)
F_REDWOOD <- envelope(redwoodfull, fun = Fest, r = r, nrank = 2, nsim = 99)
F_CELLS <- envelope(cells, fun = Fest, r = r, nrank = 2, nsim = 99)

plot(F_CELLS)

plot(F_JAPANESE)

plot(F_REDWOOD)

2a. Describe the spatial distribution of events in your point pattern (pg.184)

When describing your point pattern, you’d like to measure the intensity of your point pattern which measures the spatial distribution of events in your region of study. Before you do this, you want to assess your point pattern and identify whether it is a Homogenous or Inhomogenous Poisson Process. This distinction will affect how you go about calculating your Intensity.

Homogenous Poisson Process

Assumes events are independently distributed according to a given density and assumes that intensity is constant
Characterised as representing the kind of point process in which all events are independently and uniformly distributed in the region A where the point process occurs. This means that the location of one point does not affect the probabilities of other points appearing nearby and that there are no regions where events are more likely to appear.
Stationary and isotropic. It is stationary because the in- tensity is constant and the second-order intensity depends only on the relative positions of two points (i.e. direction and distance). In addition, it is isotropic because the second-order intensity is invariant to rotation. Hence, the point process has constant intensity and its second-order intensity depends only on the distance between the two points, regardless of the relative positions of the points.
The formal definition of a point process which is CSR
In most cases assuming that a point process under study is homogeneous is not realistic.

Inhomogenous Poisson Process

Assumes events are independently distributed according to a given density and assumes that intensity varies spatially.
Allows for a non-constant intensity. The spatial variation can be more diverse, with events appearing more likely in some areas than others.

Calculating Intensity:

Homogenous Poisson Process:

Intensity = n / |A| n = number of points; A = area of region under study
Inhomogenous Poisson Process:
1. Non-parametrically by means of kernel smoothing or parametrically by proposing a specific function for the intensity whose parameters are estimated by maximising the likelihood of the point process.
Estimate G-Function

You can use either a Quartic or a Gaussain Kernel. The important part is choosing an appropriate bandwidth. One technique is to select the bandwidth that minimizes the Mean Square Error (MSE) of the kernel smoothing estimator when the underlying point process in a stationary Cox process.
1. Parametrically by fitting a log-normal model and using it to calculate the maximum likelihood of λ(x) which is the intensity of the point pattern

# Selecting the optimal bandwidth for the redwoodfull dataset
mserwq <- mse2d(as.points(coordinates(as.data.frame(redwoodfull))), as.points(list(x = c(0, + 1, 1, 0), y = c(0, 0, 1, 1))), 100, 0.15)
bwq <- mserwq$h[which.min(mserwq$mse)]
bwq

## [1] 0.039

mserw <- bw.diggle(redwoodfull)
bw <- as.numeric(mserw)
bw

## [1] 0.01981409

# Kernel smoothing using a quartic kernel
poly <- as.points(list(x = c(0, 0, 1, 1), y = c(0, 1, 1, 0)))
sG <- Sobj_SpatialGrid(as.SpatialPoints.ppp(redwoodfull), maxDim = 100)$SG
grd <- slot(sG, "grid")
summary(grd)

## Grid topology:
##   cellcentre.offset cellsize cells.dim
## x             0.005     0.01       100
## y             0.005     0.01       100

k0 <- spkernel2d(as.SpatialPoints.ppp(redwoodfull), poly, h0 = bw, grd)
k1 <- spkernel2d(as.SpatialPoints.ppp(redwoodfull), poly, h0 = 0.05, grd)
k2 <- spkernel2d(as.SpatialPoints.ppp(redwoodfull), poly, h0 = 0.1, grd)
k3 <- spkernel2d(as.SpatialPoints.ppp(redwoodfull), poly, h0 = 0.15, grd)
df <- data.frame(k0 = k0, k1 = k1, k2 = k2, k3 = k3)
kernels <- SpatialGridDataFrame(grd, data = df)
summary(kernels)

## Object of class SpatialGridDataFrame
## Coordinates:
##   min max
## x   0   1
## y   0   1
## Is projected: NA 
## proj4string : [NA]
## Grid attributes:
##   cellcentre.offset cellsize cells.dim
## x             0.005     0.01       100
## y             0.005     0.01       100
## Data attributes:
##        k0                k1                k2              k3         
##  Min.   :   0.00   Min.   :   0.00   Min.   :  0.0   Min.   :  4.128  
##  1st Qu.:   0.00   1st Qu.:  18.56   1st Qu.:111.0   1st Qu.:145.685  
##  Median :   0.00   Median : 135.28   Median :175.4   Median :187.972  
##  Mean   : 195.60   Mean   : 197.29   Mean   :201.1   Mean   :202.236  
##  3rd Qu.:  59.09   3rd Qu.: 262.41   3rd Qu.:252.9   3rd Qu.:253.279  
##  Max.   :3968.42   Max.   :1346.84   Max.   :746.9   Max.   :561.734

# Kernel smoothing using a Gaussian kernel (... I think)
cc <- coordinates(kernels)
xy <- list(x = cc[, 1], y = cc[, 2])
k4 <- density(redwoodfull, 0.5 * bw, dimyx = c(100, 100), xy = xy)
kernels$k4 <- as(k4, "SpatialGridDataFrame")$v
k5 <- density(redwoodfull, 0.5 * 0.05, dimyx = c(100, 100), xy = xy)
kernels$k5 <- as(k5, "SpatialGridDataFrame")$v
k6 <- density(redwoodfull, 0.5 * 0.1, dimyx = c(100, 100), xy = xy)
kernels$k6 <- as(k6, "SpatialGridDataFrame")$v
k7 <- density(redwoodfull, 0.5 * 0.15, dimyx = c(100, 100), xy = xy)
kernels$k7 <- as(k7, "SpatialGridDataFrame")$v
summary(kernels)

## Object of class SpatialGridDataFrame
## Coordinates:
##   min max
## x   0   1
## y   0   1
## Is projected: NA 
## proj4string : [NA]
## Grid attributes:
##   cellcentre.offset cellsize cells.dim
## x             0.005     0.01       100
## y             0.005     0.01       100
## Data attributes:
##        k0                k1                k2              k3         
##  Min.   :   0.00   Min.   :   0.00   Min.   :  0.0   Min.   :  4.128  
##  1st Qu.:   0.00   1st Qu.:  18.56   1st Qu.:111.0   1st Qu.:145.685  
##  Median :   0.00   Median : 135.28   Median :175.4   Median :187.972  
##  Mean   : 195.60   Mean   : 197.29   Mean   :201.1   Mean   :202.236  
##  3rd Qu.:  59.09   3rd Qu.: 262.41   3rd Qu.:252.9   3rd Qu.:253.279  
##  Max.   :3968.42   Max.   :1346.84   Max.   :746.9   Max.   :561.734  
##        k4                 k5                k6                 k7       
##  Min.   :   0.000   Min.   :   0.00   Min.   :  0.3477   Min.   : 12.9  
##  1st Qu.:   0.000   1st Qu.:  24.06   1st Qu.:106.6803   1st Qu.:142.5  
##  Median :   0.342   Median : 121.32   Median :170.9927   Median :182.9  
##  Mean   : 195.477   Mean   : 196.78   Mean   :199.1688   Mean   :199.7  
##  3rd Qu.: 126.977   3rd Qu.: 260.74   3rd Qu.:249.1800   3rd Qu.:245.6  
##  Max.   :4366.883   Max.   :1412.73   Max.   :785.7830   Max.   :553.8

I have absolutely no clue how to interpret the stuff above. Refer to page 184 in the textbook. Maybe you can teach me how to read it.

# Using the log-normal model to calculate intensity
loglambda <- function(x, alpha, beta) {
  l <- alpha + sum(beta * c(x, x * x, prod(x)))
  return(l)
}

L <- function(alphabeta, x) {
  l <- apply(x, 1, loglambda, alpha = alphabeta[1], beta = alphabeta[-1])
  l <- sum(l)
  intL <- adaptIntegrate(lowerLimit = c(0, 0), upperLimit = c(1, 1), fDim = 1,
                         tol = 1e-08, f = function(x, alpha = alphabeta[1],
                                                   beta = alphabeta[-1]) {
                                                    exp(loglambda(x, alpha, beta))
                                                   })
  l <- l - intL$integral
  return(l)
}

x <- as.points(lansing[lansing$marks == "maple", ])
optbeta <- optim(par = c(log(514), 0, 0, 0, 0, 0), fn = L,
                 control = list(maxit = 1000, fnscale = -1), x = x)  

optbeta

## $par
## [1]  5.5568268  5.6608564 -0.9626295 -5.1417654 -1.1562095  0.9591336
## 
## $value
## [1] 2778.262
## 
## $counts
## function gradient 
##     1001       NA 
## 
## $convergence
## [1] 1
## 
## $message
## NULL

# Calculating intensity using the ppm function
lmaple <- lansing[lansing$marks == "maple", ]
ppm(Q = lmaple, trend = ~x + y + I(x^2) + I(y^2) + I(x *y))

## Nonstationary multitype Poisson process
## 
## Possible marks: 'blackoak', 'hickory', 'maple', 'misc', 'redoak' and 
## 'whiteoak'
## 
## Log intensity:  ~x + y + I(x^2) + I(y^2) + I(x * y)
## 
## Fitted trend coefficients:
## (Intercept)           x           y      I(x^2)      I(y^2)    I(x * y) 
##   3.7310742   5.6400643  -0.7663636  -5.0115142  -1.1983209   0.6375824 
## 
##               Estimate      S.E.    CI95.lo     CI95.hi Ztest       Zval
## (Intercept)  3.7310742 0.2542004  3.2328505  4.22929795   *** 14.6776855
## x            5.6400643 0.7990009  4.0740514  7.20607727   ***  7.0588964
## y           -0.7663636 0.6990514 -2.1364792  0.60375200       -1.0962907
## I(x^2)      -5.0115142 0.7011631 -6.3857686 -3.63725974   *** -7.1474300
## I(y^2)      -1.1983209 0.6428053 -2.4581962  0.06155433       -1.8642051
## I(x * y)     0.6375824 0.6989167 -0.7322691  2.00743391        0.9122439

Again, not sure how to interpret it … I’m screwed. Let’s keep swimming.

2b. Describe the spatial distribution of events in your point pattern (pg.190)

Second-order properties measure the strength and type of the interactions between events of the point process. These are used to study clustering or competition between events. Specifically, the second-order intensity of two points x and y reflects the probability of any pair of events occurring in the vicinities of x and y, respectively.

By comparing the estimated value Kˆ(s) to the theoretical value we can assess what kind of interaction exists. Usually, we assume that these interactions occur at small scales, and so will be interested in relatively small values of s. Values of Kˆ(s) higher than πs2 are characteristic of clustered processes, whilst values smaller than that are found when there exists competition between events (regular pattern)

set.seed(30)
K_JAPANESE <- envelope(as(japanesepines, "ppp"), fun = Kest, r = r, nrank = 2, nsim = 99)
K_REDWOOD <- envelope(as(redwoodfull, "ppp"), fun = Kest, r = r, nrank = 2, nsim = 99)
K_CELLS <- envelope(as(cells, "ppp"), fun = Kest, r = r, nrank = 2, nsim = 99)

plot(K_CELLS)

plot(K_JAPANESE)

plot(K_REDWOOD)

See page 192 for an interpretation. I have obviously given up at this point.

3. Explore other measurements that describe your point pattern

Binary Regression Estimator (pg. 198) - Estimate the probability of being a case at a given location
Binary Regression Using Generalised Additive Models (pg. 199) - Logistic Regression
Accounting for Confounding and Covariates (206)

A Step-by-Step Guide to Spatial Data Analaysis

Spatial Point Pattern Analysis

Interpolation & Geostatistics