Spatial data analysis, like time series analysis, is an enigma. If I have any hope of acing my final project, I need a step-by-step guide that tells me what I should be doing and how I should be doing it. That is what this notebook is meant to do; it’s to cover the primary procedures employed in the field of spatial data analysis. It will go over best practices as well as the main theoretical ideas underpinning it all. The end goal is for this to be a template that anyone may use as a starting point for their projects.

This guide heavily reerences ‘Applied Spatial Data Analysis with R’ by Roger S. Bivand, Edzer Pebesma, and Virgilio Gómez-Rubio. The material I choose to refer to is based on the syllabus to the spatial data analysis course taught by Dana Sylvan. The material covered in class includes the following:

There are two main focus points that one may choose to base their project on:

  1. Spatial Point Pattern Analysis
  2. Interpolation & Geostatistics

We will divide this article into the two sections above; one for point pattern analysis and the other for geostats.

Spatial Point Pattern Analysis

In general, a point process is a stochastic process in which we observe the locations of some events of interest within a bounded region A.

1. Perform a test for Complete Spatial Randomness (CSR)

CSR means that events are distributed independently at random and uniformly over the study area. This implies that there are no regions where the events are more (or less) likely to occur and that the presence of a given event does not modify the probability of other events appearing nearby.

This is done by 2 ways:

  1. Informally, by plotting the point pattern and observing whether the points tend to appear in clusters or, on the contrary, they follow a regular pattern.

  2. Formally, using tests and funcitons that calculate the degree of accomplishment of the CSR as well as the uncertainty related to the observed pattern.

    1. G Function: Distance to the Nearest Event (pg.179)

      • Measures the distribution of the distances from an arbitrary event to its nearest event.

      • The compatibility with CSR of the point pattern can be assessed by plotting the empirical function ,Gˆ(d), against the theoretical expectation.

      • In addition, point-wise envelopes under CSR can be computed by repeatedly simulating a CSR point process with the same estimated intensity λˆ in the study region and check whether the empirical function is contained inside.

      • If the line falls within the grey envelope, then the data is homogeneously distributed, while a line above the envelope denotes a clusterd pattern, and a line below the envelope denotes a regular pattern.

      Estimate G-Function G-Function under CSR

    2. F Function: Distance from a Point to the Nearest Event

      • Measures the distribution of all distances from an arbitrary point of the plane to its nearest event.

      • Point-wise envelopes under CSR can be computed by repeatedly simulating a CSR point process with the same estimated intensity λˆ in the study region and check whether the empirical function is contained inside.

      • If the line falls within the grey envelope, then the data is homogeneously distributed, while a line above the envelope denotes a regular pattern, and a line below the envelope denotes a clustered pattern.

      F-Function

      F-Function

# Reading in the data + package imports
library(spatstat)
library(spatstat)
library(maptools)
library(cowplot)
library(ggplot2)
library(gridExtra)
library(splancs)
library(cubature)
data(japanesepines)
data(cells)
data(redwoodfull)
data(lansing)
# as.ppp(japanesepines)
# as.data.frame(japanesepines)
# Plotting the point pattern
plot1 <- ggplot(as.data.frame(cells), aes(x = x, y = y)) + geom_point() + ggtitle("CELLS")
plot2 <- ggplot(as.data.frame(japanesepines), aes(x = x, y = y)) + geom_point() + ggtitle("JAPANESE")
plot3 <- ggplot(as.data.frame(redwoodfull), aes(x = x, y = y)) + geom_point() + ggtitle("REDWOOD")

grid.arrange(plot1, plot2, plot3, ncol=3, nrow=1)

# Evaluating CSR using the G-Function
set.seed(120109)
r <- seq(0, sqrt(2)/6, by = 0.005)
G_JAPANESE <- envelope(japanesepines, fun = Gest, r = r, nrank = 2, nsim = 99)
G_REDWOOD <- envelope(redwoodfull, fun = Gest, r = r, nrank = 2, nsim = 99)
G_CELLS <- envelope(cells, fun = Gest, r = r, nrank = 2, nsim = 99)
plot(G_CELLS)

plot(G_JAPANESE)

plot(G_REDWOOD)

The results show that only the Japanese trees seem to be homogeneously distributed, whilst the redwood seeds show a clustered pattern (values of Gˆ(r) above the envelopes) and the location of the cells shows a more regular pattern (values of Gˆ(r) below the envelopes).

# Evaluating CSR using the F-Function
set.seed(120109)
r <- seq(0, sqrt(2)/6, by = 0.001)
F_JAPANESE <- envelope(japanesepines, fun = Fest, r = r, nrank = 2, nsim = 99)
F_REDWOOD <- envelope(redwoodfull, fun = Fest, r = r, nrank = 2, nsim = 99)
F_CELLS <- envelope(cells, fun = Fest, r = r, nrank = 2, nsim = 99)
plot(F_CELLS)

plot(F_JAPANESE)

plot(F_REDWOOD)

2a. Describe the spatial distribution of events in your point pattern (pg.184)

When describing your point pattern, you’d like to measure the intensity of your point pattern which measures the spatial distribution of events in your region of study. Before you do this, you want to assess your point pattern and identify whether it is a Homogenous or Inhomogenous Poisson Process. This distinction will affect how you go about calculating your Intensity.

Homogenous Poisson Process

Inhomogenous Poisson Process

Calculating Intensity:

# Selecting the optimal bandwidth for the redwoodfull dataset
mserwq <- mse2d(as.points(coordinates(as.data.frame(redwoodfull))), as.points(list(x = c(0, + 1, 1, 0), y = c(0, 0, 1, 1))), 100, 0.15)
bwq <- mserwq$h[which.min(mserwq$mse)]
bwq
## [1] 0.039
mserw <- bw.diggle(redwoodfull)
bw <- as.numeric(mserw)
bw
## [1] 0.01981409
# Kernel smoothing using a quartic kernel
poly <- as.points(list(x = c(0, 0, 1, 1), y = c(0, 1, 1, 0)))
sG <- Sobj_SpatialGrid(as.SpatialPoints.ppp(redwoodfull), maxDim = 100)$SG
grd <- slot(sG, "grid")
summary(grd)
## Grid topology:
##   cellcentre.offset cellsize cells.dim
## x             0.005     0.01       100
## y             0.005     0.01       100
k0 <- spkernel2d(as.SpatialPoints.ppp(redwoodfull), poly, h0 = bw, grd)
k1 <- spkernel2d(as.SpatialPoints.ppp(redwoodfull), poly, h0 = 0.05, grd)
k2 <- spkernel2d(as.SpatialPoints.ppp(redwoodfull), poly, h0 = 0.1, grd)
k3 <- spkernel2d(as.SpatialPoints.ppp(redwoodfull), poly, h0 = 0.15, grd)
df <- data.frame(k0 = k0, k1 = k1, k2 = k2, k3 = k3)
kernels <- SpatialGridDataFrame(grd, data = df)
summary(kernels)
## Object of class SpatialGridDataFrame
## Coordinates:
##   min max
## x   0   1
## y   0   1
## Is projected: NA 
## proj4string : [NA]
## Grid attributes:
##   cellcentre.offset cellsize cells.dim
## x             0.005     0.01       100
## y             0.005     0.01       100
## Data attributes:
##        k0                k1                k2              k3         
##  Min.   :   0.00   Min.   :   0.00   Min.   :  0.0   Min.   :  4.128  
##  1st Qu.:   0.00   1st Qu.:  18.56   1st Qu.:111.0   1st Qu.:145.685  
##  Median :   0.00   Median : 135.28   Median :175.4   Median :187.972  
##  Mean   : 195.60   Mean   : 197.29   Mean   :201.1   Mean   :202.236  
##  3rd Qu.:  59.09   3rd Qu.: 262.41   3rd Qu.:252.9   3rd Qu.:253.279  
##  Max.   :3968.42   Max.   :1346.84   Max.   :746.9   Max.   :561.734
# Kernel smoothing using a Gaussian kernel (... I think)
cc <- coordinates(kernels)
xy <- list(x = cc[, 1], y = cc[, 2])
k4 <- density(redwoodfull, 0.5 * bw, dimyx = c(100, 100), xy = xy)
kernels$k4 <- as(k4, "SpatialGridDataFrame")$v
k5 <- density(redwoodfull, 0.5 * 0.05, dimyx = c(100, 100), xy = xy)
kernels$k5 <- as(k5, "SpatialGridDataFrame")$v
k6 <- density(redwoodfull, 0.5 * 0.1, dimyx = c(100, 100), xy = xy)
kernels$k6 <- as(k6, "SpatialGridDataFrame")$v
k7 <- density(redwoodfull, 0.5 * 0.15, dimyx = c(100, 100), xy = xy)
kernels$k7 <- as(k7, "SpatialGridDataFrame")$v
summary(kernels)
## Object of class SpatialGridDataFrame
## Coordinates:
##   min max
## x   0   1
## y   0   1
## Is projected: NA 
## proj4string : [NA]
## Grid attributes:
##   cellcentre.offset cellsize cells.dim
## x             0.005     0.01       100
## y             0.005     0.01       100
## Data attributes:
##        k0                k1                k2              k3         
##  Min.   :   0.00   Min.   :   0.00   Min.   :  0.0   Min.   :  4.128  
##  1st Qu.:   0.00   1st Qu.:  18.56   1st Qu.:111.0   1st Qu.:145.685  
##  Median :   0.00   Median : 135.28   Median :175.4   Median :187.972  
##  Mean   : 195.60   Mean   : 197.29   Mean   :201.1   Mean   :202.236  
##  3rd Qu.:  59.09   3rd Qu.: 262.41   3rd Qu.:252.9   3rd Qu.:253.279  
##  Max.   :3968.42   Max.   :1346.84   Max.   :746.9   Max.   :561.734  
##        k4                 k5                k6                 k7       
##  Min.   :   0.000   Min.   :   0.00   Min.   :  0.3477   Min.   : 12.9  
##  1st Qu.:   0.000   1st Qu.:  24.06   1st Qu.:106.6803   1st Qu.:142.5  
##  Median :   0.342   Median : 121.32   Median :170.9927   Median :182.9  
##  Mean   : 195.477   Mean   : 196.78   Mean   :199.1688   Mean   :199.7  
##  3rd Qu.: 126.977   3rd Qu.: 260.74   3rd Qu.:249.1800   3rd Qu.:245.6  
##  Max.   :4366.883   Max.   :1412.73   Max.   :785.7830   Max.   :553.8

I have absolutely no clue how to interpret the stuff above. Refer to page 184 in the textbook. Maybe you can teach me how to read it.

# Using the log-normal model to calculate intensity
loglambda <- function(x, alpha, beta) {
  l <- alpha + sum(beta * c(x, x * x, prod(x)))
  return(l)
}

L <- function(alphabeta, x) {
  l <- apply(x, 1, loglambda, alpha = alphabeta[1], beta = alphabeta[-1])
  l <- sum(l)
  intL <- adaptIntegrate(lowerLimit = c(0, 0), upperLimit = c(1, 1), fDim = 1,
                         tol = 1e-08, f = function(x, alpha = alphabeta[1],
                                                   beta = alphabeta[-1]) {
                                                    exp(loglambda(x, alpha, beta))
                                                   })
  l <- l - intL$integral
  return(l)
}

x <- as.points(lansing[lansing$marks == "maple", ])
optbeta <- optim(par = c(log(514), 0, 0, 0, 0, 0), fn = L,
                 control = list(maxit = 1000, fnscale = -1), x = x)  

optbeta
## $par
## [1]  5.5568268  5.6608564 -0.9626295 -5.1417654 -1.1562095  0.9591336
## 
## $value
## [1] 2778.262
## 
## $counts
## function gradient 
##     1001       NA 
## 
## $convergence
## [1] 1
## 
## $message
## NULL

OR

# Calculating intensity using the ppm function
lmaple <- lansing[lansing$marks == "maple", ]
ppm(Q = lmaple, trend = ~x + y + I(x^2) + I(y^2) + I(x *y))
## Nonstationary multitype Poisson process
## 
## Possible marks: 'blackoak', 'hickory', 'maple', 'misc', 'redoak' and 
## 'whiteoak'
## 
## Log intensity:  ~x + y + I(x^2) + I(y^2) + I(x * y)
## 
## Fitted trend coefficients:
## (Intercept)           x           y      I(x^2)      I(y^2)    I(x * y) 
##   3.7310742   5.6400643  -0.7663636  -5.0115142  -1.1983209   0.6375824 
## 
##               Estimate      S.E.    CI95.lo     CI95.hi Ztest       Zval
## (Intercept)  3.7310742 0.2542004  3.2328505  4.22929795   *** 14.6776855
## x            5.6400643 0.7990009  4.0740514  7.20607727   ***  7.0588964
## y           -0.7663636 0.6990514 -2.1364792  0.60375200       -1.0962907
## I(x^2)      -5.0115142 0.7011631 -6.3857686 -3.63725974   *** -7.1474300
## I(y^2)      -1.1983209 0.6428053 -2.4581962  0.06155433       -1.8642051
## I(x * y)     0.6375824 0.6989167 -0.7322691  2.00743391        0.9122439

Again, not sure how to interpret it … I’m screwed. Let’s keep swimming.

2b. Describe the spatial distribution of events in your point pattern (pg.190)

Second-order properties measure the strength and type of the interactions between events of the point process. These are used to study clustering or competition between events. Specifically, the second-order intensity of two points x and y reflects the probability of any pair of events occurring in the vicinities of x and y, respectively.

By comparing the estimated value Kˆ(s) to the theoretical value we can assess what kind of interaction exists. Usually, we assume that these interactions occur at small scales, and so will be interested in relatively small values of s. Values of Kˆ(s) higher than πs2 are characteristic of clustered processes, whilst values smaller than that are found when there exists competition between events (regular pattern)

set.seed(30)
K_JAPANESE <- envelope(as(japanesepines, "ppp"), fun = Kest, r = r, nrank = 2, nsim = 99)
K_REDWOOD <- envelope(as(redwoodfull, "ppp"), fun = Kest, r = r, nrank = 2, nsim = 99)
K_CELLS <- envelope(as(cells, "ppp"), fun = Kest, r = r, nrank = 2, nsim = 99)
plot(K_CELLS)

plot(K_JAPANESE)

plot(K_REDWOOD)

See page 192 for an interpretation. I have obviously given up at this point.

3. Explore other measurements that describe your point pattern

Interpolation & Geostatistics