Spatial data analysis, like time series analysis, is an enigma. If I have any hope of acing my final project, I need a step-by-step guide that tells me what I should be doing and how I should be doing it. That is what this notebook is meant to do; it’s to cover the primary procedures employed in the field of spatial data analysis. It will go over best practices as well as the main theoretical ideas underpinning it all. The end goal is for this to be a template that anyone may use as a starting point for their projects.
This guide heavily reerences ‘Applied Spatial Data Analysis with R’ by Roger S. Bivand, Edzer Pebesma, and Virgilio Gómez-Rubio. The material I choose to refer to is based on the syllabus to the spatial data analysis course taught by Dana Sylvan. The material covered in class includes the following:
There are two main focus points that one may choose to base their project on:
We will divide this article into the two sections above; one for point pattern analysis and the other for geostats.
In general, a point process is a stochastic process in which we observe the locations of some events of interest within a bounded region A.
1. Perform a test for Complete Spatial Randomness (CSR)
CSR means that events are distributed independently at random and uniformly over the study area. This implies that there are no regions where the events are more (or less) likely to occur and that the presence of a given event does not modify the probability of other events appearing nearby.
This is done by 2 ways:
Informally, by plotting the point pattern and observing whether the points tend to appear in clusters or, on the contrary, they follow a regular pattern.
Formally, using tests and funcitons that calculate the degree of accomplishment of the CSR as well as the uncertainty related to the observed pattern.
G Function: Distance to the Nearest Event (pg.179)
Measures the distribution of the distances from an arbitrary event to its nearest event.
The compatibility with CSR of the point pattern can be assessed by plotting the empirical function ,Gˆ(d), against the theoretical expectation.
In addition, point-wise envelopes under CSR can be computed by repeatedly simulating a CSR point process with the same estimated intensity λˆ in the study region and check whether the empirical function is contained inside.
If the line falls within the grey envelope, then the data is homogeneously distributed, while a line above the envelope denotes a clusterd pattern, and a line below the envelope denotes a regular pattern.
F Function: Distance from a Point to the Nearest Event
Measures the distribution of all distances from an arbitrary point of the plane to its nearest event.
Point-wise envelopes under CSR can be computed by repeatedly simulating a CSR point process with the same estimated intensity λˆ in the study region and check whether the empirical function is contained inside.
If the line falls within the grey envelope, then the data is homogeneously distributed, while a line above the envelope denotes a regular pattern, and a line below the envelope denotes a clustered pattern.
F-Function
# Reading in the data + package imports
library(spatstat)
library(spatstat)
library(maptools)
library(cowplot)
library(ggplot2)
library(gridExtra)
library(splancs)
library(cubature)
data(japanesepines)
data(cells)
data(redwoodfull)
data(lansing)
# as.ppp(japanesepines)
# as.data.frame(japanesepines)
# Plotting the point pattern
plot1 <- ggplot(as.data.frame(cells), aes(x = x, y = y)) + geom_point() + ggtitle("CELLS")
plot2 <- ggplot(as.data.frame(japanesepines), aes(x = x, y = y)) + geom_point() + ggtitle("JAPANESE")
plot3 <- ggplot(as.data.frame(redwoodfull), aes(x = x, y = y)) + geom_point() + ggtitle("REDWOOD")
grid.arrange(plot1, plot2, plot3, ncol=3, nrow=1)
# Evaluating CSR using the G-Function
set.seed(120109)
r <- seq(0, sqrt(2)/6, by = 0.005)
G_JAPANESE <- envelope(japanesepines, fun = Gest, r = r, nrank = 2, nsim = 99)
G_REDWOOD <- envelope(redwoodfull, fun = Gest, r = r, nrank = 2, nsim = 99)
G_CELLS <- envelope(cells, fun = Gest, r = r, nrank = 2, nsim = 99)
plot(G_CELLS)
plot(G_JAPANESE)
plot(G_REDWOOD)
The results show that only the Japanese trees seem to be homogeneously distributed, whilst the redwood seeds show a clustered pattern (values of Gˆ(r) above the envelopes) and the location of the cells shows a more regular pattern (values of Gˆ(r) below the envelopes).
# Evaluating CSR using the F-Function
set.seed(120109)
r <- seq(0, sqrt(2)/6, by = 0.001)
F_JAPANESE <- envelope(japanesepines, fun = Fest, r = r, nrank = 2, nsim = 99)
F_REDWOOD <- envelope(redwoodfull, fun = Fest, r = r, nrank = 2, nsim = 99)
F_CELLS <- envelope(cells, fun = Fest, r = r, nrank = 2, nsim = 99)
plot(F_CELLS)
plot(F_JAPANESE)
plot(F_REDWOOD)
2a. Describe the spatial distribution of events in your point pattern (pg.184)
When describing your point pattern, you’d like to measure the intensity of your point pattern which measures the spatial distribution of events in your region of study. Before you do this, you want to assess your point pattern and identify whether it is a Homogenous or Inhomogenous Poisson Process. This distinction will affect how you go about calculating your Intensity.
Homogenous Poisson Process
Assumes events are independently distributed according to a given density and assumes that intensity is constant
Characterised as representing the kind of point process in which all events are independently and uniformly distributed in the region A where the point process occurs. This means that the location of one point does not affect the probabilities of other points appearing nearby and that there are no regions where events are more likely to appear.
Stationary and isotropic. It is stationary because the in- tensity is constant and the second-order intensity depends only on the relative positions of two points (i.e. direction and distance). In addition, it is isotropic because the second-order intensity is invariant to rotation. Hence, the point process has constant intensity and its second-order intensity depends only on the distance between the two points, regardless of the relative positions of the points.
The formal definition of a point process which is CSR
In most cases assuming that a point process under study is homogeneous is not realistic.
Inhomogenous Poisson Process
Assumes events are independently distributed according to a given density and assumes that intensity varies spatially.
Allows for a non-constant intensity. The spatial variation can be more diverse, with events appearing more likely in some areas than others.
Calculating Intensity:
Homogenous Poisson Process:
Intensity = n / |A| n = number of points; A = area of region under study
Inhomogenous Poisson Process:
Estimate G-Function
You can use either a Quartic or a Gaussain Kernel. The important part is choosing an appropriate bandwidth. One technique is to select the bandwidth that minimizes the Mean Square Error (MSE) of the kernel smoothing estimator when the underlying point process in a stationary Cox process.
# Selecting the optimal bandwidth for the redwoodfull dataset
mserwq <- mse2d(as.points(coordinates(as.data.frame(redwoodfull))), as.points(list(x = c(0, + 1, 1, 0), y = c(0, 0, 1, 1))), 100, 0.15)
bwq <- mserwq$h[which.min(mserwq$mse)]
bwq
## [1] 0.039
mserw <- bw.diggle(redwoodfull)
bw <- as.numeric(mserw)
bw
## [1] 0.01981409
# Kernel smoothing using a quartic kernel
poly <- as.points(list(x = c(0, 0, 1, 1), y = c(0, 1, 1, 0)))
sG <- Sobj_SpatialGrid(as.SpatialPoints.ppp(redwoodfull), maxDim = 100)$SG
grd <- slot(sG, "grid")
summary(grd)
## Grid topology:
## cellcentre.offset cellsize cells.dim
## x 0.005 0.01 100
## y 0.005 0.01 100
k0 <- spkernel2d(as.SpatialPoints.ppp(redwoodfull), poly, h0 = bw, grd)
k1 <- spkernel2d(as.SpatialPoints.ppp(redwoodfull), poly, h0 = 0.05, grd)
k2 <- spkernel2d(as.SpatialPoints.ppp(redwoodfull), poly, h0 = 0.1, grd)
k3 <- spkernel2d(as.SpatialPoints.ppp(redwoodfull), poly, h0 = 0.15, grd)
df <- data.frame(k0 = k0, k1 = k1, k2 = k2, k3 = k3)
kernels <- SpatialGridDataFrame(grd, data = df)
summary(kernels)
## Object of class SpatialGridDataFrame
## Coordinates:
## min max
## x 0 1
## y 0 1
## Is projected: NA
## proj4string : [NA]
## Grid attributes:
## cellcentre.offset cellsize cells.dim
## x 0.005 0.01 100
## y 0.005 0.01 100
## Data attributes:
## k0 k1 k2 k3
## Min. : 0.00 Min. : 0.00 Min. : 0.0 Min. : 4.128
## 1st Qu.: 0.00 1st Qu.: 18.56 1st Qu.:111.0 1st Qu.:145.685
## Median : 0.00 Median : 135.28 Median :175.4 Median :187.972
## Mean : 195.60 Mean : 197.29 Mean :201.1 Mean :202.236
## 3rd Qu.: 59.09 3rd Qu.: 262.41 3rd Qu.:252.9 3rd Qu.:253.279
## Max. :3968.42 Max. :1346.84 Max. :746.9 Max. :561.734
# Kernel smoothing using a Gaussian kernel (... I think)
cc <- coordinates(kernels)
xy <- list(x = cc[, 1], y = cc[, 2])
k4 <- density(redwoodfull, 0.5 * bw, dimyx = c(100, 100), xy = xy)
kernels$k4 <- as(k4, "SpatialGridDataFrame")$v
k5 <- density(redwoodfull, 0.5 * 0.05, dimyx = c(100, 100), xy = xy)
kernels$k5 <- as(k5, "SpatialGridDataFrame")$v
k6 <- density(redwoodfull, 0.5 * 0.1, dimyx = c(100, 100), xy = xy)
kernels$k6 <- as(k6, "SpatialGridDataFrame")$v
k7 <- density(redwoodfull, 0.5 * 0.15, dimyx = c(100, 100), xy = xy)
kernels$k7 <- as(k7, "SpatialGridDataFrame")$v
summary(kernels)
## Object of class SpatialGridDataFrame
## Coordinates:
## min max
## x 0 1
## y 0 1
## Is projected: NA
## proj4string : [NA]
## Grid attributes:
## cellcentre.offset cellsize cells.dim
## x 0.005 0.01 100
## y 0.005 0.01 100
## Data attributes:
## k0 k1 k2 k3
## Min. : 0.00 Min. : 0.00 Min. : 0.0 Min. : 4.128
## 1st Qu.: 0.00 1st Qu.: 18.56 1st Qu.:111.0 1st Qu.:145.685
## Median : 0.00 Median : 135.28 Median :175.4 Median :187.972
## Mean : 195.60 Mean : 197.29 Mean :201.1 Mean :202.236
## 3rd Qu.: 59.09 3rd Qu.: 262.41 3rd Qu.:252.9 3rd Qu.:253.279
## Max. :3968.42 Max. :1346.84 Max. :746.9 Max. :561.734
## k4 k5 k6 k7
## Min. : 0.000 Min. : 0.00 Min. : 0.3477 Min. : 12.9
## 1st Qu.: 0.000 1st Qu.: 24.06 1st Qu.:106.6803 1st Qu.:142.5
## Median : 0.342 Median : 121.32 Median :170.9927 Median :182.9
## Mean : 195.477 Mean : 196.78 Mean :199.1688 Mean :199.7
## 3rd Qu.: 126.977 3rd Qu.: 260.74 3rd Qu.:249.1800 3rd Qu.:245.6
## Max. :4366.883 Max. :1412.73 Max. :785.7830 Max. :553.8
I have absolutely no clue how to interpret the stuff above. Refer to page 184 in the textbook. Maybe you can teach me how to read it.
# Using the log-normal model to calculate intensity
loglambda <- function(x, alpha, beta) {
l <- alpha + sum(beta * c(x, x * x, prod(x)))
return(l)
}
L <- function(alphabeta, x) {
l <- apply(x, 1, loglambda, alpha = alphabeta[1], beta = alphabeta[-1])
l <- sum(l)
intL <- adaptIntegrate(lowerLimit = c(0, 0), upperLimit = c(1, 1), fDim = 1,
tol = 1e-08, f = function(x, alpha = alphabeta[1],
beta = alphabeta[-1]) {
exp(loglambda(x, alpha, beta))
})
l <- l - intL$integral
return(l)
}
x <- as.points(lansing[lansing$marks == "maple", ])
optbeta <- optim(par = c(log(514), 0, 0, 0, 0, 0), fn = L,
control = list(maxit = 1000, fnscale = -1), x = x)
optbeta
## $par
## [1] 5.5568268 5.6608564 -0.9626295 -5.1417654 -1.1562095 0.9591336
##
## $value
## [1] 2778.262
##
## $counts
## function gradient
## 1001 NA
##
## $convergence
## [1] 1
##
## $message
## NULL
OR
# Calculating intensity using the ppm function
lmaple <- lansing[lansing$marks == "maple", ]
ppm(Q = lmaple, trend = ~x + y + I(x^2) + I(y^2) + I(x *y))
## Nonstationary multitype Poisson process
##
## Possible marks: 'blackoak', 'hickory', 'maple', 'misc', 'redoak' and
## 'whiteoak'
##
## Log intensity: ~x + y + I(x^2) + I(y^2) + I(x * y)
##
## Fitted trend coefficients:
## (Intercept) x y I(x^2) I(y^2) I(x * y)
## 3.7310742 5.6400643 -0.7663636 -5.0115142 -1.1983209 0.6375824
##
## Estimate S.E. CI95.lo CI95.hi Ztest Zval
## (Intercept) 3.7310742 0.2542004 3.2328505 4.22929795 *** 14.6776855
## x 5.6400643 0.7990009 4.0740514 7.20607727 *** 7.0588964
## y -0.7663636 0.6990514 -2.1364792 0.60375200 -1.0962907
## I(x^2) -5.0115142 0.7011631 -6.3857686 -3.63725974 *** -7.1474300
## I(y^2) -1.1983209 0.6428053 -2.4581962 0.06155433 -1.8642051
## I(x * y) 0.6375824 0.6989167 -0.7322691 2.00743391 0.9122439
Again, not sure how to interpret it … I’m screwed. Let’s keep swimming.
2b. Describe the spatial distribution of events in your point pattern (pg.190)
Second-order properties measure the strength and type of the interactions between events of the point process. These are used to study clustering or competition between events. Specifically, the second-order intensity of two points x and y reflects the probability of any pair of events occurring in the vicinities of x and y, respectively.
By comparing the estimated value Kˆ(s) to the theoretical value we can assess what kind of interaction exists. Usually, we assume that these interactions occur at small scales, and so will be interested in relatively small values of s. Values of Kˆ(s) higher than πs2 are characteristic of clustered processes, whilst values smaller than that are found when there exists competition between events (regular pattern)
set.seed(30)
K_JAPANESE <- envelope(as(japanesepines, "ppp"), fun = Kest, r = r, nrank = 2, nsim = 99)
K_REDWOOD <- envelope(as(redwoodfull, "ppp"), fun = Kest, r = r, nrank = 2, nsim = 99)
K_CELLS <- envelope(as(cells, "ppp"), fun = Kest, r = r, nrank = 2, nsim = 99)
plot(K_CELLS)
plot(K_JAPANESE)
plot(K_REDWOOD)
See page 192 for an interpretation. I have obviously given up at this point.
3. Explore other measurements that describe your point pattern