Spatial Statistics

library(ggplot2)
library(spatstat)

## Warning: package 'spatstat' was built under R version 4.0.3

## Loading required package: spatstat.data

## Loading required package: nlme

## Loading required package: rpart

## 
## spatstat 1.64-1       (nickname: 'Help you I can, yes!') 
## For an introduction to spatstat, type 'beginner'

## 
## Note: spatstat version 1.64-1 is out of date by more than 7 months; we recommend upgrading to the latest version.

library(raster)

## Loading required package: sp

## 
## Attaching package: 'raster'

## The following objects are masked from 'package:spatstat':
## 
##     area, rotate, shift

## The following object is masked from 'package:nlme':
## 
##     getData

Random points

# The number of points to create
n <- 200

# Set the range
xmin <- 0
xmax <- 1
ymin <- 0
ymax <- 2

# Sample from a Uniform distribution
x <- runif(n, xmin, xmax)
y <- runif(n, ymin, ymax)


# See pre-defined variables
ls.str()

## n :  num 200
## x :  num [1:200] 0.0841 0.6909 0.7102 0.3198 0.3427 ...
## xmax :  num 1
## xmin :  num 0
## y :  num [1:200] 1.9237 1.6531 0.4721 0.0716 1.0264 ...
## ymax :  num 2
## ymin :  num 0

# Plot points and a rectangle

mapxy <- function(a = NA){
  plot(x, y, asp = a)
  rect(xmin, ymin, xmax, ymax)
}

mapxy(a=1)

# Load the spatstat package

# Create this many points, in a circle of this radius
n_points <- 300
radius <- 10

# Generate uniform random numbers up to radius-squared
r_squared <- runif(n_points, 0, radius**2)
angle <- runif(n_points, 0, 2*pi)

# Take the square root of the values to get a uniform spatial distribution
x <- sqrt(r_squared) * cos(angle)
y <- sqrt(r_squared) * sin(angle)

plot(disc(radius)); points(x, y)

Planar point pattern - PPP

Quadrant count test

The quadrat count test was one of the earliest developed spatial statistics methods. It can be used to check if points are completely spatially random; that is, they are uniformly random throughout the area of interest.

# Some variables have been pre-defined
ls.str()

## angle :  num [1:300] 3.789 0.416 5.714 2.83 1.338 ...
## mapxy : function (a = NA)  
## n :  num 200
## n_points :  num 300
## r_squared :  num [1:300] 4.27 14.38 19.46 28.63 61.43 ...
## radius :  num 10
## x :  num [1:300] -1.65 3.47 3.72 -5.09 1.81 ...
## xmax :  num 1
## xmin :  num 0
## y :  num [1:300] -1.25 1.53 -2.38 1.64 7.63 ...
## ymax :  num 2
## ymin :  num 0

# Set coordinates and window
ppxy <- ppp(x = x, y = y, window = disc(radius))

# Test the point pattern
qt <- quadrat.test(ppxy)

## Warning: Some expected counts are small; chi^2 approximation may be inaccurate

# Inspect the results
plot(qt)

print(qt)

## 
##  Chi-squared test of CSR using quadrat counts
## 
## data:  ppxy
## X2 = 34.205, df = 24, p-value = 0.1622
## alternative hypothesis: two.sided
## 
## Quadrats: 25 tiles (irregular windows)

Poisson process

A Poisson point process creates events according to a Poisson distribution with an intensity parameter specifying the expected events per unit area. The total number of events generated is a single number from a Poisson distribution, so multiple realizations of the same process can easily have different numbers of events.

# Create a disc of radius 10
disc10 <- disc(10)

# Compute the rate as count divided by area
lambda <- 500 / area(disc10)

# Create a point pattern object
ppois <- rpoispp(lambda = lambda, win = disc10)

# Plot the Poisson point pattern
plot(ppois)

Thomas process

A Thomas process is a clustered pattern where a number of “parent” points, uniformly distributed, create a number of “child” points in their neighborhood. The child points themselves form the pattern. This is an attractive point pattern, and makes sense for modeling things like trees, since new trees will grow near the original tree. Random Thomas point patterns can be generated using rThomas(). This takes three numbers that determine the intensity and clustering of the points, and a window object.

Strauss process

Conversely the points of a Strauss process cause a lowering in the probability of finding another point nearby. The parameters of a Strauss process can be such that it is a “hard-core” process, where no two points can be closer than a set threshold. Creating points from this process involves some clever simulation algorithms. This is a repulsive point pattern, and makes sense for modeling things like territorial animals, since the other animals of that species will avoid the territory of a given animal. Random Strauss point patterns can be generated using rStrauss(). This takes three numbers that determine the intensity and “territory” of the points, and a window object. Points generated by a Strauss process are sometimes called regularly spaced.

# Create a disc of radius 10
disc10 <- disc(10)

# Generate clustered points from a Thomas process
set.seed(123)
p_cluster <- rThomas(kappa = 0.35, scale = 1, mu = 3, win = disc10)
plot(p_cluster)

# Run a quadrat test
quadrat.test(p_cluster, alternative = "clustered")

## Warning: Some expected counts are small; chi^2 approximation may be inaccurate

## 
##  Chi-squared test of CSR using quadrat counts
## 
## data:  p_cluster
## X2 = 53.387, df = 24, p-value = 0.0005142
## alternative hypothesis: clustered
## 
## Quadrats: 25 tiles (irregular windows)

# Regular points from a Strauss process
set.seed(123)
p_regular <- rStrauss(beta = 2.9, gamma = 0.025, R = .5, W = disc10)

## Warning: Simulation will be performed in the containing rectangle and clipped to
## the original window.

plot(p_regular)

# Run a quadrat test
quadrat.test(p_regular, alternative = "regular")

## Warning: Some expected counts are small; chi^2 approximation may be inaccurate

## 
##  Chi-squared test of CSR using quadrat counts
## 
## data:  p_regular
## X2 = 8.57, df = 24, p-value = 0.001619
## alternative hypothesis: regular
## 
## Quadrats: 25 tiles (irregular windows)

One problem with the quadrat test is that you have to choose a set of sub-regions

Nearest-neighbor distributions

Another way of assessing clustering and regularity is to consider each point, and how it relates to the other points. One simple measure is the distribution of the distances from each point to its nearest neighbor.

The nndist() function in spatstat takes a point pattern and for each point returns the distance to its nearest neighbor. You can then plot the histogram.

Instead of working with the nearest-neighbor density, as seen in the histogram, it can be easier to work with the cumulative distribution function, G(r). This is the probability of a point having a nearest neighbor within a distance r.

For a uniform Poisson process, G can be computed theoretically, and is G(r) = 1 - exp( - lambda * pi * r ^ 2). You can compute G empirically from your data using Gest() and so compare with the theoretical value.

Events near the edge of the window might have had a nearest neighbor outside the window, and so unobserved. This will make the distance to its observed nearest neighbor larger than expected, biasing the estimate of G. There are several methods for correcting this bias.

Plotting the output from Gest shows the theoretical cumulative distribution and several estimates of the cumulative distribution using different edge corrections. Often these edge corrections are almost indistinguishable, and the lines overlap. The plot can be used as a quick exploratory test of complete spatial randomness.

Other point pattern distribution functions

A number of other functions of point patterns have been developed. They are conventionally denoted by various capital letters, including F, H, J, K and L.

The K-function is defined as the expected number of points within a distance of a point of the process, scaled by the intensity. Like G, this can be computed theoretically for a uniform Poisson process and is K(r) = pi * r ^ 2 - the area of a circle of that radius. Deviation from pi * r ^ 2 can indicate clustering or point inhibition.

Computational estimates of K(r) are done using the Kest() function.

As with G calculations, K-function calculations also need edge corrections. The default edge correction in spatstat is generally the best, but can be slow, so we’ll use the “border” correction for speed here.

Uncertainties on K-function estimates can be assessed by randomly sampling points from a uniform Poisson process in the area and computing the K-function of the simulated data. Repeat this process 99 times, and take the minimum and maximum value of Kover each of the distance values. This gives an envelope - if the K-function from the data goes above the top of the envelope then we have evidence for clustering. If the K-function goes below the envelope then there is evidence for an inhibitory process causing points to be spaced out. Envelopes can be computed using the envelope() function.

Ripley’s K function

K is the number of expected events to be found at a given distance from an event, scaled by the intensity. To estimate it for some given distance d, visit each event in turn

p_poisson <- rpoispp(lambda = lambda, win = disc10)
# Point patterns are pre-defined
p_poisson; p_regular

# Calc nearest-neighbor distances for Poisson point data
nnd_poisson <- nndist(p_poisson)

# Draw a histogram of nearest-neighbor distances
hist(nnd_poisson)

# Estimate G(r)
G_poisson <- Gest(p_poisson)

# Plot G(r) vs. r
plot(G_poisson)

# Repeat for regular point data
nnd_regular <- nndist(p_regular)
hist(nnd_regular)
G_regular <- Gest(p_regular)
plot(G_regular)

# Point patterns are pre-defined
p_poisson; p_cluster; p_regular

# Estimate the K-function for the Poisson points
K_poisson <- Kest(p_poisson, correction = "border")

# The default plot shows quadratic growth
plot(K_poisson, . ~ r)

# Subtract pi * r^2 from the Y-axis and plot
plot(K_poisson, . - pi * r ^ 2 ~ r)

# Compute envelopes of K under random locations
K_cluster_env <- envelope(p_cluster, Kest, correction = "border")

# Insert the full formula to plot K minus pi * r^2
plot(K_cluster_env, . - pi * r ^ 2 ~ r)

# Repeat for regular data
K_regular_env <- envelope(p_regular, Kest, correction = "border")
plot(K_regular_env, . - pi * r ^ 2 ~ r)

Biavariate point patterns

The dataset we shall use for this example consists of crimes in a 4km radius of the center of Preston, a town in north-west England. We want to look for hotspots of violent crime in the area.

preston_crime <- readRDS("spatial_data/pcrime-spatstat.rds")
preston_osm <- readRDS("spatial_data/osm_preston_gray.rds")
# Get some summary information on the dataset
summary(preston_crime)

## Marked planar point pattern:  2036 points
## Average intensity 4.053214e-05 points per square unit
## 
## Coordinates are given to 2 decimal places
## i.e. rounded to the nearest multiple of 0.01 units
## 
## Multitype:
##                   frequency proportion    intensity
## Non-violent crime      1812  0.8899804 3.607281e-05
## Violent crime           224  0.1100196 4.459332e-06
## 
## Window: polygonal boundary
## single connected closed polygon with 99 vertices
## enclosing rectangle: [349773, 357771] x [425706.5, 433705.5] units
##                      (7998 x 7999 units)
## Window area = 50231700 square units
## Fraction of frame area: 0.785

# Get a table of marks
table(marks(preston_crime))

## 
## Non-violent crime     Violent crime 
##              1812               224

# Define a function to create a map
preston_map <- function(cols = c("green", "red"), cex = c(1, 1), pch = c(1, 1)) {
  plotRGB(preston_osm) # from the raster package
  plot(preston_crime, cols = cols, pch = pch, cex = cex, add = TRUE, show.window = TRUE)
}

# Draw the map with colors, sizes and plot characters
preston_map(
  cols = c("black", "red"), 
  cex = c(0.5, 1), 
  pch = c(19, 19)
)

One method of computing a smooth intensity surface from a set of points is to use kernel smoothing. Imagine replacing each point with a dot of ink on absorbent paper. Each individual ink drop spreads out into a patch with a dark center, and multiple drops add together and make the paper even darker. With the right amount of ink in each drop, and with paper of the right absorbency, you can create a fair impression of the density of the original points. In kernel smoothing jargon, this means computing a bandwidth and using a particular kernel function

# preston_crime has been pre-defined
preston_crime

## Marked planar point pattern: 2036 points
## Multitype, with levels = Non-violent crime, Violent crime 
## window: polygonal boundary
## enclosing rectangle: [349773, 357771] x [425706.5, 433705.5] units

# Use the split function to show the two point patterns
crime_splits <- split(preston_crime)

# Plot the split crime
plot(crime_splits)

# Compute the densities of both sets of points
crime_densities <- density(crime_splits)

# Calc the violent density divided by the sum of both
frac_violent_crime_density <- crime_densities[["Violent crime"]] / 
  (crime_densities[["Non-violent crime"]] + crime_densities[["Violent crime"]])

# Plot the density of the fraction of violent crime
plot(frac_violent_crime_density)

Spatial segregation

Bandwidth selection

We can get a more principled measure of the violent crime ratio using a spatial segregation model. The spatialkernel package implements the theory of spatial segregation.

The first step is to compute the optimal bandwidth for kernel smoothing under the segregation model. A small bandwidth would result in a density that is mostly zero, with spikes at the event locations. A large bandwidth would flatten out any structure in the events, resulting in a large “blob” across the whole window. Somewhere between these extremes is a bandwidth that best represents an underlying density for the process.

spseg() will scan over a range of bandwidths and compute a test statistic using a cross-validation method. The bandwidth that maximizes this test statistic is the one to use. The returned value from spseg() in this case is a list, with h and cv elements giving the values of the statistic over the input h values. The spatialkernel package supplies a plotcv function to show how the test value varies. The hcv element has the value of the best bandwidth.

library(spatialkernel) # installed via github

## 
## This is spatialkernel 0.4-24

## 
## Attaching package: 'spatialkernel'

## The following object is masked from 'package:spatstat.data':
## 
##     lansing

# Scan from 500m to 1000m in steps of 50m
bw_choice <- spseg(
    preston_crime, 
    h = seq(500, 1000, by = 50),
    opt = 1)

## 
## Calculating cross-validated likelihood function

# Plot the results and highlight the best bandwidth
plotcv(bw_choice); abline(v = bw_choice$hcv, lty = 2, col = "red")

# Print the best bandwidth
print(bw_choice$hcv)

## [1] 800

Segregation probabilities

The second step is to compute the probabilities for violent and non-violent crimes as a smooth surface, as well as the p-values for a point-wise test of segregation. This is done by calling spseg() with opt = 3 and a fixed bandwidth parameter h.

Normally you would run this process for at least 100 simulations, but that will take too long to run here. Instead, run for only 10 simulations. Then you can use a pre-loaded object seg which is the output from a 1000 simulation run that took about 20 minutes to complete.

# Set the correct bandwidth and run for 10 simulations only
seg10 <- spseg(
    pts = preston_crime, 
    h = bw_choice$hcv,
    opt = 3,
    ntest = 10, 
    proc = FALSE)

# Plot the segregation map for violent crime
plotmc(seg10, "Violent crime")

Space-time data

The sasquatch, or “bigfoot”, is a large ape-like creature reported to live in North American forests. The Bigfoot Field Researchers Organization maintains a database of sightings and allows its use for teaching and research. A cleaned subset of data in north-west USA has been created as the ppp object sasq and is loaded for you to explore the space-time pattern of sightings in the area.

sasq <- readRDS("spatial_data/sasquatch.rds")


# Get a quick summary of the dataset
summary(sasq)

## Marked planar point pattern:  423 points
## Average intensity 2.097156e-09 points per square unit
## 
## *Pattern contains duplicated points*
## 
## Coordinates are given to 1 decimal place
## i.e. rounded to the nearest multiple of 0.1 units
## 
## Mark variables: date, year, month
## Summary:
##       date                 year         month    
##  Min.   :1990-05-03   Y2004  : 41   Sep    : 59  
##  1st Qu.:2000-04-30   Y2000  : 36   Oct    : 56  
##  Median :2003-11-05   Y2002  : 30   Aug    : 54  
##  Mean   :2003-08-11   Y2005  : 30   Jul    : 50  
##  3rd Qu.:2007-11-02   Y2001  : 26   Nov    : 43  
##  Max.   :2016-04-05   Y2008  : 26   Jun    : 41  
##                       (Other):234   (Other):120  
## 
## Window: polygonal boundary
## single connected closed polygon with 64 vertices
## enclosing rectangle: [368187.8, 764535.6] x [4644873, 5434933] units
##                      (396300 x 790100 units)
## Window area = 2.01702e+11 square units
## Fraction of frame area: 0.644

# Plot unmarked points
plot(unmark(sasq))

# Plot the points using a circle sized by date
plot(sasq, which.marks = "date")

Having established that there is some spatial clustering going on, you need to explore the temporal behavior. Are the number of sightings increasing? Decreasing? Does the rate vary over the course of a year (“seasonality”)? Does the spatial pattern change much over the course of a year?

# Show the available marks
names(marks(sasq))

## [1] "date"  "year"  "month"

# Histogram the dates of the sightings, grouped by year
hist(marks(sasq)$date, "years", freq = TRUE)

# Plot and tabulate the calendar month of all the sightings
plot(table(marks(sasq)$month))

# Split on the month mark
sasq_by_month <- split(sasq, "month", un = TRUE)

# Plot monthly maps
plot(sasq_by_month)

# Plot smoothed versions of the above maps.
plot(density(sasq_by_month))

Space time clustering

# Get a matrix of event coordinates
sasq_xy <- as.matrix(coords(sasq))

# Check the matrix has two columns
dim(sasq_xy)

## [1] 423   2

# Get a vector of event times
sasq_t <- marks(sasq)$date

# Extract a two-column matrix from the ppp object
sasq_poly <- as.matrix(as.data.frame(Window(sasq)))
dim(sasq_poly)

## [1] 64  2

# Set the time limit to 1 day before and 1 after the range of times
tlimits <- range(sasq_t) + c(-1, 1)

# Scan over 400m intervals from 100m to 20km
s <- seq(100, 20000, by = 400)

# Scan over 14 day intervals from one week to 31 weeks 
tm <- seq(7, 7 * 31, by = 14)

Monte-carlo test of space-time clustering

Everything is now ready for you to run the space-time clustering test function. You can then plot the results and compute a p-value for rejecting the null hypothesis of no space-time clustering.

Any space-time clustering in a data set will be removed if you randomly rearrange the dates of the data points. The stmctest() function computes a clustering test statistic for your data based on the space-time K-function - how many points are within a spatial and temporal window of a point of the data. It then does a number of random rearrangements of the dates among the points and computes the clustering statistic. After doing this a large number of times, you can compare the test statistic for your data with the values from the random data. If the test statistic for your data is sufficiently large or small, you can reject the null hypothesis of no space-time clustering.

The output from stmctest() is a list with a single t0 which is the test statistic for your data, and a vector of t from the simulations. By converting to data frame you can feed this to ggplot functions.

Because the window area is a large number of square meters, and we have about 400 events, the numerical value of the intensity is a very small number. This makes values of the various K-functions very large numbers, since they are proportional to the inverse of the intensity. Don’t worry if you see 10^10 or higher!

The p-value of a Monte-Carlo test like this is just the proportion of test statistics that are larger than the value from the data. You can compute this from the t and t0 elements of the output.

library(spdep)

## Warning: package 'spdep' was built under R version 4.0.3

## Loading required package: spData

## Warning: package 'spData' was built under R version 4.0.3

## To access larger datasets in this package, install the spDataLarge
## package with: `install.packages('spDataLarge',
## repos='https://nowosad.github.io/drat/', type='source')`

## Loading required package: sf

## Warning: package 'sf' was built under R version 4.0.3

## Linking to GEOS 3.8.0, GDAL 3.0.4, PROJ 6.3.1

library(epitools)

## Warning: package 'epitools' was built under R version 4.0.3

library(R2BayesX)

## Warning: package 'R2BayesX' was built under R version 4.0.3

## Loading required package: BayesXsrc

## Warning: package 'BayesXsrc' was built under R version 4.0.3

## Loading required package: colorspace

## 
## Attaching package: 'colorspace'

## The following object is masked from 'package:raster':
## 
##     RGB

## The following object is masked from 'package:spatstat':
## 
##     coords

## Loading required package: mgcv

## This is mgcv 1.8-31. For overview type 'help("mgcv-package")'.

library(splancs)

## Warning: package 'splancs' was built under R version 4.0.3

## 
## Spatial Point Pattern Analysis Code in S-Plus
##  
##  Version 2 - Spatial and Space-Time analysis

## 
## Attaching package: 'splancs'

## The following object is masked from 'package:raster':
## 
##     zoom

library(cartogram)

## Warning: package 'cartogram' was built under R version 4.0.3

library(rgeos)

## Warning: package 'rgeos' was built under R version 4.0.3

## rgeos version: 0.5-5, (SVN revision 640)
##  GEOS runtime version: 3.8.0-CAPI-1.13.1 
##  Linking to sp version: 1.4-4 
##  Polygon checking: TRUE

# Run 999 simulations
sasq_mc <- stmctest(sasq_xy, sasq_t, sasq_poly, tlimits, s, tm, nsim = 999, quiet = TRUE)
names(sasq_mc)

## [1] "t0" "t"

# Histogram the simulated statistics and add a line at the data value
ggplot(data.frame(sasq_mc), aes(x = t)) +
  geom_histogram(binwidth = 1e13) +
  geom_vline(aes(xintercept = t0))

# Compute the p-value as the proportion of tests greater than the data
sum(sasq_mc$t > sasq_mc$t0) / 1000

## [1] 0.035

Areal statistics

Cartogram

distorts wach region until the area of each region is proportional to some quantity of the region

In 2016 the UK held a public vote on whether to remain in the European Union. The results of the referendum, where people voted either “Remain” or “Leave”, are available online. The data set london_ref contains the results for the 32 boroughs of London, and includes the number and percentage of votes in each category as well as the count of spoilt votes, the population size and the electorate size.

The london_ref object is a SpatialPolygonsDataFrame, a special kind of data frame where each row also has the shape of the borough. It behaves like a data frame in many respects, but can also be used to plot a choropleth, or shaded polygon, map.

You should start with some simple data exploration and mapping. The following variables will be useful:

NAME : the name of the borough.
Electorate : the total number of people who can vote.
Remain, Leave : the number of votes for “Remain” or “Leave”.
Pct_Remain, Pct_Leave : the percentage of votes for each side.

spplot() from the raster package provides a convenient way to draw a shaded map of regions

london_ref <- readRDS("spatial_data/london_eu.rds")
# See what information we have for each borough
summary(london_ref)

## Object of class SpatialPolygonsDataFrame
## Coordinates:
##        min      max
## x 503574.2 561956.7
## y 155850.8 200933.6
## Is projected: TRUE 
## proj4string :
## [+proj=tmerc +lat_0=49 +lon_0=-2 +k=0.9996012717 +x_0=400000
## +y_0=-100000 +datum=OSGB36 +units=m +no_defs +ellps=airy
## +towgs84=446.448,-125.157,542.060,0.1502,0.2470,0.8421,-20.4894]
## Data attributes:
##      NAME             TOTAL_POP        Electorate       Votes_Cast    
##  Length:32          Min.   :157711   Min.   : 83042   Min.   : 54801  
##  Class :character   1st Qu.:237717   1st Qu.:143458   1st Qu.:104079  
##  Mode  :character   Median :272017   Median :168394   Median :116280  
##                     Mean   :270780   Mean   :169337   Mean   :118025  
##                     3rd Qu.:316911   3rd Qu.:196285   3rd Qu.:134142  
##                     Max.   :379691   Max.   :245349   Max.   :182570  
##      Remain           Leave       Rejected_Ballots   Pct_Remain   
##  Min.   : 27750   Min.   :17138   Min.   : 60.0    Min.   :30.34  
##  1st Qu.: 55973   1st Qu.:32138   1st Qu.:105.0    1st Qu.:53.69  
##  Median : 70254   Median :45263   Median :138.0    Median :61.01  
##  Mean   : 70631   Mean   :47255   Mean   :139.0    Mean   :60.46  
##  3rd Qu.: 84287   3rd Qu.:59018   3rd Qu.:164.2    3rd Qu.:69.90  
##  Max.   :118463   Max.   :96885   Max.   :267.0    Max.   :78.62  
##    Pct_Leave      Pct_Rejected      Assembly        
##  Min.   :21.38   Min.   :0.0600   Length:32         
##  1st Qu.:30.10   1st Qu.:0.0875   Class :character  
##  Median :38.99   Median :0.1100   Mode  :character  
##  Mean   :39.54   Mean   :0.1187                     
##  3rd Qu.:46.31   3rd Qu.:0.1500                     
##  Max.   :69.66   Max.   :0.2200

# Which boroughs voted to "Leave"?
london_ref$NAME[london_ref$Leave > london_ref$Remain]

## [1] "Sutton"               "Barking and Dagenham" "Bexley"              
## [4] "Havering"             "Hillingdon"

# Plot a map of the percentage that voted "Remain"
spplot(london_ref, zcol = "Pct_Remain")

Cartogram

Large areas, such as cities or countries, are often divided into smaller administrative units, often into zones of approximately equal population. But the area of those units may vary considerably. When mapping them, the large areas carry more visual “weight” than small areas, although just as many people live in the small areas.

One technique for correcting for this is the cartogram. This is a controlled distortion of the regions, expanding some and contracting others, so that the area of each region is proportional to a desired quantity, such as the population. The cartogram also tries to maintain the correct geography as much as possible, by keeping regions in roughly the same place relative to each other.

The cartogram package contains functions for creating cartograms. You give it a spatial data frame and the name of a column, and you get back a similar data frame but with regions distorted so that the region area is proportional to the column value of the regions.

You’ll also use the rgeos package for computing the areas of individual regions with the gArea() function.

# Make a scatterplot of electorate vs borough area
names(london_ref)

##  [1] "NAME"             "TOTAL_POP"        "Electorate"       "Votes_Cast"      
##  [5] "Remain"           "Leave"            "Rejected_Ballots" "Pct_Remain"      
##  [9] "Pct_Leave"        "Pct_Rejected"     "Assembly"

plot(london_ref$Electorate, gArea(london_ref, byid = TRUE))

# Make a cartogram, scaling the area to the electorate
carto_ref <- cartogram(london_ref, "Electorate")

## 
## Please use cartogram_cont() instead of cartogram().

## Mean size error for iteration 1: 1.5881743190908

## Mean size error for iteration 2: 1.30262460145333

## Mean size error for iteration 3: 1.16605703069549

## Mean size error for iteration 4: 1.09652671280571

## Mean size error for iteration 5: 1.05979301980903

## Mean size error for iteration 6: 1.03870642059676

## Mean size error for iteration 7: 1.02576724388584

## Mean size error for iteration 8: 1.01780225318077

## Mean size error for iteration 9: 1.0124614648423

## Mean size error for iteration 10: 1.00876130634916

## Mean size error for iteration 11: 1.00621346049453

## Mean size error for iteration 12: 1.00446957764083

## Mean size error for iteration 13: 1.00322569444771

## Mean size error for iteration 14: 1.00233570000671

## Mean size error for iteration 15: 1.00169692422828

plot(carto_ref)

# Check the linearity of the electorate-area plot
plot(carto_ref$Electorate, gArea(carto_ref, byid = TRUE))

# Make a fairer map of the Remain percentage
spplot(carto_ref, "Pct_Remain")

Spatial autocorrelation test

The map of “Remain” votes seems to have spatial correlation. Pick any two boroughs that are neighbors - with a shared border - and the chances are they’ll be more similar than any two random boroughs. This can be a problem when using statistical models that assume, conditional on the model, that the data points are independent.

The spdep package has functions for measures of spatial correlation, also known as spatial dependency. Computing these measures first requires you to work out which regions are neighbors via the poly2nb() function, short for “polygons to neighbors”. The result is an object of class nb. Then you can compute the test statistic and run a significance test on the null hypothesis of no spatial correlation. The significance test can either be done by Monte-Carlo or theoretical models.

In this example you’ll use the Moran “I” statistic to test the spatial correlation of the population and the percentage “Remain” vote.

\[ I = \frac{n \sum_{i}\sum_{j}w_{ij}(z_{i} - \bar{z})(z_{j} - \bar{z})}{\sum_{i}(z_{i} - \bar{z})^2} \]

# Make neighbor list
borough_nb <- poly2nb(london_ref)

# Get center points of each borough
borough_centers <- coordinates(london_ref)

# Show the connections
plot(london_ref); plot(borough_nb, borough_centers, add = TRUE)

# Map the total pop'n
spplot(london_ref, zcol = "TOTAL_POP")

# Run a Moran I test on total pop'n
moran.test(
  london_ref$TOTAL_POP, 
  nb2listw(borough_nb)
)

## 
##  Moran I test under randomisation
## 
## data:  london_ref$TOTAL_POP  
## weights: nb2listw(borough_nb)    
## 
## Moran I statistic standard deviate = 1.2124, p-value = 0.1127
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##        0.11549264       -0.03225806        0.01485190

# Map % Remain
spplot(london_ref, zcol = "Pct_Remain")

# Run a Moran I MC test on % Remain
moran.mc(
  london_ref$Pct_Remain, 
  nb2listw(borough_nb), 
  999
)

## 
##  Monte-Carlo simulation of Moran I
## 
## data:  london_ref$Pct_Remain 
## weights: nb2listw(borough_nb)  
## number of simulations + 1: 1000 
## 
## statistic = 0.42841, observed rank = 1000, p-value = 0.001
## alternative hypothesis: greater

he UK’s National Health Service publishes weekly data for consultations at a number of “sentinel” clinics and makes this data available. A dataset for one week in February 2017 has been loaded for you to analyze. It is called london, and contains data for the 32 boroughs.

You will focus on reports of “Influenza-like illness”, or more simply “Flu”. Your first task is to map the “Standardized Morbidity Ratio”, or SMR. This is the number of cases per person, but scaled by the overall incidence so that the expected number is 1.

london <- readRDS("spatial_data/london_2017_2.rds")

# Get a summary of the data set
summary(london)

## Object of class SpatialPolygonsDataFrame
## Coordinates:
##        min      max
## x 503574.2 561956.7
## y 155850.8 200933.6
## Is projected: TRUE 
## proj4string :
## [+proj=tmerc +lat_0=49 +lon_0=-2 +k=0.9996012717 +x_0=400000
## +y_0=-100000 +datum=OSGB36 +units=m +no_defs +ellps=airy
## +towgs84=446.448,-125.157,542.060,0.1502,0.2470,0.8421,-20.4894]
## Data attributes:
##      CODE               NAME              Flu_OBS          Vom_OBS     
##  Length:32          Length:32          Min.   :  0.00   Min.   : 0.00  
##  Class :character   Class :character   1st Qu.: 11.00   1st Qu.:14.00  
##  Mode  :character   Mode  :character   Median : 33.00   Median :40.00  
##                                        Mean   : 38.56   Mean   :37.28  
##                                        3rd Qu.: 61.00   3rd Qu.:57.50  
##                                        Max.   :112.00   Max.   :96.00  
##    Diarr_OBS        Gastro_OBS      TOTAL_POP        CCGcode         
##  Min.   :  0.00   Min.   :  0.0   Min.   :157711   Length:32         
##  1st Qu.: 22.50   1st Qu.: 48.0   1st Qu.:237717   Class :character  
##  Median : 62.00   Median :120.5   Median :272017   Mode  :character  
##  Mean   : 57.03   Mean   :113.7   Mean   :270780                     
##  3rd Qu.: 90.75   3rd Qu.:176.8   3rd Qu.:316911                     
##  Max.   :122.00   Max.   :251.0   Max.   :379691                     
##  CCG.geography.code   CCG.name         Asthma_Prevalence Obesity_Prevalence
##  Length:32          Length:32          Min.   :3.550     Min.   : 3.930    
##  Class :character   Class :character   1st Qu.:4.412     1st Qu.: 5.855    
##  Mode  :character   Mode  :character   Median :4.660     Median : 7.565    
##                                        Mean   :4.624     Mean   : 7.585    
##                                        3rd Qu.:4.925     3rd Qu.: 8.810    
##                                        Max.   :5.470     Max.   :12.130    
##  Cancer_Prevalence Diabetes_Prevalence     Income         Employment    
##  Min.   :0.870     Min.   :3.620       Min.   :0.0730   Min.   :0.0570  
##  1st Qu.:1.438     1st Qu.:5.265       1st Qu.:0.1308   1st Qu.:0.0905  
##  Median :1.605     Median :6.305       Median :0.1665   Median :0.1095  
##  Mean   :1.684     Mean   :6.245       Mean   :0.1655   Mean   :0.1092  
##  3rd Qu.:1.903     3rd Qu.:7.067       3rd Qu.:0.1985   3rd Qu.:0.1283  
##  Max.   :2.540     Max.   :9.060       Max.   :0.2530   Max.   :0.1560  
##    Education      HealthDeprivation     Crime            Services    
##  Min.   : 3.958   Min.   :-1.4100   Min.   :-0.1550   Min.   :19.63  
##  1st Qu.:10.047   1st Qu.:-0.5055   1st Qu.: 0.3745   1st Qu.:24.43  
##  Median :13.925   Median :-0.2050   Median : 0.5515   Median :30.41  
##  Mean   :14.024   Mean   :-0.2044   Mean   : 0.5379   Mean   :29.55  
##  3rd Qu.:17.480   3rd Qu.: 0.2010   3rd Qu.: 0.7823   3rd Qu.:34.74  
##  Max.   :27.182   Max.   : 0.5430   Max.   : 1.0190   Max.   :41.89  
##   Environment          i        
##  Min.   :13.37   Min.   : 0.00  
##  1st Qu.:24.03   1st Qu.: 7.75  
##  Median :28.20   Median :15.50  
##  Mean   :31.38   Mean   :15.50  
##  3rd Qu.:40.15   3rd Qu.:23.25  
##  Max.   :55.00   Max.   :31.00

# Map the OBServed number of flu reports
spplot(london, "Flu_OBS")

# Compute and print the overall incidence of flu
r <- sum(london$Flu_OBS) / sum(london$TOTAL_POP)
r

## [1] 0.0001424128

# Calculate the expected number for each borough
london$Flu_EXP <- london$TOTAL_POP * r

# Calculate the ratio of OBServed to EXPected
london$Flu_SMR <- london$Flu_OBS / london$Flu_EXP

# Map the SMR
spplot(london, "Flu_SMR")

Exceedance probabilities

Distributions and confidence intervals can be difficult things to present to non-statisticians. An alternative is to present a probability that a value is over a threshold. For example, public health teams might be interested in when an SMR has more than doubled, and as a statistician you can give a probability that this has happened. Then the public health team might decide to go to some alert level when the probability of a doubling of SMR is over 0.95.

Again, the properties of the binomial distribution let you compute this for proportional data. You can then map these exceedance probabilities for some threshold, and use a sensible color scheme to highlight probabilities close to 1.

# Probability of a binomial exceeding a multiple
binom.exceed <- function(observed, population, expected, e){
    1 - pbinom(e * expected, population, prob = observed / population)
}

# Compute P(rate > 2)
london$Flu_gt_2 <- binom.exceed(
            observed = london$Flu_OBS,
            population = london$TOTAL_POP,
            expected = london$Flu_EXP,
            e = 2)

# Use a 50-color palette that only starts changing at around 0.9
pal <- c(
  rep("#B0D0B0", 40),
  colorRampPalette(c("#B0D0B0", "orange"))(5), 
  colorRampPalette(c("orange", "red"))(5)
)

# Plot the P(rate > 2) map
spplot(london, "Flu_gt_2", col.regions = pal, at = seq(0, 1, len = 50))

Generalized linear models in space

\[ Y \sim D(\mu(X\beta)) \] D is some probability distribution, and \(\mu\) is called a link function, which transform the X

A Poisson GLM

A Poisson generalized linear model is a way of fitting count data to explanatory variables. You get out parameter estimates and standard errors for your explanatory variables, and can get fitted values and residuals.

The glm() function fits Poisson GLMs. It works just like the lm() function, but you also specify a family argument. The formula has the usual meaning - response on the left of the ~, and explanatory variables on the right.

To cope with count data coming from populations of different sizes, you specify an offset argument. This adds a constant term for each row of the data in the model. The log of the population is used in the offset term.

# Fit a poisson GLM.
model_flu <- glm(
  Flu_OBS ~ HealthDeprivation, 
  offset = log(TOTAL_POP), 
  data = london, 
  family = poisson)

# Is HealthDeprivation significant?
summary(model_flu)

## 
## Call:
## glm(formula = Flu_OBS ~ HealthDeprivation, family = poisson, 
##     data = london, offset = log(TOTAL_POP))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -9.5361  -4.5285  -0.0499   2.9043   8.2194  
## 
## Coefficients:
##                   Estimate Std. Error  z value Pr(>|z|)    
## (Intercept)       -8.78190    0.02869 -306.043   <2e-16 ***
## HealthDeprivation  0.65689    0.06797    9.665   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 703.75  on 31  degrees of freedom
## Residual deviance: 605.03  on 30  degrees of freedom
## AIC: 762.37
## 
## Number of Fisher Scoring iterations: 5

# Put residuals into the spatial data.
london$Flu_Resid <- residuals(model_flu)

# Map the residuals using spplot
spplot(london, "Flu_Resid")

Residuals

A linear model should fit the data and leave uncorrelated residuals. This applies to non-spatial models, where, for example, fitting a straight line through points on a curve would lead to serially-correlated residuals. A model on spatial data should aim to have residuals that show no significant spatial correlation.

You can test the model fitted to the flu data using moran.mc() from the spdep package. Monte Carlo Moran tests were previously discussed in the Spatial autocorrelation test exercise earlier in the chapter.

borough_nb <- poly2nb(london)

# Test spatial correlation of the residuals.
moran.mc(london$Flu_Resid, listw = nb2listw(borough_nb), nsim = 999)

## 
##  Monte-Carlo simulation of Moran I
## 
## data:  london$Flu_Resid 
## weights: nb2listw(borough_nb)  
## number of simulations + 1: 1000 
## 
## statistic = 0.15059, observed rank = 932, p-value = 0.068
## alternative hypothesis: greater

Correlation in spatial GLMs

Fit a Bayesian GLM

Bayesian statistical models return samples of the parameters of interest (the “posterior” distribution) based on some “prior” distribution which is then updated by the data. The Bayesian modeling process returns a number of samples from which you can compute the mean, or an exceedance probability, or any other quantity you might compute from a distribution.

Before you fit a model with spatial correlation, you’ll first fit the same model as before, but using Bayesian inference.

# Use R2BayesX

# Fit a GLM
model_flu <- glm(Flu_OBS ~ HealthDeprivation, offset = log(TOTAL_POP),
                data = london, family = poisson)

# Summarize it                
summary(model_flu)

## 
## Call:
## glm(formula = Flu_OBS ~ HealthDeprivation, family = poisson, 
##     data = london, offset = log(TOTAL_POP))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -9.5361  -4.5285  -0.0499   2.9043   8.2194  
## 
## Coefficients:
##                   Estimate Std. Error  z value Pr(>|z|)    
## (Intercept)       -8.78190    0.02869 -306.043   <2e-16 ***
## HealthDeprivation  0.65689    0.06797    9.665   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 703.75  on 31  degrees of freedom
## Residual deviance: 605.03  on 30  degrees of freedom
## AIC: 762.37
## 
## Number of Fisher Scoring iterations: 5

# Calculate coeff confidence intervals
confint(model_flu)

## Waiting for profiling to be done...

##                       2.5 %     97.5 %
## (Intercept)       -8.838677 -8.7261843
## HealthDeprivation  0.524437  0.7908841

# Fit a Bayesian GLM
bayes_flu <- bayesx(Flu_OBS ~ HealthDeprivation, offset = log(london$TOTAL_POP),
                    family = "poisson", data = data.frame(london), 
                    control = bayesx.control(seed = 17610407))
                    
# Summarize it                    
summary(bayes_flu)

## Call:
## bayesx(formula = Flu_OBS ~ HealthDeprivation, data = data.frame(london), 
##     offset = log(london$TOTAL_POP), control = bayesx.control(seed = 17610407), 
##     family = "poisson")
##  
## Fixed effects estimation results:
## 
## Parametric coefficients:
##                      Mean      Sd    2.5%     50%   97.5%
## (Intercept)       -8.7843  0.0281 -8.8371 -8.7848 -8.7292
## HealthDeprivation  0.6582  0.0700  0.5282  0.6568  0.7959
##  
## N = 32  burnin = 2000  method = MCMC  family = poisson  
## iterations = 12000  step = 10

# Look at the samples from the Bayesian model
plot(samples(bayes_flu))

Adding a spatially autocorrelated effect

You’ve fitted a non-spatial GLM with BayesX. You can include a spatially correlated term based on the adjacency structure by adding a term to the formula specifying a spatially correlated model.

# Compute adjacency objects
borough_nb <- poly2nb(london)
borough_gra <- nb2gra(borough_nb)

# Fit spatial model
flu_spatial <- bayesx(
  Flu_OBS ~ HealthDeprivation + sx(i, bs = "spatial", map = borough_gra),
  offset = log(london$TOTAL_POP),
  family = "poisson", data = data.frame(london), 
  control = bayesx.control(seed = 17610407)
)

## Note: created new output directory 'C:/Users/DATASC~1/AppData/Local/Temp/RtmpmeSEMZ/bayesx1'!

# Summarize the model
summary(flu_spatial)

## Call:
## bayesx(formula = Flu_OBS ~ HealthDeprivation + sx(i, bs = "spatial", 
##     map = borough_gra), data = data.frame(london), offset = log(london$TOTAL_POP), 
##     control = bayesx.control(seed = 17610407), family = "poisson")
##  
## Fixed effects estimation results:
## 
## Parametric coefficients:
##                      Mean      Sd    2.5%     50%   97.5%
## (Intercept)       -9.2522  0.1183 -9.5107 -9.2426 -9.0462
## HealthDeprivation  0.6507  0.4918 -0.3896  0.6633  1.5985
## 
## Smooth terms variances:
##             Mean     Sd   2.5%    50%  97.5%    Min    Max
## sx(i):mrf 4.6297 1.6885 2.2601 4.2771 8.6782 1.5129 14.119
##  
## N = 32  burnin = 2000  method = MCMC  family = poisson  
## iterations = 12000  step = 10

Mapping the spatial effects

As with glm, you can get the fitted values and residuals from your model using the fitted and residuals functions. bayesx models are a bit more complex, since you have the linear predictor and terms from sx elements, such as the spatially correlated term.

The summary function will show you information for the linear model terms and the smoothing terms in two separate tables. The spatial term is called "sx(i):mrf" - standing for “Markov Random Field”.

Bayesian analysis returns samples from a distribution for our S(x) term at each of the London boroughs. The fitted function from bayesx models returns summary statistics for each borough. You’ll just look at the mean of that distribution for now.

# Summarize the model
summary(flu_spatial)

## Call:
## bayesx(formula = Flu_OBS ~ HealthDeprivation + sx(i, bs = "spatial", 
##     map = borough_gra), data = data.frame(london), offset = log(london$TOTAL_POP), 
##     control = bayesx.control(seed = 17610407), family = "poisson")
##  
## Fixed effects estimation results:
## 
## Parametric coefficients:
##                      Mean      Sd    2.5%     50%   97.5%
## (Intercept)       -9.2522  0.1183 -9.5107 -9.2426 -9.0462
## HealthDeprivation  0.6507  0.4918 -0.3896  0.6633  1.5985
## 
## Smooth terms variances:
##             Mean     Sd   2.5%    50%  97.5%    Min    Max
## sx(i):mrf 4.6297 1.6885 2.2601 4.2771 8.6782 1.5129 14.119
##  
## N = 32  burnin = 2000  method = MCMC  family = poisson  
## iterations = 12000  step = 10

# Map the fitted spatial term only
london$spatial <- fitted(flu_spatial, term = "sx(i):mrf")[, "Mean"]
spplot(london, zcol = "spatial")

# Map the residuals
london$spatial_resid <- residuals(flu_spatial)[, "mu"]
spplot(london, zcol = "spatial_resid")

# Test residuals for spatial correlation
moran.mc(london$spatial_resid, nb2listw(borough_nb), 999)

## 
##  Monte-Carlo simulation of Moran I
## 
## data:  london$spatial_resid 
## weights: nb2listw(borough_nb)  
## number of simulations + 1: 1000 
## 
## statistic = -0.21497, observed rank = 55, p-value = 0.945
## alternative hypothesis: greater