2. Introducing Spatial Statistical Ideas

December 2020

Spatial Statistics

Overview

Motivation
Waldo Tobler’s First Law of Geography
- “everything is related to everything else, but near things are more related than distant things”.
It was meant to be tongue-in-cheek but think of
- Diffusion processes
- Migration
- Crime (eg. Felson and Cohen’s routine activity theory)
- Cell kinetics (Ok, that isn’t geography!)
All support the idea well
But this isn’t encapsulated in independence between observations assumption of many statistical models

Implications

“Data of geographic units are tied together, like bunches of grapes, not separate, like balls in an urn” - Frederick F. Stephan (1934)
Similar to time series and temporal closeness

We need to adjust existing approaches to data analysis, or consider new ones accordingly
- Summary statistics
- Visualisation
- Modelling

For the different kinds of spatial data seen earlier

Plan for rest of session

Focus on
1. Area data with associated values
2. Point data with associated valures
3. A few quick examples of other types of analysis

Summary Statistics
The absence of independence

Join Count Statistic

age_cty <- age_cty %>% mutate(Age_gt_37 = Mean_Age > 37)

For binary (or categorical variables)
In LHS map do counties of the same kind clump together
Count the cross-border joins:

Combination	Count
Red-Red	8
Red-Blue	17
Blue-Blue	33

Is that what one might expect if Red/Blue were randomly distributed across counties?

A Crowdsourced Hypothesis Test

One of these was based on real data
The others by randomly assigning 16 reds and 10 blues to counties
Which is the real data one?

A Crowdsourced Hypothesis Test - Revealed

One of these was based on real data
The others by randomly assigning 16 reds and 10 blues to counties
Which is the real data one?
Panel number 2

Testing the Join Count Statistic

Let \(\textrm{BB}\) be the Blue-Blue join count, and \(J_{ij}\) be an indicator for whether counties \(i\) and \(j\) are adjacent. Under an assumption of randomisation (ie the given red and blue counts randomly permuted across the counties) we have:

\[ \textrm{E}(\textrm{BB}) = \frac{S_0}{2}\frac{n_B^{(2)}}{n^{(2)}} \]

\[ \textrm{Var}(\textrm{BB}) = \frac{S_1}{4}\left[\frac{n_B^{(2)}}{n^{(2)}} - \frac{2n_B^{(3)}}{n^{(3)}} + \frac{n_B^{(4)}}{n^{(4)}} \right] + \frac{S_2}{4}\left[\frac{n_B^{(3)}}{n^{(3)}} - \frac{n_B^{(4)}}{n^{(4)}} \right] + \frac{S_0^2n_B^{(4)}}{4n^{(4)}} - \textrm{E}(\textrm{BB})^2 \]

where: \(n^{(k)} = n(n-1) \cdots (n-k+1)\), \(S_0 = \sum_{ij} J_{ij}\)
\(S_1=\frac{1}{2}\sum_{ij}(J_{ji}+J_{ij})^2\), \(S_2=\sum_{ij}{(\sum_jJ_{ij}+\sum_iJ_{ij})^2}\)
Assume asymptotic normality and test observed \(\textrm{BB}\) count
see Spatial Processes: Models and Applications AD Cliff & JK Ord (p39) for details

Doing this in R…

library(spdep) # A spatial statistics package for R
col_cty <- ifelse(age_cty$Age_gt_37,'Red','Blue') %>% factor()
joincount.test(col_cty, 
               nb2listw(poly2nb(age_cty),style='B'))[[1]]

    Join count test under nonfree sampling

data:  col_cty 
weights: nb2listw(poly2nb(age_cty), style = "B") 

Std. deviate for Blue = 2.6138, p-value = 0.004477
alternative hypothesis: greater
sample estimates:
Same colour statistic           Expectation              Variance 
             8.000000              3.747692              2.646642

Based on Blue-Blue counts there is reasonable evidence to reject a hypothesis of no clustering of Blue counties (p < 0.005)

NOTES:
- Uses the spdep library
- col_cty is a factor variable for the colour of each county
- poly2nb extracts a neighbour list from a polygon sf
- nb2listw turns this into a listw object - a weighted neighbour list, although here all the weights are 1 (specified by style="B")
- joincount.test does the test - it returns a list with 2 results (one for each colour), we select the first item (for \(\textrm{BB}\))

A Monte Carlo alternative

set.seed(19760604) # Reproducibility!!
joincount.mc(col_cty, 
               nb2listw(poly2nb(age_cty),style='B'),nsim=10000)[[1]]

    Monte-Carlo simulation of join-count statistic

data:  col_cty 
weights: nb2listw(poly2nb(age_cty), style = "B") 
number of simulations + 1: 10001 

Join-count statistic for Blue = 8, rank of observed statistic =
9868, p-value = 0.0133
alternative hypothesis: greater
sample estimates:
    mean of simulation variance of simulation 
              3.733800               2.664204

Based in Blue-Blue counts there is reasonable evidence (but weaker) to reject a hypothesis of no clustering of Blue counties (p < 0.05)

NOTES:
- This randomly permutes the Red and Blue colours among counties several times
- Doesn’t rely on asymptotic normal approximation
- joincount.mc then compares the actual join count with the simulated distribution
- Then the experimental p-value is based on the rank of the actual count
- joincount.mc works pretty much like joincount.test with an extra nsim (number of simulations) parameter.

Moran’s-\(I\)

A similar idea, but now each geographical area has a numeric value attached
For example average age for each county

\[ I = \frac{n}{\sum_{ij}W_{ij}}\frac{\sum_{ij}W_{ij}(x_i-\bar{x})(x_j-\bar{x})}{\sum_i(x_i-\bar{x})^2} \]

\(W_{ij}\) is a weight attached to the join for each county pair - we could just use \(J_{ij}\)
The idea is a non-binary weight could reflect strength of join
- But if \(W_{ij}\) is \(J_{ij}\) then

\[ I = \frac{\textsf{Mean of }(x_i - \bar{x})(x_j-\bar{x}) \textsf{ if i and j adjacent}}{\textsf{Variance of x}}\\ = \textsf{Mean of }{z_jz_j} \textsf{ for z-scores of adjacent zones} \]

Visually assessing spatial dependence

The Moran plot
Works well if \(W_{ij} = J_{ij}/J_{i.}\) (ie rows sum to 1)
The plot here is of \(x_i\) vs \(\sum_j W_{ij} x_i\)
ie \(x_i\) vs. mean of neighbours \(x_j\)’s
R code is below (style='W' for \(W_{ij}\) as above)

moran.plot(age_cty$Mean_Age, 
           nb2listw(poly2nb(age_cty),style='W'))

Testing Moran’s \(I\)

moran.test(age_cty$Mean_Age, 
               nb2listw(poly2nb(age_cty),style='W'))

    Moran I test under randomisation

data:  age_cty$Mean_Age  
weights: nb2listw(poly2nb(age_cty), style = "W")    

Moran I statistic standard deviate = 4.2282, p-value = 1.178e-05
alternative hypothesis: greater
sample estimates:
Moran I statistic       Expectation          Variance 
       0.49267372       -0.04000000        0.01587125

Based on similar analysis to join count
\(E(I) = -1/(n-1)\)
\(\textsf{Var}(I)\) very complicated - see Cliff & Ord!
Again, evidence of spatial pattern

set.seed(19981115)
moran.mc(age_cty$Mean_Age, 
               nb2listw(poly2nb(age_cty),style='W'),nsim=1000)

    Monte-Carlo simulation of Moran I

data:  age_cty$Mean_Age 
weights: nb2listw(poly2nb(age_cty), style = "W")  
number of simulations + 1: 1001 

statistic = 0.49267, observed rank = 1001, p-value = 0.000999
alternative hypothesis: greater

Monte-Carlo variation
Avoids aymptotic normality assumption
Again, evidence of spatial pattern

Point Data

Now consider point data - are points randomly distributed?
New data, thanks to City of Toronto Open Data Portal
- https://open.toronto.ca
Here reported broken water main locations are downloaded in R

library(opendatatoronto)
watermain <- search_packages("Watermain Breaks") %>% 
  list_package_resources() %>% .[2,] %>% get_resource()
neighbourhoods <- list_package_resources(
    "https://open.toronto.ca/dataset/neighbourhoods/") %>%
  get_resource() %>% st_transform(3857)
cline <- search_packages("TCL") %>% 
  list_package_resources() %>% .[1,] %>% 
  get_resource() %>%
  st_transform(3857) %>% 
  filter(FCODE_DESC %in% c("Major Arterial",
                           "Minor Arterial"))

ggplot(neighbourhoods) + geom_sf() + 
  geom_sf(data=cline,col='darkred') + 
  geom_sf(data=watermain %>% filter(BREAK_YEAR==2014))

Mean Nearest Neighbour Distance

This measures spatial clustering of the points, the lower this is, the more items are ‘clumped’
Can compute this via the RANN package in R

nnd <- watermain %>% 
  st_coordinates() %>% 
  RANN::nn2(k=2) %>% .$nn.dists 
nnd <- nnd[,2]
glue::glue(
 "Mean NND is {round(mean(nnd),2)} for {length(nnd)} points.")

Mean NND is 46.79 for 35465 points.

However we need to know how the observed value (in m) compares to a random process
For a random process on a 2D plane, the probability density of \(2\pi\lambda d_{i}^2\) is \(\chi^2_2\) Where \(d_{i}\) is the distance from \(p_i\) to its nearest neighbour and \(\lambda\) is the density of points per unit area - (Skellam, 1951)

Skellam, JG, 1951, “Random Dispersal in Theoretical Populations”, Biometrika, 38, 196-218

This needs a lower tail test
We can use some sf tools to estimate \(\lambda\)

lambda <- length(nnd)/sum(st_area(neighbourhoods))
test_stat <- 2 * pi * lambda * sum(nnd^2)
glue::glue(
 " Test statistic = {round(test_stat,1)}
  D.F. = {2*length(nnd)}
  p-value = {round(pchisq(test_stat,2*length(nnd)),2)}")

Test statistic = 37216.7
D.F. = 70930
p-value = 0

Although, note ideally we would allow for edge effects
… and think carefully about how \(\lambda\) is computed

Where might the clusters be?

Results suggest breaks DO cluster
Return to the mapping tools from before to explore

library(ggspatial)
hexgrid <- st_make_grid(neighbourhoods,square=FALSE,n=30) %>% 
  st_sf() %>% mutate(id=row_number())
hexgrid <- watermain %>% st_join(hexgrid) %>% 
  count(id) %>% st_drop_geometry() %>% 
  left_join(hexgrid,.) %>% mutate(n=ifelse(is.na(n),0,n))
ggplot(hexgrid) +  
  annotation_map_tile(zoomin=0,type='osmgrayscale') +
  geom_sf(data=hexgrid,aes(fill=n),alpha=0.5,lwd=NA) +
  scale_fill_distiller(name='Breaks',direction=1,
                       palette='Reds')

A closer look

Conclusion

The final part …

From
- The absence of spatial independence
To
- The presence of dependence