December 2020

Spatial Statistics

Overview


  • Motivation
  • Waldo Tobler’s First Law of Geography
    • “everything is related to everything else, but near things are more related than distant things”.
  • It was meant to be tongue-in-cheek but think of
    • Diffusion processes
    • Migration
    • Crime (eg. Felson and Cohen’s routine activity theory)
    • Cell kinetics (Ok, that isn’t geography!)
  • All support the idea well
  • But this isn’t encapsulated in independence between observations assumption of many statistical models

Implications


  • “Data of geographic units are tied together, like bunches of grapes, not separate, like balls in an urn” - Frederick F. Stephan (1934)
  • Similar to time series and temporal closeness

  • We need to adjust existing approaches to data analysis, or consider new ones accordingly
    • Summary statistics
    • Visualisation
    • Modelling

  • For the different kinds of spatial data seen earlier

Plan for rest of session


  • Focus on
    1. Area data with associated values
    2. Point data with associated valures
    3. A few quick examples of other types of analysis

Summary Statistics
The absence of independence

Join Count Statistic


age_cty <- age_cty %>% mutate(Age_gt_37 = Mean_Age > 37)

  • For binary (or categorical variables)
  • In LHS map do counties of the same kind clump together
  • Count the cross-border joins:
Combination Count
Red-Red 8
Red-Blue 17
Blue-Blue 33

  • Is that what one might expect if Red/Blue were randomly distributed across counties?

A Crowdsourced Hypothesis Test


  • One of these was based on real data
  • The others by randomly assigning 16 reds and 10 blues to counties
  • Which is the real data one?

A Crowdsourced Hypothesis Test - Revealed


  • One of these was based on real data
  • The others by randomly assigning 16 reds and 10 blues to counties
  • Which is the real data one?
  • Panel number 2

Testing the Join Count Statistic


  • Let \(\textrm{BB}\) be the Blue-Blue join count, and \(J_{ij}\) be an indicator for whether counties \(i\) and \(j\) are adjacent. Under an assumption of randomisation (ie the given red and blue counts randomly permuted across the counties) we have:

\[ \textrm{E}(\textrm{BB}) = \frac{S_0}{2}\frac{n_B^{(2)}}{n^{(2)}} \]

\[ \textrm{Var}(\textrm{BB}) = \frac{S_1}{4}\left[\frac{n_B^{(2)}}{n^{(2)}} - \frac{2n_B^{(3)}}{n^{(3)}} + \frac{n_B^{(4)}}{n^{(4)}} \right] + \frac{S_2}{4}\left[\frac{n_B^{(3)}}{n^{(3)}} - \frac{n_B^{(4)}}{n^{(4)}} \right] + \frac{S_0^2n_B^{(4)}}{4n^{(4)}} - \textrm{E}(\textrm{BB})^2 \]

  • where: \(n^{(k)} = n(n-1) \cdots (n-k+1)\), \(S_0 = \sum_{ij} J_{ij}\)
  • \(S_1=\frac{1}{2}\sum_{ij}(J_{ji}+J_{ij})^2\), \(S_2=\sum_{ij}{(\sum_jJ_{ij}+\sum_iJ_{ij})^2}\)
  • Assume asymptotic normality and test observed \(\textrm{BB}\) count
  • see Spatial Processes: Models and Applications AD Cliff & JK Ord (p39) for details

Doing this in R…


library(spdep) # A spatial statistics package for R
col_cty <- ifelse(age_cty$Age_gt_37,'Red','Blue') %>% factor()
joincount.test(col_cty, 
               nb2listw(poly2nb(age_cty),style='B'))[[1]]
    Join count test under nonfree sampling

data:  col_cty 
weights: nb2listw(poly2nb(age_cty), style = "B") 

Std. deviate for Blue = 2.6138, p-value = 0.004477
alternative hypothesis: greater
sample estimates:
Same colour statistic           Expectation              Variance 
             8.000000              3.747692              2.646642 
  • Based on Blue-Blue counts there is reasonable evidence to reject a hypothesis of no clustering of Blue counties (p < 0.005)
  • NOTES:
    • Uses the spdep library
    • col_cty is a factor variable for the colour of each county
    • poly2nb extracts a neighbour list from a polygon sf
    • nb2listw turns this into a listw object - a weighted neighbour list, although here all the weights are 1 (specified by style="B")
    • joincount.test does the test - it returns a list with 2 results (one for each colour), we select the first item (for \(\textrm{BB}\))

A Monte Carlo alternative


set.seed(19760604) # Reproducibility!!
joincount.mc(col_cty, 
               nb2listw(poly2nb(age_cty),style='B'),nsim=10000)[[1]]
    Monte-Carlo simulation of join-count statistic

data:  col_cty 
weights: nb2listw(poly2nb(age_cty), style = "B") 
number of simulations + 1: 10001 

Join-count statistic for Blue = 8, rank of observed statistic =
9868, p-value = 0.0133
alternative hypothesis: greater
sample estimates:
    mean of simulation variance of simulation 
              3.733800               2.664204 
  • Based in Blue-Blue counts there is reasonable evidence (but weaker) to reject a hypothesis of no clustering of Blue counties (p < 0.05)
  • NOTES:
    • This randomly permutes the Red and Blue colours among counties several times
    • Doesn’t rely on asymptotic normal approximation
    • joincount.mc then compares the actual join count with the simulated distribution
    • Then the experimental p-value is based on the rank of the actual count
    • joincount.mc works pretty much like joincount.test with an extra nsim (number of simulations) parameter.

Moran’s-\(I\)


  • A similar idea, but now each geographical area has a numeric value attached
  • For example average age for each county

\[ I = \frac{n}{\sum_{ij}W_{ij}}\frac{\sum_{ij}W_{ij}(x_i-\bar{x})(x_j-\bar{x})}{\sum_i(x_i-\bar{x})^2} \]

  • \(W_{ij}\) is a weight attached to the join for each county pair - we could just use \(J_{ij}\)
  • The idea is a non-binary weight could reflect strength of join
    • But if \(W_{ij}\) is \(J_{ij}\) then

\[ I = \frac{\textsf{Mean of }(x_i - \bar{x})(x_j-\bar{x}) \textsf{ if i and j adjacent}}{\textsf{Variance of x}}\\ = \textsf{Mean of }{z_jz_j} \textsf{ for z-scores of adjacent zones} \]

Visually assessing spatial dependence


  • The Moran plot
  • Works well if \(W_{ij} = J_{ij}/J_{i.}\) (ie rows sum to 1)
  • The plot here is of \(x_i\) vs \(\sum_j W_{ij} x_i\)
  • ie \(x_i\) vs. mean of neighbours \(x_j\)’s
  • R code is below (style='W' for \(W_{ij}\) as above)
moran.plot(age_cty$Mean_Age, 
           nb2listw(poly2nb(age_cty),style='W'))

Testing Moran’s \(I\)


moran.test(age_cty$Mean_Age, 
               nb2listw(poly2nb(age_cty),style='W'))
    Moran I test under randomisation

data:  age_cty$Mean_Age  
weights: nb2listw(poly2nb(age_cty), style = "W")    

Moran I statistic standard deviate = 4.2282, p-value = 1.178e-05
alternative hypothesis: greater
sample estimates:
Moran I statistic       Expectation          Variance 
       0.49267372       -0.04000000        0.01587125 
  • Based on similar analysis to join count
  • \(E(I) = -1/(n-1)\)
  • \(\textsf{Var}(I)\) very complicated - see Cliff & Ord!
  • Again, evidence of spatial pattern
set.seed(19981115)
moran.mc(age_cty$Mean_Age, 
               nb2listw(poly2nb(age_cty),style='W'),nsim=1000)
    Monte-Carlo simulation of Moran I

data:  age_cty$Mean_Age 
weights: nb2listw(poly2nb(age_cty), style = "W")  
number of simulations + 1: 1001 

statistic = 0.49267, observed rank = 1001, p-value = 0.000999
alternative hypothesis: greater
  • Monte-Carlo variation
  • Avoids aymptotic normality assumption
  • Again, evidence of spatial pattern

Point Data


  • Now consider point data - are points randomly distributed?
  • New data, thanks to City of Toronto Open Data Portal
  • Here reported broken water main locations are downloaded in R
library(opendatatoronto)
watermain <- search_packages("Watermain Breaks") %>% 
  list_package_resources() %>% .[2,] %>% get_resource()
neighbourhoods <- list_package_resources(
    "https://open.toronto.ca/dataset/neighbourhoods/") %>%
  get_resource() %>% st_transform(3857)
cline <- search_packages("TCL") %>% 
  list_package_resources() %>% .[1,] %>% 
  get_resource() %>%
  st_transform(3857) %>% 
  filter(FCODE_DESC %in% c("Major Arterial",
                           "Minor Arterial"))

ggplot(neighbourhoods) + geom_sf() + 
  geom_sf(data=cline,col='darkred') + 
  geom_sf(data=watermain %>% filter(BREAK_YEAR==2014))

Mean Nearest Neighbour Distance


  • This measures spatial clustering of the points, the lower this is, the more items are ‘clumped’
  • Can compute this via the RANN package in R
nnd <- watermain %>% 
  st_coordinates() %>% 
  RANN::nn2(k=2) %>% .$nn.dists 
nnd <- nnd[,2]
glue::glue(
 "Mean NND is {round(mean(nnd),2)} for {length(nnd)} points.")
Mean NND is 46.79 for 35465 points.
  • However we need to know how the observed value (in m) compares to a random process
  • For a random process on a 2D plane, the probability density of \(2\pi\lambda d_{i}^2\) is \(\chi^2_2\) Where \(d_{i}\) is the distance from \(p_i\) to its nearest neighbour and \(\lambda\) is the density of points per unit area - (Skellam, 1951)

    Skellam, JG, 1951, “Random Dispersal in Theoretical Populations”, Biometrika, 38, 196-218

  • This needs a lower tail test
  • We can use some sf tools to estimate \(\lambda\)
lambda <- length(nnd)/sum(st_area(neighbourhoods))
test_stat <- 2 * pi * lambda * sum(nnd^2)
glue::glue(
 " Test statistic = {round(test_stat,1)}
  D.F. = {2*length(nnd)}
  p-value = {round(pchisq(test_stat,2*length(nnd)),2)}")
Test statistic = 37216.7
D.F. = 70930
p-value = 0
  • Although, note ideally we would allow for edge effects
  • … and think carefully about how \(\lambda\) is computed

Where might the clusters be?


  • Results suggest breaks DO cluster
  • Return to the mapping tools from before to explore
library(ggspatial)
hexgrid <- st_make_grid(neighbourhoods,square=FALSE,n=30) %>% 
  st_sf() %>% mutate(id=row_number())
hexgrid <- watermain %>% st_join(hexgrid) %>% 
  count(id) %>% st_drop_geometry() %>% 
  left_join(hexgrid,.) %>% mutate(n=ifelse(is.na(n),0,n))
ggplot(hexgrid) +  
  annotation_map_tile(zoomin=0,type='osmgrayscale') +
  geom_sf(data=hexgrid,aes(fill=n),alpha=0.5,lwd=NA) +
  scale_fill_distiller(name='Breaks',direction=1,
                       palette='Reds')

A closer look

Conclusion

The final part …


  • From
    • The absence of spatial independence
  • To
    • The presence of dependence