Analysing Spatial Data

What do we mean

  • Some processes are essentially spatial
    • Weather patterns
    • Epidemic spreading
    • Global temperature patterns
  • Need to summarise and visualise these processes to help understand them
    • Here we can do this in R with tmap, sf and other helpers
    • Need to think about space and time

What is Special About Spatial

Tobler’s First “Law”

“I invoke the first law of geography - everything is related to everything else, but near things are more related than distant things.” (Tobler, 1970)

Tobler, W. R. 1970. A computer movie simulating urban growth in the Detroit region. Economic Geography 46: 234–40.

  • Tone is tongue in cheek
    • I’m not sure Tobler meant it to be a ‘law’.
  • But frequently the observation is true
  • Certainly a hypothesis worth testing
    • Things may not happen bacause of Tobler’s first law
    • but many processes exhibit it
    • eg rainfall, levels of particulate matter

Spatial Autocorrelation - 1

  • Measures to what extent Toblers ‘Law’ applies
  • Similar idea to correlation seen last week
  • Consider the mean rainfall data by county from last week
  • Do neighbour county levels look similar?

Spatial Autocorrelation - 2

  • Subtract overall mean from each county
  • Colours above mean \(\rightarrow\) blue
  • Colours below mean \(\rightarrow\) red
  • Look at boundaries
    • Mainly \((+,+)\) or \((-,-)\)
    • Fewer \((+,-)\) or \((-,+)\)

Spatial Autocorrelation - 3

  • Subtract overall mean from each county
  • Colours above mean \(\rightarrow\) blue
  • Colours below mean \(\rightarrow\) red
  • Look at boundaries - multiply county values for each side
    • Mainly \((+,+)\) or \((-,-)\) \(\rightarrow\) blue
    • Fewer \((+,-)\) or \((-,+)\) \(\rightarrow\) red
    • Mean of this is 76.485
  • Finally, as with correlation coefficient, divide by standard deviation (squared) to make the measure unit free
    • Value is 0.515
    • This is Moran’s-\(I\).

What does Moran’s-\(I\) look like?

Negative Moran’s \(I\)

  • Quite a number of differences over borders: \(I\) = -0.314

Moran’s \(I\) near zero

  • Mixture of differences/similarities across borders: \(I\) = -0.03

Positive Moran’s-\(I\)

  • Mainly same sign across borders: \(I\) = 0.515

Calculating the Moran’s-\(I\) in R - 1

  • You need my function morans_I in the file morans_i.R
  • The files are in the usual folder on Teams
  • Also need the adjacency information - thats in adjacency.csv
  • read all of this in to R…
source('morans_I.R')
adj <- read_csv('adjacency.csv')
adj
## # A tibble: 7 x 2
##   County1      County2
##   <chr>        <chr>  
## 1 Armagh       Antrim 
## 2 Down         Antrim 
## 3 Londonderry  Antrim 
## 4 Tyrone       Antrim 
## 5 Antrim       Armagh 
## 6 Down         Armagh 
## 7 Louth County Armagh

Calculating the Moran’s-\(I\) in R - 2

  • You need the adjacency information and
    • A file with the areas and the quantity of interest (counties.geojson)
    • You already have that as a file:
    • enter counties <- st_read('counties.geojson')
m_I <- morans_I(counties,adj,County,mean_rain)
m_I
## [1] 0.5218756
  • format is morans_I(#1,#2,#3,#4)
    • #1 - geographical data object
    • #2 - adjacency table
    • #3 - Area name variable
    • #4 - Variable to calculate Moran’s-\(I\)

So, is 0.5-ish importantly spatially correlated

  • If the levels of rainfall were scattered randomly amoungs the counties, there would be some degree of clustering
  • Think of the star constellations in session 1!
  • So, are the 0.5-ish levels seen for rainfall indicative of clustering or just a random pattern?
  • To test this, start by assuming levels are just randomly scattered
  • then…

A randomisation test - 1

  1. Randomly assign each rainfall value to a county
  2. Compute the Moran’s-\(I\) for that random assignment
  3. Note the value
  4. Repeat from step 1 a large number of times (eg 1,000)
  5. Compare the actual Moran’s-\(I\) to these
  6. The number of random values exceeding the actual value is an estimate of the chance that the actual value or something more extreme could have happened if pattern really was random.
  7. Very low values of this suggest pattern is clustered, not random

A randomisation test - 2

  • In R this is done using a loop
# make a list for randomised Moran's I values
# start by filling with zeroes then fill in values
random_m_I <- rep(0,10000) 

## Loop to calculate the randomised Moran's I's
for (i in 1:10000) {
  test <- mutate(counties,mean_rain=sample(mean_rain)) # Randomise
  random_m_I[i] <- morans_I(test,adj,County,mean_rain) # calculate I
}

## Quick summary of results
fivenum(random_m_I)
## [1] -0.31701185 -0.09599609 -0.03478540  0.03124031  0.42912956

Graphical view of results - 1

hist(random_m_I,xlim=c(-0.7,0.7),
     main="Randomised Values of Moran's I",
     xlab="Moran's I")
abline(v=m_I,col='brown',lwd=2)
text(0.5,1000,'Actual Value',srt=90,col='brown')

Graphical view of results - 2

  • Note ylim not xlim here!!
boxplot(random_m_I, horizontal=TRUE,
     main="Randomised Values of Moran's I",
     xlab="Moran's I",ylim=c(-0.7,0.7))
abline(v=m_I,col='brown',lwd=2)
text(0.5,1,'Actual Value',srt=90,col='brown')

Graphical view of results - 3

dens <- density(random_m_I)
plot(dens,
     main="Randomised Values of Moran's I",
     xlab="Moran's I",xlim=c(-0.7,0.7))
abline(v=m_I,col='brown',lwd=2)
text(0.5,2,'Actual Value',srt=90,col='brown')

So what is the probability mentioned?

length(which(random_m_I > m_I))/10000
## [1] 0
  • Effectively zero!
  • In reality not exactly zero BUT
    • In 10,000 random simulations none exceed the actual value
    • Possibly in 10 million dimulations some may have
    • Pretty strong evidence for clustering of values
    • Positive spatial autocorrelation

More advanced data processing

dplyr, the tidyverse and pipelines

  • Until now, use of R has been ‘traditional’
  • Similar to code in say java in terms of loops, functions etc
  • tidyverse approach (using libraries like dplyr) useful for ‘flow based approach’
  • In short
Version \(\Rightarrow\) Ordinary R Tidyverse R
f1(f2(f3(x))) x %>% f3() %>% f2() %>% f1()
f1(f2(f3(x,b),a)) x %>% f3(b) %>% f2(a) %>% f1()
y <- f1(f2(f3(x,b),a)) y <- x %>% f3(b) %>% f2(a) %>% f1()
Or y <- f1(f2(f3(x,b),a)) x %>% f3(b) %>% f2(a) %>% f1() -> y

Why do this?

  • which is easier to read?

Ordinary R:

high_counties <- st_drop_geometry(arrange(filter(counties,mean_rain > 80),
                                          mean_rain))

Tidyverse:

library(tidyverse)
counties %>% filter(mean_rain > 80) %>% 
  arrange(mean_rain) %>% 
  st_drop_geometry() -> high_counties
  • Generally tidyverse most useful for processing data tables
  • Should you use ‘->’ or ‘<-’ in tidyverse expressions?
    • I don’t mind, its a personal choice!
    • But try to stick to one or the other!

Conclusion

đź’ˇ New ideas

  • New general ideas

    • Spatial autocorrelation
    • tidyverse
  • New techniques

    • Computing Moran’s \(I\)
    • Seeing if Moran’s-\(I\) is significant
    • using the tidyverse
  • Practical issues

    • R commands for above
  • Next lecture - More Investigating Relationships / ggplot

  • This link may be useful for some of the ideas - follow up on the first two suggested links.