Analysing Spatial Data

What do we mean

Some processes are essentially spatial
- Weather patterns
- Epidemic spreading
- Global temperature patterns
Need to summarise and visualise these processes to help understand them
- Here we can do this in R with tmap, sf and other helpers
- Need to think about space and time

What is Special About Spatial

Tobler’s First “Law”

“I invoke the first law of geography - everything is related to everything else, but near things are more related than distant things.” (Tobler, 1970)

Tobler, W. R. 1970. A computer movie simulating urban growth in the Detroit region. Economic Geography 46: 234–40.

Tone is tongue in cheek
- I’m not sure Tobler meant it to be a ‘law’.
But frequently the observation is true
Certainly a hypothesis worth testing
- Things may not happen bacause of Tobler’s first law
- but many processes exhibit it
- eg rainfall, levels of particulate matter

Spatial Autocorrelation - 1

Measures to what extent Toblers ‘Law’ applies
Similar idea to correlation seen last week
Consider the mean rainfall data by county from last week
Do neighbour county levels look similar?

Spatial Autocorrelation - 2

Subtract overall mean from each county
Colours above mean \(\rightarrow\) blue
Colours below mean \(\rightarrow\) red
Look at boundaries
- Mainly \((+,+)\) or \((-,-)\)
- Fewer \((+,-)\) or \((-,+)\)

Spatial Autocorrelation - 3

Subtract overall mean from each county
Colours above mean \(\rightarrow\) blue
Colours below mean \(\rightarrow\) red
Look at boundaries - multiply county values for each side
- Mainly \((+,+)\) or \((-,-)\) \(\rightarrow\) blue
- Fewer \((+,-)\) or \((-,+)\) \(\rightarrow\) red
- Mean of this is 76.485
Finally, as with correlation coefficient, divide by standard deviation (squared) to make the measure unit free
- Value is 0.515
- This is Moran’s-\(I\).

What does Moran’s-\(I\) look like?

Negative Moran’s \(I\)

Quite a number of differences over borders: \(I\) = -0.314

Moran’s \(I\) near zero

Mixture of differences/similarities across borders: \(I\) = -0.03

Positive Moran’s-\(I\)

Mainly same sign across borders: \(I\) = 0.515

Calculating the Moran’s-\(I\) in R - 1

You need my function morans_I in the file morans_i.R
The files are in the usual folder on Teams
Also need the adjacency information - thats in adjacency.csv
read all of this in to R…

source('morans_I.R')
adj <- read_csv('adjacency.csv')
adj

## # A tibble: 7 x 2
##   County1      County2
##   <chr>        <chr>  
## 1 Armagh       Antrim 
## 2 Down         Antrim 
## 3 Londonderry  Antrim 
## 4 Tyrone       Antrim 
## 5 Antrim       Armagh 
## 6 Down         Armagh 
## 7 Louth County Armagh

Calculating the Moran’s-\(I\) in R - 2

You need the adjacency information and
- A file with the areas and the quantity of interest (counties.geojson)
- You already have that as a file:
- enter counties <- st_read('counties.geojson')

m_I <- morans_I(counties,adj,County,mean_rain)
m_I

## [1] 0.5218756

format is morans_I(#1,#2,#3,#4)
- #1 - geographical data object
- #2 - adjacency table
- #3 - Area name variable
- #4 - Variable to calculate Moran’s-\(I\)

So, is 0.5-ish importantly spatially correlated

If the levels of rainfall were scattered randomly amoungs the counties, there would be some degree of clustering
Think of the star constellations in session 1!
So, are the 0.5-ish levels seen for rainfall indicative of clustering or just a random pattern?
To test this, start by assuming levels are just randomly scattered
then…

A randomisation test - 1

Randomly assign each rainfall value to a county
Compute the Moran’s-\(I\) for that random assignment
Note the value
Repeat from step 1 a large number of times (eg 1,000)
Compare the actual Moran’s-\(I\) to these
The number of random values exceeding the actual value is an estimate of the chance that the actual value or something more extreme could have happened if pattern really was random.
Very low values of this suggest pattern is clustered, not random

A randomisation test - 2

In R this is done using a loop

# make a list for randomised Moran's I values
# start by filling with zeroes then fill in values
random_m_I <- rep(0,10000) 

## Loop to calculate the randomised Moran's I's
for (i in 1:10000) {
  test <- mutate(counties,mean_rain=sample(mean_rain)) # Randomise
  random_m_I[i] <- morans_I(test,adj,County,mean_rain) # calculate I
}

## Quick summary of results
fivenum(random_m_I)

## [1] -0.31701185 -0.09599609 -0.03478540  0.03124031  0.42912956

Graphical view of results - 1

hist(random_m_I,xlim=c(-0.7,0.7),
     main="Randomised Values of Moran's I",
     xlab="Moran's I")
abline(v=m_I,col='brown',lwd=2)
text(0.5,1000,'Actual Value',srt=90,col='brown')

Graphical view of results - 2

Note ylim not xlim here!!

boxplot(random_m_I, horizontal=TRUE,
     main="Randomised Values of Moran's I",
     xlab="Moran's I",ylim=c(-0.7,0.7))
abline(v=m_I,col='brown',lwd=2)
text(0.5,1,'Actual Value',srt=90,col='brown')

Graphical view of results - 3

dens <- density(random_m_I)
plot(dens,
     main="Randomised Values of Moran's I",
     xlab="Moran's I",xlim=c(-0.7,0.7))
abline(v=m_I,col='brown',lwd=2)
text(0.5,2,'Actual Value',srt=90,col='brown')

So what is the probability mentioned?

length(which(random_m_I > m_I))/10000

## [1] 0

Effectively zero!
In reality not exactly zero BUT
- In 10,000 random simulations none exceed the actual value
- Possibly in 10 million dimulations some may have
- Pretty strong evidence for clustering of values
- Positive spatial autocorrelation

More advanced data processing

`dplyr`, the `tidyverse` and pipelines

Until now, use of R has been ‘traditional’
Similar to code in say java in terms of loops, functions etc
tidyverse approach (using libraries like dplyr) useful for ‘flow based approach’
In short

Version \(\Rightarrow\)	Ordinary `R`	Tidyverse `R`
	`f1(f2(f3(x)))`	`x %>% f3() %>% f2() %>% f1()`
	`f1(f2(f3(x,b),a))`	`x %>% f3(b) %>% f2(a) %>% f1()`

	`y <- f1(f2(f3(x,b),a))`	`y <- x %>% f3(b) %>% f2(a) %>% f1()`
Or	`y <- f1(f2(f3(x,b),a))`	`x %>% f3(b) %>% f2(a) %>% f1() -> y`

Why do this?

which is easier to read?

Ordinary R:

high_counties <- st_drop_geometry(arrange(filter(counties,mean_rain > 80),
                                          mean_rain))

Tidyverse:

library(tidyverse)
counties %>% filter(mean_rain > 80) %>% 
  arrange(mean_rain) %>% 
  st_drop_geometry() -> high_counties

Generally tidyverse most useful for processing data tables
Should you use ‘->’ or ‘<-’ in tidyverse expressions?
- I don’t mind, its a personal choice!
- But try to stick to one or the other!

Conclusion

💡 New ideas

New general ideas
- Spatial autocorrelation
- tidyverse
New techniques
- Computing Moran’s \(I\)
- Seeing if Moran’s-\(I\) is significant
- using the tidyverse
Practical issues
- R commands for above
Next lecture - More Investigating Relationships / ggplot
This link may be useful for some of the ideas - follow up on the first two suggested links.