Now that we have started working with some spatial statistics, this next unit will help us start building towards spatial regression. Before we can get there, we need to learn one foundational topic in spatial statistics: spatial autocorrelation. One assumption of any regression model is independence. Let’s say we are working with state-level data. If each state represents an independent observation, then no one state influences any other. But is this really ever true?

In Geography, Tobler’s Law states: Everything is related to everything else, but near things are more related than distant things. We might expect, therefore, that neighboring states influence each other. This tends to be the case - if I pick any two states in the South, they are likely to be more related to each other than, say, a Northern state. So my independence assumption is violated - we often call this “Spatial Autocorrelation”. Data with high spatial autocorrelation is data where neighboring geographic units are more related than distant geographic units. In today’s unit, we will practice methods to check for spatial autocorrelation. Later on, we will work on methods to control for this problem in spatial regression.

To illustrate this, let’s load in some data. Suppose I’m a researcher studying poverty in NYC. I have income data at the neighborhood level, and I have created a Poor-index based on the ratio of a household’s income to the poverty level. I then aggregated the number of households with an index value of 1 or below to create my poor count variable. Are impoverished households clustered near each other in NYC? This will help us to learn whether we should concentrate resources in certain areas more than others. Is there spatial inequality alongside economic inequality?

library(sf)     
library(dplyr)   
library(spData) 
library(ggplot2)
library(ggthemes)
#Let's load in some packages and data
setwd("~/Binghamton/DIDA 370/nyc_neighborhoods")
Warning: The working directory was changed to C:/Users/melha/OneDrive/Documents/Binghamton/DIDA 370/nyc_neighborhoods inside a notebook chunk. The working directory will be reset when the chunk is finished running. Use the knitr root.dir option in the setup chunk to change the working directory for notebook chunks.
nyc <- st_read("NYC_Nhood ACS2008_12.shp")
Reading layer `NYC_Nhood ACS2008_12' from data source 
  `C:\Users\melha\OneDrive\Documents\Binghamton\DIDA 370\nyc_neighborhoods\NYC_Nhood ACS2008_12.shp' 
  using driver `ESRI Shapefile'
Simple feature collection with 195 features and 98 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -74.25559 ymin: 40.49612 xmax: -73.70001 ymax: 40.91553
Geodetic CRS:  WGS 84
#Let's plot the data first

ggplot()+
  geom_sf(data = nyc, aes(fill=poor))+
  scale_fill_steps(
    name = "Number of \nPoor Households",
    low = "lightsteelblue1",
    high = "tomato1",
    n.breaks = 4,
    show.limits = T)+
  theme_void()

What do you think, is there spatial autocorrelation in this data?

We can test for this using a common statistic called Moran’s I. This essentially tests whether our data was produced by a spatially random process (or CSR). We will need to follow a few steps to do this. First, we need to download a new packages (spdep). Then, we will need to create some information for R about the spatial relationships between the neighborhoods in the data. We will create what are called spatial weights. Similarly to the weights used to compute a weighted average, a higher weight indicates that two neighborhoods are geographically closer to eachother, while a lower weight indicates that the neighborhoods are furthe apart. Let’s see what this looks like:

nyc_list <- nyc %>% 
  #first convert polygons to a neighbor object
  #R needs to know a unique ID for each neighborhood    
  #(cartodb_id)
  #Neighbors determined using Queen weights
  poly2nb(st_geometry(nyc), row.names = nyc$cartodb_id) %>% 
  #zero.policy just tells R not to remove 0 values
  nb2listw(zero.policy = TRUE)
Error in Ops.sfc(snap, 0) : operation < not supported

Here, with a Moran’s I statistic of 0.363 and a p-value of <0.001, we can reject the null hypothesis and conclude that our data was not generated by a spatially random process. In other words, the number of poor households by NYC neighborhood are not spatially independent - neighborhoods that are close together tend to be more similar than neigborhoods that are further apart.

Let’s look at a plot of these results!

The solid line here goes through the mean value, and slope is the Moran’s I statistic. The positive slope indicates positive autocorrelation - in other words, neighborhoods with high Poor counts tend to be near other neighborhoods with high poor counts. For each point, the x-axis represents the actual Poor count for the neighborhood, while the y-axis represents an estimate of this value for that neighborhood’s neighbors. If the points in our plot are mostly concentrated in the bottom left and top right quadrants of the graph, we have positive autocorrelation. If they are concentrated in the top left and bottom right quadrants, we have negative autocorrelation. In data with no autocorrelation, we would expect to see a random cloud of points (with no pattern).

Next, we’ll take a look at Local Indicators of Spatial Autocorrelation (LISA) plots. This will help us to identify clusters of low or high values in the graph, as well as outliers. This uses the Local Moran’s I statistic. Instead of calculating one Moran’s I value for the whole plot (as is the case with the global statistic), the local version calculates a local I for each spatial unit (neighborhood)

lisaRslt <- localmoran(nyc$poor, nyc_list, 
            zero.policy = TRUE, na.action = na.omit)

Let’s talk about the values in this output for a moment. The Pr column is our P-value. A significant result (i.e. a p-value less than 0.05) tells us that a neighborhood is either a) part of a cluster or b) a spatial outlier. The Li column gives us more info about which one it is. If Li is positive, then a neighborhood is part of a spatial cluster. Otherwise, it is a spatial outlier.

The code below reproduces a LISA plot that is commonly used in GIS software. Essentially, this plot a) helps us to identify significant neighborhoods, and b) visualizes whether they are clusters or outliers. High-High (HH) and Low-Low (LL) neighborhoods are clusters - these are units that are surrounded by units with similar values. High-Low (HL) and Low-High (LH) units are spatial outliers - they are different from their surrounding units. The L and H tell us whether the number of poor households in a neighborhood is low or high relative to the mean.

# Now we can derive the cluster/outlier types (COType in ArcGIS term) for each spatial feature in the data
library(magrittr)

Attaching package: ‘magrittr’

The following object is masked from ‘package:purrr’:

    set_names

The following object is masked from ‘package:tidyr’:

    extract
library(tidyr)
significanceLevel <- 0.05; # 95% confidence
meanVal <- mean(nyc$poor);

lisaRslt %<>% as_tibble() %>%
  set_colnames(c("Ii","E.Ii","Var.Ii","Z.Ii","Pr(z > 0)")) %>%
  #Create a new categorical value
  #This variable will call insignificant units "insignificant"
  #Outliers are determined by LI <0 - units higher than the mean are HL and units lower than the mean are LH
  #Clusters are units with LI >= 0, and classified as HH or LL based on whether they are above or below the mean
  mutate(coType = case_when(
  `Pr(z > 0)` > 0.05 ~ "Insignificant",
  `Pr(z > 0)` <= 0.05 & Ii >= 0 & nyc$poor >= meanVal ~ "HH",
  `Pr(z > 0)` <= 0.05 & Ii >= 0 & nyc$poor < meanVal ~ "LL",
  `Pr(z > 0)` <= 0.05 & Ii < 0 & nyc$poor >= meanVal ~ "HL",
  `Pr(z > 0)` <= 0.05 & Ii < 0 & nyc$poor < meanVal ~ "LH"))

# Now add this coType to original sf data
nyc$coType <- lisaRslt$coType %>% replace_na("Insignificant")

#Now we'll plot it!
ggplot(nyc) +
  geom_sf(aes(fill=coType),color = 'lightgrey') +
  scale_fill_manual(values = c('red','brown','NA','blue','cyan'), name='Clusters & \nOutliers') +
  labs(title = "Number of Poor Families by Neighborhood")+
  theme_minimal()

Now, it’s your turn! Pick a new variable from the data and re-run the analysis. What do you find? Do you see different clustering patterns than the one above?
r #Analysis here

Resources ArcGIS (2022). Global Vs. Local Spatial Autocorrelation. Retrieved from: https://storymaps.arcgis.com/stories/5b26f25bb81a437b89003423505e2f71.

Sun, S. (N. D.). 4.1: Spatial Autocorrelation. Retrieved from: http://www.geo.hunter.cuny.edu/~ssun/R-Spatial/spregression.html#spatial-autocorrelation.

---
title: "R Notebook"
output: html_notebook
---
Now that we have started working with some spatial statistics, this next unit will help us start building towards spatial regression. Before we can get there, we need to learn one foundational topic in spatial statistics: spatial autocorrelation. One assumption of any regression model is independence. Let's say we are working with state-level data. If each state represents an independent observation, then no one state influences any other. But is this really ever true?

In Geography, Tobler's Law states: Everything is related to everything else, but near things are more related than distant things. We might expect, therefore, that neighboring states influence each other. This tends to be the case - if I pick any two states in the South, they are likely to be more related to each other than, say, a Northern state. So my independence assumption is violated - we often call this "Spatial Autocorrelation". Data with high spatial autocorrelation is data where neighboring geographic units are more related than distant geographic units. In today's unit, we will practice methods to check for spatial autocorrelation. Later on, we will work on methods to control for this problem in spatial regression.

To illustrate this, let's load in some data. Suppose I'm a researcher studying poverty in NYC. I have income data at the neighborhood level, and I have created a Poor-index based on the ratio of a household's income to the poverty level. I then aggregated the number of households with an index value of 1 or below to create my poor count variable. Are impoverished households clustered near each other in NYC? This will help us to learn whether we should concentrate resources in certain areas more than others. Is there spatial inequality alongside economic inequality?


```{r}
library(sf)     
library(dplyr)   
library(spData) 
library(ggplot2)
library(ggthemes)
#Let's load in some packages and data
setwd("~/Binghamton/DIDA 370/nyc_neighborhoods")
nyc <- st_read("NYC_Nhood ACS2008_12.shp")

#Let's plot the data first

ggplot()+
  geom_sf(data = nyc, aes(fill=poor))+
  scale_fill_steps(
    name = "Number of \nPoor Households",
    low = "lightsteelblue1",
    high = "tomato1",
    n.breaks = 4,
    show.limits = T)+
  theme_void()

```
What do you think, is there spatial autocorrelation in this data?

We can test for this using a common statistic called Moran's I. This essentially tests whether our data was produced by a spatially random process (or CSR). We will need to follow a few steps to do this. First, we need to download a new packages (spdep). Then, we will need to create some information for R about the spatial relationships between the neighborhoods in the data. We will create what are called spatial weights. Similarly to the weights used to compute a weighted average, a higher weight indicates that two neighborhoods are geographically closer to eachother, while a lower weight indicates that the neighborhoods are furthe apart. Let's see what this looks like:

```{r}
library(spdep)

#This code creates the weights
nyc_list <- nyc %>% 
  #first convert polygons to a neighbor object
  #R needs to know a unique ID for each neighborhood    
  #(cartodb_id)
  #Neighbors determined using Queen weights
  poly2nb(st_geometry(nyc)) %>% 
  #zero.policy just tells R not to remove 0 values
  nb2listw(zero.policy = TRUE) 

#Now we can calculate the "Global Moran's I" for the data
#start with the weights, then give R the corresponding variable
nyc_list %>% 
  moran.test(nyc$poor, ., zero.policy = TRUE)

```
Here, with a Moran's I statistic of 0.363 and a p-value of <0.001, we can reject the null hypothesis and conclude that our data was not generated by a spatially random process. In other words, the number of poor households by NYC neighborhood are not spatially independent - neighborhoods that are close together tend to be more similar than neigborhoods that are further apart. 

Let's look at a plot of these results!

```{r}
moran.plot(nyc$poor, 
           nyc_list, 
           zero.policy = TRUE, 
           labels = F,
           xlab = 'Number of Poor Households',
           ylab = 'Lagged Poor Households (of Neighbors)',
                   pch=20)
```
The solid line here goes through the mean value, and slope is the Moran's I statistic. The positive slope indicates positive autocorrelation - in other words, neighborhoods with high Poor counts tend to be near other neighborhoods with high poor counts. For each point, the x-axis represents the actual Poor count for the neighborhood, while the y-axis represents an estimate of this value for that neighborhood's neighbors. If the points in our plot are mostly concentrated in the bottom left and top right quadrants of the graph, we have positive autocorrelation. If they are concentrated in the top left and bottom right quadrants, we have negative autocorrelation. In data with no autocorrelation, we would expect to see a random cloud of points (with no pattern). 

Next, we'll take a look at Local Indicators of Spatial Autocorrelation (LISA) plots. This will help us to identify clusters of low or high values in the graph, as well as outliers. This uses the Local Moran's I statistic. Instead of calculating one Moran's I value for the whole plot (as is the case with the global statistic), the local version calculates a local I for each spatial unit (neighborhood)

```{r}
lisaRslt <- localmoran(nyc$poor, nyc_list, 
            zero.policy = TRUE, na.action = na.omit)
```

Let's talk about the values in this output for a moment. The Pr column is our P-value. A significant result (i.e. a p-value less than 0.05) tells us that a neighborhood is either a) part of a cluster or b) a spatial outlier. The Li column gives us more info about which one it is. If Li is positive, then a neighborhood is part of a spatial cluster. Otherwise, it is a spatial outlier. 

The code below reproduces a LISA plot that is commonly used in GIS software. Essentially, this plot a) helps us to identify significant neighborhoods, and b) visualizes whether they are clusters or outliers. High-High (HH) and Low-Low (LL) neighborhoods are clusters - these are units that are surrounded by units with similar values. High-Low (HL) and Low-High (LH) units are spatial outliers - they are different from their surrounding units. The L and H tell us whether the number of poor households in a neighborhood is low or high relative to the mean. 

```{r}
# Now we can derive the cluster/outlier types (COType in ArcGIS term) for each spatial feature in the data
library(magrittr)
library(tidyr)
significanceLevel <- 0.05; # 95% confidence
meanVal <- mean(nyc$poor);

lisaRslt %<>% as_tibble() %>%
  set_colnames(c("Ii","E.Ii","Var.Ii","Z.Ii","Pr(z > 0)")) %>%
  #Create a new categorical value
  #This variable will call insignificant units "insignificant"
  #Outliers are determined by LI <0 - units higher than the mean are HL and units lower than the mean are LH
  #Clusters are units with LI >= 0, and classified as HH or LL based on whether they are above or below the mean
  mutate(coType = case_when(
  `Pr(z > 0)` > 0.05 ~ "Insignificant",
  `Pr(z > 0)` <= 0.05 & Ii >= 0 & nyc$poor >= meanVal ~ "HH",
  `Pr(z > 0)` <= 0.05 & Ii >= 0 & nyc$poor < meanVal ~ "LL",
  `Pr(z > 0)` <= 0.05 & Ii < 0 & nyc$poor >= meanVal ~ "HL",
  `Pr(z > 0)` <= 0.05 & Ii < 0 & nyc$poor < meanVal ~ "LH"))

# Now add this coType to original sf data
nyc$coType <- lisaRslt$coType %>% replace_na("Insignificant")

#Now we'll plot it!
ggplot(nyc) +
  geom_sf(aes(fill=coType),color = 'lightgrey') +
  scale_fill_manual(values = c('red','brown','NA','blue','cyan'), name='Clusters & \nOutliers') +
  labs(title = "Number of Poor Families by Neighborhood")+
  theme_minimal()
```

----------------------------------------------------------------
Now, it's your turn! Pick a new variable from the data and re-run the analysis. What do you find? Do you see different clustering patterns than the one above?
```{r}
#Analysis here
```


----------------------------------------------------------------
Resources
ArcGIS (2022). Global Vs. Local Spatial Autocorrelation. Retrieved from: https://storymaps.arcgis.com/stories/5b26f25bb81a437b89003423505e2f71. 

Sun, S. (N. D.). 4.1: Spatial Autocorrelation. Retrieved from: http://www.geo.hunter.cuny.edu/~ssun/R-Spatial/spregression.html#spatial-autocorrelation. 