library(sf)     
library(dplyr)   
library(spData) 
library(ggplot2)
library(ggthemes)
#Let's load in some packages and data
setwd("C:/Users/Donny/OneDrive/R studio/nyc_neighborhoods")
Warning: The working directory was changed to C:/Users/Donny/OneDrive/R studio/nyc_neighborhoods inside a notebook chunk. The working directory will be reset when the chunk is finished running. Use the knitr root.dir option in the setup chunk to change the working directory for notebook chunks.
nyc <- st_read("NYC_Nhood ACS2008_12.shp")
Reading layer `NYC_Nhood ACS2008_12' from data source 
  `C:\Users\Donny\OneDrive\R studio\nyc_neighborhoods\NYC_Nhood ACS2008_12.shp' using driver `ESRI Shapefile'
Simple feature collection with 195 features and 98 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -74.25559 ymin: 40.49612 xmax: -73.70001 ymax: 40.91553
Geodetic CRS:  WGS 84
#Let's plot the data first

ggplot()+
  geom_sf(data = nyc, aes(fill=poor))+
  scale_fill_steps(
    name = "Number of \nPoor Households",
    low = "lightsteelblue1",
    high = "tomato1",
    n.breaks = 4,
    show.limits = T)+
  theme_void()

What do you think, is there spatial autocorrelation in this data?

We can test for this using a common statistic called Moran’s I. This essentially tests whether our data was produced by a spatially random process (or CSR). We will need to follow a few steps to do this. First, we need to download a new packages (spdep). Then, we will need to create some information for R about the spatial relationships between the neighborhoods in the data. We will create what are called spatial weights. Similarly to the weights used to compute a weighted average, a higher weight indicates that two neighborhoods are geographically closer to eachother, while a lower weight indicates that the neighborhoods are furthe apart. Let’s see what this looks like:

library(spdep)
Warning: package ‘spdep’ was built under R version 4.3.3
#This code creates the weights
nyc_list <- nyc %>% 
  #first convert polygons to a neighbor object
  #R needs to know a unique ID for each neighborhood    
  #(cartodb_id)
  #Neighbors determined using Queen weights
  poly2nb(st_geometry(nyc)) %>% 
  #zero.policy just tells R not to remove 0 values
  nb2listw(zero.policy = TRUE) 

#Now we can calculate the "Global Moran's I" for the data
#start with the weights, then give R the corresponding variable
nyc_list %>% 
  moran.test(nyc$poor, ., zero.policy = TRUE)

    Moran I test under randomisation

data:  nyc$poor  
weights: .    

Moran I statistic standard deviate = 8.2642, p-value < 2.2e-16
alternative hypothesis: greater
sample estimates:
Moran I statistic       Expectation          Variance 
      0.362862746      -0.005154639       0.001983038 

Here, with a Moran’s I statistic of 0.363 and a p-value of <0.001, we can reject the null hypothesis and conclude that our data was not generated by a spatially random process. In other words, the number of poor households by NYC neighborhood are not spatially independent - neighborhoods that are close together tend to be more similar than neigborhoods that are further apart.

Let’s look at a plot of these results!

moran.plot(nyc$poor, 
           nyc_list, 
           zero.policy = TRUE, 
           labels = F,
           xlab = 'Number of Poor Households',
           ylab = 'Lagged Poor Households (of Neighbors)',
                   pch=20)

The solid line here goes through the mean value, and slope is the Moran’s I statistic. The positive slope indicates positive autocorrelation - in other words, neighborhoods with high Poor counts tend to be near other neighborhoods with high poor counts. For each point, the x-axis represents the actual Poor count for the neighborhood, while the y-axis represents an estimate of this value for that neighborhood’s neighbors. If the points in our plot are mostly concentrated in the bottom left and top right quadrants of the graph, we have positive autocorrelation. If they are concentrated in the top left and bottom right quadrants, we have negative autocorrelation. In data with no autocorrelation, we would expect to see a random cloud of points (with no pattern).

Next, we’ll take a look at Local Indicators of Spatial Autocorrelation (LISA) plots. This will help us to identify clusters of low or high values in the graph, as well as outliers. This uses the Local Moran’s I statistic. Instead of calculating one Moran’s I value for the whole plot (as is the case with the global statistic), the local version calculates a local I for each spatial unit (neighborhood)

lisaRslt <- localmoran(nyc$poor, nyc_list, 
            zero.policy = TRUE, na.action = na.omit)

Let’s talk about the values in this output for a moment. The Pr column is our P-value. A significant result (i.e. a p-value less than 0.05) tells us that a neighborhood is either a) part of a cluster or b) a spatial outlier. The Li column gives us more info about which one it is. If Li is positive, then a neighborhood is part of a spatial cluster. Otherwise, it is a spatial outlier.

The code below reproduces a LISA plot that is commonly used in GIS software. Essentially, this plot a) helps us to identify significant neighborhoods, and b) visualizes whether they are clusters or outliers. High-High (HH) and Low-Low (LL) neighborhoods are clusters - these are units that are surrounded by units with similar values. High-Low (HL) and Low-High (LH) units are spatial outliers - they are different from their surrounding units. The L and H tell us whether the number of poor households in a neighborhood is low or high relative to the mean.

# Now we can derive the cluster/outlier types (COType in ArcGIS term) for each spatial feature in the data
library(magrittr)

Attaching package: ‘magrittr’

The following object is masked from ‘package:tidyr’:

    extract
library(tidyr)
significanceLevel <- 0.05; # 95% confidence
meanVal <- mean(nyc$poor);

lisaRslt %<>% as_tibble() %>%
  set_colnames(c("Ii","E.Ii","Var.Ii","Z.Ii","Pr(z > 0)")) %>%
  #Create a new categorical value
  #This variable will call insignificant units "insignificant"
  #Outliers are determined by LI <0 - units higher than the mean are HL and units lower than the mean are LH
  #Clusters are units with LI >= 0, and classified as HH or LL based on whether they are above or below the mean
  mutate(coType = case_when(
  `Pr(z > 0)` > 0.05 ~ "Insignificant",
  `Pr(z > 0)` <= 0.05 & Ii >= 0 & nyc$poor >= meanVal ~ "HH",
  `Pr(z > 0)` <= 0.05 & Ii >= 0 & nyc$poor < meanVal ~ "LL",
  `Pr(z > 0)` <= 0.05 & Ii < 0 & nyc$poor >= meanVal ~ "HL",
  `Pr(z > 0)` <= 0.05 & Ii < 0 & nyc$poor < meanVal ~ "LH"))

# Now add this coType to original sf data
nyc$coType <- lisaRslt$coType %>% replace_na("Insignificant")

#Now we'll plot it!
ggplot(nyc) +
  geom_sf(aes(fill=coType),color = 'lightgrey') +
  scale_fill_manual(values = c('red','brown','NA','blue','cyan'), name='Clusters & \nOutliers') +
  labs(title = "Number of Poor Families by Neighborhood")+
  theme_minimal()

Now, it’s your turn! Pick a new variable from the data and re-run the analysis. What do you find? Do you see different clustering patterns than the one above?
```r #Analysis here
setwd(“C:/Users/Donny/OneDrive/R studio/chicago_airbnb”) chicago <- st_read(“airbnb_Chicago 2015.shp”)
chicago_list <- chicago %>% #first convert polygons to a neighbor object #R needs to know a unique ID for each neighborhood #(cartodb_id) #Neighbors determined using Queen weights poly2nb(st_geometry(chicago)) %>% #zero.policy just tells R not to remove 0 values nb2listw(zero.policy = TRUE)
#Now we can calculate the “Global Moran’s I” for the data #start with the weights, then give R the corresponding variable chicago_list %>% moran.test(chicago$crowded, ., zero.policy = TRUE) ```
r moran.plot(chicago$crowded, chicago_list, zero.policy = TRUE, labels = F, xlab = 'Percent Crowding in Chicago neighborhoods', ylab = 'Percent Crowding of neighborring neighborhoods', pch = 20)
r lisaRslt <- localmoran(chicago$crowded, chicago_list, zero.policy = TRUE, na.action = na.omit)

Resources ArcGIS (2022). Global Vs. Local Spatial Autocorrelation. Retrieved from: https://storymaps.arcgis.com/stories/5b26f25bb81a437b89003423505e2f71.

Sun, S. (N. D.). 4.1: Spatial Autocorrelation. Retrieved from: http://www.geo.hunter.cuny.edu/~ssun/R-Spatial/spregression.html#spatial-autocorrelation.

---
title: "R Notebook"
output: html_notebook
---
```{r}
library(sf)     
library(dplyr)   
library(spData) 
library(ggplot2)
library(ggthemes)
#Let's load in some packages and data
setwd("C:/Users/Donny/OneDrive/R studio/nyc_neighborhoods")
nyc <- st_read("NYC_Nhood ACS2008_12.shp")

#Let's plot the data first

ggplot()+
  geom_sf(data = nyc, aes(fill=poor))+
  scale_fill_steps(
    name = "Number of \nPoor Households",
    low = "lightsteelblue1",
    high = "tomato1",
    n.breaks = 4,
    show.limits = T)+
  theme_void()

```
What do you think, is there spatial autocorrelation in this data?

We can test for this using a common statistic called Moran's I. This essentially tests whether our data was produced by a spatially random process (or CSR). We will need to follow a few steps to do this. First, we need to download a new packages (spdep). Then, we will need to create some information for R about the spatial relationships between the neighborhoods in the data. We will create what are called spatial weights. Similarly to the weights used to compute a weighted average, a higher weight indicates that two neighborhoods are geographically closer to eachother, while a lower weight indicates that the neighborhoods are furthe apart. Let's see what this looks like:

```{r}
library(spdep)

#This code creates the weights
nyc_list <- nyc %>% 
  #first convert polygons to a neighbor object
  #R needs to know a unique ID for each neighborhood    
  #(cartodb_id)
  #Neighbors determined using Queen weights
  poly2nb(st_geometry(nyc)) %>% 
  #zero.policy just tells R not to remove 0 values
  nb2listw(zero.policy = TRUE) 

#Now we can calculate the "Global Moran's I" for the data
#start with the weights, then give R the corresponding variable
nyc_list %>% 
  moran.test(nyc$poor, ., zero.policy = TRUE)

```
Here, with a Moran's I statistic of 0.363 and a p-value of <0.001, we can reject the null hypothesis and conclude that our data was not generated by a spatially random process. In other words, the number of poor households by NYC neighborhood are not spatially independent - neighborhoods that are close together tend to be more similar than neigborhoods that are further apart. 

Let's look at a plot of these results!

```{r}
moran.plot(nyc$poor, 
           nyc_list, 
           zero.policy = TRUE, 
           labels = F,
           xlab = 'Number of Poor Households',
           ylab = 'Lagged Poor Households (of Neighbors)',
                   pch=20)
```
The solid line here goes through the mean value, and slope is the Moran's I statistic. The positive slope indicates positive autocorrelation - in other words, neighborhoods with high Poor counts tend to be near other neighborhoods with high poor counts. For each point, the x-axis represents the actual Poor count for the neighborhood, while the y-axis represents an estimate of this value for that neighborhood's neighbors. If the points in our plot are mostly concentrated in the bottom left and top right quadrants of the graph, we have positive autocorrelation. If they are concentrated in the top left and bottom right quadrants, we have negative autocorrelation. In data with no autocorrelation, we would expect to see a random cloud of points (with no pattern). 

Next, we'll take a look at Local Indicators of Spatial Autocorrelation (LISA) plots. This will help us to identify clusters of low or high values in the graph, as well as outliers. This uses the Local Moran's I statistic. Instead of calculating one Moran's I value for the whole plot (as is the case with the global statistic), the local version calculates a local I for each spatial unit (neighborhood)

```{r}
lisaRslt <- localmoran(nyc$poor, nyc_list, 
            zero.policy = TRUE, na.action = na.omit)
```

Let's talk about the values in this output for a moment. The Pr column is our P-value. A significant result (i.e. a p-value less than 0.05) tells us that a neighborhood is either a) part of a cluster or b) a spatial outlier. The Li column gives us more info about which one it is. If Li is positive, then a neighborhood is part of a spatial cluster. Otherwise, it is a spatial outlier. 

The code below reproduces a LISA plot that is commonly used in GIS software. Essentially, this plot a) helps us to identify significant neighborhoods, and b) visualizes whether they are clusters or outliers. High-High (HH) and Low-Low (LL) neighborhoods are clusters - these are units that are surrounded by units with similar values. High-Low (HL) and Low-High (LH) units are spatial outliers - they are different from their surrounding units. The L and H tell us whether the number of poor households in a neighborhood is low or high relative to the mean. 

```{r}
# Now we can derive the cluster/outlier types (COType in ArcGIS term) for each spatial feature in the data
library(magrittr)
library(tidyr)
significanceLevel <- 0.05; # 95% confidence
meanVal <- mean(nyc$poor);

lisaRslt %<>% as_tibble() %>%
  set_colnames(c("Ii","E.Ii","Var.Ii","Z.Ii","Pr(z > 0)")) %>%
  #Create a new categorical value
  #This variable will call insignificant units "insignificant"
  #Outliers are determined by LI <0 - units higher than the mean are HL and units lower than the mean are LH
  #Clusters are units with LI >= 0, and classified as HH or LL based on whether they are above or below the mean
  mutate(coType = case_when(
  `Pr(z > 0)` > 0.05 ~ "Insignificant",
  `Pr(z > 0)` <= 0.05 & Ii >= 0 & nyc$poor >= meanVal ~ "HH",
  `Pr(z > 0)` <= 0.05 & Ii >= 0 & nyc$poor < meanVal ~ "LL",
  `Pr(z > 0)` <= 0.05 & Ii < 0 & nyc$poor >= meanVal ~ "HL",
  `Pr(z > 0)` <= 0.05 & Ii < 0 & nyc$poor < meanVal ~ "LH"))

# Now add this coType to original sf data
nyc$coType <- lisaRslt$coType %>% replace_na("Insignificant")

#Now we'll plot it!
ggplot(nyc) +
  geom_sf(aes(fill=coType),color = 'lightgrey') +
  scale_fill_manual(values = c('red','brown','NA','blue','cyan'), name='Clusters & \nOutliers') +
  labs(title = "Number of Poor Families by Neighborhood")+
  theme_minimal()
```

----------------------------------------------------------------
Now, it's your turn! Pick a new variable from the data and re-run the analysis. What do you find? Do you see different clustering patterns than the one above?
```{r}
#Analysis here

setwd("C:/Users/Donny/OneDrive/R studio/chicago_airbnb")
chicago <- st_read("airbnb_Chicago 2015.shp")

chicago_list <- chicago %>% 
  #first convert polygons to a neighbor object
  #R needs to know a unique ID for each neighborhood    
  #(cartodb_id)
  #Neighbors determined using Queen weights
  poly2nb(st_geometry(chicago)) %>% 
  #zero.policy just tells R not to remove 0 values
  nb2listw(zero.policy = TRUE) 

#Now we can calculate the "Global Moran's I" for the data
#start with the weights, then give R the corresponding variable
chicago_list %>% 
  moran.test(chicago$crowded, ., zero.policy = TRUE)
```


```{r}
moran.plot(chicago$crowded, 
           chicago_list, 
           zero.policy = TRUE, 
           labels = F,
           xlab = 'Percent Crowding in Chicago neighborhoods',
           ylab = 'Percent Crowding of neighborring neighborhoods',
           pch = 20)
```


```{r}
lisaRslt <- localmoran(chicago$crowded, chicago_list, 
            zero.policy = TRUE, na.action = na.omit)
```
```{r}
significanceLevel <- 0.05; # 95% confidence
meanVal <- mean(chicago$crowded);

lisaRslt %<>% as_tibble() %>%
  set_colnames(c("Ii","E.Ii","Var.Ii","Z.Ii","Pr(z > 0)")) %>%

  mutate(coType = case_when(
  `Pr(z > 0)` > 0.05 ~ "Insignificant",
  `Pr(z > 0)` <= 0.05 & Ii >= 0 & chicago$crowded >= meanVal ~ "HH",
  `Pr(z > 0)` <= 0.05 & Ii >= 0 & chicago$crowded < meanVal ~ "LL",
  `Pr(z > 0)` <= 0.05 & Ii < 0 & chicago$crowded >= meanVal ~ "HL",
  `Pr(z > 0)` <= 0.05 & Ii < 0 & chicago$crowded < meanVal ~ "LH"))

# Now add this coType to original sf data
chicago$coType <- lisaRslt$coType %>% replace_na("Insignificant")

#Now we'll plot it!
ggplot(chicago) +
  geom_sf(aes(fill=coType),color = 'lightgrey') +
  scale_fill_manual(values = c('red','NA','blue','cyan'), name='Clusters & \nOutliers') +
  labs(title = "Clusters of Crowded Neighborhoods")+
  theme_minimal()
```


----------------------------------------------------------------
Resources
ArcGIS (2022). Global Vs. Local Spatial Autocorrelation. Retrieved from: https://storymaps.arcgis.com/stories/5b26f25bb81a437b89003423505e2f71. 

Sun, S. (N. D.). 4.1: Spatial Autocorrelation. Retrieved from: http://www.geo.hunter.cuny.edu/~ssun/R-Spatial/spregression.html#spatial-autocorrelation. 