R Tutorial: Hotspot Analysis using Getis Ord Gi

Introduction
Explore Raw Data
Check Empty Neighbor Sets
Global G Test
Local Gi Test
Disclaimer and Acknowledgements
Session Info

Introduction

This tutorial will walk you through the steps and R code to perform a hotspot analysis using the Getis Ord Gi method.

Note: Getis Ord Gi and Gi* are the same thing.

Example Scenario

As an example, we will look at tree equity score data for the city of Tucson, Arizona, USA. A tree equity score (TES) is “a metric that helps cities assess how well they are delivering equitable tree canopy cover to all residents. It is derived from tree canopy cover, climate, demographic and socioeconomic data” (American Forests 2023). TES values range from 0 - 100, with a lower score indicating lower equity.

Imagine we want to assess spatial patterns of TES across Tucson. Specifically, we want to know whether there are statistically significant clusters of high and low TES values. This information can help the City of Tucson prioritize where resources are allocated for tree planting initiatives and other climate mitigation strategies.

Spatial Autocorrelation

Before moving on, it’s important to understand the concept of spatial autocorrelation.

Spatial autocorrelation is a measure of how a set of spatial features or their associated values are related to each other in space. It describes the extent to which nearby features or values are similar (clustered) or dissimilar (dispersed).

Spatial Autocorrelation can be positive or negative. Positive spatial autocorrelation indicates that features or values are clustered in space, and nearby locations tend to have similar values or characteristics. Negative spatial autocorrelation indicates that features or values are dispersed in space, and nearby locations tend to have dissimilar values or characteristics.

Spatial autocorrelation can be evaluated at a global or local level. Global spatial autocorrelation provides an overall assessment of the spatial pattern across the entire study area, while a local indicator of spatial association (LISA) identifies hotspots and coldspots of high or low values for each feature, offering a more detailed analysis of spatial patterns.

A free eBook on Geospatial Analysis: de Smith, M. J., Goodchild, M. F., Longley, P. A. & Associates. (2018). Geospatial Analysis: A Comprehensive Guide to Principles, Techniques and Software Tools (6th ed.). See the section on Local Indicators of Spatial Association.

R Packages

You will need the following R packages:
sf for simple features
sfdep for spatial analyses
spdep for spatial analyses
dyplyr for data manipulation
tidyr for data manipulation
ggplot2 for data visualization

Explore Raw Data

As with any analysis, we begin by exploring the raw data.

I have already cleaned this dataset and removed NA values. The original TES dataset can be accessed from the City of Tucson’s open data platform

# Read in data
tes_data <- st_read("data/tree_equity_data.shp")

## Reading layer `tree_equity_data' from data source 
##   `C:\Users\hlear\Desktop\statsandviz\advanced_topic\data\tree_equity_data.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 330 features and 3 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: 965352.1 ymin: 388440 xmax: 1072170 ymax: 481555
## Projected CRS: NAD83(HARN) / Arizona Central (ft)

# View first six rows of the dataset
head(tes_data)

## Simple feature collection with 6 features and 3 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: 979936.4 ymin: 427642.5 xmax: 1035359 ymax: 458731.4
## Projected CRS: NAD83(HARN) / Arizona Central (ft)
##   NID          nghbrhd  treEqty                       geometry
## 1   1 Longview Estates 78.37067 MULTIPOLYGON (((1033787 443...
## 2   2      Silvercroft 68.05159 MULTIPOLYGON (((982384.9 45...
## 3   3     Saguaro Hill 88.88094 MULTIPOLYGON (((1014661 449...
## 4   4   Balboa Heights 80.78016 MULTIPOLYGON (((992065.3 45...
## 5   5    Plaza Del Sol 90.92715 MULTIPOLYGON (((986341.2 42...
## 6   7        Rose Hill 91.41872 MULTIPOLYGON (((1027317 455...

This is a simple feature data type with 330 rows (Tucson neighborhood polygons), which contains the following columns:

NID = unique neighborhood identifier
nghbrhd = Tucson neighborhood name
treEqty = tree equity score
geometry = spatial polygon geometries

Let’s visualize the tree equity variable (treEqty):

# Visualize tree equity as histogram
hist(tes_data$treEqty, main = "Distribution of Tree Equity Scores", xlab = "Tree Equity Score", ylab = "Frequency")

# Visualize tree equity across neighborhoods
ggplot(tes_data) +
  geom_sf(aes(fill = treEqty), color = "black", lwd = 0.15) +
  scale_fill_gradient(name = "Tree Equity Score",
                      low = "white",
                      high = "darkgreen") +
  ggtitle("Tree Equity Scores of Tucson Neighborhoods") +
  labs(caption = "Data source: City of Tucson") +
  theme_void() +
  theme(plot.title = element_text(hjust = 0.5),
        legend.position = "right")

Check Empty Neighbor Sets

Create a list of neighbors for each polygon in ‘tes_data’.

# Create a neighbor list based on queen contiguity
list_nb <- poly2nb(tes_data, queen = TRUE)

The poly2nb() function creates a list of neighbors for each polygon in ‘tes_data’. In this example, the relationship will be based on a queen contiguity criterion (queen=TRUE), which considers polygons to be neighbors if they share a boundary or a vertex.

Next, we check for empty neighbor sets (polygons without neighbors).

If there are empty neighbors, then this will eventually cause an error and prevent the analysis from running.

There are a few ways to address empty neighbor sets. Here, we can simply remove them. Note that removing empty neighbor sets could impact the results of spatial analyses.

# Check for empty neighbor sets
# card() calculates number of neighbors for each polygon in the list
# which() finds polygons with 0 neighbors
empty_nb <- which(card(list_nb) == 0)
empty_nb

## [1]  65  85 234

# Remove polygons with empty neighbor sets from the data
tes_subset <- tes_data[-empty_nb, ]

If you need to know the neighborhood names for reporting, then run the following code.

# Subset 'tes_data' to extract polygons with empty neighbor sets
empty_polygons <- tes_data[empty_nb, ]
empty_polygons$nghbrhd  # print neighborhood names

## [1] "Antigua Village" "Mortimore"       "Sycamore Park"

Global G Test

Test for global spatial autocorrelation

globalG.test (or global_g_test()) computes a global test for spatial autocorrelation using a Monte Carlo simulation approach (simulated spatial datasets that have the same spatial structure as the original data but are randomly permuted). It tests the null hypothesis of no spatial autocorrelation against the alternative hypothesis of positive spatial autocorrelation.

# Now that we removed empty neighbor sets (tes_subset)
# Identify neighbors with queen contiguity (edge/vertex touching)
tes_nb <- poly2nb(tes_subset, queen = TRUE)

# Binary weighting assigns a weight of 1 to all neighboring features 
# and a weight of 0 to all other features
tes_w_binary <- nb2listw(tes_nb, style="B")

# Calculate spatial lag of TreEqty
tes_lag <- lag.listw(tes_w_binary, tes_subset$treEqty)

I used binary weighting to assess the overall spatial distribution. Binary weighting assigns a weight of 1 to all neighboring features (ignoring relative size or extent) and a weight of 0 to all other features. More about this can be learned from ESRI’s article “How High/Low Clustering (Getis-Ord General G) works” and “Spatial Weights”, though there are many resources available.

# Test for global G statistic of TreEqty
globalG.test(tes_subset$treEqty, tes_w_binary)

## 
##  Getis-Ord global G statistic
## 
## data:  tes_subset$treEqty 
## weights: tes_w_binary 
## 
## standard deviate = 2.419, p-value = 0.007783
## alternative hypothesis: greater
## sample estimates:
## Global G statistic        Expectation           Variance 
##       1.567748e-02       1.536557e-02       1.662723e-08

Global G Test Interpretation

The output shows a standard deviate of 2.419, which indicates that the observed clustering of tree equity is 2.419 standard deviations away from what would be expected under the null hypothesis of no clustering. This value is associated with a p-value of 0.007783, so observed clustering is statistically significant at the 0.05 level. The alternative hypothesis is “greater,” which means that the analysis is looking for clusters of high tree equity values.

The sample estimates of the Global G statistic, Expectation, and Variance are also reported. The Global G statistic is 1.567748e-02, which is the observed value of the clustering statistic for the entire dataset. The Expectation is 1.536557e-02, which is the expected value of the clustering statistic under the null hypothesis of no clustering. Finally, the Variance is 1.662723e-08, which is an estimate of the variance of the clustering statistic under the null hypothesis.

Overall, the output suggests that there is statistically significant clustering of high tree equity values.

Local Gi Test

Test for local spatial autocorrelation (hotspots)

# Identify neighbors, create weights, calculate spatial lag
tes_nbs <- tes_subset |> 
  mutate(
    nb = st_contiguity(geometry),        # neighbors share border/vertex
    wt = st_weights(nb),                 # row-standardized weights
    tes_lag = st_lag(treEqty, nb, wt)    # calculate spatial lag of TreEqty
  )

We can calculate the Gi statistic using local_g_perm(). The Gi is the ratio of the spatial lag of a feature to the sum of the feature’s values for its neighbors. A positive Gi value indicates that a feature and its neighbors have high values, while a negative Gi value indicates that they have low values. The magnitude of the Gi value indicates the strength of the clustering.

# Calculate the Gi using local_g_perm
tes_hot_spots <- tes_nbs |> 
  mutate(
    Gi = local_g_perm(treEqty, nb, wt, nsim = 999)
    # nsim = number of Monte Carlo simulations (999 is default)
  ) |> 
  # The new 'Gi' column itself contains a dataframe 
  # We can't work with that, so we need to 'unnest' it
  unnest(Gi)

Let’s do a cursory visualization of Gi values across Tucson.

# Cursory visualization
# Plot looks at gi values for all locations
tes_hot_spots |> 
  ggplot((aes(fill = gi))) +
  geom_sf(color = "black", lwd = 0.15) +
  scale_fill_gradient2() # makes the value 0 (random) be the middle

We’re getting an idea of spatial patterns of TES values for Tucson neighborhoods. But is it statistically significant? We will consider p-values in the next step.

Almost done!

# Create a new data frame called 'tes_hot_spots"
tes_hot_spots |> 
  # with the columns 'gi' and 'p_folded_sim"
  # 'p_folded_sim' is the p-value of a folded permutation test
  select(gi, p_folded_sim) |> 
  mutate(
    # Add a new column called "classification"
    classification = case_when(
      # Classify based on the following criteria:
      gi > 0 & p_folded_sim <= 0.01 ~ "Very hot",
      gi > 0 & p_folded_sim <= 0.05 ~ "Hot",
      gi > 0 & p_folded_sim <= 0.1 ~ "Somewhat hot",
      gi < 0 & p_folded_sim <= 0.01 ~ "Very cold",
      gi < 0 & p_folded_sim <= 0.05 ~ "Cold",
      gi < 0 & p_folded_sim <= 0.1 ~ "Somewhat cold",
      TRUE ~ "Insignificant"
    ),
    # Convert 'classification' into a factor for easier plotting
    classification = factor(
      classification,
      levels = c("Very hot", "Hot", "Somewhat hot",
                 "Insignificant",
                 "Somewhat cold", "Cold", "Very cold")
    )
  ) |> 
  # Visualize the results with ggplot2
  ggplot(aes(fill = classification)) +
  geom_sf(color = "black", lwd = 0.1) +
  scale_fill_brewer(type = "div", palette = 5) +
  theme_void() +
  labs(
    fill = "Hot Spot Classification",
    title = "Tree Equity Hot Spots in Tucson"
  )

Hotspot Analysis Interpretation

The hotspot analysis map assessing tree equity across Tucson reveals a clear pattern of spatial clustering, with hotspots in northern neighborhoods and coldspots in southern neighborhoods.

Disclaimer and Acknowledgements

Disclaimer
This tutorial was created by a non-expert for educational purposes only. While efforts have been made to ensure the accuracy of the information provided, users should verify any information before applying it to their own analysis. The author accepts no responsibility or liability for any errors or omissions, or for any damage or loss arising from the use of this tutorial.

Acknowledgements
Tree equity data was obtained from the City of Tucson’s open data platform. The code used in this tutorial is adapted and modified from Josiah Parry’s YouTube tutorial “Hotspot Analysis in R: GIS Fundamentals”, accessible from his GitHub.

Session Info

sessionInfo()

## R version 4.2.3 (2023-03-15 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 22621)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.utf8 
## [2] LC_CTYPE=English_United States.utf8   
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.utf8    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] ggplot2_3.4.2 tidyr_1.3.0   dplyr_1.1.1   spdep_1.2-8   spData_2.2.2 
## [6] sfdep_0.2.3   sf_1.0-12    
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.2.0   xfun_0.38          bslib_0.4.2        purrr_1.0.1       
##  [5] lattice_0.21-8     colorspace_2.0-3   vctrs_0.6.1        generics_0.1.3    
##  [9] htmltools_0.5.5    s2_1.1.2           yaml_2.3.7         utf8_1.2.2        
## [13] rlang_1.1.0        e1071_1.7-13       jquerylib_0.1.4    pillar_1.9.0      
## [17] withr_2.5.0        glue_1.6.2         DBI_1.1.3          RColorBrewer_1.1-3
## [21] sp_1.6-0           wk_0.7.2           lifecycle_1.0.3    munsell_0.5.0     
## [25] gtable_0.3.3       evaluate_0.20      labeling_0.4.2     knitr_1.42        
## [29] fastmap_1.1.1      class_7.3-21       fansi_1.0.3        highr_0.10        
## [33] Rcpp_1.0.10        KernSmooth_2.23-20 scales_1.2.1       classInt_0.4-9    
## [37] cachem_1.0.7       jsonlite_1.8.4     farver_2.1.1       deldir_1.0-6      
## [41] digest_0.6.31      grid_4.2.3         cli_3.6.0          tools_4.2.3       
## [45] magrittr_2.0.3     sass_0.4.5         proxy_0.4-27       tibble_3.2.1      
## [49] pkgconfig_2.0.3    rmarkdown_2.21     rstudioapi_0.14    R6_2.5.1          
## [53] boot_1.3-28.1      units_0.8-1        compiler_4.2.3