Note: Getis Ord Gi and Gi* are the same thing.
As an example, we will look at tree equity score data for the city of Tucson, Arizona, USA. A tree equity score (TES) is “a metric that helps cities assess how well they are delivering equitable tree canopy cover to all residents. It is derived from tree canopy cover, climate, demographic and socioeconomic data” (American Forests 2023). TES values range from 0 - 100, with a lower score indicating lower equity.
Imagine we want to assess spatial patterns of TES across Tucson. Specifically, we want to know whether there are statistically significant clusters of high and low TES values. This information can help the City of Tucson prioritize where resources are allocated for tree planting initiatives and other climate mitigation strategies.
Spatial autocorrelation is a measure of how a set of spatial features or their associated values are related to each other in space. It describes the extent to which nearby features or values are similar (clustered) or dissimilar (dispersed).
Spatial Autocorrelation can be positive or negative. Positive spatial autocorrelation indicates that features or values are clustered in space, and nearby locations tend to have similar values or characteristics. Negative spatial autocorrelation indicates that features or values are dispersed in space, and nearby locations tend to have dissimilar values or characteristics.
Spatial autocorrelation can be evaluated at a global or local level. Global spatial autocorrelation provides an overall assessment of the spatial pattern across the entire study area, while a local indicator of spatial association (LISA) identifies hotspots and coldspots of high or low values for each feature, offering a more detailed analysis of spatial patterns.
A free eBook on Geospatial Analysis: de Smith, M. J., Goodchild, M. F., Longley, P. A. & Associates. (2018). Geospatial Analysis: A Comprehensive Guide to Principles, Techniques and Software Tools (6th ed.). See the section on Local Indicators of Spatial Association.
You will need the following R packages:
sf
for simple features
sfdep
for spatial analyses
spdep
for spatial analyses
dyplyr
for data manipulation
tidyr
for data manipulation
ggplot2
for data visualization
I have already cleaned this dataset and removed NA values. The original TES dataset can be accessed from the City of Tucson’s open data platform
# Read in data
<- st_read("data/tree_equity_data.shp") tes_data
## Reading layer `tree_equity_data' from data source
## `C:\Users\hlear\Desktop\statsandviz\advanced_topic\data\tree_equity_data.shp'
## using driver `ESRI Shapefile'
## Simple feature collection with 330 features and 3 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: 965352.1 ymin: 388440 xmax: 1072170 ymax: 481555
## Projected CRS: NAD83(HARN) / Arizona Central (ft)
# View first six rows of the dataset
head(tes_data)
## Simple feature collection with 6 features and 3 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: 979936.4 ymin: 427642.5 xmax: 1035359 ymax: 458731.4
## Projected CRS: NAD83(HARN) / Arizona Central (ft)
## NID nghbrhd treEqty geometry
## 1 1 Longview Estates 78.37067 MULTIPOLYGON (((1033787 443...
## 2 2 Silvercroft 68.05159 MULTIPOLYGON (((982384.9 45...
## 3 3 Saguaro Hill 88.88094 MULTIPOLYGON (((1014661 449...
## 4 4 Balboa Heights 80.78016 MULTIPOLYGON (((992065.3 45...
## 5 5 Plaza Del Sol 90.92715 MULTIPOLYGON (((986341.2 42...
## 6 7 Rose Hill 91.41872 MULTIPOLYGON (((1027317 455...
This is a simple feature data type with 330 rows (Tucson neighborhood polygons), which contains the following columns:
Let’s visualize the tree equity variable (treEqty):
# Visualize tree equity as histogram
hist(tes_data$treEqty, main = "Distribution of Tree Equity Scores", xlab = "Tree Equity Score", ylab = "Frequency")
# Visualize tree equity across neighborhoods
ggplot(tes_data) +
geom_sf(aes(fill = treEqty), color = "black", lwd = 0.15) +
scale_fill_gradient(name = "Tree Equity Score",
low = "white",
high = "darkgreen") +
ggtitle("Tree Equity Scores of Tucson Neighborhoods") +
labs(caption = "Data source: City of Tucson") +
theme_void() +
theme(plot.title = element_text(hjust = 0.5),
legend.position = "right")
# Create a neighbor list based on queen contiguity
<- poly2nb(tes_data, queen = TRUE) list_nb
The poly2nb()
function creates a list of neighbors for
each polygon in ‘tes_data’. In this example, the relationship will be
based on a queen contiguity criterion (queen=TRUE
), which
considers polygons to be neighbors if they share a boundary or a
vertex.
If there are empty neighbors, then this will eventually cause an error and prevent the analysis from running.
There are a few ways to address empty neighbor sets. Here, we can simply remove them. Note that removing empty neighbor sets could impact the results of spatial analyses.
# Check for empty neighbor sets
# card() calculates number of neighbors for each polygon in the list
# which() finds polygons with 0 neighbors
<- which(card(list_nb) == 0)
empty_nb empty_nb
## [1] 65 85 234
# Remove polygons with empty neighbor sets from the data
<- tes_data[-empty_nb, ] tes_subset
If you need to know the neighborhood names for reporting, then run the following code.
# Subset 'tes_data' to extract polygons with empty neighbor sets
<- tes_data[empty_nb, ]
empty_polygons $nghbrhd # print neighborhood names empty_polygons
## [1] "Antigua Village" "Mortimore" "Sycamore Park"
globalG.test
(or global_g_test()
) computes
a global test for spatial autocorrelation using a Monte Carlo simulation
approach (simulated spatial datasets that have the same spatial
structure as the original data but are randomly permuted). It tests the
null hypothesis of no spatial autocorrelation against the alternative
hypothesis of positive spatial autocorrelation.
# Now that we removed empty neighbor sets (tes_subset)
# Identify neighbors with queen contiguity (edge/vertex touching)
<- poly2nb(tes_subset, queen = TRUE)
tes_nb
# Binary weighting assigns a weight of 1 to all neighboring features
# and a weight of 0 to all other features
<- nb2listw(tes_nb, style="B")
tes_w_binary
# Calculate spatial lag of TreEqty
<- lag.listw(tes_w_binary, tes_subset$treEqty) tes_lag
I used binary weighting to assess the overall spatial distribution. Binary weighting assigns a weight of 1 to all neighboring features (ignoring relative size or extent) and a weight of 0 to all other features. More about this can be learned from ESRI’s article “How High/Low Clustering (Getis-Ord General G) works” and “Spatial Weights”, though there are many resources available.
# Test for global G statistic of TreEqty
globalG.test(tes_subset$treEqty, tes_w_binary)
##
## Getis-Ord global G statistic
##
## data: tes_subset$treEqty
## weights: tes_w_binary
##
## standard deviate = 2.419, p-value = 0.007783
## alternative hypothesis: greater
## sample estimates:
## Global G statistic Expectation Variance
## 1.567748e-02 1.536557e-02 1.662723e-08
Global G Test Interpretation
The output shows a standard deviate of 2.419, which indicates that the observed clustering of tree equity is 2.419 standard deviations away from what would be expected under the null hypothesis of no clustering. This value is associated with a p-value of 0.007783, so observed clustering is statistically significant at the 0.05 level. The alternative hypothesis is “greater,” which means that the analysis is looking for clusters of high tree equity values.
The sample estimates of the Global G statistic, Expectation, and Variance are also reported. The Global G statistic is 1.567748e-02, which is the observed value of the clustering statistic for the entire dataset. The Expectation is 1.536557e-02, which is the expected value of the clustering statistic under the null hypothesis of no clustering. Finally, the Variance is 1.662723e-08, which is an estimate of the variance of the clustering statistic under the null hypothesis.
Overall, the output suggests that there is statistically significant clustering of high tree equity values.
# Identify neighbors, create weights, calculate spatial lag
<- tes_subset |>
tes_nbs mutate(
nb = st_contiguity(geometry), # neighbors share border/vertex
wt = st_weights(nb), # row-standardized weights
tes_lag = st_lag(treEqty, nb, wt) # calculate spatial lag of TreEqty
)
We can calculate the Gi statistic using local_g_perm()
.
The Gi is the ratio of the spatial lag of a feature to the sum of the
feature’s values for its neighbors. A positive Gi value indicates that a
feature and its neighbors have high values, while a negative Gi value
indicates that they have low values. The magnitude of the Gi value
indicates the strength of the clustering.
# Calculate the Gi using local_g_perm
<- tes_nbs |>
tes_hot_spots mutate(
Gi = local_g_perm(treEqty, nb, wt, nsim = 999)
# nsim = number of Monte Carlo simulations (999 is default)
|>
) # The new 'Gi' column itself contains a dataframe
# We can't work with that, so we need to 'unnest' it
unnest(Gi)
Let’s do a cursory visualization of Gi values across Tucson.
# Cursory visualization
# Plot looks at gi values for all locations
|>
tes_hot_spots ggplot((aes(fill = gi))) +
geom_sf(color = "black", lwd = 0.15) +
scale_fill_gradient2() # makes the value 0 (random) be the middle
We’re getting an idea of spatial patterns of TES values for Tucson neighborhoods. But is it statistically significant? We will consider p-values in the next step.
Almost done!
# Create a new data frame called 'tes_hot_spots"
|>
tes_hot_spots # with the columns 'gi' and 'p_folded_sim"
# 'p_folded_sim' is the p-value of a folded permutation test
select(gi, p_folded_sim) |>
mutate(
# Add a new column called "classification"
classification = case_when(
# Classify based on the following criteria:
> 0 & p_folded_sim <= 0.01 ~ "Very hot",
gi > 0 & p_folded_sim <= 0.05 ~ "Hot",
gi > 0 & p_folded_sim <= 0.1 ~ "Somewhat hot",
gi < 0 & p_folded_sim <= 0.01 ~ "Very cold",
gi < 0 & p_folded_sim <= 0.05 ~ "Cold",
gi < 0 & p_folded_sim <= 0.1 ~ "Somewhat cold",
gi TRUE ~ "Insignificant"
),# Convert 'classification' into a factor for easier plotting
classification = factor(
classification,levels = c("Very hot", "Hot", "Somewhat hot",
"Insignificant",
"Somewhat cold", "Cold", "Very cold")
)|>
) # Visualize the results with ggplot2
ggplot(aes(fill = classification)) +
geom_sf(color = "black", lwd = 0.1) +
scale_fill_brewer(type = "div", palette = 5) +
theme_void() +
labs(
fill = "Hot Spot Classification",
title = "Tree Equity Hot Spots in Tucson"
)
Hotspot Analysis Interpretation
The hotspot analysis map assessing tree equity across Tucson reveals a clear pattern of spatial clustering, with hotspots in northern neighborhoods and coldspots in southern neighborhoods.
Disclaimer
This tutorial was created by a non-expert for educational purposes only.
While efforts have been made to ensure the accuracy of the information
provided, users should verify any information before applying it to
their own analysis. The author accepts no responsibility or liability
for any errors or omissions, or for any damage or loss arising from the
use of this tutorial.
Acknowledgements
Tree equity data was obtained from the City
of Tucson’s open data platform. The code used in this tutorial is
adapted and modified from Josiah Parry’s YouTube tutorial “Hotspot
Analysis in R: GIS Fundamentals”, accessible from his GitHub.
sessionInfo()
## R version 4.2.3 (2023-03-15 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 22621)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_United States.utf8
## [2] LC_CTYPE=English_United States.utf8
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.utf8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] ggplot2_3.4.2 tidyr_1.3.0 dplyr_1.1.1 spdep_1.2-8 spData_2.2.2
## [6] sfdep_0.2.3 sf_1.0-12
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.2.0 xfun_0.38 bslib_0.4.2 purrr_1.0.1
## [5] lattice_0.21-8 colorspace_2.0-3 vctrs_0.6.1 generics_0.1.3
## [9] htmltools_0.5.5 s2_1.1.2 yaml_2.3.7 utf8_1.2.2
## [13] rlang_1.1.0 e1071_1.7-13 jquerylib_0.1.4 pillar_1.9.0
## [17] withr_2.5.0 glue_1.6.2 DBI_1.1.3 RColorBrewer_1.1-3
## [21] sp_1.6-0 wk_0.7.2 lifecycle_1.0.3 munsell_0.5.0
## [25] gtable_0.3.3 evaluate_0.20 labeling_0.4.2 knitr_1.42
## [29] fastmap_1.1.1 class_7.3-21 fansi_1.0.3 highr_0.10
## [33] Rcpp_1.0.10 KernSmooth_2.23-20 scales_1.2.1 classInt_0.4-9
## [37] cachem_1.0.7 jsonlite_1.8.4 farver_2.1.1 deldir_1.0-6
## [41] digest_0.6.31 grid_4.2.3 cli_3.6.0 tools_4.2.3
## [45] magrittr_2.0.3 sass_0.4.5 proxy_0.4-27 tibble_3.2.1
## [49] pkgconfig_2.0.3 rmarkdown_2.21 rstudioapi_0.14 R6_2.5.1
## [53] boot_1.3-28.1 units_0.8-1 compiler_4.2.3