Hot Spot-, Outlier Analysis

Topics that will be discussed in this R markdown file:

Libraries, setup
Hotspot Analysis

1.1 Explore raw data

1.2 Check empty neighbor sets

1.3 Global G test

1.4 Local Gi Test
Outlier analysis

1. Libraries, setup

Install missing packages

#rpackages <- c("sfdep", "spdep", "tmap", "rgdal")
#install.packages(rpackages)

Load required packages

# Hot spot analysis
library(arcgisbinding)

## *** Please call arc.check_product() to define a desktop license.

library(sf)

## Warning: Paket 'sf' wurde unter R Version 4.2.3 erstellt

## Linking to GEOS 3.9.3, GDAL 3.5.2, PROJ 8.2.1; sf_use_s2() is TRUE

library(sfdep)

## Warning: Paket 'sfdep' wurde unter R Version 4.2.3 erstellt

library(spdep)

## Warning: Paket 'spdep' wurde unter R Version 4.2.3 erstellt

## Lade nötiges Paket: spData

## Warning: Paket 'spData' wurde unter R Version 4.2.3 erstellt

## To access larger datasets in this package, install the spDataLarge
## package with: `install.packages('spDataLarge',
## repos='https://nowosad.github.io/drat/', type='source')`

library(dplyr)

## 
## Attache Paket: 'dplyr'

## Die folgenden Objekte sind maskiert von 'package:stats':
## 
##     filter, lag

## Die folgenden Objekte sind maskiert von 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyr)
library(ggplot2)

## Warning: Paket 'ggplot2' wurde unter R Version 4.2.3 erstellt

library(mosaic)         # describtive statistics

## Registered S3 method overwritten by 'mosaic':
##   method                           from   
##   fortify.SpatialPolygonsDataFrame ggplot2

## 
## The 'mosaic' package masks several functions from core packages in order to add 
## additional features.  The original behavior of these functions should not be affected by this.

## 
## Attache Paket: 'mosaic'

## Das folgende Objekt ist maskiert 'package:Matrix':
## 
##     mean

## Das folgende Objekt ist maskiert 'package:ggplot2':
## 
##     stat

## Die folgenden Objekte sind maskiert von 'package:dplyr':
## 
##     count, do, tally

## Die folgenden Objekte sind maskiert von 'package:stats':
## 
##     binom.test, cor, cor.test, cov, fivenum, IQR, median, prop.test,
##     quantile, sd, t.test, var

## Die folgenden Objekte sind maskiert von 'package:base':
## 
##     max, mean, min, prod, range, sample, sum

# Outlier analysis

library(tmap)

## Warning: Paket 'tmap' wurde unter R Version 4.2.3 erstellt

## Breaking News: tmap 3.x is retiring. Please test v4, e.g. with
## remotes::install_github('r-tmap/tmap')

library(rgdal)

## Warning: Paket 'rgdal' wurde unter R Version 4.2.3 erstellt

## Lade nötiges Paket: sp

## Warning: Paket 'sp' wurde unter R Version 4.2.3 erstellt

## Please note that rgdal will be retired during October 2023,
## plan transition to sf/stars/terra functions using GDAL and PROJ
## at your earliest convenience.
## See https://r-spatial.org/r/2023/05/15/evolution4.html and https://github.com/r-spatial/evolution
## rgdal: version: 1.6-7, (SVN revision 1203)
## Geospatial Data Abstraction Library extensions to R successfully loaded
## Loaded GDAL runtime: GDAL 3.5.2, released 2022/09/02
## Path to GDAL shared files: C:/Users/EggerT/AppData/Local/Programs/R/R-4.2.2/library/rgdal/gdal
## GDAL binary built with GEOS: TRUE 
## Loaded PROJ runtime: Rel. 8.2.1, January 1st, 2022, [PJ_VERSION: 821]
## Path to PROJ shared files: C:/Users/EggerT/AppData/Local/Programs/R/R-4.2.2/library/rgdal/proj
## PROJ CDN enabled: FALSE
## Linking to sp version:2.0-0
## To mute warnings of possible GDAL/OSR exportToProj4() degradation,
## use options("rgdal_show_exportToProj4_warnings"="none") before loading sp or rgdal.

## 
## Attache Paket: 'rgdal'

## Das folgende Objekt ist maskiert 'package:mosaic':
## 
##     project

library(tidyverse)

## Warning: Paket 'tidyverse' wurde unter R Version 4.2.3 erstellt

## Warning: Paket 'readr' wurde unter R Version 4.2.3 erstellt

## Warning: Paket 'forcats' wurde unter R Version 4.2.3 erstellt

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ readr     2.1.4
## ✔ lubridate 1.9.1     ✔ stringr   1.5.0
## ✔ purrr     1.0.1     ✔ tibble    3.1.8

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ mosaic::count()  masks dplyr::count()
## ✖ purrr::cross()   masks mosaic::cross()
## ✖ mosaic::do()     masks dplyr::do()
## ✖ Matrix::expand() masks tidyr::expand()
## ✖ dplyr::filter()  masks stats::filter()
## ✖ dplyr::lag()     masks stats::lag()
## ✖ Matrix::pack()   masks tidyr::pack()
## ✖ mosaic::stat()   masks ggplot2::stat()
## ✖ mosaic::tally()  masks dplyr::tally()
## ✖ Matrix::unpack() masks tidyr::unpack()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

arc.check_product()

## product: ArcGIS Pro (13.1.3.41833)
## license: Advanced
## version: 1.0.1.306

arc.check_portal()

## *** Current
##   url        : https://www.arcgis.com/
##   version    : 2023.2
##   user       : Tobias_98_sds
##   organization   : Spatial Data Science:  The New Frontier in Analytics

1. Hotspot Analysis [2] [3]

1.1 Explore raw data

Read the shapefile

# Read in data
tes_data <- st_read('data/Tucson_Tree_Equity_Scores.shp')

## Reading layer `Tucson_Tree_Equity_Scores' from data source 
##   `S:\01_Master_Applied_Sciences\EsriTraining\PatternDetection_SpaceTime\data\Tucson_Tree_Equity_Scores.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 466 features and 33 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -111.0588 ymin: 31.99027 xmax: -110.7062 ymax: 32.32086
## Geodetic CRS:  WGS 84

First 6 rows of the data

head(tes_data)

There are in total 814 NA values

sum(is.na(tes_data))

## [1] 814

Rows need to be dropped for further analysis and tests

Remaining features (rows): 329 out of 466

#tes_data[!colSums(is.na(tes_data)), ]
#tes_data[!rowSums(is.na(tes_data)), ]

tes_data_clean <- na.omit(tes_data)

Select specific fields

tes_data_relevant <- select(tes_data_clean, c("OBJECTID", "NAME", "TreeEquity", "geometry"))
head(tes_data_relevant)

Visualize the tree equity variable (treEqty):

# Visualize tree equity as histogram
hist(tes_data_relevant$TreeEquity, main = "Distribution of Tree Equity Scores", xlab = "Tree Equity Score", ylab = "Frequency")

# Visualize tree equity across neighborhoods
ggplot(tes_data_relevant) +
  geom_sf(aes(fill = TreeEquity), color = "black", lwd = 0.15) +
  scale_fill_gradient(name = "Tree Equity Score",
                      low = "white",
                      high = "darkgreen") +
  ggtitle("Tree Equity Scores of Tucson Neighborhoods") +
  labs(caption = "Data source: City of Tucson") +
  theme_void() +
  theme(plot.title = element_text(hjust = 0.5),
        legend.position = "right")

1.2 Check empty neighbor sets

List of neighbors for each polygon in ‘tes_data’

# Create a neighbor list based on queen contiguity
list_nb <- poly2nb(tes_data_relevant, queen = TRUE)

Check for empty neighbor sets (polygons without neighbors)

# Check for empty neighbor sets
# card() calculates number of neighbors for each polygon in the list
# which() finds polygons with 0 neighbors
empty_nb <- which(card(list_nb) == 0)
empty_nb

## [1]  65  85 233

# Remove polygons with empty neighbor sets from the data
tes_subset <- tes_data_relevant[-empty_nb, ]

Neighbor names which are empty

# Subset 'tes_data' to extract polygons with empty neighbor sets
empty_polygons <- tes_data_relevant[empty_nb, ]
empty_polygons$NAME   # print neighborhood names

## [1] "Antigua Village" "Mortimore"       "Sycamore Park"

1.3 Global G test

Test for global spatial autocorrelation. Uses a monte carlo simulation approach.

Tests the null hypothesis of no spatial autocorrelation against

the alternative hypothesis of positive spatial autocorrelation.

# Now that we removed empty neighbor sets (tes_subset)
# Identify neighbors with queen contiguity (edge/vertex touching)
tes_nb <- poly2nb(tes_subset, queen = TRUE)

# Binary weighting assigns a weight of 1 to all neighboring features 
# and a weight of 0 to all other features
tes_w_binary <- nb2listw(tes_nb, style="B")

# Calculate spatial lag of TreEqty
tes_lag <- lag.listw(tes_w_binary, tes_subset$TreeEquity)

# Test for global G statistic of TreEqty
globalG.test(tes_subset$TreeEquity, tes_w_binary)

## 
##  Getis-Ord global G statistic
## 
## data:  tes_subset$TreeEquity 
## weights: tes_w_binary 
## 
## standard deviate = 3.0686, p-value = 0.001075
## alternative hypothesis: greater
## sample estimates:
## Global G statistic        Expectation           Variance 
##       1.558256e-02       1.519585e-02       1.588146e-08

1.4 Local Gi Test

Test for local spatial autocorrelation (hotspots).

# Identify neighbors, create weights, calculate spatial lag
tes_nbs <- tes_subset |> 
  mutate(
    nb = st_contiguity(geometry),        # neighbors share border/vertex
    wt = st_weights(nb),                 # row-standardized weights
    tes_lag = st_lag(TreeEquity, nb, wt)    # calculate spatial lag of TreEqty
  )

GI is the ratio of the spatial lag of a feature to the sum of the feature’s values for its neighbors.

Positive GI value: Feature an its neighbors have high values

Negative GI value: Feature an its neighbors have low values

Magnitude GI value: Indicates the strength of the clustering

# Calculate the Gi using local_g_perm
tes_hot_spots <- tes_nbs |> 
  mutate(
    Gi = local_g_perm(TreeEquity, nb, wt, nsim = 999)
    # nsim = number of Monte Carlo simulations (999 is default)
  ) |> 
  # The new 'Gi' column itself contains a dataframe 
  # We can't work with that, so we need to 'unnest' it
  unnest(Gi)

Visualization of Gi values across Tucson

# Cursory visualization
# Plot looks at gi values for all locations
tes_hot_spots |> 
  ggplot((aes(fill = gi))) +
  geom_sf(color = "black", lwd = 0.15) +
  scale_fill_gradient2() # makes the value 0 (random) be the middle

# Create a new data frame called 'tes_hot_spots"
tes_hot_spots |> 
  # with the columns 'gi' and 'p_folded_sim"
  # 'p_folded_sim' is the p-value of a folded permutation test
  select(gi, p_folded_sim) |> 
  mutate(
    # Add a new column called "classification"
    classification = case_when(
      # Classify based on the following criteria:
      gi > 0 & p_folded_sim <= 0.01 ~ "Very hot (99 %)",
      gi > 0 & p_folded_sim <= 0.05 ~ "Hot (95 %)",
      gi > 0 & p_folded_sim <= 0.1 ~ "Somewhat hot (90 %)",
      gi < 0 & p_folded_sim <= 0.01 ~ "Very cold (99 %)",
      gi < 0 & p_folded_sim <= 0.05 ~ "Cold (95 %)",
      gi < 0 & p_folded_sim <= 0.1 ~ "Somewhat cold (90 %)",
      TRUE ~ "Insignificant"
    ),
    # Convert 'classification' into a factor for easier plotting
    classification = factor(
      classification,
      levels = c("Very hot (99 %)", "Hot (95 %)", "Somewhat hot (90 %)",
                 "Insignificant",
                 "Somewhat cold (90 %)", "Cold (95 %)", "Very cold (99 %)")
    )
  ) |> 
  # Visualize the results with ggplot2
  ggplot(aes(fill = classification)) +
  geom_sf(color = "black", lwd = 0.1) +
  scale_fill_brewer(type = "div", palette = 5) +
  theme_void() +
  labs(
    fill = "Hot Spot Classification",
    title = "Tree Equity Hot Spots in Tucson"
  )

2. Outlier analysis [4]

2.1 QUEEN, weight matrix

Contiguity based neighbours and row standardised weight matrix are already created

summary(tes_nb)

## Neighbour list object:
## Number of regions: 326 
## Number of nonzero links: 1610 
## Percentage nonzero weights: 1.514923 
## Average number of links: 4.93865 
## Link number distribution:
## 
##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 
## 10 36 47 59 58 40 29 25 13  4  2  1  1  1 
## 10 least connected regions:
## 74 105 109 151 182 220 322 328 393 405 with 1 link
## 1 most connected region:
## 91 with 14 links

summary(tes_w_binary)

## Characteristics of weights list object:
## Neighbour list object:
## Number of regions: 326 
## Number of nonzero links: 1610 
## Percentage nonzero weights: 1.514923 
## Average number of links: 4.93865 
## Link number distribution:
## 
##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 
## 10 36 47 59 58 40 29 25 13  4  2  1  1  1 
## 10 least connected regions:
## 74 105 109 151 182 220 322 328 393 405 with 1 link
## 1 most connected region:
## 91 with 14 links
## 
## Weights style: B 
## Weights constants summary:
##     n     nn   S0   S1    S2
## B 326 106276 1610 3220 38544

2.2 Local Moran’s l

The Moran’s method computes the li values, to provide neighbor weighting information for the polygon.

fips <- order(tes_subset$OBJECTID)
localMI <- localmoran(tes_subset$TreeEquity, tes_w_binary)
head(localMI)

##           Ii          E.Ii      Var.Ii       Z.Ii Pr(z != E(Ii))
## 1  0.4192189 -8.083268e-04 0.259413016  0.8246726     0.40955749
## 2  2.9040296 -9.187799e-03 2.967616765  1.6910988     0.09081793
## 3  1.8058728 -4.339430e-03 1.400035098  1.5298894     0.12604410
## 4  0.1354307 -1.167864e-05 0.003748476  2.2122138     0.02695189
## 5 -2.5095456 -1.038802e-02 3.328459626 -1.3698468     0.17073473
## 7  3.8098967 -1.336336e-02 4.267618491  1.8507214     0.06420965

Local Moran matrix

head(data.frame(localMI[fips,]))

2.3 Mapping local Moran’s l

Before mapping the local Moran’s l map, the localMI dataframe need to be appended to the tes subset dataframe

tes_subset.localMI <- cbind(tes_subset,localMI)

The local Moran’s l values can be plotted by usingthe mapping function of the tmap package

tm_shape(tes_subset.localMI) +
  tm_fill(col = "Ii", 
          style = "pretty",
          palette = "RdBu",
          title = "local moran statistics") +
  tm_borders(alpha = 0.5)

## Variable(s) "Ii" contains positive and negative values, so midpoint is set to 0. Set midpoint = NA to show the full spectrum of the color palette.

2.4 Mapping local Moran’s l p-values

Again the tmap package can be used to plot the Moran’s l p-values

tm_shape(tes_subset.localMI) +
  tm_fill(col = "Pr.z....E.Ii..", 
          breaks=c(-Inf, 0.001, 0.01, 0.05, 0.1, Inf),
          palette="-Blues", 
          title = "local Moran's I p-values") +
  tm_borders(alpha = 0.5)

2.5 Mapping both local Moran’s l values and p-values

localMI.map <- tm_shape(tes_subset.localMI) +
  tm_fill(col = "Ii", 
          style = "pretty", 
          title = "local moran statistics") +
  tm_borders(alpha = 0.5)

pvalue.map <- tm_shape(tes_subset.localMI) +
  tm_fill(col = "Pr.z....E.Ii..", 
          breaks=c(-Inf, 0.001, 0.01, 0.05, 0.1, Inf),
          palette="-Blues", 
          title = "local Moran's I p-values") +
  tm_borders(alpha = 0.5)

tmap_arrange(localMI.map, pvalue.map, asp=1, ncol=2)

## Variable(s) "Ii" contains positive and negative values, so midpoint is set to 0. Set midpoint = NA to show the full spectrum of the color palette.

2.6 Mapping final cluster map

The LISA cluster map shows the significant locations color coded by type of spatial autocorrelation.

First the Moron scatterplot need to be plotted.

The plot is split into 4 quadrants: * Top right corner: High tree equity (high-high locations)

moran_plot <- moran.plot(tes_subset$TreeEquity, tes_w_binary, labels=as.character(tes_subset$OBJECTID), xlab="TreeEquity", ylab="Spatially TreeEquity Tucson")

Plotting moron scatterplot with standardised variable.

Center and scale the variable.

tes_subset$Z.TreeEquity <- scale(tes_subset$TreeEquity) %>% as.vector

Plot moron scatterplot.

moran_plot_norm <- moran.plot(tes_subset$Z.TreeEquity, tes_w_binary, labels=as.character(tes_subset$OBJECTID), xlab="z-TreeEquity", ylab="Spatially Lag z-TreeEquity Tucson")

Preparation for LISA cluster map: * Quadrants

quadrant <- vector(mode="numeric",length=nrow(localMI))

Center variable of interest around its mean

DV <- tes_subset$TreeEquity - mean(tes_subset$TreeEquity)

Centering local Moran’s around the mean

C_mI <- localMI[,1] - mean(localMI[,1])

Set a statistical significance level for the local Moran

signif <- 0.05

Define the cluster categories

quadrant[DV >0 & C_mI>0] <- 4      
quadrant[DV <0 & C_mI<0] <- 1      
quadrant[DV <0 & C_mI>0] <- 2
quadrant[DV >0 & C_mI<0] <- 3

Place non-significant Moran in the category 0.

quadrant[localMI[,5]>signif] <- 0

Additional tmap code for a better map, title, legen representation

# Source: https://stackoverflow.com/questions/60892033/how-do-you-position-the-title-and-legend-in-tmap
# make some bbox magic
bbox_new <- st_bbox(tes_subset.localMI) # current bounding box

xrange <- bbox_new$xmax - bbox_new$xmin # range of x values
yrange <- bbox_new$ymax - bbox_new$ymin # range of y values

# bbox_new[1] <- bbox_new[1] - (0.25 * xrange) # xmin - left
 bbox_new[3] <- bbox_new[3] + (0.25 * xrange) # xmax - right
# bbox_new[2] <- bbox_new[2] - (0.25 * yrange) # ymin - bottom
bbox_new[4] <- bbox_new[4] + (0.2 * yrange) # ymax - top

bbox_new <- bbox_new %>%  # take the bounding box ...
  st_as_sfc() # ... and make it a sf polygon

Build the LISA map

tes_subset.localMI$quadrant <- quadrant
colors <- c("#ffffff", "#2c7bb6", "#abd9e9", "#fdae61", "#d7191c")
clusters <- c("insignificant", "low-low", "low-high", "high-low", "high-high")

tm_shape(tes_subset.localMI, bbox = bbox_new) +
  tm_fill(col = "quadrant", style = "cat", palette = colors[c(sort(unique(quadrant)))+1], labels = clusters[c(sort(unique(quadrant)))+1], popup.vars = c("Postal.Code")) +
  tm_view(set.zoom.limits = c(105,207)) +
  tm_borders(alpha=0.5) +
  tm_layout(legend.position = c("right", "center"), 
            title= 'Tree Equity Outliers in Tucson', 
            title.position = c('left', 'top')) +
  # Source: https://stackoverflow.com/questions/41292949/how-to-export-tm-object-without-chart-borders
  tm_layout(frame = FALSE)