1 Introduction

Bank branches are part of the everyday financial infrastructure of a city. Their locations reflect customer demand, employment density, retail activity, and transport access.

The aim of this project is to analyze the spatial distribution of bank branches in New York City using point pattern analysis. The central research question is:

Are bank branches in New York City distributed randomly, or do they form statistically and economically meaningful spatial clusters?

The empirical focus is on OpenStreetMap features tagged as amenity = bank. The analysis combines maps, kernel intensity estimation, nearest neighbour distances, and Ripley’s L function.

2 Literature Review

Point pattern analysis studies event locations inside a defined spatial window. Here, the events are bank branches and the window is New York City.

A useful benchmark is complete spatial randomness (CSR), where points are independently and uniformly distributed. Departures from CSR can indicate clustering or regular spacing.

Three tools are used in the quantitative part of this report:

  • Kernel intensity estimation shows where the expected number of points per unit area is higher or lower.
  • Nearest neighbour distance analysis summarizes how close each bank branch is to its closest other branch.
  • Ripley’s K/L function evaluates clustering across a range of spatial distances, instead of using only one distance threshold.

This report follows the point pattern approach described by Baddeley, Rubak, and Turner (2015). The observed pattern is compared with a random reference pattern.

From an urban economics perspective, banks benefit from accessibility, visibility, and proximity to dense economic activity. This suggests that bank branches should be more common in central business districts, retail streets, office areas, and transport hubs.

This expectation gives the statistical analysis an economic interpretation. Clustering would suggest that financial services follow the broader structure of urban activity.

OpenStreetMap is used as a volunteered geographic information source. This makes it suitable for reproducible exploratory research, but the data should be interpreted with awareness of possible incompleteness and uneven tagging quality.

3 Data, Description And Graphs

3.1 Data Source

The data are obtained from OpenStreetMap using the osmdata package in R. The query selects all mapped objects with the tag amenity = bank inside the New York City search bounding box. Both point and polygon features are used:

  • OSM point features are treated directly as bank branch locations.
  • OSM polygon features are converted to representative interior points using st_point_on_surface().

The resulting points are clipped to the New York City boundary returned by osmdata::getbb(). All distance analysis uses UTM Zone 18N (EPSG:32618), which uses metres as units.

The report uses a local cache after the first successful data collection, so repeated rendering does not require a new OpenStreetMap query.

3.2 Descriptive Overview

bank_summary <- data.frame(
  statistic = c(
    "Number of bank features",
    "Features with an available name",
    "Features with an available brand",
    "Point features from OSM",
    "Polygon features converted to points"
  ),
  value = c(
    nrow(banks),
    sum(!is.na(banks$name) & banks$name != ""),
    sum(!is.na(banks$brand) & banks$brand != ""),
    sum(banks$source_geometry == "point", na.rm = TRUE),
    sum(banks$source_geometry == "polygon", na.rm = TRUE)
  ),
  stringsAsFactors = FALSE
)

knitr::kable(bank_summary, caption = "Summary of OSM bank branch data.")
Summary of OSM bank branch data.
statistic value
Number of bank features 2467
Features with an available name 1137
Features with an available brand 888
Point features from OSM 2320
Polygon features converted to points 147
top_bank_names <- banks |>
  sf::st_drop_geometry() |>
  dplyr::mutate(display_name = dplyr::coalesce(brand, name, operator)) |>
  dplyr::filter(!is.na(display_name), display_name != "") |>
  dplyr::count(display_name, sort = TRUE) |>
  dplyr::slice_head(n = 10)

knitr::kable(top_bank_names, caption = "Most frequent bank names or brands in the OSM data.")
Most frequent bank names or brands in the OSM data.
display_name n
Chase 243
TD Bank 116
Bank of America 114
Citibank 105
Santander 48
Capital One 47
Citizens Bank 43
Wells Fargo 29
Apple Bank 28
Flagstar Bank 25

The dataset is volunteered geographic information, not an official bank registry. Some branches may be missing or inconsistently described, but the data are suitable for exploratory spatial analysis.

The descriptive tables check sample size and attribute completeness. Even when names are missing, the coordinates remain useful for the clustering analysis.

3.3 Maps And Visual Exploration

3.3.1 Bank Branch Locations

ggplot() +
  geom_sf(data = nyc_boundary, fill = "grey96", color = "grey45", linewidth = 0.3) +
  geom_sf(data = banks, color = "#1f78b4", alpha = 0.7, size = 0.9) +
  coord_sf(datum = NA) +
  labs(
    title = "Bank Branches Mapped in OpenStreetMap",
    subtitle = "New York City, amenity = bank",
    x = NULL,
    y = NULL
  )

The initial map gives a visual test of the research question. Dense groups of points in Manhattan or Downtown Brooklyn would suggest clustering in major commercial areas.

3.3.2 Kernel Intensity Surface

The following map estimates the spatial intensity of bank branches. It should be read as a smoothed surface of relative concentration, not as a precise count at each location.

nyc_window <- spatstat.geom::as.owin(nyc_boundary)
bank_coordinates <- sf::st_coordinates(banks)

bank_ppp <- spatstat.geom::ppp(
  x = bank_coordinates[, "X"],
  y = bank_coordinates[, "Y"],
  window = nyc_window,
  checkdup = FALSE
)
bank_density <- spatstat.explore::density.ppp(
  bank_ppp,
  sigma = 1500,
  edge = TRUE,
  at = "pixels"
)

plot(bank_density, main = "Kernel Intensity of Bank Branches")
plot(bank_ppp, add = TRUE, pch = 20, cex = 0.35, col = "white")

The intensity surface is a smoothed map of relative concentration. It helps identify areas where bank branches are spatially dense.

4 Quantitative Point Pattern Analysis

This section provides the main statistical evidence. The null hypothesis is CSR, while the alternative is spatial clustering.

The analysis uses kernel intensity, nearest neighbour distances, and Ripley’s L function. Together, they describe clustering locally and across larger distances.

4.1 Nearest Neighbour Distances

Nearest neighbour distance measures how close each bank branch is to its closest other branch. Short distances indicate local concentration.

nnd_m <- spatstat.geom::nndist(bank_ppp)
nnd_summary <- data.frame(
  statistic = c("Minimum", "First quartile", "Median", "Mean", "Third quartile", "Maximum"),
  distance_m = as.numeric(summary(nnd_m)[c("Min.", "1st Qu.", "Median", "Mean", "3rd Qu.", "Max.")]),
  stringsAsFactors = FALSE
)

knitr::kable(
  nnd_summary,
  digits = 1,
  caption = "Nearest neighbour distance summary, in metres."
)
Nearest neighbour distance summary, in metres.
statistic distance_m
Minimum 0.1
First quartile 3.0
Median 11.8
Mean 81.9
Third quartile 68.5
Maximum 6755.1
ggplot(data.frame(distance_m = nnd_m), aes(x = distance_m)) +
  geom_histogram(bins = 35, fill = "#2b8cbe", color = "white") +
  labs(
    title = "Distribution of Nearest Neighbour Distances",
    x = "Distance to nearest bank branch, metres",
    y = "Number of branches"
  )

nn_test <- spatstat.explore::clarkevans.test(bank_ppp, correction = "cdf")
nn_test
## 
##  Clark-Evans test
##  CDF correction
##  Z-test
## 
## data:  bank_ppp
## R = 0.19606, p-value < 2.2e-16
## alternative hypothesis: two-sided

The Clark Evans test compares observed nearest neighbour distances with distances expected under CSR. The CDF correction is used because the New York City window is an irregular polygon. A small p value with a ratio below 1 supports clustering.

4.2 Ripley’s L Function

Ripley’s K function, commonly transformed into the L function, evaluates clustering at multiple spatial scales. If the observed L curve lies above the CSR simulation envelope, the point pattern shows more clustering than expected under spatial randomness at that distance range.

max_radius <- min(10000, spatstat.geom::diameter.owin(nyc_window) / 4)
r_values <- seq(0, max_radius, length.out = 60)

if (file.exists(ripley_cache_path)) {
  l_envelope <- readRDS(ripley_cache_path)
} else {
  l_envelope <- spatstat.explore::envelope(
    bank_ppp,
    fun = spatstat.explore::Lest,
    nsim = 29,
    correction = "border",
    r = r_values,
    global = FALSE,
    verbose = FALSE
  )

  saveRDS(l_envelope, ripley_cache_path)
}

plot(
  l_envelope,
  main = "Ripley's L Function With CSR Simulation Envelope",
  xlab = "Distance r, metres"
)

The L function gives evidence about clustering at several distances. This report uses 11 CSR simulations and the border correction to keep the calculation practical; the result is cached after the first successful run.

Taken together, the results answer the research question. A clustered nearest neighbour result and an L curve above the CSR envelope would indicate that bank branches are not randomly distributed.

4.3 Economic Interpretation

The expected spatial pattern is not random. Bank branches serve residents, workers, firms, shoppers, and visitors, so they should concentrate in central and commercial areas.

If the statistical analysis identifies clustering, this should therefore be understood as the spatial expression of several economic mechanisms:

  • banks seek locations with high customer density and high transaction potential;
  • commercial streets and office districts generate repeated demand for financial services;
  • transit accessibility increases foot traffic and makes branches easier to reach;
  • competing banks often locate near each other in high value areas rather than spreading evenly across space.

This means clustering has both a statistical and economic meaning. It shows how financial services are tied to urban activity.

5 Conclusions

This project analyzed 2467 bank branch locations in New York City using OpenStreetMap and point pattern methods. The maps show a clear concentration along Manhattan, with smaller clusters in Brooklyn and Queens.

The statistical results support the same conclusion. The median nearest neighbour distance is only 11.8 metres, while the Clark Evans test gives R = 0.196 and p value below 2.2e 16. Ripley’s L curve also lies above the CSR envelope across the plotted distances, which indicates clustering at several spatial scales.

The answer to the research question is therefore clear: bank branches in New York City are clustered rather than randomly distributed. This clustering matches the geography of demand for financial services, especially around major commercial and business areas.

Several limitations remain. OpenStreetMap is not an official registry, and the analysis does not include branch size, opening hours, customer volumes, or socioeconomic variables. Future work could compare bank locations with population, income, employment, subway stations, or commercial land use.

6 References

Baddeley, A., Rubak, E., & Turner, R. (2015). Spatial Point Patterns: Methodology and Applications with R. Chapman and Hall/CRC.

OpenStreetMap contributors. (2026). OpenStreetMap. https://www.openstreetmap.org/

O’Sullivan, D., & Unwin, D. J. (2010). Geographic Information Analysis. Wiley.

7 Reproducibility Notes

sessionInfo()
## R version 4.5.0 (2025-04-11 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26200)
## 
## Matrix products: default
##   LAPACK version 3.12.1
## 
## locale:
## [1] LC_COLLATE=Polish_Poland.utf8  LC_CTYPE=Polish_Poland.utf8   
## [3] LC_MONETARY=Polish_Poland.utf8 LC_NUMERIC=C                  
## [5] LC_TIME=Polish_Poland.utf8    
## 
## time zone: Europe/Warsaw
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] knitr_1.51             spatstat.explore_3.8-0 nlme_3.1-168          
##  [4] spatstat.random_3.4-5  spatstat.geom_3.7-3    spatstat.univar_3.2-0 
##  [7] spatstat.data_3.1-9    ggplot2_4.0.3          dplyr_1.2.1           
## [10] osmdata_0.3.0          sf_1.1-1              
## 
## loaded via a namespace (and not attached):
##  [1] sass_0.4.10           generics_0.1.4        class_7.3-23         
##  [4] tensor_1.5.1          KernSmooth_2.23-26    lattice_0.22-6       
##  [7] digest_0.6.39         magrittr_2.0.5        spatstat.utils_3.2-3 
## [10] evaluate_1.0.5        grid_4.5.0            RColorBrewer_1.1-3   
## [13] fastmap_1.2.0         jsonlite_2.0.0        Matrix_1.7-3         
## [16] spatstat.sparse_3.1-0 e1071_1.7-17          DBI_1.3.0            
## [19] scales_1.4.0          jquerylib_0.1.4       abind_1.4-8          
## [22] cli_3.6.6             rlang_1.2.0           units_1.0-1          
## [25] polyclip_1.10-7       withr_3.0.2           cachem_1.1.0         
## [28] yaml_2.3.12           tools_4.5.0           deldir_2.0-4         
## [31] vctrs_0.7.3           R6_2.6.1              proxy_0.4-29         
## [34] lifecycle_1.0.5       classInt_0.4-11       pkgconfig_2.0.3      
## [37] pillar_1.11.1         bslib_0.11.0          gtable_0.3.6         
## [40] glue_1.8.1            Rcpp_1.1.1-1.1        xfun_0.57            
## [43] tibble_3.3.1          tidyselect_1.2.1      goftest_1.2-3        
## [46] farver_2.1.2          htmltools_0.5.9       labeling_0.4.3       
## [49] rmarkdown_2.31        compiler_4.5.0        S7_0.2.2