Bank branches are part of the everyday financial infrastructure of a city. Their locations reflect customer demand, employment density, retail activity, and transport access.
The aim of this project is to analyze the spatial distribution of bank branches in New York City using point pattern analysis. The central research question is:
Are bank branches in New York City distributed randomly, or do they form statistically and economically meaningful spatial clusters?
The empirical focus is on OpenStreetMap features tagged as
amenity = bank. The analysis combines maps, kernel
intensity estimation, nearest neighbour distances, and Ripley’s L
function.
Point pattern analysis studies event locations inside a defined spatial window. Here, the events are bank branches and the window is New York City.
A useful benchmark is complete spatial randomness (CSR), where points are independently and uniformly distributed. Departures from CSR can indicate clustering or regular spacing.
Three tools are used in the quantitative part of this report:
This report follows the point pattern approach described by Baddeley, Rubak, and Turner (2015). The observed pattern is compared with a random reference pattern.
From an urban economics perspective, banks benefit from accessibility, visibility, and proximity to dense economic activity. This suggests that bank branches should be more common in central business districts, retail streets, office areas, and transport hubs.
This expectation gives the statistical analysis an economic interpretation. Clustering would suggest that financial services follow the broader structure of urban activity.
OpenStreetMap is used as a volunteered geographic information source. This makes it suitable for reproducible exploratory research, but the data should be interpreted with awareness of possible incompleteness and uneven tagging quality.
The data are obtained from OpenStreetMap using the
osmdata package in R. The query selects all mapped objects
with the tag amenity = bank inside the New York City search
bounding box. Both point and polygon features are used:
st_point_on_surface().The resulting points are clipped to the New York City boundary
returned by osmdata::getbb(). All distance analysis uses
UTM Zone 18N (EPSG:32618), which uses metres as units.
The report uses a local cache after the first successful data collection, so repeated rendering does not require a new OpenStreetMap query.
bank_summary <- data.frame(
statistic = c(
"Number of bank features",
"Features with an available name",
"Features with an available brand",
"Point features from OSM",
"Polygon features converted to points"
),
value = c(
nrow(banks),
sum(!is.na(banks$name) & banks$name != ""),
sum(!is.na(banks$brand) & banks$brand != ""),
sum(banks$source_geometry == "point", na.rm = TRUE),
sum(banks$source_geometry == "polygon", na.rm = TRUE)
),
stringsAsFactors = FALSE
)
knitr::kable(bank_summary, caption = "Summary of OSM bank branch data.")
| statistic | value |
|---|---|
| Number of bank features | 2467 |
| Features with an available name | 1137 |
| Features with an available brand | 888 |
| Point features from OSM | 2320 |
| Polygon features converted to points | 147 |
top_bank_names <- banks |>
sf::st_drop_geometry() |>
dplyr::mutate(display_name = dplyr::coalesce(brand, name, operator)) |>
dplyr::filter(!is.na(display_name), display_name != "") |>
dplyr::count(display_name, sort = TRUE) |>
dplyr::slice_head(n = 10)
knitr::kable(top_bank_names, caption = "Most frequent bank names or brands in the OSM data.")
| display_name | n |
|---|---|
| Chase | 243 |
| TD Bank | 116 |
| Bank of America | 114 |
| Citibank | 105 |
| Santander | 48 |
| Capital One | 47 |
| Citizens Bank | 43 |
| Wells Fargo | 29 |
| Apple Bank | 28 |
| Flagstar Bank | 25 |
The dataset is volunteered geographic information, not an official bank registry. Some branches may be missing or inconsistently described, but the data are suitable for exploratory spatial analysis.
The descriptive tables check sample size and attribute completeness. Even when names are missing, the coordinates remain useful for the clustering analysis.
ggplot() +
geom_sf(data = nyc_boundary, fill = "grey96", color = "grey45", linewidth = 0.3) +
geom_sf(data = banks, color = "#1f78b4", alpha = 0.7, size = 0.9) +
coord_sf(datum = NA) +
labs(
title = "Bank Branches Mapped in OpenStreetMap",
subtitle = "New York City, amenity = bank",
x = NULL,
y = NULL
)
The initial map gives a visual test of the research question. Dense groups of points in Manhattan or Downtown Brooklyn would suggest clustering in major commercial areas.
The following map estimates the spatial intensity of bank branches. It should be read as a smoothed surface of relative concentration, not as a precise count at each location.
nyc_window <- spatstat.geom::as.owin(nyc_boundary)
bank_coordinates <- sf::st_coordinates(banks)
bank_ppp <- spatstat.geom::ppp(
x = bank_coordinates[, "X"],
y = bank_coordinates[, "Y"],
window = nyc_window,
checkdup = FALSE
)
bank_density <- spatstat.explore::density.ppp(
bank_ppp,
sigma = 1500,
edge = TRUE,
at = "pixels"
)
plot(bank_density, main = "Kernel Intensity of Bank Branches")
plot(bank_ppp, add = TRUE, pch = 20, cex = 0.35, col = "white")
The intensity surface is a smoothed map of relative concentration. It helps identify areas where bank branches are spatially dense.
This section provides the main statistical evidence. The null hypothesis is CSR, while the alternative is spatial clustering.
The analysis uses kernel intensity, nearest neighbour distances, and Ripley’s L function. Together, they describe clustering locally and across larger distances.
Nearest neighbour distance measures how close each bank branch is to its closest other branch. Short distances indicate local concentration.
nnd_m <- spatstat.geom::nndist(bank_ppp)
nnd_summary <- data.frame(
statistic = c("Minimum", "First quartile", "Median", "Mean", "Third quartile", "Maximum"),
distance_m = as.numeric(summary(nnd_m)[c("Min.", "1st Qu.", "Median", "Mean", "3rd Qu.", "Max.")]),
stringsAsFactors = FALSE
)
knitr::kable(
nnd_summary,
digits = 1,
caption = "Nearest neighbour distance summary, in metres."
)
| statistic | distance_m |
|---|---|
| Minimum | 0.1 |
| First quartile | 3.0 |
| Median | 11.8 |
| Mean | 81.9 |
| Third quartile | 68.5 |
| Maximum | 6755.1 |
ggplot(data.frame(distance_m = nnd_m), aes(x = distance_m)) +
geom_histogram(bins = 35, fill = "#2b8cbe", color = "white") +
labs(
title = "Distribution of Nearest Neighbour Distances",
x = "Distance to nearest bank branch, metres",
y = "Number of branches"
)
nn_test <- spatstat.explore::clarkevans.test(bank_ppp, correction = "cdf")
nn_test
##
## Clark-Evans test
## CDF correction
## Z-test
##
## data: bank_ppp
## R = 0.19606, p-value < 2.2e-16
## alternative hypothesis: two-sided
The Clark Evans test compares observed nearest neighbour distances with distances expected under CSR. The CDF correction is used because the New York City window is an irregular polygon. A small p value with a ratio below 1 supports clustering.
Ripley’s K function, commonly transformed into the L function, evaluates clustering at multiple spatial scales. If the observed L curve lies above the CSR simulation envelope, the point pattern shows more clustering than expected under spatial randomness at that distance range.
max_radius <- min(10000, spatstat.geom::diameter.owin(nyc_window) / 4)
r_values <- seq(0, max_radius, length.out = 60)
if (file.exists(ripley_cache_path)) {
l_envelope <- readRDS(ripley_cache_path)
} else {
l_envelope <- spatstat.explore::envelope(
bank_ppp,
fun = spatstat.explore::Lest,
nsim = 29,
correction = "border",
r = r_values,
global = FALSE,
verbose = FALSE
)
saveRDS(l_envelope, ripley_cache_path)
}
plot(
l_envelope,
main = "Ripley's L Function With CSR Simulation Envelope",
xlab = "Distance r, metres"
)
The L function gives evidence about clustering at several distances. This report uses 11 CSR simulations and the border correction to keep the calculation practical; the result is cached after the first successful run.
Taken together, the results answer the research question. A clustered nearest neighbour result and an L curve above the CSR envelope would indicate that bank branches are not randomly distributed.
The expected spatial pattern is not random. Bank branches serve residents, workers, firms, shoppers, and visitors, so they should concentrate in central and commercial areas.
If the statistical analysis identifies clustering, this should therefore be understood as the spatial expression of several economic mechanisms:
This means clustering has both a statistical and economic meaning. It shows how financial services are tied to urban activity.
This project analyzed 2467 bank branch locations in New York City using OpenStreetMap and point pattern methods. The maps show a clear concentration along Manhattan, with smaller clusters in Brooklyn and Queens.
The statistical results support the same conclusion. The median nearest neighbour distance is only 11.8 metres, while the Clark Evans test gives R = 0.196 and p value below 2.2e 16. Ripley’s L curve also lies above the CSR envelope across the plotted distances, which indicates clustering at several spatial scales.
The answer to the research question is therefore clear: bank branches in New York City are clustered rather than randomly distributed. This clustering matches the geography of demand for financial services, especially around major commercial and business areas.
Several limitations remain. OpenStreetMap is not an official registry, and the analysis does not include branch size, opening hours, customer volumes, or socioeconomic variables. Future work could compare bank locations with population, income, employment, subway stations, or commercial land use.
Baddeley, A., Rubak, E., & Turner, R. (2015). Spatial Point Patterns: Methodology and Applications with R. Chapman and Hall/CRC.
OpenStreetMap contributors. (2026). OpenStreetMap. https://www.openstreetmap.org/
O’Sullivan, D., & Unwin, D. J. (2010). Geographic Information Analysis. Wiley.
sessionInfo()
## R version 4.5.0 (2025-04-11 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26200)
##
## Matrix products: default
## LAPACK version 3.12.1
##
## locale:
## [1] LC_COLLATE=Polish_Poland.utf8 LC_CTYPE=Polish_Poland.utf8
## [3] LC_MONETARY=Polish_Poland.utf8 LC_NUMERIC=C
## [5] LC_TIME=Polish_Poland.utf8
##
## time zone: Europe/Warsaw
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] knitr_1.51 spatstat.explore_3.8-0 nlme_3.1-168
## [4] spatstat.random_3.4-5 spatstat.geom_3.7-3 spatstat.univar_3.2-0
## [7] spatstat.data_3.1-9 ggplot2_4.0.3 dplyr_1.2.1
## [10] osmdata_0.3.0 sf_1.1-1
##
## loaded via a namespace (and not attached):
## [1] sass_0.4.10 generics_0.1.4 class_7.3-23
## [4] tensor_1.5.1 KernSmooth_2.23-26 lattice_0.22-6
## [7] digest_0.6.39 magrittr_2.0.5 spatstat.utils_3.2-3
## [10] evaluate_1.0.5 grid_4.5.0 RColorBrewer_1.1-3
## [13] fastmap_1.2.0 jsonlite_2.0.0 Matrix_1.7-3
## [16] spatstat.sparse_3.1-0 e1071_1.7-17 DBI_1.3.0
## [19] scales_1.4.0 jquerylib_0.1.4 abind_1.4-8
## [22] cli_3.6.6 rlang_1.2.0 units_1.0-1
## [25] polyclip_1.10-7 withr_3.0.2 cachem_1.1.0
## [28] yaml_2.3.12 tools_4.5.0 deldir_2.0-4
## [31] vctrs_0.7.3 R6_2.6.1 proxy_0.4-29
## [34] lifecycle_1.0.5 classInt_0.4-11 pkgconfig_2.0.3
## [37] pillar_1.11.1 bslib_0.11.0 gtable_0.3.6
## [40] glue_1.8.1 Rcpp_1.1.1-1.1 xfun_0.57
## [43] tibble_3.3.1 tidyselect_1.2.1 goftest_1.2-3
## [46] farver_2.1.2 htmltools_0.5.9 labeling_0.4.3
## [49] rmarkdown_2.31 compiler_4.5.0 S7_0.2.2