Introduction
Literature review
Dataset preview
- Description
Methods used: Introduction
- KMeans
- PAM
- CLARA
- Model Based Clustering
- DBSCAN
Clustering Analysis
Conclusions
References

All the libraries used in this essay are presented below with instructions for loading data from the collection and a glossary explaining the columns used for analysis:

library(readr)
library(tidyverse)
library(readxl)
library(cluster)
library(factoextra)
library(flexclust)
library(fpc)
library(clustertend)
library(ClusterR)
library(osmdata)
library(showtext)
library(ggmap)
library(rvest)
library(knitr)
library(kableExtra)
library(lattice)
library(viridisLite)
library(ggpubr)
library(mclust)
library(dbscan)


setwd("G:/DataScience/I_semester/UnsupervisedLearning/ClusteringPaper")
NYPD_2020 <- read_csv("NYPD_2020.csv")
NYPD_data_dic <-
  read_excel(
    "NYPD_Complaint_YTD_DataDictionary.xlsx",
    col_types = c("text", "text", "skip",
                  "skip"),
    skip = 1
  )

For the visualisation of the topographic map of New York City, an API from OpenStreetMap was used. This API allows us to download current maps around the world, giving us full control over what data we want to download and then visualise it. The idea of creating a topographic map of New York was created in cooperation with two online tutorials available at the following addresses OpenStreetMap Tutorial 1 and OpenStreetMap Tutorial 2.

getbb("New York") %>%
  opq() %>%
  add_osm_feature(
    key = "highway",
    value = c("motorway", "primary", "motorway_link", "primary_link")
  ) %>%
  osmdata_sf() -> big_streets


getbb("New York") %>%
  opq() %>%
  add_osm_feature(
    key = "highway",
    value = c("secondary", "tertiary", "secondary_link", "tertiary_link")
  ) %>%
  osmdata_sf() -> med_streets


getbb("New York") %>%
  opq() %>%
  add_osm_feature(key = "waterway", value = "river") %>%
  osmdata_sf() -> river

getbb("New York") %>%
  opq() %>%
  add_osm_feature(key = "railway", value = "rail") %>%
  osmdata_sf() -> railway


showtext_auto()


NYPD_2020 %>%
  ggplot(mapping = aes(x = Longitude, y = Latitude, color = BORO_NM)) +
  geom_point() +
  geom_sf(
    data = river$osm_lines,
    inherit.aes = FALSE,
    color = "deepskyblue",
    size = .8,
    alpha = .3
  ) +
  geom_sf(
    data = railway$osm_lines,
    inherit.aes = FALSE,
    color = "#ffbe7f",
    size = .2,
    linetype = "dotdash",
    alpha = .5
  ) +
  geom_sf(
    data = med_streets$osm_lines,
    inherit.aes = FALSE,
    color = "#ffbe7f",
    size = .3,
    alpha = .5
  ) +
  geom_sf(
    data = big_streets$osm_lines,
    inherit.aes = FALSE,
    color = "#ffbe7f",
    size = .5,
    alpha = .6
  ) +
  coord_sf(
    xlim = c(-74.25,-73.65),
    ylim = c(40.47, 40.96),
    expand = FALSE
  ) +
  theme_classic() +
  theme(
    axis.title = element_text(colour = "white"),
    axis.text = element_text(colour = "white"),
    legend.background = element_rect(fill = "#282828"),
    legend.text = element_text(colour = "white"),
    legend.title = element_text(colour = "white"),
    panel.background = element_rect(fill = "#282828"),
    plot.title = element_text(
      size = 20,
      face = "bold",
      hjust = .5,
      color = "white"
    ),
    plot.subtitle = element_text(
      size = 8,
      hjust = .5,
      color = "white",
      margin = margin(2, 0, 5, 0)
    ),
    plot.background = element_rect(fill = "#282828")
  ) +
  labs(title = "NEW YORK CITY",
       subtitle = "40.43°N / 74.00°W",
       color = 'Name of Borough')

1 Introduction

A criminal offence is recognised by law as a prohibited act that can be generally regarded as socially dangerous, harmful. The harm caused may not only affect one person, but also a group of people or society as a whole. Specific offences are fully defined and a penalty is set for each situation under criminal law. Due to the degree of damage to a person or public property, three degrees of seriousness can be distinguished: felony, misdemeanor and violation.

The crime problem is not only a problem analysed by the police or security services in the country. There are many organisations or individual analysts who want to learn about the motives for crime, to find interesting relationships, relationships that in the future could contribute to the skilful prediction of potential crime.

The following study specifically aims to find suitable heterogeneous sectors in the city where potential sites for police stations could be highlighted in the design of the city. The analysis uses a plethora of clustering algorithms, which in their assumptions try to minimise the distances of points from their cluster centre at all costs. This cluster centre would be the police station, so that the police could arrive on site as quickly as possible as a result of action.

Let us now turn to the literature for a deeper understanding of the problem and to explore crime analysis techniques.

2 Literature review

In this study Hajela, Chawla, and Rasool (2020) shows how different types of crimes can be effectively detected using clustering based on the Hotspot Identification Approach. The study used crime data from the city of San Francisco. The data included the exact date when the crime took place, the category of crime, the city district and the exact location on the map (latitude and longitude). The authors divided their analysis into three phases: Crime Hotspot Identification, Crime History Dataset Preparation and Crime Prediciton Approach. The authors divided their analysis into three phases: Crime Hotspot Identification, Crime History Dataset Preparation and Crime Prediciton Approach. The Naive Bayes estimator and decision tree method used, together with KMeans clustering, produced incredibly good results. The test observation predictions were noticeably correctly classified into the appropriate hotspot region.

In this study Ansari and Prakash (2018) an analysis of spatio-temporal crime data for Montgomery County was conducted using KMeans and Fuzzy Clustering. The dataset contained all information about the offenders, the address where the crime took place, the type and category of crime, and the exact location. From a preliminary analysis of the Sum of Squared Error (SSE), they came up with this value for 20 clusters. The authors presented that their proposed Fuzzy C-Means algorithm in very high quality can separate the observations into correct clusters by temporal and spatial variables.

In this study Kiani, Mahdavi, and Keshavarzi (2015) we can read in detail about the authors’ proposed ready-made framework for the analysis and prediction of potential crimes. They used for the study a set of data recorded by the police on the time horizon from 1990 to 2011 for England and Wales. In their study they performed thorough data cleaning, filling in missing data using one of the aggregation functions, detecting outliers, using the KMeans algorithm and classifying test observations using decision trees. Their framework after appropriate optimisation showed almost 92% efficiency. A useful article for structuring your own analysis.

Last paper Alves et al. (2015) features a very different approach from others in the literature. They used all reported homicides in all cities in Brazil. In this study, the authors analysed the dynamics of change in homicides per person, differentiated by space and time. They showed that the number of murders decreases exponentially with increasing distance from the city. On the other hand, as the year progressed, a significant increase in homicide coverage was observed not only in urban centres but also in villages adjacent to cities. They used percolation-like analysis in spatial clustering. Thanks to their research, we are able to understand more precisely what the dynamics of change in these homicides per person are.

3 Dataset Preview

The data set to be analysed is available to everyone at the following address NYPD Complaint Data. NYC Open Data is a site that aims to disseminate information about the City of New York itself in order to inform New Yorkers themselves. They analyse all kinds of consumer and producer relations, reveal the current economic situation of the city, report on all kinds of crime, the number of graduates or the application of employees for new jobs. As I have narrowed my analysis to only analyse crime in New York I will only use the distance measure ‘Manhattan.’ The name itself is very close to this city hence this decision.

The “Manhattan” distance is calculated using the following formula: \[\sum_{i=1}^{n}|x_{i} - y_{i}|\]

where \[X = (x_{1}, x_{2} ... x_{n})\] and \[Y = (y_{1}, y_{2} ... y_{n})\]

3.1 Description

The full dataset contains all information and crime reports from as far back as the 1960s to the 20th century. This is approximately 324,000 observations, which if clustered could strongly affect the time efficiency of the algorithm. In the following study, we have limited ourselves to observations from 2020 only, as this is some of the most recent data (data for 2021 is not yet fully collated). After extracting only the 2020 submissions, approximately 7,000 observations were obtained. Each observation contains detailed information on when the crime was committed, the characteristics of the offender and victim, the exact borough of New York City where the crime took place, and the exact latitude and longitude. It is also possible to read out what type of crime it was, or perhaps just violations. Details of the dataset are shown in the table below:

NYPD_data_dic %>%
  kable(caption = "Description of the variables") %>%
  kable_styling(font_size = 12)

Description of the variables
Column Name	Column Description
CMPLNT_NUM	Randomly generated persistent ID for each complaint
ADDR_PCT_CD	The precinct in which the incident occurred
BORO	The name of the borough in which the incident occurred
CMPLNT_FR_DT	Exact date of occurrence for the reported event (or starting date of occurrence, if CMPLNT_TO_DT exists)
CMPLNT_FR_TM	Exact time of occurrence for the reported event (or starting time of occurrence, if CMPLNT_TO_TM exists)
CMPLNT_TO_DT	Ending date of occurrence for the reported event, if exact time of occurrence is unknown
CMPLNT_TO_TM	Ending time of occurrence for the reported event, if exact time of occurrence is unknown
CRM_ATPT_CPTD_CD	Indicator of whether crime was successfully completed or attempted, but failed or was interrupted prematurely
HADEVELOPT	Name of NYCHA housing development of occurrence, if applicable
HOUSING_PSA	Development Level Code
JURISDICTION_CODE	Jurisdiction responsible for incident. Either internal, like Police(0), Transit(1), and Housing(2); or external(3), like Correction, Port Authority, etc.
JURIS_DESC	Description of the jurisdiction code
KY_CD	Three digit offense classification code
LAW_CAT_CD	Level of offense: felony, misdemeanor, violation
LOC_OF_OCCUR_DESC	Specific location of occurrence in or around the premises; inside, opposite of, front of, rear of
OFNS_DESC	Description of offense corresponding with key code
PARKS_NM	Name of NYC park, playground or greenspace of occurrence, if applicable (state parks are not included)
PATROL_BORO	The name of the patrol borough in which the incident occurred
PD_CD	Three digit internal classification code (more granular than Key Code)
PD_DESC	Description of internal classification corresponding with PD code (more granular than Offense Description)
PREM_TYP_DESC	Specific description of premises; grocery store, residence, street, etc.
RPT_DT	Date event was reported to police
STATION_NAME	Transit station name
SUSP_AGE_GROUP	Suspect’s Age Group
SUSP_RACE	Suspect’s Race Description
SUSP_SEX	Suspect’s Sex Description
TRANSIT_DISTRICT	Transit district in which the offense occurred.
VIC_AGE_GROUP	Victim’s Age Group
VIC_RACE	Victim’s Race Description
VIC_SEX	Victim’s Sex Description
X_COORD_CD	X-coordinate for New York State Plane Coordinate System, Long Island Zone, NAD 83, units feet (FIPS 3104)
Y_COORD_CD	Y-coordinate for New York State Plane Coordinate System, Long Island Zone, NAD 83, units feet (FIPS 3104)
Latitude	Midblock Latitude coordinate for Global Coordinate System, WGS 1984, decimal degrees (EPSG 4326)
Longitude	Midblock Longitude coordinate for Global Coordinate System, WGS 1984, decimal degrees (EPSG 4326)

The following map of the incidence of crime in New York City shows that the most frequent crimes occur in the heart of Manhattan, where there is a district of skyscrapers and skyscrapers of well-known corporations, banks, opulent shops and cramped streets that can give rise to frequent violations of the law and human integrity. In addition, the neighbourhoods of Brooklyn and Queens are also worth mentioning as neighbourhoods where crime is more frequent than in the outskirts of New York.

NYPD_2020 %>%
  select(Latitude, Longitude) %>%
  mutate(across(1:2, round, 2)) %>%
  count(Latitude, Longitude) -> NYPD_heatmap

levelplot(
  n ~ Longitude + Latitude ,
  data = NYPD_heatmap,
  col.regions = magma(16),
  main = "Crime density map in NYC"
)

NYPD_2020 %>%
  group_by(BORO_NM) %>%
  summarise(n = n()) %>%
  arrange(., n) %>%
  kable(caption = "Number of crimes in each borough") %>%
  kable_styling(font_size = 12)

Number of crimes in each borough
BORO_NM	n
NA	10
STATEN ISLAND	390
BRONX	1370
MANHATTAN	1566
QUEENS	1574
BROOKLYN	2054

The vast majority of crimes are carried out by men, and women are the most frequent victims. On the other hand, the 10 most common crimes in 2020 could include major or minor theft, harassment, fraud with a stretch, fights or sexual offences

NYPD_2020 %>%
  select(SUSP_SEX, VIC_SEX) %>%
  filter((SUSP_SEX == "M" |
            SUSP_SEX == "F") & (VIC_SEX == "M" | VIC_SEX == "F")) %>%
  table() %>%
  mosaicplot(
    xlab = "Suspect's Sex",
    ylab = "Victim's Sex",
    main = "Dependencies between Suspect's and Victim's Sex",
    color = c("#ce79f9", "#FDF7F7")
  )

NYPD_2020 %>% count(OFNS_DESC) %>% arrange(desc(n)) %>% head(10) %>%
  ggplot(mapping = aes(x = reorder(OFNS_DESC, n), y = n)) +
  geom_bar(stat = 'identity', fill = '#ce79f9') +
  coord_flip() +
  ggtitle("TOP 10 highest types of crimes in New York City in 2020") +
  ylab("Number of occurences") +
  xlab("Type of crime")

4 Methods used: Introduction

This analysis will not aim to find the optimal number of clusters by suggesting Silhouette, WSS or Gap statistics. In this study, the key issue is to look for potential micro-regions in a city-wide area for which it is worth finding a point (police station). Through advanced clustering algorithms and using their basic assumption of minimizing the distance between the cluster center and the observations, it is possible to deduce where police stations would be worth building. Let us now go one by one through all the methods that were used in the study to identify potential micro-regions.

4.1 KMeans

The KMeans algorithm is one of the simplest as well as one of the most efficient algorithms used for clustering observations. It tries to divide a group of N observations into k separate, non-overlapping clusters with the aim of minimizing a factor such as inertia or within-cluster sum-of-squares. To use this algorithm, we only need to specify how many clusters we want to divide our observations into. It is not possible to determine unequivocally into how many clusters our set should be divided, but knowing such statistics as Silhouette, WSS or Gap it is possible to determine roughly how many clusters there should be.

Within-cluster sum-of-squares formula is calculated like below:\[\sum_{i=0}^{n}min_{\mu_{j} \in C}(\|x_{i} - \mu_{j}\|^2)\]

4.2 PAM

The PAM algorithm is very similar to the operation of the KMeans algorithm with one minor change. The KMeans algorithm looks for separate points called centroids among the whole set, while the PAM algorithm uses the most centered points among the observations. These are referred to hereafter as medoids. The second difference is that PAM uses the Manhattan distance metric rather than the Euclidean distance metric.

4.3 CLARA

The CLARA (Clustering LARge Applications) algorithm is an extended version of the PAM algorithm except that it is designed for larger numbers of observations >1K. Like PAM, the cluster centers on medoids (real observations), but uses a sampling method. It selects a subset from the whole set and runs the PAM algorithm, which determines the average dissimilarity between each object. It then calculates the cost function, which is minimized after a certain number of iterations by successive sampling.

4.4 Model Based Clustering

Model Based Clustering takes a completely different approach to clustering than all the previously mentioned methods. Model Based Clustering assumes that the data comes from a distribution of two or more clusters and, unlike KMeans, for each observation the probability of being in a particular cluster is calculated. Each cluster is modelled and is assumed to come from a normal distribution with some mean vector and covariance variance matrix. On the graph, elliptical isolines are drawn for each potential cluster, which can indicate possible clusters. Model Based Clustering selects from all the models created (it has 10 of them) the one that maximises the value of the BIC statistic.

4.5 DBSCAN

DBSCAN (Density-Based Spatial Clustering and Application with Noise) is an algorithm that is able to identify clusters that take visible shapes from outliers and noise. DBSCAN works on the principle of human eyesight, which at first glance is able to recognize certain shapes depicted in the graph. The only thing that the user has to enter for the algorithm is the eps epsilon value used to determine the radius around the point in the center and the minimum number of points minPts that should be included in the cluster.

Sometimes it can be hard to figure out what the value of epsilon should be. The graph of the average distance to the k-nearest neighbors comes in handy.

5 Clustering analysis

Let’s move on to the analytical part and check how the clustering algorithms are doing for such observations.

5.1 KMeans analysis

Two visualizations of KMeans clustering for 10 clusters from two different libraries stats::kmeans and factoextra::eclust("kmeans") are presented below:

NYPD_2020 %>%
  select(Longitude, Latitude) %>%
  kmeans(10) -> clust1
fviz_cluster(
  clust1,
  data = NYPD_2020[, c("Longitude", "Latitude")],
  geom = c("point") ,
  main = "Kmeans Clustering",
  palette = "paired",
  show.clust.cent = TRUE,
  alpha = 0.2,
  shape = 19,
  ggtheme = theme_minimal()
)

NYPD_2020 %>%
  select(Longitude, Latitude) %>%
  eclust("kmeans",
         hc_metric = "manhattan",
         k = 10,
         graph = FALSE) -> km1
fviz_cluster(
  km1,
  data = NYPD_2020[, c("Longitude", "Latitude")],
  geom = c("point") ,
  main = "Kmeans Clustering",
  palette = "paired",
  show.clust.cent = TRUE,
  alpha = 0.2,
  shape = 19,
  ggtheme = theme_minimal()
)

At first glance, there are differences in the clustering of observations, while this is not due to the difference in the two separate libraries, but because different points were chosen as starting points. Everything indicates that the micro-regions proposed by the algorithm are significantly different, and looking at the actual map of New York, there are no inaccuracies (crossing a river or a bridge to the other side of the city).

sil <-
  silhouette(clust1$cluster, dist(NYPD_2020[, c("Longitude", "Latitude")]))
fviz_silhouette(sil, print.summary = FALSE)

silk <-
  silhouette(km1$cluster, dist(NYPD_2020[, c("Longitude", "Latitude")]))
fviz_silhouette(silk, print.summary = FALSE)

By analysing the Silhouette plots for both libraries, it can be concluded that the observations in the clusters are moderately matched. Very few observations have negative values, and the average Silhouette value for all 10 clusters is around 0.4.

NYPD_2020 %>%
  select(Longitude, Latitude) %>%
  fviz_nbclust(kmeans, method = "silhouette", k.max = 10)

NYPD_2020 %>%
  select(Longitude, Latitude) %>%
  fviz_nbclust(kmeans, method = "wss", k.max = 10)

NYPD_2020 %>%
  select(Longitude, Latitude) %>%
  fviz_nbclust(kmeans, method = "gap_stat", k.max = 8)

As a next step, it would be useful to check whether indeed the original selection of 10 clusters is appropriate and in line with the available Silhouette, WSS and GAP statistics. The Silhouette chart shows that we only have 4 clusters visible (as a reminder we are analysing 5 neighbourhoods here) with a value of 0.49. This is not much more than the adopted 10 clusters and if there were only 4 police stations in New York then there could be many more crimes. The WSS and GAP statistics show that we have only one cluster among the observations. This does not satisfy the problem of providing security in the whole city.

NYPD_2020 %>%
  select(Longitude, Latitude) %>%
  cclust(10, dist = "manhattan") %>%
  stripes(main = "Stripes for k-means")

NYPD_2020 %>%
  select(Longitude, Latitude) %>%
  Optimal_Clusters_KMeans(max_clusters = 10, plot_clusters = TRUE)

##  [1] 1.00000000 0.57809815 0.37646214 0.24608377 0.18235549 0.15820689
##  [7] 0.12622774 0.11654129 0.09639013 0.08823139
## attr(,"class")
## [1] "k-means clustering"

NYPD_2020 %>%
  select(Longitude, Latitude) %>%
  Optimal_Clusters_KMeans(
    max_clusters = 10,
    plot_clusters = TRUE,
    criterion = "silhouette"
  )

##  [1] 0.0000000 0.3905489 0.4209726 0.4927528 0.4619101 0.4262570 0.4207013
##  [8] 0.3927703 0.4039236 0.3965628
## attr(,"class")
## [1] "k-means clustering"

At the very end, it is still worth looking at the bar charts with bars, which show how far the observations inside the cluster are located from their centroid. Most clusters are very similar and their spread is comparable (observations in one cluster are on average the same distance from each other).

5.2 PAM analysis

NYPD_2020 %>%
  select(Longitude, Latitude) %>%
  pam(10) -> clust2

fviz_cluster(
  clust2,
  data = NYPD_2020,
  geom = c("point") ,
  main = "Pam Clustering",
  palette = "paired",
  show.clust.cent = TRUE,
  alpha = 0.2,
  shape = 19,
  ggtheme = theme_minimal()
)

sil2 <-
  silhouette(clust2$cluster, dist(NYPD_2020[, c("Longitude", "Latitude")]))
fviz_silhouette(sil2, print.summary = FALSE)

Looking at the clustering with the PAM algorithm also shows potential micro-regions that would be suitable for control areas by one of the police stations. The extent of the clusters would also determine the region and range of operations of a particular branch.

xyplot(Longitude ~ Latitude | clust2$cluster , data = NYPD_2020)

5.3 CLARA algorithms

NYPD_2020 %>%
  select(Longitude, Latitude) %>%
  eclust("clara",
         hc_metric = "manhattan",
         k = 10,
         graph = FALSE) -> clara

fviz_cluster(
  clara,
  data = NYPD_2020[, c("Longitude", "Latitude")],
  geom = c("point") ,
  main = "CLARA algorithms",
  palette = "paired",
  show.clust.cent = TRUE,
  alpha = 0.2,
  shape = 19,
  ggtheme = theme_minimal()
)

sil3 <-
  silhouette(clara$cluster, dist(NYPD_2020[, c("Longitude", "Latitude")]))
fviz_silhouette(sil3, print.summary = FALSE)

While the CLARA algorithm behaves exactly like the PAM algorithm, it is worth noting that the Silhouette index has deteriorated from 0.39 to 0.36 and the Silhouette plot itself shows that there are many observations that do not fit the cluster.

5.4 Model Based Clustering

NYPD_2020 %>%
  select(Longitude, Latitude) %>%
  ggscatter(x = "Longitude", y = "Latitude") +
  geom_density2d()

NYPD_2020 %>%
  select(Longitude, Latitude) %>%
  Mclust(verbose = FALSE) -> ModelBased

fviz_mclust(ModelBased, "BIC", palette = "paired")

# Classification: plot showing the clustering
fviz_mclust(
  ModelBased,
  "classification",
  geom = "point",
  pointsize = 1.5,
  palette = "paired"
)

# Classification uncertainty
fviz_mclust(ModelBased, "uncertainty", palette = "paired")

An interesting, different approach is characterised by Model Based Clustering. It differs from the outset in that it first reveals a plot of isolines, which have been determined by a vector of mean values and a covariance variance matrix. In the figure you can see that some 7 clusters are generated there. After estimating all the models, it can be seen that the BIC is maximised for 9 clusters. The best model is VVV, which means that both volume, shape and orientation can vary, change.

The next two graphs show the breakdown into the 9 clusters proposed by the model, and the uncertainty graph shows which points definitely with high probability belong to a given cluster and which ones less or not at all. The smaller the point the higher this probability is, and the larger the point the lower the probability.

5.5 DBSCAN analysis

NYPD_2020 %>%
  select(Longitude, Latitude) %>%
  dbscan(eps = 0.0098, minPts = 70) -> dbscanModel

NYPD_2020 %>%
  select(Longitude, Latitude) %>%
  fviz_cluster(
    dbscanModel,
    data = .,
    stand = FALSE,
    ellipse = FALSE,
    show.clust.cent = FALSE,
    geom = "point",
    palette = "paired",
    alpha = 0.2,
    shape = 19,
    ggtheme = theme_classic()
  )

NYPD_2020 %>%
  select(Longitude, Latitude) %>%
  kNNdistplot(k = 5)
abline(h = 0.0098, lty = 2)

The worst performance was achieved by using the DBSCAN algorithm, as there are no visible shapes characteristic of the human eye for the dataset presented. On the other hand, what can be said positive about this method is that it managed to capture the densest areas of New York. As mentioned before, the highest number of crimes was recorded in the boroughs of Manhattan, Brooklyn and Queens. These neighbourhoods were singled out as a result of the DBSCAN algorithm. Additionally, a graph of the distance to the nearest neighbours is presented. This graph is intended to determine the epsilon value. The best value is there, that the graph starts growing very quickly.

6 Conclusions

A survey was conducted to identify potential micro-regions where designated cluster centres would be potential police stations. According to the analysis conducted, Model Based Clustering performed best, followed by KMeans with the PAM algorithm and CLARA, and the last place was taken by the DBSCAN method. Model Based Clustering due to its unique approach, use of assumptions about normal distributions of clusters and determination of probabilities obtains the greatest potential to implement such an idea in practice.

Below we can see a computer-generated map of New York with the actual number of police stations, and for good measure this dataset could be clustered into at least 20 clusters.

showtext_auto()
police <- getbb("New York") %>%
  opq() %>%
  add_osm_feature(key = "amenity", value = "police") %>%
  osmdata_sf()

ggplot() +
  geom_sf(data = police$osm_points,
          inherit.aes = FALSE,
          color = "firebrick2") +
  geom_sf(
    data = big_streets$osm_lines,
    inherit.aes = FALSE,
    color = "#ffbe7f",
    size = .5,
    alpha = .6
  ) +
  geom_sf(
    data = med_streets$osm_lines,
    inherit.aes = FALSE,
    color = "#ffbe7f",
    size = .3,
    alpha = .5
  ) +
  geom_sf(
    data = river$osm_lines,
    inherit.aes = FALSE,
    color = "deepskyblue",
    size = .8,
    alpha = .3
  ) +
  coord_sf(
    xlim = c(-74.25,-73.65),
    ylim = c(40.47, 40.96),
    expand = FALSE
  ) +
  theme_classic() +
  theme(
    axis.title = element_text(colour = "white"),
    axis.text = element_text(colour = "white"),
    panel.background = element_rect(fill = "#282828"),
    plot.title = element_text(
      size = 20,
      face = "bold",
      hjust = .5,
      color = "white"
    ),
    plot.subtitle = element_text(
      size = 8,
      hjust = .5,
      color = "white",
      margin = margin(2, 0, 5, 0)
    ),
    plot.background = element_rect(fill = "#282828")
  ) +
  labs(title = "POLICE STATION", subtitle = "40.43°N / 74.00°W")

7 References

OpenStreetMap Tutorial 1
OpenStreetMap Tutorial 2
DataNovia

Alves, Luiz GA, Ervin K Lenzi, Renio S Mendes, and Haroldo V Ribeiro. 2015. “Spatial Correlations, Clustering and Percolation-Like Transitions in Homicide Crimes.” EPL (Europhysics Letters) 111 (1): 18002.

Ansari, Mohd Yousuf, and Anand Prakash. 2018. “Application of Spatiotemporal Fuzzy c-Means Clustering for Crime Spot Detection.” Defence Science Journal 68 (4): 374.

Hajela, Gaurav, Meenu Chawla, and Akhtar Rasool. 2020. “A Clustering Based Hotspot Identification Approach for Crime Prediction.” Procedia Computer Science 167: 1462–70.

Kiani, Rasoul, Siamak Mahdavi, and Amin Keshavarzi. 2015. “Analysis and Prediction of Crimes by Clustering and Classification.” International Journal of Advanced Research in Artificial Intelligence 4 (8): 11–17.

Where should police stations be located? Clustering of crimes in the New York City

Robert Kowalczyk

01/2022