Exploring criminal cases in San Francisco using clustering and dimensionality reduction techniques.

1.Introduction

Big data analytics is becoming more popular and used in various fields of science every year. One of its main advantages is that it can evaluate huge quantities of data much faster than people, and also spot trends that they would likely miss. So, from a crime-solving perspective, data analysis can help catch criminals or even prevent crime from occurring and increase the security of entire cities.

Examples of big data analysis in forensics include the installation of license plate readers that record the plate numbers of every vehicle that enters or leaves each place. Another example might be storing and analyzing huge volume and variety of data in real time, to predict and spot patterns especially related to human interactions and behavior to determine which areas are at increased risk and more police patrols should be directed there.

Interested in the above-mentioned big data applications, I decided to use them and explore deeper for the purpose of assigned paper during my Master’s of Data Science studies at Faculty of Economics University of Warsaw at Unsupervised Learning Classes. For analysis I used the San Francisco Police Database as this city is famous for being attractive place for tourists, migration destination, but also for high wealth inequality, heavy burden on health care and homelessness.For this reason, I decided that exploring the San Francisco data could provide interesting conclusions. Analyzed dataset contains information from 2018-2021 on registered crimes divided into categories, place and time of occurrence and status of the solution.

2.Dataset and preprocessing

2.1.Preprocessing

I started with downloading the necessary packages, loading dataset and multi-stage preparation of the database for further analysis, such as removing duplicates of unique Incident.ID or empty values.

library(ggplot2)
library(ggthemes)
library(dplyr)
library(viridis)
library(tidyr)
library(cluster)
library(clusterSim)
library(ggmap)
library(maps)
library(knitr)
library(mice)
library(VIM)
library(corrplot)
library(GGally)
library(ggfortify)
library(factoextra)
library(gridExtra)
library(psych)
library(NbClust)
library(clustree)

SanFrancisco <- read.csv("~/Documents/UL/Projekt Clustering/sanfrancisco.csv", sep=";")
SanFrancisco <- subset(SanFrancisco, !duplicated(SanFrancisco$Incident.ID))
dim(SanFrancisco)

## [1] 346384     37

We can see that our dataset contains more than 300.000 observations and 37 variables. Next I displayed unique values, in this case, the years in which the crimes took place and chose year when the most crimes happened for further analysis.

unique(SanFrancisco$Incident.Year)

## [1] 2018 2020   NA 2019 2021

year_group <- group_by(SanFrancisco, Incident.Year)
crime_by_year <- summarise(year_group,n = n())
crime_by_year

## # A tibble: 5 x 2
##   Incident.Year      n
##           <int>  <int>
## 1          2018 126807
## 2          2019 123142
## 3          2020  95781
## 4          2021    653
## 5            NA      1

As we can see, most of the crimes occurred in 2018, hence it will be further analyzed. In the last line you can see one blank value that will be removed as well as columns that will not be used any further.

keeps <- c("Incident.Year", "Incident.Day.of.Week","Incident.Category","Incident.Subcategory","Analysis.Neighborhood","Incident.Time","Incident.Month","Latitude","Longitude")
SanFrancisco<-SanFrancisco[keeps]
SanFrancisco<-SanFrancisco[SanFrancisco$Incident.Year == 2018,]
SanFrancisco<-SanFrancisco[complete.cases(SanFrancisco), ]

2.2.Dataset

In the first place, it is important to deepen knowledge about datset and for this purpose, several graphs are presented below. The first one shows what types of crime were reported and which occurred most frequently in 2018.The majority of reported criminal incidents were larceny thefts, non-criminal indcidents and assults.The non-criminal category seems unusual, but as described on the San Francisco Police website, these incidents usually concern mental health detention and stay away or court order.

indicator_group <- group_by(SanFrancisco, Incident.Category)
crime_by_indicator <- summarise(indicator_group, n=n())
crime_by_indicator <- crime_by_indicator[order(crime_by_indicator$n, decreasing = TRUE),]
crime_by_indicator <- crime_by_indicator[-c(46), ]
ggplot(aes(x = reorder(Incident.Category, n), y=n), data = crime_by_indicator) +
geom_bar(stat = 'identity', width = 0.7) +
geom_text(aes(label = n), stat = 'identity', data = crime_by_indicator, hjust = -0.1, size = 2.8) +
coord_flip() +
xlab('Crime types') +
ylab('Number of occurrences') +
ggtitle('Crime by types in San Francisco 2018') +
theme_bw() +
theme(plot.title = element_text(size = 16),
axis.title = element_text(size = 12, face = "bold"))

Also we can create another plot to see what subcategories we can divide among larceny thefts. The most frequently occurring theft subcategory are those from the car.

larceny_theft <- SanFrancisco[SanFrancisco$Incident.Category == 'Larceny Theft', ]
larceny_theft_group <- group_by(larceny_theft, Incident.Subcategory)
larceny_theft_by_subcategory <- summarise(larceny_theft_group, n=n())
larceny_theft_by_subcategory <- larceny_theft_by_subcategory[order(larceny_theft_by_subcategory$n, decreasing = TRUE), ]
ggplot(aes(x = reorder(Incident.Subcategory, n), y = n), data = larceny_theft_by_subcategory) +
geom_bar(stat = 'identity', width = 0.6) +
geom_text(aes(label = n), stat = 'identity', data = larceny_theft_by_subcategory, hjust = -0.1, size = 3) +
coord_flip() +
xlab('Types of larceny thefts') +
ylab('Number of occurrences') +
ggtitle('Larceny theft crimes in San Francisco 2018') +
theme_bw() +
theme(plot.title = element_text(size = 16),
axis.title = element_text(size = 12, face = "bold"))

After having a look at the categories and structure of crime cases, it is also worth to investigate in which months the most crime occurs. As expected, they most often occur in the holiday months, when there are many tourists also there are many thefts.

month_group <- group_by(SanFrancisco, Incident.Month)
crime_month <- summarise(month_group, n=n(), )
crime_month$Incident.Month <-ordered(crime_month$Incident.Month,levels = c('Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec'))
ggplot(aes(x=Incident.Month, y=n), data = crime_month) + geom_line(size = 1.5, alpha = 0.7, color = "black", group=1) +
geom_point(size = 0.5) +
ggtitle('Total Crimes by Month in San Francisco 2018') +
ylab('Number of Occurrences') +
xlab('Months') +
theme_bw() +
theme(plot.title = element_text(size = 14),
axis.title = element_text(size = 10, face = "bold"))

It seems also interesting to check on the basis of registered incident time at which hours the most crimes take place. In line with previous observations, such results were expected - the plot above shows two equally peaks, first is around 11am, followed by 6pm, where during these hours you can expect the greatest number of tourists hanging around the city.

hour_group <- group_by(SanFrancisco, Incident.Time)
crime_hour <- summarise(hour_group, n=n())
ggplot(aes(x=Incident.Time, y=n), data = crime_hour) + geom_line(size = 1.5, alpha = 0.7, color = "black", group=1) +
geom_point(size = 0.5) +
ggtitle('Total Crimes by Hour of Day in San Francisco 2018') +
ylab('Number of Occurrences') +
xlab('Hour(24-hour clock)') +
theme_bw() +
theme(plot.title = element_text(size = 14),
axis.title = element_text(size = 10, face = "bold"))

The last variable worth analyzing are districts where crime incidents were reported. The most dangerous areas are listed below.

location_group <- group_by(SanFrancisco, Analysis.Neighborhood)
crime_by_location <- summarise(location_group, n=n())
crime_by_location <- crime_by_location[order(crime_by_location$n, decreasing = TRUE), ]
crime_by_location_top10 <- head(crime_by_location, 20)
ggplot(aes(x = reorder(Analysis.Neighborhood, n), y = n), data = crime_by_location_top10) +
geom_bar(stat = 'identity', width = 0.6) +
geom_text(aes(label = n), stat = 'identity', data = crime_by_location_top10, hjust = -0.1, size = 3) +
coord_flip() +
xlab('Neighbourhoods') +
ylab('Number of Occurrences') +
ggtitle('Neighbourhoods with Most Crimes - Top 10') +
theme_bw() +
theme(plot.title = element_text(size = 16),
axis.title = element_text(size = 12, face = "bold"))

While, the following districts are the safest.

tail(crime_by_location, 5)

## # A tibble: 5 x 2
##   Analysis.Neighborhood     n
##   <chr>                 <int>
## 1 Treasure Island         414
## 2 Presidio                236
## 3 Lincoln Park            132
## 4 Seacliff                130
## 5 McLaren Park             91

Both of the above charts are reflected on the map.

lat <- SanFrancisco$Latitude
lon <- SanFrancisco$Longitude
crimes <- SanFrancisco$Incident.Category
to_map <- data.frame(crimes, lat, lon)
colnames(to_map) <- c('crimes', 'lat', 'lon')
sbbox <- make_bbox(lon = SanFrancisco$Longitude, lat = SanFrancisco$Latitude, f = 0.01)
my_map <- get_map(location = sbbox, maptype = "roadmap", scale = 2, color="bw", zoom = 11)
ggmap(my_map) +
geom_point(data=to_map, aes(x = lon, y = lat, color = "black"),
size = 0.5, alpha = 0.03) +
xlab('Longitude') +
ylab('Latitude') +
ggtitle('Location of crimes in SanFrancisco 2018') +
guides(color=FALSE)

3.Exploratory data analysis

3.1.Data prepartion

To prepare data for further analysis it should be grouped, first qualitative column should be removed, also all missing values has to be removed. In the next step data must be scaled for comparison.

by_groups <- group_by(SanFrancisco, Incident.Category, Analysis.Neighborhood)
groups <- summarise(by_groups, n=n())
groups <- groups[c("Analysis.Neighborhood", "Incident.Category", "n")]
groups_wide <- spread(groups, key = Incident.Category, value = n)
kable(groups_wide[1:10,1:10])

Analysis.Neighborhood	V1	Arson	Assault	Burglary	Case Closure	Civil Sidewalks	Courtesy Report	Disorderly Conduct	Drug Offense
Bayview Hunters Point	2	43	699	377	1	NA	8	117	48
Bernal Heights	NA	5	147	120	NA	NA	1	26	14
Castro/Upper Market	NA	19	296	252	1	115	NA	45	52
Chinatown	NA	7	145	101	NA	NA	53	28	22
Excelsior	NA	4	163	133	2	NA	3	35	32
Financial District/South Beach	NA	25	664	832	8	1	4	136	70
Glen Park	NA	NA	36	57	NA	NA	NA	8	3
Golden Gate Park	NA	11	56	43	NA	5	4	6	14
Haight Ashbury	2	2	120	103	1	260	1	17	34
Hayes Valley	NA	8	135	203	2	6	2	17	18

z <- groups_wide[, -c(1,1:2)]
z[is.na(z)] <- 0
m <- apply(z, 2, mean)
s <- apply(z, 2, sd)
z <- scale(z, m, s)

3.2.Dimensionality reduction

3.2.1.Optimal number of components

As it was indicated before, over 50 types of crimes were registered in the San Francisco Police’s database, which complicates further analyzes.Therefore I decided to see if there are correlations between the variables, as you can see in the plot below, there are strong correlations between them. For this reason, it seems reasonable to reduce dimensionality using one the statistical methods, e.g. PCA (Principal component analysis) to create its simpler version.

corr_z = cor(z,method='pearson')
corrplot(corr_z, tl.cex=0.6)

Frist, I used promp() function to compute and visualize PCA to obtain values of basic statistics.One of the approaches is to select number of components that explained approximately 90% of variance. Based on this rule we should choose 8 components.

z.pca1<-prcomp(z, center=TRUE, scale.=TRUE) 
summary(z.pca1)

## Importance of components:
##                           PC1     PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     5.4286 1.88162 1.74203 1.58567 1.53029 1.46005 1.07361
## Proportion of Variance 0.5894 0.07081 0.06069 0.05029 0.04684 0.04264 0.02305
## Cumulative Proportion  0.5894 0.66021 0.72090 0.77119 0.81802 0.86066 0.88371
##                            PC8     PC9    PC10    PC11   PC12    PC13    PC14
## Standard deviation     1.03978 0.88939 0.86238 0.82317 0.7071 0.61621 0.55492
## Proportion of Variance 0.02162 0.01582 0.01487 0.01355 0.0100 0.00759 0.00616
## Cumulative Proportion  0.90533 0.92115 0.93603 0.94958 0.9596 0.96718 0.97333
##                           PC15    PC16   PC17    PC18    PC19    PC20    PC21
## Standard deviation     0.54474 0.43716 0.3937 0.37471 0.35994 0.27978 0.26089
## Proportion of Variance 0.00593 0.00382 0.0031 0.00281 0.00259 0.00157 0.00136
## Cumulative Proportion  0.97927 0.98309 0.9862 0.98900 0.99159 0.99316 0.99452
##                           PC22    PC23    PC24    PC25    PC26    PC27    PC28
## Standard deviation     0.23059 0.20148 0.19064 0.16649 0.15653 0.15141 0.13002
## Proportion of Variance 0.00106 0.00081 0.00073 0.00055 0.00049 0.00046 0.00034
## Cumulative Proportion  0.99558 0.99639 0.99712 0.99767 0.99816 0.99862 0.99896
##                           PC29    PC30    PC31    PC32    PC33    PC34    PC35
## Standard deviation     0.11139 0.10235 0.09200 0.08274 0.06765 0.05547 0.05118
## Proportion of Variance 0.00025 0.00021 0.00017 0.00014 0.00009 0.00006 0.00005
## Cumulative Proportion  0.99921 0.99942 0.99959 0.99972 0.99982 0.99988 0.99993
##                           PC36    PC37    PC38    PC39    PC40      PC41
## Standard deviation     0.04142 0.02853 0.02143 0.01745 0.01477 4.015e-16
## Proportion of Variance 0.00003 0.00002 0.00001 0.00001 0.00000 0.000e+00
## Cumulative Proportion  0.99996 0.99998 0.99999 1.00000 1.00000 1.000e+00

a<-summary(z.pca1)
plot(a$importance[3,],type="l")
abline(v=8,col="blue")
abline(h= 0.90533,col="red")

To choose the number of components, we can also use the Kaiser rule, which says that we should take into account PCs with eigenvalues bigger than 1. According to this rule, we should also take 8 components for further analysis. Below visualizations of eigenvalues (scree plot) shows the percentage of variances explained by each principal component.

fviz_eig(z.pca1, choice='eigenvalue', addlabels=TRUE)

fviz_eig(z.pca1, addlabels=TRUE)

3.2.2.Components analysis

Positive correlated variables point to the same side of the plot. Below presentation of variables on a chart confirms earlier conclusions about the strong correlations. All of them are located in the second and third quadrant of the variable correlation plot. It seems quite understandable, first group contains more serious crimes with the use of weapons, kidnapping and servitude, while the second group, located in the third quadrant includes incidents related to gambling, thefts and family affairs.

fviz_pca_var(z.pca1,col.var="steelblue", repel=TRUE)

3.2.3.Contribution of variables

Below bar plots show contribution of the ten most significant variables in each of the eight selected dimensions, with the expected average share of 2% marked with the red line.

Dim-1

fviz_contrib(z.pca1, choice = "var", axes =1 , top = 10)

Dim-2

fviz_contrib(z.pca1, choice = "var", axes =2 , top = 10)

Dim-3

fviz_contrib(z.pca1, choice = "var", axes =3 , top = 10)

Dim-4

fviz_contrib(z.pca1, choice = "var", axes =4 , top = 10)

Dim-5

fviz_contrib(z.pca1, choice = "var", axes =5 , top = 10)

Dim-6

fviz_contrib(z.pca1, choice = "var", axes =6 , top = 10)

Dim-7

fviz_contrib(z.pca1, choice = "var", axes =7 , top = 10)

Dim-8

fviz_contrib(z.pca1, choice = "var", axes =8 , top = 10)

Further analysis will be performed on data prepared with PCA with eight principal components kept. Below data shows only the significant loadings for each component.

z.pca2<-principal(z, nfactors=8, rotate="varimax")
print(loadings(z.pca2), digits=8, cutoff=0.4, sort=TRUE)

## 
## Loadings:
##                                              RC5         RC1        
## Assault                                       0.66594528  0.50866770
## Drug Offense                                  0.96314309            
## Forgery And Counterfeiting                    0.52303496            
## Homicide                                      0.79978419            
## Non-Criminal                                  0.72447477            
## Offences Against The Family And Children      0.70441703  0.55806196
## Other                                         0.87835650            
## Other Miscellaneous                           0.58973741  0.45581080
## Other Offenses                                0.67532827  0.48061861
## Robbery                                       0.63395095  0.42819047
## Stolen Property                               0.60793643            
## Suspicious Occ                                0.63116680  0.50949626
## Warrant                                       0.81718010            
## Weapons Carrying Etc                          0.64750811  0.60083549
## Arson                                         0.52439290  0.69566320
## Disorderly Conduct                            0.53106292  0.53696974
## Family Offense                                            0.93161654
## Fire Report                                               0.80772651
## Gambling                                                  0.59735269
## Miscellaneous Investigation                               0.75847790
## Motor Vehicle Theft                                       0.80094989
## Recovered Vehicle                                         0.82207675
## Traffic ColNovion                                         0.59313790
## Vehicle Impounded                                         0.83927502
## Weapons Offense                               0.52816014  0.71311483
## Burglary                                                  0.40011208
## Case Closure                                                        
## Embezzlement                                              0.48779802
## Fraud                                         0.41096591            
## Human Trafficking (A), Commercial Sex Acts                          
## Larceny Theft                                                       
## Lost Property                                 0.52696166            
## Malicious Mischief                            0.42263711  0.53664302
## Weapons Offence                               0.61600722            
## Liquor Laws                                                         
## Prostitution                                                        
## Traffic Violation Arrest                      0.43909659            
## VandaNovm                                     0.47612815            
## Human Trafficking (B), Involuntary Servitude                        
## Rape                                                                
## Sex Offense                                                         
## Vehicle Misplaced                                                   
## Drug Violation                                                      
## Human Trafficking, Commercial Sex Acts                              
## Suspicious                                                          
## Courtesy Report                                                     
## Missing Person                                            0.48306421
## Civil Sidewalks                                                     
## Motor Vehicle Theft?                                      0.48496570
## Suicide                                       0.49545106            
##                                              RC2         RC4        
## Assault                                                             
## Drug Offense                                                        
## Forgery And Counterfeiting                    0.46590251  0.49337676
## Homicide                                                            
## Non-Criminal                                  0.40908011            
## Offences Against The Family And Children                            
## Other                                                               
## Other Miscellaneous                           0.49131423            
## Other Offenses                                                      
## Robbery                                       0.54641309            
## Stolen Property                               0.52499688            
## Suspicious Occ                                0.45838187            
## Warrant                                                             
## Weapons Carrying Etc                                                
## Arson                                                               
## Disorderly Conduct                            0.50227284            
## Family Offense                                                      
## Fire Report                                                         
## Gambling                                                            
## Miscellaneous Investigation                                         
## Motor Vehicle Theft                                                 
## Recovered Vehicle                                                   
## Traffic ColNovion                                                   
## Vehicle Impounded                                                   
## Weapons Offense                                                     
## Burglary                                      0.76908454            
## Case Closure                                  0.52328982            
## Embezzlement                                  0.73967010            
## Fraud                                         0.71947899            
## Human Trafficking (A), Commercial Sex Acts    0.91389926            
## Larceny Theft                                 0.75458813            
## Lost Property                                 0.76284916            
## Malicious Mischief                            0.53828967            
## Weapons Offence                               0.68762354            
## Liquor Laws                                               0.91213644
## Prostitution                                              0.88928818
## Traffic Violation Arrest                                  0.71355739
## VandaNovm                                                 0.56046281
## Human Trafficking (B), Involuntary Servitude                        
## Rape                                                                
## Sex Offense                                                         
## Vehicle Misplaced                                                   
## Drug Violation                                                      
## Human Trafficking, Commercial Sex Acts                    0.47729583
## Suspicious                                                          
## Courtesy Report                                                     
## Missing Person                                                      
## Civil Sidewalks                                                     
## Motor Vehicle Theft?                                                
## Suicide                                       0.43635760            
##                                              RC3         RC6        
## Assault                                                             
## Drug Offense                                                        
## Forgery And Counterfeiting                                          
## Homicide                                                            
## Non-Criminal                                                        
## Offences Against The Family And Children                            
## Other                                                               
## Other Miscellaneous                                                 
## Other Offenses                                                      
## Robbery                                                             
## Stolen Property                                                     
## Suspicious Occ                                                      
## Warrant                                                             
## Weapons Carrying Etc                                                
## Arson                                                               
## Disorderly Conduct                                                  
## Family Offense                                                      
## Fire Report                                                         
## Gambling                                                            
## Miscellaneous Investigation                                         
## Motor Vehicle Theft                                                 
## Recovered Vehicle                                                   
## Traffic ColNovion                                                   
## Vehicle Impounded                                                   
## Weapons Offense                                                     
## Burglary                                                            
## Case Closure                                  0.47032124            
## Embezzlement                                                        
## Fraud                                                               
## Human Trafficking (A), Commercial Sex Acts                          
## Larceny Theft                                                       
## Lost Property                                                       
## Malicious Mischief                                                  
## Weapons Offence                                                     
## Liquor Laws                                                         
## Prostitution                                                        
## Traffic Violation Arrest                                            
## VandaNovm                                                           
## Human Trafficking (B), Involuntary Servitude  0.95888233            
## Rape                                          0.65361654            
## Sex Offense                                   0.91018119            
## Vehicle Misplaced                             0.55692003            
## Drug Violation                                            0.96999442
## Human Trafficking, Commercial Sex Acts        0.49513934  0.55814254
## Suspicious                                                0.90807109
## Courtesy Report                                           0.46136403
## Missing Person                                                      
## Civil Sidewalks                                                     
## Motor Vehicle Theft?                                                
## Suicide                                                             
##                                              RC8         RC7        
## Assault                                                             
## Drug Offense                                                        
## Forgery And Counterfeiting                                          
## Homicide                                                            
## Non-Criminal                                                        
## Offences Against The Family And Children                            
## Other                                                               
## Other Miscellaneous                                                 
## Other Offenses                                                      
## Robbery                                                             
## Stolen Property                                                     
## Suspicious Occ                                                      
## Warrant                                                             
## Weapons Carrying Etc                                                
## Arson                                                               
## Disorderly Conduct                                                  
## Family Offense                                                      
## Fire Report                                                         
## Gambling                                                            
## Miscellaneous Investigation                                         
## Motor Vehicle Theft                                                 
## Recovered Vehicle                                                   
## Traffic ColNovion                                                   
## Vehicle Impounded                                                   
## Weapons Offense                                                     
## Burglary                                                            
## Case Closure                                                        
## Embezzlement                                                        
## Fraud                                                               
## Human Trafficking (A), Commercial Sex Acts                          
## Larceny Theft                                                       
## Lost Property                                                       
## Malicious Mischief                                                  
## Weapons Offence                                                     
## Liquor Laws                                                         
## Prostitution                                                        
## Traffic Violation Arrest                                            
## VandaNovm                                                           
## Human Trafficking (B), Involuntary Servitude                        
## Rape                                                                
## Sex Offense                                                         
## Vehicle Misplaced                             0.51373551            
## Drug Violation                                                      
## Human Trafficking, Commercial Sex Acts                              
## Suspicious                                                          
## Courtesy Report                               0.53219304            
## Missing Person                                0.54757277            
## Civil Sidewalks                                           0.97165841
## Motor Vehicle Theft?                                                
## Suicide                                                             
## 
##                       RC5        RC1       RC2       RC4        RC3        RC6
## SS loadings    11.3789378 10.5697421 7.9769317 5.7276928 4.41204710 2.52853456
## Proportion Var  0.2275788  0.2113948 0.1595386 0.1145539 0.08824094 0.05057069
## Cumulative Var  0.2275788  0.4389736 0.5985122 0.7130661 0.80130703 0.85187772
##                       RC8        RC7
## SS loadings    1.46952576 1.20329706
## Proportion Var 0.02939052 0.02406594
## Cumulative Var 0.88126824 0.90533418

3.3.Clustering

The next part of work will concern clustering which is the process of grouping set of data points with similar characteristics which have not been explicitly labeled. Objects in the same group (called a cluster) are more similar to each other than to those in other groups (clusters). Cluster analysis can be achieved by numerous algorithms, but in this case, for our PCA results I will use probably the most well-known K-Means and Cluster Dendogram for the hierarchical clustering.

3.3.1.Optimal number of clusters

After selecting the algorithm, we should understand its process. K-Means algorithm will first start with randomly selected group of centroids and then perform iterative calculations to optimize their positions by minimazing the eucildean distances between given point and each group center until centroids don’t change much between iterations are they are stabilized.Therefore, it is important to indicate the optimal number of clusters, which can be determined using the methods below. The vast majority of them indicate the optimal number of 2 clusters.

WSS

The Sum of Squares Method involves minimizing the sum of squares within a cluster (a measure of how tight each cluster is) and maximizing the sum of squares between clusters (a measure of how separate each cluster is from the others). As we can see in the plot below, a strong elbow is visible, indicating that the optimal number of clusters should not be greater than 2.

wss <- (nrow(z.pca2[["scores"]])-1) * sum(apply(z.pca2[["scores"]], 2, var))
for (i in 2:20) wss[i] <- sum(kmeans(z.pca2[["scores"]], centers=i)$withiness)
plot(1:20, wss, type='b', xlab='Number of Clusters', ylab='Within groups sum of squares')

Silhouette

Another method computes statistic, the average silhouette of observations for different values of k. The higher the value the better, so also in this case we should choose 2 clusters.

fviz_nbclust(z.pca2[["scores"]], FUNcluster = kmeans)

Gap statistic

The gap statistic indicate optimal number of clusters by comparing the total intra-cluster variability for different k values with their expected values under null reference distribution of the data. Value that maximized the gap statistic estimates the optimal number of clusters. In contrary to previously mentioned methods, this indicates that 5 clusters would be optimal.

fviz_nbclust(z.pca2[["scores"]], FUNcluster=kmeans, method="gap_stat")+ theme_classic()

NbClust

According to the package description the NbClust includes 30 indicators to determine the appropriate number of clusters and offers users the best clustering scheme based on the different results obtained by varying all combinations of cluster numbers, distance measures and clustering methods. Based on the obtained results optimal number of clusters equals to 2.

res.nbclust <- NbClust(z.pca2[["scores"]], distance = "euclidean",
                  min.nc = 2, max.nc = 9, 
                  method = "complete", index ="all")

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
##

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 9 proposed 2 as the best number of clusters 
## * 7 proposed 3 as the best number of clusters 
## * 1 proposed 4 as the best number of clusters 
## * 3 proposed 5 as the best number of clusters 
## * 4 proposed 9 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  2 
##  
##  
## *******************************************************************

factoextra::fviz_nbclust(res.nbclust) + theme_minimal() + ggtitle("NbClust's optimal number of clusters")

## Among all indices: 
## ===================
## * 2 proposed  0 as the best number of clusters
## * 9 proposed  2 as the best number of clusters
## * 7 proposed  3 as the best number of clusters
## * 1 proposed  4 as the best number of clusters
## * 3 proposed  5 as the best number of clusters
## * 4 proposed  9 as the best number of clusters
## 
## Conclusion
## =========================
## * According to the majority rule, the best number of clusters is  2 .

Clustree

According to the package description the above statistical method shows a single result that considers only one set of clusters at a time. Clustree R package takes an alternative approach to considering how samples change groups as the number of clusters increases. This is useful for showing which clusters are distinct and which are unstable. It does not explicitly indicate what is the optimal cluster number, but it is useful for exploring possible choices.

tmp <- NULL
for (k in 1:11){
  tmp[k] <- kmeans(z.pca2[["scores"]], k, nstart = 30)
}
df <- data.frame(tmp)
colnames(df) <- seq(1:11)
colnames(df) <- paste0("k",colnames(df))
df.pca <- prcomp(df, center = TRUE, scale. = FALSE)
ind.coord <- df.pca$x
ind.coord <- ind.coord[,1:2]
df <- bind_cols(as.data.frame(df), as.data.frame(ind.coord))
clustree(df, prefix = "k")

3.3.2.Applying K-means on PCA

Based on the previously obtained results, I chose 2 clusters for further analysis. As we can see first cluster contains 36 districts and the second has only 5 neighbourhoods. According to the cluster means we can see that negative values are found only in the first cluster, their value below zero means that they are “lower than most”, thus cluster 1 has neighbourhoods with low crime rate, while positive values are grouped in the second cluster and analogously these values are “higher than most”, so in this cluster we can find neighbourhoods with high crime rate. Clustering vector shows that 1st, 6th, 19th, 34th and 36th districts should belong to second cluster, while others belong to the first.

kc<- kmeans(z.pca2[["scores"]],2)
kc

## K-means clustering with 2 clusters of sizes 36, 5
## 
## Cluster means:
##         RC5       RC1      RC2       RC4       RC3        RC6       RC8
## 1 -5.798587 -5.445599 -4.50944 -3.243109 -2.452153 -0.6533268 -1.012384
## 2 41.749825 39.208310 32.46797 23.350383 17.655505  4.7039526  7.289162
##           RC7
## 1 -0.05291218
## 2  0.38096772
## 
## Clustering vector:
##  [1] 2 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 1 1
## [39] 1 1 1
## 
## Within cluster sum of squares by cluster:
## [1] 3082.020 3288.727
##  (between_SS / total_SS =  82.5 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

z1 <- data.frame(z.pca2[["scores"]], kc$cluster)
clusplot(z1, kc$cluster, color=TRUE, shade=F, labels=0, lines=0, main='k-Means Cluster Analysis')

3.3.3.Hierarchical Clustering

For the hierarchical clustering purposes I used dendrogram to gain some insights into a cluster solution. Looking from the top we can see two distinct groups. First one includes branches with many branches equal to and below, while another group only consists of a few neighbourhoods that are the most dangerous districts.

z2 <- data.frame(z.pca2[["scores"]])
distance <- dist(z2)
hc <- hclust(distance)
plot(hc, labels = groups_wide$Analysis.Neighborhood, main='Cluster Dendrogram', cex=0.65)

Additionally we can plot silhouette and see that the vast majority of neighbourhoods belong to the 1st cluster.

plot(silhouette(cutree(hc, 2), distance))

4.Conclusions

To sum up, popular methods of unsupervised learning, specifically K-Means for clustering, dendrogram plots for hierachical clustering and PCA for reducing dimensions does seem to group similar items together, therefore we obtained the expected results. Additionally, for a better understanding of the dataset I have also used geographic locations and dates to design spatial and temporal elements.