Big data analytics is becoming more popular and used in various fields of science every year. One of its main advantages is that it can evaluate huge quantities of data much faster than people, and also spot trends that they would likely miss. So, from a crime-solving perspective, data analysis can help catch criminals or even prevent crime from occurring and increase the security of entire cities.
Examples of big data analysis in forensics include the installation of license plate readers that record the plate numbers of every vehicle that enters or leaves each place. Another example might be storing and analyzing huge volume and variety of data in real time, to predict and spot patterns especially related to human interactions and behavior to determine which areas are at increased risk and more police patrols should be directed there.
Interested in the above-mentioned big data applications, I decided to use them and explore deeper for the purpose of assigned paper during my Master’s of Data Science studies at Faculty of Economics University of Warsaw at Unsupervised Learning Classes. For analysis I used the San Francisco Police Database as this city is famous for being attractive place for tourists, migration destination, but also for high wealth inequality, heavy burden on health care and homelessness.For this reason, I decided that exploring the San Francisco data could provide interesting conclusions. Analyzed dataset contains information from 2018-2021 on registered crimes divided into categories, place and time of occurrence and status of the solution.
I started with downloading the necessary packages, loading dataset and multi-stage preparation of the database for further analysis, such as removing duplicates of unique Incident.ID or empty values.
library(ggplot2)
library(ggthemes)
library(dplyr)
library(viridis)
library(tidyr)
library(cluster)
library(clusterSim)
library(ggmap)
library(maps)
library(knitr)
library(mice)
library(VIM)
library(corrplot)
library(GGally)
library(ggfortify)
library(factoextra)
library(gridExtra)
library(psych)
library(NbClust)
library(clustree)
SanFrancisco <- read.csv("~/Documents/UL/Projekt Clustering/sanfrancisco.csv", sep=";")
SanFrancisco <- subset(SanFrancisco, !duplicated(SanFrancisco$Incident.ID))
dim(SanFrancisco)
## [1] 346384 37
We can see that our dataset contains more than 300.000 observations and 37 variables. Next I displayed unique values, in this case, the years in which the crimes took place and chose year when the most crimes happened for further analysis.
unique(SanFrancisco$Incident.Year)
## [1] 2018 2020 NA 2019 2021
year_group <- group_by(SanFrancisco, Incident.Year)
crime_by_year <- summarise(year_group,n = n())
crime_by_year
## # A tibble: 5 x 2
## Incident.Year n
## <int> <int>
## 1 2018 126807
## 2 2019 123142
## 3 2020 95781
## 4 2021 653
## 5 NA 1
As we can see, most of the crimes occurred in 2018, hence it will be further analyzed. In the last line you can see one blank value that will be removed as well as columns that will not be used any further.
keeps <- c("Incident.Year", "Incident.Day.of.Week","Incident.Category","Incident.Subcategory","Analysis.Neighborhood","Incident.Time","Incident.Month","Latitude","Longitude")
SanFrancisco<-SanFrancisco[keeps]
SanFrancisco<-SanFrancisco[SanFrancisco$Incident.Year == 2018,]
SanFrancisco<-SanFrancisco[complete.cases(SanFrancisco), ]
In the first place, it is important to deepen knowledge about datset and for this purpose, several graphs are presented below. The first one shows what types of crime were reported and which occurred most frequently in 2018.The majority of reported criminal incidents were larceny thefts, non-criminal indcidents and assults.The non-criminal category seems unusual, but as described on the San Francisco Police website, these incidents usually concern mental health detention and stay away or court order.
indicator_group <- group_by(SanFrancisco, Incident.Category)
crime_by_indicator <- summarise(indicator_group, n=n())
crime_by_indicator <- crime_by_indicator[order(crime_by_indicator$n, decreasing = TRUE),]
crime_by_indicator <- crime_by_indicator[-c(46), ]
ggplot(aes(x = reorder(Incident.Category, n), y=n), data = crime_by_indicator) +
geom_bar(stat = 'identity', width = 0.7) +
geom_text(aes(label = n), stat = 'identity', data = crime_by_indicator, hjust = -0.1, size = 2.8) +
coord_flip() +
xlab('Crime types') +
ylab('Number of occurrences') +
ggtitle('Crime by types in San Francisco 2018') +
theme_bw() +
theme(plot.title = element_text(size = 16),
axis.title = element_text(size = 12, face = "bold"))
Also we can create another plot to see what subcategories we can divide among larceny thefts. The most frequently occurring theft subcategory are those from the car.
larceny_theft <- SanFrancisco[SanFrancisco$Incident.Category == 'Larceny Theft', ]
larceny_theft_group <- group_by(larceny_theft, Incident.Subcategory)
larceny_theft_by_subcategory <- summarise(larceny_theft_group, n=n())
larceny_theft_by_subcategory <- larceny_theft_by_subcategory[order(larceny_theft_by_subcategory$n, decreasing = TRUE), ]
ggplot(aes(x = reorder(Incident.Subcategory, n), y = n), data = larceny_theft_by_subcategory) +
geom_bar(stat = 'identity', width = 0.6) +
geom_text(aes(label = n), stat = 'identity', data = larceny_theft_by_subcategory, hjust = -0.1, size = 3) +
coord_flip() +
xlab('Types of larceny thefts') +
ylab('Number of occurrences') +
ggtitle('Larceny theft crimes in San Francisco 2018') +
theme_bw() +
theme(plot.title = element_text(size = 16),
axis.title = element_text(size = 12, face = "bold"))
After having a look at the categories and structure of crime cases, it is also worth to investigate in which months the most crime occurs. As expected, they most often occur in the holiday months, when there are many tourists also there are many thefts.
month_group <- group_by(SanFrancisco, Incident.Month)
crime_month <- summarise(month_group, n=n(), )
crime_month$Incident.Month <-ordered(crime_month$Incident.Month,levels = c('Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec'))
ggplot(aes(x=Incident.Month, y=n), data = crime_month) + geom_line(size = 1.5, alpha = 0.7, color = "black", group=1) +
geom_point(size = 0.5) +
ggtitle('Total Crimes by Month in San Francisco 2018') +
ylab('Number of Occurrences') +
xlab('Months') +
theme_bw() +
theme(plot.title = element_text(size = 14),
axis.title = element_text(size = 10, face = "bold"))
It seems also interesting to check on the basis of registered incident time at which hours the most crimes take place. In line with previous observations, such results were expected - the plot above shows two equally peaks, first is around 11am, followed by 6pm, where during these hours you can expect the greatest number of tourists hanging around the city.
hour_group <- group_by(SanFrancisco, Incident.Time)
crime_hour <- summarise(hour_group, n=n())
ggplot(aes(x=Incident.Time, y=n), data = crime_hour) + geom_line(size = 1.5, alpha = 0.7, color = "black", group=1) +
geom_point(size = 0.5) +
ggtitle('Total Crimes by Hour of Day in San Francisco 2018') +
ylab('Number of Occurrences') +
xlab('Hour(24-hour clock)') +
theme_bw() +
theme(plot.title = element_text(size = 14),
axis.title = element_text(size = 10, face = "bold"))
The last variable worth analyzing are districts where crime incidents were reported. The most dangerous areas are listed below.
location_group <- group_by(SanFrancisco, Analysis.Neighborhood)
crime_by_location <- summarise(location_group, n=n())
crime_by_location <- crime_by_location[order(crime_by_location$n, decreasing = TRUE), ]
crime_by_location_top10 <- head(crime_by_location, 20)
ggplot(aes(x = reorder(Analysis.Neighborhood, n), y = n), data = crime_by_location_top10) +
geom_bar(stat = 'identity', width = 0.6) +
geom_text(aes(label = n), stat = 'identity', data = crime_by_location_top10, hjust = -0.1, size = 3) +
coord_flip() +
xlab('Neighbourhoods') +
ylab('Number of Occurrences') +
ggtitle('Neighbourhoods with Most Crimes - Top 10') +
theme_bw() +
theme(plot.title = element_text(size = 16),
axis.title = element_text(size = 12, face = "bold"))
While, the following districts are the safest.
tail(crime_by_location, 5)
## # A tibble: 5 x 2
## Analysis.Neighborhood n
## <chr> <int>
## 1 Treasure Island 414
## 2 Presidio 236
## 3 Lincoln Park 132
## 4 Seacliff 130
## 5 McLaren Park 91
Both of the above charts are reflected on the map.
lat <- SanFrancisco$Latitude
lon <- SanFrancisco$Longitude
crimes <- SanFrancisco$Incident.Category
to_map <- data.frame(crimes, lat, lon)
colnames(to_map) <- c('crimes', 'lat', 'lon')
sbbox <- make_bbox(lon = SanFrancisco$Longitude, lat = SanFrancisco$Latitude, f = 0.01)
my_map <- get_map(location = sbbox, maptype = "roadmap", scale = 2, color="bw", zoom = 11)
ggmap(my_map) +
geom_point(data=to_map, aes(x = lon, y = lat, color = "black"),
size = 0.5, alpha = 0.03) +
xlab('Longitude') +
ylab('Latitude') +
ggtitle('Location of crimes in SanFrancisco 2018') +
guides(color=FALSE)
To prepare data for further analysis it should be grouped, first qualitative column should be removed, also all missing values has to be removed. In the next step data must be scaled for comparison.
by_groups <- group_by(SanFrancisco, Incident.Category, Analysis.Neighborhood)
groups <- summarise(by_groups, n=n())
groups <- groups[c("Analysis.Neighborhood", "Incident.Category", "n")]
groups_wide <- spread(groups, key = Incident.Category, value = n)
kable(groups_wide[1:10,1:10])
| Analysis.Neighborhood | V1 | Arson | Assault | Burglary | Case Closure | Civil Sidewalks | Courtesy Report | Disorderly Conduct | Drug Offense |
|---|---|---|---|---|---|---|---|---|---|
| Bayview Hunters Point | 2 | 43 | 699 | 377 | 1 | NA | 8 | 117 | 48 |
| Bernal Heights | NA | 5 | 147 | 120 | NA | NA | 1 | 26 | 14 |
| Castro/Upper Market | NA | 19 | 296 | 252 | 1 | 115 | NA | 45 | 52 |
| Chinatown | NA | 7 | 145 | 101 | NA | NA | 53 | 28 | 22 |
| Excelsior | NA | 4 | 163 | 133 | 2 | NA | 3 | 35 | 32 |
| Financial District/South Beach | NA | 25 | 664 | 832 | 8 | 1 | 4 | 136 | 70 |
| Glen Park | NA | NA | 36 | 57 | NA | NA | NA | 8 | 3 |
| Golden Gate Park | NA | 11 | 56 | 43 | NA | 5 | 4 | 6 | 14 |
| Haight Ashbury | 2 | 2 | 120 | 103 | 1 | 260 | 1 | 17 | 34 |
| Hayes Valley | NA | 8 | 135 | 203 | 2 | 6 | 2 | 17 | 18 |
z <- groups_wide[, -c(1,1:2)]
z[is.na(z)] <- 0
m <- apply(z, 2, mean)
s <- apply(z, 2, sd)
z <- scale(z, m, s)
As it was indicated before, over 50 types of crimes were registered in the San Francisco Police’s database, which complicates further analyzes.Therefore I decided to see if there are correlations between the variables, as you can see in the plot below, there are strong correlations between them. For this reason, it seems reasonable to reduce dimensionality using one the statistical methods, e.g. PCA (Principal component analysis) to create its simpler version.
corr_z = cor(z,method='pearson')
corrplot(corr_z, tl.cex=0.6)
Frist, I used promp() function to compute and visualize PCA to obtain values of basic statistics.One of the approaches is to select number of components that explained approximately 90% of variance. Based on this rule we should choose 8 components.
z.pca1<-prcomp(z, center=TRUE, scale.=TRUE)
summary(z.pca1)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 5.4286 1.88162 1.74203 1.58567 1.53029 1.46005 1.07361
## Proportion of Variance 0.5894 0.07081 0.06069 0.05029 0.04684 0.04264 0.02305
## Cumulative Proportion 0.5894 0.66021 0.72090 0.77119 0.81802 0.86066 0.88371
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 1.03978 0.88939 0.86238 0.82317 0.7071 0.61621 0.55492
## Proportion of Variance 0.02162 0.01582 0.01487 0.01355 0.0100 0.00759 0.00616
## Cumulative Proportion 0.90533 0.92115 0.93603 0.94958 0.9596 0.96718 0.97333
## PC15 PC16 PC17 PC18 PC19 PC20 PC21
## Standard deviation 0.54474 0.43716 0.3937 0.37471 0.35994 0.27978 0.26089
## Proportion of Variance 0.00593 0.00382 0.0031 0.00281 0.00259 0.00157 0.00136
## Cumulative Proportion 0.97927 0.98309 0.9862 0.98900 0.99159 0.99316 0.99452
## PC22 PC23 PC24 PC25 PC26 PC27 PC28
## Standard deviation 0.23059 0.20148 0.19064 0.16649 0.15653 0.15141 0.13002
## Proportion of Variance 0.00106 0.00081 0.00073 0.00055 0.00049 0.00046 0.00034
## Cumulative Proportion 0.99558 0.99639 0.99712 0.99767 0.99816 0.99862 0.99896
## PC29 PC30 PC31 PC32 PC33 PC34 PC35
## Standard deviation 0.11139 0.10235 0.09200 0.08274 0.06765 0.05547 0.05118
## Proportion of Variance 0.00025 0.00021 0.00017 0.00014 0.00009 0.00006 0.00005
## Cumulative Proportion 0.99921 0.99942 0.99959 0.99972 0.99982 0.99988 0.99993
## PC36 PC37 PC38 PC39 PC40 PC41
## Standard deviation 0.04142 0.02853 0.02143 0.01745 0.01477 4.015e-16
## Proportion of Variance 0.00003 0.00002 0.00001 0.00001 0.00000 0.000e+00
## Cumulative Proportion 0.99996 0.99998 0.99999 1.00000 1.00000 1.000e+00
a<-summary(z.pca1)
plot(a$importance[3,],type="l")
abline(v=8,col="blue")
abline(h= 0.90533,col="red")
To choose the number of components, we can also use the Kaiser rule, which says that we should take into account PCs with eigenvalues bigger than 1. According to this rule, we should also take 8 components for further analysis. Below visualizations of eigenvalues (scree plot) shows the percentage of variances explained by each principal component.
fviz_eig(z.pca1, choice='eigenvalue', addlabels=TRUE)
fviz_eig(z.pca1, addlabels=TRUE)
Positive correlated variables point to the same side of the plot. Below presentation of variables on a chart confirms earlier conclusions about the strong correlations. All of them are located in the second and third quadrant of the variable correlation plot. It seems quite understandable, first group contains more serious crimes with the use of weapons, kidnapping and servitude, while the second group, located in the third quadrant includes incidents related to gambling, thefts and family affairs.
fviz_pca_var(z.pca1,col.var="steelblue", repel=TRUE)
Below bar plots show contribution of the ten most significant variables in each of the eight selected dimensions, with the expected average share of 2% marked with the red line.
fviz_contrib(z.pca1, choice = "var", axes =1 , top = 10)
fviz_contrib(z.pca1, choice = "var", axes =2 , top = 10)
fviz_contrib(z.pca1, choice = "var", axes =3 , top = 10)
fviz_contrib(z.pca1, choice = "var", axes =4 , top = 10)
fviz_contrib(z.pca1, choice = "var", axes =5 , top = 10)
fviz_contrib(z.pca1, choice = "var", axes =6 , top = 10)
fviz_contrib(z.pca1, choice = "var", axes =7 , top = 10)
fviz_contrib(z.pca1, choice = "var", axes =8 , top = 10)
Further analysis will be performed on data prepared with PCA with eight principal components kept. Below data shows only the significant loadings for each component.
z.pca2<-principal(z, nfactors=8, rotate="varimax")
print(loadings(z.pca2), digits=8, cutoff=0.4, sort=TRUE)
##
## Loadings:
## RC5 RC1
## Assault 0.66594528 0.50866770
## Drug Offense 0.96314309
## Forgery And Counterfeiting 0.52303496
## Homicide 0.79978419
## Non-Criminal 0.72447477
## Offences Against The Family And Children 0.70441703 0.55806196
## Other 0.87835650
## Other Miscellaneous 0.58973741 0.45581080
## Other Offenses 0.67532827 0.48061861
## Robbery 0.63395095 0.42819047
## Stolen Property 0.60793643
## Suspicious Occ 0.63116680 0.50949626
## Warrant 0.81718010
## Weapons Carrying Etc 0.64750811 0.60083549
## Arson 0.52439290 0.69566320
## Disorderly Conduct 0.53106292 0.53696974
## Family Offense 0.93161654
## Fire Report 0.80772651
## Gambling 0.59735269
## Miscellaneous Investigation 0.75847790
## Motor Vehicle Theft 0.80094989
## Recovered Vehicle 0.82207675
## Traffic ColNovion 0.59313790
## Vehicle Impounded 0.83927502
## Weapons Offense 0.52816014 0.71311483
## Burglary 0.40011208
## Case Closure
## Embezzlement 0.48779802
## Fraud 0.41096591
## Human Trafficking (A), Commercial Sex Acts
## Larceny Theft
## Lost Property 0.52696166
## Malicious Mischief 0.42263711 0.53664302
## Weapons Offence 0.61600722
## Liquor Laws
## Prostitution
## Traffic Violation Arrest 0.43909659
## VandaNovm 0.47612815
## Human Trafficking (B), Involuntary Servitude
## Rape
## Sex Offense
## Vehicle Misplaced
## Drug Violation
## Human Trafficking, Commercial Sex Acts
## Suspicious
## Courtesy Report
## Missing Person 0.48306421
## Civil Sidewalks
## Motor Vehicle Theft? 0.48496570
## Suicide 0.49545106
## RC2 RC4
## Assault
## Drug Offense
## Forgery And Counterfeiting 0.46590251 0.49337676
## Homicide
## Non-Criminal 0.40908011
## Offences Against The Family And Children
## Other
## Other Miscellaneous 0.49131423
## Other Offenses
## Robbery 0.54641309
## Stolen Property 0.52499688
## Suspicious Occ 0.45838187
## Warrant
## Weapons Carrying Etc
## Arson
## Disorderly Conduct 0.50227284
## Family Offense
## Fire Report
## Gambling
## Miscellaneous Investigation
## Motor Vehicle Theft
## Recovered Vehicle
## Traffic ColNovion
## Vehicle Impounded
## Weapons Offense
## Burglary 0.76908454
## Case Closure 0.52328982
## Embezzlement 0.73967010
## Fraud 0.71947899
## Human Trafficking (A), Commercial Sex Acts 0.91389926
## Larceny Theft 0.75458813
## Lost Property 0.76284916
## Malicious Mischief 0.53828967
## Weapons Offence 0.68762354
## Liquor Laws 0.91213644
## Prostitution 0.88928818
## Traffic Violation Arrest 0.71355739
## VandaNovm 0.56046281
## Human Trafficking (B), Involuntary Servitude
## Rape
## Sex Offense
## Vehicle Misplaced
## Drug Violation
## Human Trafficking, Commercial Sex Acts 0.47729583
## Suspicious
## Courtesy Report
## Missing Person
## Civil Sidewalks
## Motor Vehicle Theft?
## Suicide 0.43635760
## RC3 RC6
## Assault
## Drug Offense
## Forgery And Counterfeiting
## Homicide
## Non-Criminal
## Offences Against The Family And Children
## Other
## Other Miscellaneous
## Other Offenses
## Robbery
## Stolen Property
## Suspicious Occ
## Warrant
## Weapons Carrying Etc
## Arson
## Disorderly Conduct
## Family Offense
## Fire Report
## Gambling
## Miscellaneous Investigation
## Motor Vehicle Theft
## Recovered Vehicle
## Traffic ColNovion
## Vehicle Impounded
## Weapons Offense
## Burglary
## Case Closure 0.47032124
## Embezzlement
## Fraud
## Human Trafficking (A), Commercial Sex Acts
## Larceny Theft
## Lost Property
## Malicious Mischief
## Weapons Offence
## Liquor Laws
## Prostitution
## Traffic Violation Arrest
## VandaNovm
## Human Trafficking (B), Involuntary Servitude 0.95888233
## Rape 0.65361654
## Sex Offense 0.91018119
## Vehicle Misplaced 0.55692003
## Drug Violation 0.96999442
## Human Trafficking, Commercial Sex Acts 0.49513934 0.55814254
## Suspicious 0.90807109
## Courtesy Report 0.46136403
## Missing Person
## Civil Sidewalks
## Motor Vehicle Theft?
## Suicide
## RC8 RC7
## Assault
## Drug Offense
## Forgery And Counterfeiting
## Homicide
## Non-Criminal
## Offences Against The Family And Children
## Other
## Other Miscellaneous
## Other Offenses
## Robbery
## Stolen Property
## Suspicious Occ
## Warrant
## Weapons Carrying Etc
## Arson
## Disorderly Conduct
## Family Offense
## Fire Report
## Gambling
## Miscellaneous Investigation
## Motor Vehicle Theft
## Recovered Vehicle
## Traffic ColNovion
## Vehicle Impounded
## Weapons Offense
## Burglary
## Case Closure
## Embezzlement
## Fraud
## Human Trafficking (A), Commercial Sex Acts
## Larceny Theft
## Lost Property
## Malicious Mischief
## Weapons Offence
## Liquor Laws
## Prostitution
## Traffic Violation Arrest
## VandaNovm
## Human Trafficking (B), Involuntary Servitude
## Rape
## Sex Offense
## Vehicle Misplaced 0.51373551
## Drug Violation
## Human Trafficking, Commercial Sex Acts
## Suspicious
## Courtesy Report 0.53219304
## Missing Person 0.54757277
## Civil Sidewalks 0.97165841
## Motor Vehicle Theft?
## Suicide
##
## RC5 RC1 RC2 RC4 RC3 RC6
## SS loadings 11.3789378 10.5697421 7.9769317 5.7276928 4.41204710 2.52853456
## Proportion Var 0.2275788 0.2113948 0.1595386 0.1145539 0.08824094 0.05057069
## Cumulative Var 0.2275788 0.4389736 0.5985122 0.7130661 0.80130703 0.85187772
## RC8 RC7
## SS loadings 1.46952576 1.20329706
## Proportion Var 0.02939052 0.02406594
## Cumulative Var 0.88126824 0.90533418
The next part of work will concern clustering which is the process of grouping set of data points with similar characteristics which have not been explicitly labeled. Objects in the same group (called a cluster) are more similar to each other than to those in other groups (clusters). Cluster analysis can be achieved by numerous algorithms, but in this case, for our PCA results I will use probably the most well-known K-Means and Cluster Dendogram for the hierarchical clustering.
After selecting the algorithm, we should understand its process. K-Means algorithm will first start with randomly selected group of centroids and then perform iterative calculations to optimize their positions by minimazing the eucildean distances between given point and each group center until centroids don’t change much between iterations are they are stabilized.Therefore, it is important to indicate the optimal number of clusters, which can be determined using the methods below. The vast majority of them indicate the optimal number of 2 clusters.
The Sum of Squares Method involves minimizing the sum of squares within a cluster (a measure of how tight each cluster is) and maximizing the sum of squares between clusters (a measure of how separate each cluster is from the others). As we can see in the plot below, a strong elbow is visible, indicating that the optimal number of clusters should not be greater than 2.
wss <- (nrow(z.pca2[["scores"]])-1) * sum(apply(z.pca2[["scores"]], 2, var))
for (i in 2:20) wss[i] <- sum(kmeans(z.pca2[["scores"]], centers=i)$withiness)
plot(1:20, wss, type='b', xlab='Number of Clusters', ylab='Within groups sum of squares')
Another method computes statistic, the average silhouette of observations for different values of k. The higher the value the better, so also in this case we should choose 2 clusters.
fviz_nbclust(z.pca2[["scores"]], FUNcluster = kmeans)
The gap statistic indicate optimal number of clusters by comparing the total intra-cluster variability for different k values with their expected values under null reference distribution of the data. Value that maximized the gap statistic estimates the optimal number of clusters. In contrary to previously mentioned methods, this indicates that 5 clusters would be optimal.
fviz_nbclust(z.pca2[["scores"]], FUNcluster=kmeans, method="gap_stat")+ theme_classic()
According to the package description the NbClust includes 30 indicators to determine the appropriate number of clusters and offers users the best clustering scheme based on the different results obtained by varying all combinations of cluster numbers, distance measures and clustering methods. Based on the obtained results optimal number of clusters equals to 2.
res.nbclust <- NbClust(z.pca2[["scores"]], distance = "euclidean",
min.nc = 2, max.nc = 9,
method = "complete", index ="all")
## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##
## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## *******************************************************************
## * Among all indices:
## * 9 proposed 2 as the best number of clusters
## * 7 proposed 3 as the best number of clusters
## * 1 proposed 4 as the best number of clusters
## * 3 proposed 5 as the best number of clusters
## * 4 proposed 9 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 2
##
##
## *******************************************************************
factoextra::fviz_nbclust(res.nbclust) + theme_minimal() + ggtitle("NbClust's optimal number of clusters")
## Among all indices:
## ===================
## * 2 proposed 0 as the best number of clusters
## * 9 proposed 2 as the best number of clusters
## * 7 proposed 3 as the best number of clusters
## * 1 proposed 4 as the best number of clusters
## * 3 proposed 5 as the best number of clusters
## * 4 proposed 9 as the best number of clusters
##
## Conclusion
## =========================
## * According to the majority rule, the best number of clusters is 2 .
According to the package description the above statistical method shows a single result that considers only one set of clusters at a time. Clustree R package takes an alternative approach to considering how samples change groups as the number of clusters increases. This is useful for showing which clusters are distinct and which are unstable. It does not explicitly indicate what is the optimal cluster number, but it is useful for exploring possible choices.
tmp <- NULL
for (k in 1:11){
tmp[k] <- kmeans(z.pca2[["scores"]], k, nstart = 30)
}
df <- data.frame(tmp)
colnames(df) <- seq(1:11)
colnames(df) <- paste0("k",colnames(df))
df.pca <- prcomp(df, center = TRUE, scale. = FALSE)
ind.coord <- df.pca$x
ind.coord <- ind.coord[,1:2]
df <- bind_cols(as.data.frame(df), as.data.frame(ind.coord))
clustree(df, prefix = "k")
Based on the previously obtained results, I chose 2 clusters for further analysis. As we can see first cluster contains 36 districts and the second has only 5 neighbourhoods. According to the cluster means we can see that negative values are found only in the first cluster, their value below zero means that they are “lower than most”, thus cluster 1 has neighbourhoods with low crime rate, while positive values are grouped in the second cluster and analogously these values are “higher than most”, so in this cluster we can find neighbourhoods with high crime rate. Clustering vector shows that 1st, 6th, 19th, 34th and 36th districts should belong to second cluster, while others belong to the first.
kc<- kmeans(z.pca2[["scores"]],2)
kc
## K-means clustering with 2 clusters of sizes 36, 5
##
## Cluster means:
## RC5 RC1 RC2 RC4 RC3 RC6 RC8
## 1 -5.798587 -5.445599 -4.50944 -3.243109 -2.452153 -0.6533268 -1.012384
## 2 41.749825 39.208310 32.46797 23.350383 17.655505 4.7039526 7.289162
## RC7
## 1 -0.05291218
## 2 0.38096772
##
## Clustering vector:
## [1] 2 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 1 1
## [39] 1 1 1
##
## Within cluster sum of squares by cluster:
## [1] 3082.020 3288.727
## (between_SS / total_SS = 82.5 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
z1 <- data.frame(z.pca2[["scores"]], kc$cluster)
clusplot(z1, kc$cluster, color=TRUE, shade=F, labels=0, lines=0, main='k-Means Cluster Analysis')
For the hierarchical clustering purposes I used dendrogram to gain some insights into a cluster solution. Looking from the top we can see two distinct groups. First one includes branches with many branches equal to and below, while another group only consists of a few neighbourhoods that are the most dangerous districts.
z2 <- data.frame(z.pca2[["scores"]])
distance <- dist(z2)
hc <- hclust(distance)
plot(hc, labels = groups_wide$Analysis.Neighborhood, main='Cluster Dendrogram', cex=0.65)
Additionally we can plot silhouette and see that the vast majority of neighbourhoods belong to the 1st cluster.
plot(silhouette(cutree(hc, 2), distance))
To sum up, popular methods of unsupervised learning, specifically K-Means for clustering, dendrogram plots for hierachical clustering and PCA for reducing dimensions does seem to group similar items together, therefore we obtained the expected results. Additionally, for a better understanding of the dataset I have also used geographic locations and dates to design spatial and temporal elements.