This project is dedicated to the Crime in Baltimore Data, which provide ongoing information for crimes committed in the Baltimore City. The dataset is accessible online as a part of the Open Baltimore project (https://data.baltimorecity.gov/).
The Dataset covers crimes commited in the city for years 2016-2020. The report consists of 6 parts:
Following libraries were used during the analysis:
# let's manage the date and time
data$date <- as.Date(substr(data$CrimeDateTime, 1, 10))
data$time <- as_hms(substr(data$CrimeDateTime, 12, 19))
# for additional analysis purposes - let's extract year and month from date
data$year <- year(data$date)
data$month <- month(data$date)
#let's additionally create a 0-1 column informing if crime commited at night (we assume night 11pm - 04am)
data$is_night <- ifelse(substr(data$time,1,2) %in% c('23','00','01','02','03'),'crime at night','crime during the day')
#the last column concerning datetime - weekdays to analyze how are crimes distributed among weekdays:
data$weekday <- weekdays(data$date)
data$weekday <- factor(data$weekday, levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"))
#finally let's adjust colnames to make the coding more convenient
colnames(data) <- c('date_time','crime','weapon','district','lat','lon','date','time',
'year','month','is_night','weekday')
#let's read the data as data table
data <- data.table(data)
Originally, the dataset consists of 19 columns and 266 414 observations. Among the features, I selected 6 interesting from the exploratory point of view. I decided to add features concerning date and time, so eventually there are 12 features in the final dataset:
Let’s take a look at the dataset.
# inspect the dataset
head(data)
## date_time crime weapon district lat lon
## 1: 2020/12/31 16:04:00+00 LARCENY <NA> NORTHWEST 39.3493 -76.6905
## 2: 2020/12/31 22:30:00+00 COMMON ASSAULT <NA> NORTHEAST 39.3353 -76.5372
## 3: 2020/12/31 14:30:00+00 LARCENY FROM AUTO <NA> NORTHWEST 39.3452 -76.6750
## 4: 2020/12/31 04:03:00+00 BURGLARY <NA> EASTERN 39.2989 -76.5828
## 5: 2020/12/31 13:00:00+00 BURGLARY <NA> SOUTHWEST 39.3114 -76.6724
## 6: 2020/12/31 09:00:00+00 BURGLARY <NA> WESTERN 39.3095 -76.6528
## date time year month is_night weekday
## 1: 2020-12-31 16:04:00 2020 12 crime during the day Thursday
## 2: 2020-12-31 22:30:00 2020 12 crime during the day Thursday
## 3: 2020-12-31 14:30:00 2020 12 crime during the day Thursday
## 4: 2020-12-31 04:03:00 2020 12 crime during the day Thursday
## 5: 2020-12-31 13:00:00 2020 12 crime during the day Thursday
## 6: 2020-12-31 09:00:00 2020 12 crime during the day Thursday
#str(data)
#describe(data)
#summary(data)
#table(data$crime)
#table(data$weapon)
#table(data$district)
#table(data$year)
#table(data$month)
#table(data$is_night)
The data is reported officially for the Baltimore City, since there are some issues concerning the Missing Data. There is one column, that have numerous rows with NA - weapon. There are 179969 rows, where there is no information about the weapon wielded by the criminal. They constitute 77% of the whole dataset. Since then, I treat the missing information on the weapon as a fact, that no weapon (officially) was used during the crime. At the same time, this column is treated as an addition to the analysis, and this statement might be not valid. Keeping this in mind, the NAs for column weapon are going to be recoded into ‘no weapon’ values.
The second issue are some rows, where we have missing data / empty records on latitude/longitude/district. These are being removed from the dataset. Finally there are 232 203 rows ready for the analysis (552 rows with missing data removed).
#do we have any missing values?
#sum(is.na(data))
#sum(is.na(data$weapon))
#sum(is.na(data$district))
#sum(is.na(data$lat))
#sum(is.na(data$lon))
#sum(is.na(data$date_time))
#recode NA in weapon column into 'no weapon'
data <- data %>% replace_na(list(weapon = 'NO WEAPON'))
#remove data with NAs/ empty rows for cols district,lat,lon
#nrow(data[data$lat=='' | data$lon=='',])
data <- data[data$lat!=0 & data$lon!=0 & data$lat!='' & data$lon!='' & data$district!='',]
#table(data$district)
#sum(is.na(data))
First of all - let’s take a look at the dataset presented as a scatterplot, creating a map with data points.
#Let's present the data as a scatterplot
ggplot(data, aes(lon, lat, alpha = 1.5, color = crime, show.legend = FALSE)) +
geom_point() +
theme_classic() +
ggtitle('Crimes in Baltimore City - 2016-2020:') +
xlab('')+
ylab('')
This minimal map looks a little bit like a chaotic mosaic. The goal of the analysis is to differentiate the data and find patterns hidden behind the data points.
That’s how looks crime breakdown by its category:
data %>%
group_by(crime) %>%
summarize(counts = n()) %>%
ggplot(aes(y=counts, x = reorder(crime, counts), fill = counts)) +
geom_bar(stat="identity") +
theme_minimal() +
theme(axis.text.x = element_text(size=7)) +
labs(x = "", y = "", title = "Crimes committed in Baltimore City 2016-2020:") +
geom_text(aes(label = counts), hjust = -.1, size = 4, color = "black") +
theme(panel.grid.major.y = element_blank(), legend.position = "off") +
scale_y_discrete(labels = NULL) +
coord_flip() +
expand_limits(y = 60000)
#scale_x_discrete(breaks = c("2015-01","2016-01","2017-01","2018-01","2019-01","2020-01"))
As we can see, the most popular crime is a common larceny. Robberies are divided into four different categories, summing up all of the cases concerning robbery, it becomes as popular as aggresive assault and larceny from auto.
There are a little bit more reported homicides than rapes, which surprised me at first glance. Thinking about it more, I suppose there might be many cases of unreported rapes, that’s why the statistics look like this.
Let’s move on to date-time analysis of the data.
First of all counts of committed crimes divided by years:
data %>%
group_by(year) %>%
summarize(counts = n()) %>%
ggplot(aes(y=counts, x = year)) +
geom_line(size = 1.5, colour ='darkblue') +
theme_minimal() +
scale_x_continuous(breaks = seq(2016,2020,1)) +
labs(x = "", y = "", title = "Crimes committed in Baltimore City in each year 2016-2020:") +
theme(legend.position = "off")
As we can see, the criminal peak in Baltimore is year 2017, since then the crime rate is declining constantly.
There is a significant drop between years 2019/2020. There might be some COVID-19 effect of the crimes, we are going to check it later.
The next part in the date time analysis is a view on months - are there any months when criminals are more active?
dm <- data[order(month),.(counts = .N),(month)]
ggplot(data = dm, aes(y=counts, x = order(month))) +
geom_bar(stat="identity", fill = 'darkred', colour ='darkred') +
scale_x_continuous(breaks = seq(1,12,1)) +
theme_minimal() +
labs(x = "", y = "", title = "Crimes committed in Baltimore City in each month:") +
theme(legend.position = "off")
There is a decline for February, which may be effect of two possible reasons:
Additionally, we can see that in fact, during the Spring months (March-May) the criminals are “waking up” and their activity is rising and reaching high activity in May.
Peak of criminal activity (not much different from other months) seems to be October (my intuition before creating the chart was December in fact).
Now let’s take a look at the division of crimes among the weekdays:
dw1 <- unique(data$date)
weekday <- weekdays(dw1)
dw1 <- data.table(dw1,weekday)
dw1 <- dw1[,(w_counts = .N),.(weekday)]
#in years 2016-2020 there were equal counts of weekdays - 7x261 = 1827
dw <- data[,.(counts = .N),.(weekday)]
ggplot(data = dw, aes(y=counts, x=weekday)) +
geom_bar(stat="identity", fill = 'darkgreen') +
theme_minimal() +
theme(axis.text.x = element_text(size=12)) +
labs(x = "", y = "", title = "Number of Crimes commited in Baltimore City broken down by weekdays:") +
geom_text(aes(label = counts), vjust = 1.5, size = 4, color = "white") +
theme(panel.grid.major.y = element_blank(), legend.position = "off") +
scale_y_discrete(labels = NULL)
The first worth mentioning information is the fact that in years 2016-2020 there were equal day counts of weekdays - 7x261 = 1827.
The most popular criminal day is Friday. Saturday is a little bit less popular. There is a significant decline for Sunday. Then from Monday till Thursday the criminal activity is going down a little just to reach its peak on Friday.
The most peaceful and safe day is then Sunday - no surprise there. On the other side, Saturday is a little bit safer than Monday-Friday. It seems that even criminals would like to have some rest during weekends. :)
The next step is to check frequency of crimes given the time:
data %>%
group_by(time) %>%
summarize(counts = n()) %>%
ggplot(aes(y=counts, x = time)) +
geom_line(size = 1, colour ='darkgrey') +
theme_minimal() +
labs(x = "", y = "", title = "Number of Crimes commited in Baltimore City by time:") +
theme(legend.position = "off")
There is a hge peak around 00:00. Besides, there are some minor peaks, mainly in the evening. Find below the exact time, for which stated more than 4000 cases:
dt <- data[,.(counts=.N),.(time)]
dt <- dt[counts > 4000]
dt <- dt[order(-counts)]
dt
## time counts
## 1: 00:00:00 5343
## 2: 18:00:00 5020
## 3: 17:00:00 4948
## 4: 12:00:00 4543
## 5: 20:00:00 4505
## 6: 22:00:00 4462
## 7: 16:00:00 4378
## 8: 21:00:00 4375
## 9: 19:00:00 4306
## 10: 15:00:00 4163
In order to differentiate between weekdays and time of crimes, let’s take a look at crimes breakdown by these two variables:
data %>%
group_by(time, weekday) %>%
summarize(counts = n()) %>%
ggplot(aes(y=counts, x = time, colour =weekday)) +
geom_line(size = 1) +
theme_minimal() +
labs(x = "", y = "", title = "Number of crimes in Baltimore City by Time ~ Weekday") +
theme(legend.position = "off") +
facet_wrap(~weekday)
## `summarise()` has grouped output by 'time'. You can override using the `.groups` argument.
When it comes to crimes frequency distribution given time, there are no major differences between the weekdays. The only worth mentioning change would be the excess criminal activity on Friday afternoon. It seems that during Friday’s afternoon the citizens of the Baltimore City are especially exposed to the criminal activity.
The last part of date-time analysis of the criminal data, let’s take a look what is the proportion between crimes committed during night (11pm - 4am) and during the day.
dn <- data.frame(table(data$is_night))
colnames(dn) <- c("group", "frequency")
dn$label <- c("crimes at night", "crimes during the day")
dn$label2 <- round(100*dn$freq/sum(dn$freq), 1)
dn$label2 <- paste0(dn$label2, "%")
# Example 3b: ggplot piechart
ggplot(data = dn, aes(x = "", y = frequency, fill = label))+
geom_bar(stat = "identity")+coord_polar("y", start = 0) +
geom_text(aes(label = paste(label2)),
position = position_stack(vjust = 0.5)) +
ggtitle("Proportions of crimes committed during the night vs the day time:")+
theme_void()
I must admit, that the results are surprising. “The nighttime” defined as presented above, counts for 20.83% of the whole daytime. At the same time, all of the crimes committed during the night constitute 19.6% of all of the crimes. Since then, the crime density is larger during the day. My intuition firstly gravitated towards reverse trend. Again the data prooves to behave counterintuitively.
The last view presented in the preliminary data exploration is the breakdown for weapons used during the crimes by the weekdays. Let’s take a look if there are any significant differences in terms of Monday-Sunday criminal activity:
data %>%
group_by(weapon, weekday) %>%
summarize(counts = n()) %>%
ggplot(aes(y=counts, x = weapon, fill =weekday)) +
geom_bar(stat="identity") +
theme_minimal() +
labs(x = "", y = "", title = "Number of crimes where used different weapons by weekdays") +
theme(axis.text=element_text(size=7)) +
theme(legend.position = "off") +
facet_wrap(~weekday, nrow = 4, ncol = 2)
## `summarise()` has grouped output by 'weapon'. You can override using the `.groups` argument.
As we can see, the criminals use the same methods / weapons in similar proportions no matter the weekday.
Before the clustering, it could be useful to see the data point, where historically committed the most of the crimes - let’s check that by extracting the coordinates of that place and present them visually on the map (via google maps):
coords <- paste0('[', data$lat, ' ; ', data$lon, ']')
cc <- table(coords)
names(sort(-table(coords)))[1]
## [1] "[39.2741 ; -76.6276]"
#sort(-table(coords))[1]
#data[lat=='39.2741' & lon=='-76.6276',.(counts=.N),.(crime)]
The Coordinates of the place are: [39.2741 ; -76.6276]. The point is located near the Horseshoe Casino Baltimore.
There were committed 512 crimes. Two most popular categories are:
This place and crimes committed there are not surprising for me, it perfectly fits intuition concerning casinos.
Let’s move on to the actual clustering part. The first approach is clustering applied to the whole dataset. There are plenty of records (over 230 thousand), so in this case I am going to perform Clara. The clusters are going to be assigned according to latitude and longitude. Let’s take a look how the data look like (no coloring, just plain data points):
#Let's present the data as a scatterplot
ggplot(data, aes(lon, lat, alpha = 10, show.legend = FALSE)) +
geom_point() +
theme_classic() +
ggtitle('Crimes in Baltimore City - 2016-2020') +
xlab('')+
ylab('')
At first glance we can observe some outliers. It’s not good in general when it comes to clustering, but Clara method uses th Partitioning around Medoids (PAM) algorithm. Fortunately, it is not sensitive to outliers.
My first guess is that there would be possibly 8-10 clusters. Below we are going to check that hypothesis.
The appropriate number of clusters is going to be determined by running a few iterations of clara (assigning 5-12 clusters). After that, I am going to check the quality of clusters by calculating appropriate quality measures.
Before I start the clustering procedure, I am going to standardize longitude and latitude variables When it comes to the clustering computations, I set number of samples = 10, sample size = 1000 and the chosen distance metric is manhattan.
xy_z <- as.data.frame(lapply(data[,5:6], scale))
#opt<-Optimal_Clusters_KMeans(xy_z, max_clusters=10, plot_clusters=TRUE, criterion="silhouette")
# another approach - 3 clusters and 6 samples specified - play with samplesize
set.seed(111)
clara_clust_5<-clara(xy_z, 5, metric="manhattan", stand=FALSE, samples=10,
sampsize=1000, trace=0, medoids.x=TRUE,
rngR=FALSE, pamLike=FALSE, correct.d=TRUE)
clara_clust_6<-clara(xy_z, 6, metric="manhattan", stand=FALSE, samples=10,
sampsize=1000, trace=0, medoids.x=TRUE,
rngR=FALSE, pamLike=FALSE, correct.d=TRUE)
clara_clust_7<-clara(xy_z, 7, metric="manhattan", stand=FALSE, samples=10,
sampsize=1000, trace=0, medoids.x=TRUE,
rngR=FALSE, pamLike=FALSE, correct.d=TRUE)
clara_clust_8<-clara(xy_z, 8, metric="manhattan", stand=FALSE, samples=10,
sampsize=1000, trace=0, medoids.x=TRUE,
rngR=FALSE, pamLike=FALSE, correct.d=TRUE)
clara_clust_9<-clara(xy_z, 9, metric="manhattan", stand=FALSE, samples=10,
sampsize=1000, trace=0, medoids.x=TRUE,
rngR=FALSE, pamLike=FALSE, correct.d=TRUE)
clara_clust_10<-clara(xy_z, 10, metric="manhattan", stand=FALSE, samples=10,
sampsize=1000, trace=0, medoids.x=TRUE,
rngR=FALSE, pamLike=FALSE, correct.d=TRUE)
clara_clust_11<-clara(xy_z, 11, metric="manhattan", stand=FALSE, samples=10,
sampsize=1000, trace=0, medoids.x=TRUE,
rngR=FALSE, pamLike=FALSE, correct.d=TRUE)
clara_clust_12<-clara(xy_z, 12, metric="manhattan", stand=FALSE, samples=10,
sampsize=1000, trace=0, medoids.x=TRUE,
rngR=FALSE, pamLike=FALSE, correct.d=TRUE)
# plot the output
#fviz_cluster(clara_clust_5, geom="point", pointsize=1, ggtheme=theme_classic())
#fviz_cluster(clara_clust_6, geom="point", pointsize=1, ggtheme=theme_classic())
#fviz_cluster(clara_clust_7, geom="point", pointsize=1, ggtheme=theme_classic())
#fviz_cluster(clara_clust_8, geom="point", pointsize=1, ggtheme=theme_classic())
#fviz_cluster(clara_clust_9, geom="point", pointsize=1, ggtheme=theme_classic())
fviz_cluster(clara_clust_10, geom="point", pointsize=1, ggtheme=theme_classic())
#fviz_cluster(clara_clust_11, geom="point", pointsize=1, ggtheme=theme_classic())
#fviz_cluster(clara_clust_12, geom="point", pointsize=1, ggtheme=theme_classic())
Let’s take a look at silhouette plots for calculated clustering variations:
#fviz_silhouette(clara_clust_5)
#fviz_silhouette(clara_clust_6)
#fviz_silhouette(clara_clust_7)
#fviz_silhouette(clara_clust_8)
#fviz_silhouette(clara_clust_9)
fviz_silhouette(clara_clust_10)
## cluster size ave.sil.width
## 1 1 102 0.55
## 2 2 62 0.44
## 3 3 151 0.40
## 4 4 85 0.35
## 5 5 110 0.31
## 6 6 46 0.58
## 7 7 152 0.43
## 8 8 127 0.33
## 9 9 100 0.31
## 10 10 65 0.45
#fviz_silhouette(clara_clust_11)
#fviz_silhouette(clara_clust_12)
As we can see, the highest average score attains iteration with ten clusters (Avg silhouette width = 0.4). Let’s validate it by another metric.
Let’s calculate Calinski-Harabasz measure for each iteration and check if results from silhouette plots would be applicable here:
# Calinski-Harabasz
ch_5 <- calinhara(xy_z, clara_clust_5$clustering)
ch_6 <- calinhara(xy_z, clara_clust_6$clustering)
ch_7 <- calinhara(xy_z, clara_clust_7$clustering)
ch_8 <- calinhara(xy_z, clara_clust_8$clustering)
ch_9 <- calinhara(xy_z, clara_clust_9$clustering)
ch_10 <- calinhara(xy_z, clara_clust_10$clustering)
ch_11 <- calinhara(xy_z, clara_clust_11$clustering)
ch_12 <- calinhara(xy_z, clara_clust_12$clustering)
ch_table <- data.frame(ch_5, ch_6, ch_7, ch_8, ch_9, ch_10, ch_11, ch_12)
colnames(ch_table) <- c("5 clusters", "6 clusters", "7 clusters", "8 clusters", "9 clusters", "10 clusters",
"11 clusters", "12 clusters")
#ch_table
In fact, Considering Calinski-Harabasz measure, the best result is obtained for k = 12 clusters. However, the difference between results for ten and twelve clusters is very small (202557.4 vs. 200928.4). Since then I am going to stick to ten clusters.
In fact, 10 clusters seem to be the best possible option (according to the average silhouette width). We could ask one more question before evaluating our clusters: Should we carry out the clustering on this data at all?
Let’s examine that by carrying out Duda-Hart test for splitting. It states whether dataset should be split into two clusters.
clara_clust_2<-clara(xy_z, 2, metric="manhattan", stand=FALSE, samples=10,
sampsize=1000, trace=0, medoids.x=TRUE,
rngR=FALSE, pamLike=FALSE, correct.d=TRUE)
dudahart2(xy_z, clara_clust_2$clustering) #fpc::
## $p.value
## [1] 0
##
## $dh
## [1] 0.6573870847
##
## $compare
## [1] 0.6767445961
##
## $cluster1
## [1] FALSE
##
## $alpha
## [1] 0.001
##
## $z
## [1] 3.090232306
We can reject the null hypothesis, which assumes homogeneity of cluster. Since then we can state, that for our data, we can easily split the dataset into clusters. Let’s take a look at the medoids of the clusters:
#clara_clust_10$medoids
#$i.med
data <- cbind(data,xy_z,clara_clust_10$clustering)
colnames(data) <- c('date_time','crime','weapon','district','lat','lon','date','time',
'year','month','is_night','weekday', "lat_z", "lon_z", "cluster")
data$cluster <- as.factor(data$cluster)
medoids <- data[c(clara_clust_10$i.med), #-c(1,8,9)
]
We will examine whole clusters by these characteristics later. For now, let’s look at the medoids’ location on the map.
Let’s plot on the map - medoids with clusters assignment:
Let’s take a look how many crimes were committed within each cluster:
dc <- data[,.(counts = .N),(cluster)]
ggplot(data = dc, aes(y=counts, x = cluster)) +
geom_bar(stat="identity", fill = 'darkgrey', colour ='black') +
#scale_x_continuous(breaks = seq(1,10,1)) +
theme_minimal() +
labs(x = "", y = "", title = "Crimes committed in Baltimore City within each cluster:") +
theme(legend.position = "off")
The most criminally active clusters are #3 - Highlandstown and #7 - Downtown Cluster. The least active is #6 - Brooklyn Cluster.
There are some levels of crime rate for the cluster:
similar level for clusters 2,3,10 and then for 1,5,8,9.
Let’s check how look proportions of different types of committed crimes within clusters:
data %>%
group_by(cluster, crime) %>%
summarize(counts = n()) %>%
mutate(freq = formattable::percent(counts / sum(counts))) %>%
ggplot(aes(y=freq, x = reorder(crime, counts), fill = freq)) +
geom_bar(stat="identity") +
theme_minimal() +
theme(axis.text.x = element_text(size=6)) +
theme(axis.text.y = element_text(size=6)) +
labs(x = "", y = "", title = "Crimes committed in Baltimore City 2016-2020 within clusters (%)") +
geom_text(aes(label = freq), hjust = -.1, size = 2, color = "black") +
theme(panel.grid.major.y = element_blank(), legend.position = "off") +
#scale_y_discrete(labels = NULL) +
coord_flip() +
#+
#expand_limits(y = 60000)
#scale_x_discrete(breaks = c("2015-01","2016-01","2017-01","2018-01","2019-01","2020-01"))
facet_wrap(~cluster)
## `summarise()` has grouped output by 'cluster'. You can override using the `.groups` argument.
In terms of standard Larceny, in cluster #7 - Downtown Cluster it is the most popular activity (27% of all committed crimes).
Common Assaults are most popular in cluster #4 Edmondson one (with low-medium level of income) - 20% of all crimes are common assaults.
When it comes to burglaries, there is no single cluster that we can single out there in terms of the most active one. However, there is one cluster, where burglaries are quite rare - 8% in cluster #7 (again Downtown). Aggressive Assaults are quite common in cluster #6 (Brooklyn - industrial cluster). They constitute 15% of all crimes committed there.
Robberies in the street are least popular in cluster #4 - again Edmondson cluster. Shootings are most popular in clusters #5 and #9 (Druid Height Neighbourhood and Carrolton Ridge). Luckily, they constitute merely 2% of all committed crimes.
Now let’s move on to checking how clusters differ when it comes to crimes committed in different weekdays and during night/ daytime:
data %>%
group_by(cluster, weekday) %>%
summarize(counts = n()) %>%
mutate(freq = formattable::percent(counts / sum(counts))) %>%
ggplot(aes(y=freq, x = weekday, fill = freq)) +
geom_bar(stat="identity") +
theme_minimal() +
theme(axis.text.x = element_text(size=5)) +
labs(x = "", y = "", title = "Crimes committed in Baltimore City 2016-2020 within clusters by weekday (%)") +
geom_text(aes(label = freq), vjust = 1, size = 2.5, color = "white") +
theme(panel.grid.major.y = element_blank(), legend.position = "off") +
#scale_y_discrete(labels = NULL) +
#+
#expand_limits(y = 60000)
#scale_x_discrete(breaks = c("2015-01","2016-01","2017-01","2018-01","2019-01","2020-01"))
facet_wrap(~cluster, nrow = 5, ncol = 2)
## `summarise()` has grouped output by 'cluster'. You can override using the `.groups` argument.
data %>%
group_by(cluster, is_night) %>%
summarize(counts = n()) %>%
mutate(freq = formattable::percent(counts / sum(counts))) %>%
ggplot(aes(y=freq, x = reorder(is_night, counts), fill = freq)) +
geom_bar(stat="identity") +
theme_minimal() +
theme(axis.text.x = element_text(size=7)) +
labs(x = "", y = "", title = "Crimes committed in Baltimore City 2016-2020 within clusters by daytime (%)") +
geom_text(aes(label = freq), vjust = 1, size = 3.5, color = "white") +
theme(panel.grid.major.y = element_blank(), legend.position = "off") +
#scale_y_discrete(labels = NULL) +
#+
#expand_limits(y = 60000)
#scale_x_discrete(breaks = c("2015-01","2016-01","2017-01","2018-01","2019-01","2020-01"))
facet_wrap(~cluster, nrow = 5, ncol = 2)
## `summarise()` has grouped output by 'cluster'. You can override using the `.groups` argument.
Actually, there are no dramatic differences between assigned clusters according to weekday and nighttime crimes.
Worth mentioning is fact, that we can call the cluster #6 (Brooklyn - industrial one) a “Sunday crime” cluster. 15% of all crimes were committed on Sunday within this cluster.
When it comes to night/day time division, it is quite stable across clusters (proportions are about 20/80). The only cluster, where nighttime crimes constitute over 21% of all crimes, is Downtown cluster #7.
Now let’s conduct another clustering exercise on our dataset.
The next part of the report is an analysis of possible crime clusters in Baltimore City for 2019 and 2020, taking into consideration a possible Covid 19 effect.
At the beginning, let’s check number of crimes for each year divided by months (especially interesting seems to be the possible difference march-june 2019 vs 2020):
data_1920 <- data[year %in% c(2019,2020),1:12]
data_1920$mm <- ifelse(data_1920$month == 1, 'jan',
ifelse(data_1920$month == 2, 'feb',
ifelse(data_1920$month == 3, 'mar',
ifelse(data_1920$month == 4, 'apr',
ifelse(data_1920$month == 5, 'may',
ifelse(data_1920$month == 6, 'jun',
ifelse(data_1920$month == 7, 'jul',
ifelse(data_1920$month == 8, 'aug',
ifelse(data_1920$month == 9, 'sep',
ifelse(data_1920$month == 10, 'oct',
ifelse(data_1920$month == 11, 'nov',
ifelse(data_1920$month == 12, 'dec', '-'))))))))))))
data_1920 %>%
group_by(mm, year) %>%
summarize(counts = n())%>%
mutate(mm = factor(mm, levels = c("jan", "feb", "mar","apr", "may", "jun", "jul", "aug", "sep", "oct",
"nov", "dec"))) %>%
ggplot(aes(y=counts, x = mm, fill = as.factor(year))) +
geom_bar(position="dodge", stat="identity", colour ='black') +
theme_minimal() +
labs(x = "", y = "", title = "Crimes committed in Baltimore City in each month in years 2019-2020:")
## `summarise()` has grouped output by 'mm'. You can override using the `.groups` argument.
#theme(legend.position = "off")
As we can see Above, a significant difference between number of crimes in 2019 vs 2020 may be found in April. This can be stated as the start of “covid effect”. This difference is rising in next months, peaks in August and then is declining till the end of the year. However, we can observe the difference between these two years even in December, so the “covid effect” changed the criminal life of Baltimore even in the long run.
The other feature worth checking while constructing this comparison is the crime structure. Let’s take a look at the proportions of crimes for each year:
data_1920 %>%
group_by(year, crime) %>%
summarize(counts = n()) %>%
mutate(freq = formattable::percent(counts / sum(counts))) %>%
ggplot(aes(y=freq, x = reorder(crime, counts), fill = as.factor(year))) +
geom_bar(position = 'dodge', stat="identity") +
theme_minimal() +
expand_limits(y = 0.35) +
theme(axis.text.x = element_text(size=5)) +
theme(axis.text.y = element_text(size=7)) +
labs(x = "", y = "", title = "Crimes committed in Baltimore City in years 2019-20 by crime type (%):") +
geom_text(aes(label = freq), position = position_dodge(0.8), hjust = -0.3,
size = 2.5, color = "black") +
#theme(panel.grid.major.y = element_blank(), legend.position = "off") +
#scale_y_discrete(labels = NULL) +
coord_flip()
## `summarise()` has grouped output by 'year'. You can override using the `.groups` argument.
#+
#expand_limits(y = 60000)
#scale_x_discrete(breaks = c("2015-01","2016-01","2017-01","2018-01","2019-01","2020-01"))
The results are quite interesting. The common larceny is more popular in 2019 in comparison with 2020. On the other hand, the share of assaults is higher for 2020. Possibly due to lack of social activity in 2020, we can observe less burglaries, larcenies from auto and robberies on the street. The share of heavy crimes is slightly higher for 2020 (shootings, rapes, arsons, homicides)
Let’s conduct the clustering for each year. Since there are 80k rows, I am again using Clara. First iteration - year 2019:
#2019
data_19 <- data_1920[year==2019,]
xy_z <- as.data.frame(lapply(data_19[,5:6], scale))
#opt<-Optimal_Clusters_KMeans(xy_z, max_clusters=10, plot_clusters=TRUE, criterion="silhouette")
# another approach - 3 clusters and 6 samples specified - play with samplesize
set.seed(111)
clara_clust_5<-clara(xy_z, 5, metric="manhattan", stand=FALSE, samples=10,
sampsize=1000, trace=0, medoids.x=TRUE,
rngR=FALSE, pamLike=FALSE, correct.d=TRUE)
clara_clust_6<-clara(xy_z, 6, metric="manhattan", stand=FALSE, samples=10,
sampsize=1000, trace=0, medoids.x=TRUE,
rngR=FALSE, pamLike=FALSE, correct.d=TRUE)
clara_clust_7<-clara(xy_z, 7, metric="manhattan", stand=FALSE, samples=10,
sampsize=1000, trace=0, medoids.x=TRUE,
rngR=FALSE, pamLike=FALSE, correct.d=TRUE)
clara_clust_8<-clara(xy_z, 8, metric="manhattan", stand=FALSE, samples=10,
sampsize=1000, trace=0, medoids.x=TRUE,
rngR=FALSE, pamLike=FALSE, correct.d=TRUE)
clara_clust_9<-clara(xy_z, 9, metric="manhattan", stand=FALSE, samples=10,
sampsize=1000, trace=0, medoids.x=TRUE,
rngR=FALSE, pamLike=FALSE, correct.d=TRUE)
clara_clust_10<-clara(xy_z, 10, metric="manhattan", stand=FALSE, samples=10,
sampsize=1000, trace=0, medoids.x=TRUE,
rngR=FALSE, pamLike=FALSE, correct.d=TRUE)
clara_clust_11<-clara(xy_z, 11, metric="manhattan", stand=FALSE, samples=10,
sampsize=1000, trace=0, medoids.x=TRUE,
rngR=FALSE, pamLike=FALSE, correct.d=TRUE)
clara_clust_12<-clara(xy_z, 12, metric="manhattan", stand=FALSE, samples=10,
sampsize=1000, trace=0, medoids.x=TRUE,
rngR=FALSE, pamLike=FALSE, correct.d=TRUE)
# plot the output
#fviz_cluster(clara_clust_5, geom="point", pointsize=1, ggtheme=theme_classic())
#fviz_cluster(clara_clust_6, geom="point", pointsize=1, ggtheme=theme_classic())
#fviz_cluster(clara_clust_7, geom="point", pointsize=1, ggtheme=theme_classic())
#fviz_cluster(clara_clust_8, geom="point", pointsize=1, ggtheme=theme_classic())
#fviz_cluster(clara_clust_9, geom="point", pointsize=1, ggtheme=theme_classic())
fviz_cluster(clara_clust_10, geom="point", pointsize=1, ggtheme=theme_classic())
#fviz_cluster(clara_clust_11, geom="point", pointsize=1, ggtheme=theme_classic())
#fviz_cluster(clara_clust_12, geom="point", pointsize=1, ggtheme=theme_classic())
Let’s take a look at silhouette plots for calculated clustering variations and also Calinski-Harabasz measure:
#fviz_silhouette(clara_clust_5)
#fviz_silhouette(clara_clust_6)
#fviz_silhouette(clara_clust_7)
#fviz_silhouette(clara_clust_8)
#fviz_silhouette(clara_clust_9)
fviz_silhouette(clara_clust_10)
## cluster size ave.sil.width
## 1 1 118 0.44
## 2 2 132 0.49
## 3 3 165 0.36
## 4 4 34 0.59
## 5 5 59 0.48
## 6 6 131 0.35
## 7 7 73 0.45
## 8 8 73 0.39
## 9 9 93 0.33
## 10 10 122 0.36
#fviz_silhouette(clara_clust_11)
#fviz_silhouette(clara_clust_12)
# Calinski-Harabasz
ch_5 <- calinhara(xy_z, clara_clust_5$clustering)
ch_6 <- calinhara(xy_z, clara_clust_6$clustering)
ch_7 <- calinhara(xy_z, clara_clust_7$clustering)
ch_8 <- calinhara(xy_z, clara_clust_8$clustering)
ch_9 <- calinhara(xy_z, clara_clust_9$clustering)
ch_10 <- calinhara(xy_z, clara_clust_10$clustering)
ch_11 <- calinhara(xy_z, clara_clust_11$clustering)
ch_12 <- calinhara(xy_z, clara_clust_12$clustering)
ch_table <- data.frame(ch_5, ch_6, ch_7, ch_8, ch_9, ch_10, ch_11, ch_12)
colnames(ch_table) <- c("5 clusters", "6 clusters", "7 clusters", "8 clusters", "9 clusters", "10 clusters",
"11 clusters", "12 clusters")
#ch_table
Both average silhouette width and Calinski-Harabasz measure indicate that we should again go with ten clusters in case of year 2019.
#clara_clust_10$medoids
#clara_clust_10$i.med
data_19 <- cbind(data_19,xy_z,clara_clust_10$clustering)
colnames(data_19) <- c('date_time','crime','weapon','district','lat','lon','date','time',
'year','month','is_night','weekday','mm', "lat_z", "lon_z", "cluster")
data_19$cluster <- as.factor(data_19$cluster)
medoids_19 <- data_19[c(clara_clust_10$i.med), #-c(1,8,9)
]
3 medoids from 2016, 3 for 2017, 1 for 2018 and 3 for 2019:
4 of them committed in September.
4 of them committed at night.
3 of them committed on Monday
Second iteration - year 2020:
#2019
data_20 <- data_1920[year==2020,]
xy_z <- as.data.frame(lapply(data_20[,5:6], scale))
#opt<-Optimal_Clusters_KMeans(xy_z, max_clusters=10, plot_clusters=TRUE, criterion="silhouette")
# another approach - 3 clusters and 6 samples specified - play with samplesize
set.seed(111)
clara_clust_5<-clara(xy_z, 5, metric="manhattan", stand=FALSE, samples=10,
sampsize=1000, trace=0, medoids.x=TRUE,
rngR=FALSE, pamLike=FALSE, correct.d=TRUE)
clara_clust_6<-clara(xy_z, 6, metric="manhattan", stand=FALSE, samples=10,
sampsize=1000, trace=0, medoids.x=TRUE,
rngR=FALSE, pamLike=FALSE, correct.d=TRUE)
clara_clust_7<-clara(xy_z, 7, metric="manhattan", stand=FALSE, samples=10,
sampsize=1000, trace=0, medoids.x=TRUE,
rngR=FALSE, pamLike=FALSE, correct.d=TRUE)
clara_clust_8<-clara(xy_z, 8, metric="manhattan", stand=FALSE, samples=10,
sampsize=1000, trace=0, medoids.x=TRUE,
rngR=FALSE, pamLike=FALSE, correct.d=TRUE)
clara_clust_9<-clara(xy_z, 9, metric="manhattan", stand=FALSE, samples=10,
sampsize=1000, trace=0, medoids.x=TRUE,
rngR=FALSE, pamLike=FALSE, correct.d=TRUE)
clara_clust_10<-clara(xy_z, 10, metric="manhattan", stand=FALSE, samples=10,
sampsize=1000, trace=0, medoids.x=TRUE,
rngR=FALSE, pamLike=FALSE, correct.d=TRUE)
clara_clust_11<-clara(xy_z, 11, metric="manhattan", stand=FALSE, samples=10,
sampsize=1000, trace=0, medoids.x=TRUE,
rngR=FALSE, pamLike=FALSE, correct.d=TRUE)
clara_clust_12<-clara(xy_z, 12, metric="manhattan", stand=FALSE, samples=10,
sampsize=1000, trace=0, medoids.x=TRUE,
rngR=FALSE, pamLike=FALSE, correct.d=TRUE)
# plot the output
#fviz_cluster(clara_clust_5, geom="point", pointsize=1, ggtheme=theme_classic())
fviz_cluster(clara_clust_6, geom="point", pointsize=1, ggtheme=theme_classic())
#fviz_cluster(clara_clust_7, geom="point", pointsize=1, ggtheme=theme_classic())
#fviz_cluster(clara_clust_8, geom="point", pointsize=1, ggtheme=theme_classic())
#fviz_cluster(clara_clust_9, geom="point", pointsize=1, ggtheme=theme_classic())
#fviz_cluster(clara_clust_10, geom="point", pointsize=1, ggtheme=theme_classic())
#fviz_cluster(clara_clust_11, geom="point", pointsize=1, ggtheme=theme_classic())
#fviz_cluster(clara_clust_12, geom="point", pointsize=1, ggtheme=theme_classic())
Let’s take a look at silhouette plots for calculated clustering variations and also Calinski-Harabasz measure:
#fviz_silhouette(clara_clust_5)
fviz_silhouette(clara_clust_6)
## cluster size ave.sil.width
## 1 1 149 0.51
## 2 2 144 0.35
## 3 3 192 0.43
## 4 4 222 0.34
## 5 5 250 0.32
## 6 6 43 0.69
#fviz_silhouette(clara_clust_7)
#fviz_silhouette(clara_clust_8)
#fviz_silhouette(clara_clust_9)
#fviz_silhouette(clara_clust_10)
#fviz_silhouette(clara_clust_11)
#fviz_silhouette(clara_clust_12)
# Calinski-Harabasz
ch_5 <- calinhara(xy_z, clara_clust_5$clustering)
ch_6 <- calinhara(xy_z, clara_clust_6$clustering)
ch_7 <- calinhara(xy_z, clara_clust_7$clustering)
ch_8 <- calinhara(xy_z, clara_clust_8$clustering)
ch_9 <- calinhara(xy_z, clara_clust_9$clustering)
ch_10 <- calinhara(xy_z, clara_clust_10$clustering)
ch_11 <- calinhara(xy_z, clara_clust_11$clustering)
ch_12 <- calinhara(xy_z, clara_clust_12$clustering)
ch_table <- data.frame(ch_5, ch_6, ch_7, ch_8, ch_9, ch_10, ch_11, ch_12)
colnames(ch_table) <- c("5 clusters", "6 clusters", "7 clusters", "8 clusters", "9 clusters", "10 clusters",
"11 clusters", "12 clusters")
#ch_table
For year 2020, the average silhouette is best (=0.39) for:
Calinski-Harabasz measure is the heighest for 9 clusters, but I decide to go for 6 clusters, to analyze the 2020 data in more concentrated way (difference of C-H measure between 9 and 6 clusters is slight).
#clara_clust_6$medoids
#clara_clust_6$i.med
data_20 <- cbind(data_20,xy_z,clara_clust_6$clustering)
colnames(data_20) <- c('date_time','crime','weapon','district','lat','lon','date','time',
'year','month','is_night','weekday','mm', "lat_z", "lon_z", "cluster")
data_20$cluster <- as.factor(data_20$cluster)
medoids_20 <- data_20[c(clara_clust_6$i.med), #-c(1,8,9)
]
3 medoids from 2016, 3 for 2017, 1 for 2018 and 3 for 2019.
4 of them committed in September.
4 of them committed at night.
3 of them committed on Monday
Plot on map - medoids with clusters assignment
There is a structural transition of the clusters for 2019 vs 2020:
Let’s additionally take a look at proportions of crimes committed in each cluster for years 2019 and 2020:
data_19 %>%
group_by(cluster) %>%
summarize(counts = n()) %>%
mutate(freq = formattable::percent(counts / sum(counts))) %>%
ggplot(aes(y=freq, x = order(cluster), fill = cluster)) +
geom_bar(stat="identity") +
scale_x_continuous(breaks = seq(1,10,1)) +
theme_minimal() +
theme(axis.text.x = element_text(size=7)) +
labs(x = "", y = "", title = "Crimes committed in Baltimore City 2019 within clusters (%)")+
geom_text(aes(label = freq), vjust = -0.3,
size = 4, color = "black")
data_20 %>%
group_by(cluster) %>%
summarize(counts = n()) %>%
mutate(freq = formattable::percent(counts / sum(counts))) %>%
ggplot(aes(y=freq, x = order(cluster), fill = cluster)) +
geom_bar(stat="identity") +
scale_x_continuous(breaks = seq(1,10,1)) +
theme_minimal() +
theme(axis.text.x = element_text(size=7)) +
labs(x = "", y = "", title = "Crimes committed in Baltimore City 2020 within clusters (%)")+
geom_text(aes(label = freq), vjust = -0.3,
size = 4, color = "black")
Actually, the clustering pattern is very similar to the first trial in case of year 2019. What is interesting, year 2020 brought more “crime concentration” in Baltimore City. Most of the criminal activity happens in the central cluster (25% of all crimes). At the same time other clusters (1-4) are also significant. The Brooklyn Part is negligable in case of year 2020 while considering the “share of criminal activity”.
Let’s dig into the 2020 clustering. We are going to check crime typology, weekday and day/nighttime assignment within clusters.
Firstly, let’s check the shares of different criminal activities within clusters:
data_20 %>%
group_by(cluster, crime) %>%
summarize(counts = n()) %>%
mutate(freq = formattable::percent(counts / sum(counts))) %>%
ggplot(aes(y=freq, x = reorder(crime, counts), fill = freq)) +
geom_bar(stat="identity") +
theme_minimal() +
theme(axis.text.x = element_text(size=6.5)) +
theme(axis.text.y = element_text(size=7)) +
labs(x = "", y = "", title = "Crimes committed in Baltimore City in 2020 within clusters (%)") +
geom_text(aes(label = freq), hjust = -.1, size = 2.5, color = "black") +
theme(panel.grid.major.y = element_blank(), legend.position = "off") +
#scale_y_discrete(labels = NULL) +
coord_flip() +
#+
#expand_limits(y = 60000)
#scale_x_discrete(breaks = c("2015-01","2016-01","2017-01","2018-01","2019-01","2020-01"))
facet_wrap(~cluster, nrow = 3, ncol = 2)
## `summarise()` has grouped output by 'cluster'. You can override using the `.groups` argument.
Proportion of Larceny is quite stable across the clusters.
Common Assault is most popular in cluster 4 - the western one.
In cluster #4 there is also the biggest share of shootings. It seems that in 2020 it was the most dangerous area (shares of homicides and rapes are also highest there).
Now let’s move on to checking how clusters differ when it comes to crimes committed in different weekdays and during night/ daytime:
data_20 %>%
group_by(cluster, weekday) %>%
summarize(counts = n()) %>%
mutate(freq = formattable::percent(counts / sum(counts))) %>%
ggplot(aes(y=freq, x = weekday, fill = freq)) +
geom_bar(stat="identity") +
theme_minimal() +
theme(axis.text.x = element_text(size=6)) +
labs(x = "", y = "", title = "Crimes committed in Baltimore City in year 2020 within clusters by weekday (%)") +
geom_text(aes(label = freq), vjust = 1, size = 2.5, color = "white") +
theme(panel.grid.major.y = element_blank(), legend.position = "off") +
#scale_y_discrete(labels = NULL) +
#+
#expand_limits(y = 60000)
#scale_x_discrete(breaks = c("2015-01","2016-01","2017-01","2018-01","2019-01","2020-01"))
facet_wrap(~cluster, nrow = 3, ncol = 2)
## `summarise()` has grouped output by 'cluster'. You can override using the `.groups` argument.
data_20 %>%
group_by(cluster, is_night) %>%
summarize(counts = n()) %>%
mutate(freq = formattable::percent(counts / sum(counts))) %>%
ggplot(aes(y=freq, x = reorder(is_night, counts), fill = freq)) +
geom_bar(stat="identity") +
theme_minimal() +
theme(axis.text.x = element_text(size=7)) +
labs(x = "", y = "", title = "Crimes committed in Baltimore City in year 2020 within clusters by daytime (%)") +
geom_text(aes(label = freq), vjust = 1, size = 2.5, color = "white") +
theme(panel.grid.major.y = element_blank(), legend.position = "off") +
#scale_y_discrete(labels = NULL) +
#+
#expand_limits(y = 60000)
#scale_x_discrete(breaks = c("2015-01","2016-01","2017-01","2018-01","2019-01","2020-01"))
facet_wrap(~cluster)
## `summarise()` has grouped output by 'cluster'. You can override using the `.groups` argument.
The only worth mentioning information concerning weekdays is the fact, that for cluster #6 (the Brooklyn-industrial one) there is the highest difference between proportions of crimes committed in Friday and Thursday (4 p.p.).
When it comes to night/daytime division, the only slight deviation is that for cluster #1 (northern-east) the proportion of nighttime crimes is less than 20%.
Let’s finish the analysis of 2019 vs 2020 crimes at this point and proceed to the last part of clustering - Clustering for murders committed in Baltimore using KNN.
The last clustering exercise is going to be murders analysis (I assume in this part that homicide = murder). There were 1639 murders committed in Baltimore City in years 2016-2020.
It would be interesting to analyze the murders by the date-time variables before the actual clustering.
Firstly, let’s check the dispersion of murders among months:
data_m <- data[crime =='HOMICIDE',1:12]
dm <- data_m[order(month),.(counts = .N),(month)]
ggplot(data = dm, aes(y=counts, x = order(month))) +
geom_bar(stat="identity", fill = 'darkred', colour ='darkred') +
scale_x_continuous(breaks = seq(1,12,1)) +
theme_minimal() +
labs(x = "", y = "", title = "Murders committed in Baltimore City in each month:") +
theme(legend.position = "off")
February and March are the months, where least murders occured in Baltimore. The peak month is July (possibly some rough endings of summer romances? - just hitting a hypothesis there ;)). Actually, the first quarter is pretty calm in terms of murders, the rest of the year has some peaks and declines.
Now let’s take a look at the division of murder counts among the weekdays:
dw <- data_m[,.(counts = .N),.(weekday)]
ggplot(data = dw, aes(y=counts, x=weekday, fill = counts)) +
geom_bar(stat="identity") +
theme_minimal() +
theme(axis.text.x = element_text(size=12)) +
labs(x = "", y = "", title = "Murders commited in Baltimore City broken down by weekdays:") +
geom_text(aes(label = counts), vjust = 1.5, size = 4, color = "white") +
theme(panel.grid.major.y = element_blank(), legend.position = "off") +
scale_y_discrete(labels = NULL)
It is quite interesting - my intuition before running this plot was telling me, that weekends should be most active when it comes to murders - they need some planning and time etc. It is quite surprising - the most popular weekday for committing murder is Wednesday.
The last part of date-time analysis: let’s take a look what is the proportion between homicides committed during night (11pm - 4am) and during the day. I presume, that the share of nighttime murders should be significantly higher than the baseline (~20%):
dn <- data.frame(table(data_m$is_night))
colnames(dn) <- c("group", "frequency")
dn$label <- c("crimes at night", "crimes during the day")
dn$label2 <- round(100*dn$freq/sum(dn$freq), 1)
dn$label2 <- paste0(dn$label2, "%")
# Example 3b: ggplot piechart
ggplot(data = dn, aes(x = "", y = frequency, fill = label))+
geom_bar(stat = "identity")+coord_polar("y", start = 0) +
geom_text(aes(label = paste(label2)),
position = position_stack(vjust = 0.5)) +
ggtitle("Proportions of murders committed during the night vs the day time:")+
theme_void()
Another surprise, only 23% of the homicides were committed during the nighttime. I was expecting something different (even proportions like 50/50). It seems that we are finished with the date-time analysis of Baltimore’s homicides. Let’s move on to the clustering part!
Let’s take a look at the map in the first step:
#Let's present the data as a scatterplot
ggplot(data_m, aes(lon, lat, alpha = 1.5, show.legend = FALSE)) +
geom_point(color = 'darkred') +
theme_classic() +
ggtitle('Murders in Baltimore City - 2016-2020') +
xlab('')+
ylab('')
Actually, it seems that there are quite a lot of outliers. I presume there could be possible 3-4 murder-clusters, but it would be appropriate to check if the clustering is informative in case of this dataset. To determine that, we are going to conduct again the Duda-Hart test:
xy_z <- as.data.frame(lapply(data_m[,5:6], scale))
km2<-kmeans(xy_z,2)
dudahart2(xy_z, km2$cluster) #fpc::
## $p.value
## [1] 0.9105829497
##
## $dh
## [1] 0.7072982842
##
## $compare
## [1] 0.6228252431
##
## $cluster1
## [1] TRUE
##
## $alpha
## [1] 0.001
##
## $z
## [1] 3.090232306
Additionally, I am going to calculate the Hopkins Statistic:
hopkins::hopkins(xy_z)
## [1] 0.9749344601
When it comes to the murders data, it seems that clustering is not the best option here (we fail to reject the null hypothesis stating that the data is homogeneous). However, we cann see some possible clusters judging from the plot presented above. Since then I decided to calculate the Hopkins Statistic. It is equal 0.989 - the value is close to 1, which indicates that the dataset is significantly a clusterable data and some possible clusters are visible, which appeals to my intuition. Since then, let’s move on with the clustering procedure.
Keeping in mind the fact, that there are quite a lot outliers in the data, we are going to perform KNN clustering. It is possible due to not so large amount of data (1639 records depicting murders).
I am going to determine the number of clusters using Average Silhouette width. Let’s construct the Silhouette plot for k= 1-10 clusters:
Optimal_Clusters_KMeans(xy_z, max_clusters=10, plot_clusters=TRUE, criterion="silhouette")
## [1] 0.0000000000 0.3409608496 0.4568656962 0.5367780589 0.5070223026
## [6] 0.4462018216 0.4661110111 0.4782200750 0.4710206532 0.4720163059
The highest Average Silhouette value observed for k=4 clusters.
cluster_km_4<-kmeans(xy_z, 4)
cc <- as.data.frame(cluster_km_4$centers)
cluster_km_4$i.med
## NULL
data_m <- cbind(data_m,xy_z,cluster_km_4$cluster)
colnames(data_m) <- c('date_time','crime','weapon','district','lat','lon','date','time',
'year','month','is_night','weekday', "lat_z", "lon_z", "cluster")
data_m$cluster <- as.factor(data_m$cluster)
#unscale the centroids to put them on the map
dm <- data_m[,c(5,6)]
mean <- apply(dm, 2, mean)
sd <- apply(dm, 2, sd)
#unscale data with linear algebra
centers <- t((t(cluster_km_4$centers) * sd) + mean)
Let’s plot the data on map - centroids with clusters assignment
As we can see, 2/4 centroids are located in parks, which are natural places where the homicides may be committed.
Let’s check the murders counts for each cluster:
dc <- data_m[,.(counts = .N),(cluster)]
ggplot(data = dc, aes(y=counts, x = cluster, fill = cluster)) +
geom_bar(stat="identity", colour ='black') +
#scale_x_continuous(breaks = seq(1,10,1)) +
theme_minimal() +
labs(x = "", y = "", title = "Murders committed in Baltimore City within each cluster:") +
theme(legend.position = "off") +
geom_text(aes(label = counts), vjust = 2, size = 8, color = "white")
The most numerous clusters are #2 and #3. They also cover the biggest part of the Baltimore area. At the same time, as we can see on the plot above, the murders concentration is quite big for cluster #3.
Let’s analyze the clusters in terms of date-time of murders committed within these clusters. Firstly, let’s check the shares of different criminal activities within clusters:
data_m %>%
group_by(cluster, weekday) %>%
summarize(counts = n()) %>%
mutate(freq = formattable::percent(counts / sum(counts))) %>%
ggplot(aes(y=freq, x = weekday, fill = freq)) +
geom_bar(stat="identity") +
theme_minimal() +
theme(axis.text.x = element_text(size=6)) +
theme(axis.text.y = element_text(size=7)) +
labs(x = "", y = "", title = "Murders committed in Baltimore City in year 2020 within clusters by weekday (%)") +
geom_text(aes(label = freq), vjust = 2, size = 2.5, color = "white") +
theme(panel.grid.major.y = element_blank(), legend.position = "off") +
#scale_y_discrete(labels = NULL) +
#+
#expand_limits(y = 60000)
#scale_x_discrete(breaks = c("2015-01","2016-01","2017-01","2018-01","2019-01","2020-01"))
facet_wrap(~cluster)
## `summarise()` has grouped output by 'cluster'. You can override using the `.groups` argument.
data_m %>%
group_by(cluster, is_night) %>%
summarize(counts = n()) %>%
mutate(freq = formattable::percent(counts / sum(counts))) %>%
ggplot(aes(y=freq, x = reorder(is_night, counts), fill = freq)) +
geom_bar(stat="identity") +
theme_minimal() +
theme(axis.text.x = element_text(size=7)) +
labs(x = "", y = "", title = "Murders committed in Baltimore City in year 2020 within clusters by daytime (%)") +
geom_text(aes(label = freq), vjust = 2, size = 3, color = "white") +
theme(panel.grid.major.y = element_blank(), legend.position = "off") +
#scale_y_discrete(labels = NULL) +
#+
#expand_limits(y = 60000)
#scale_x_discrete(breaks = c("2015-01","2016-01","2017-01","2018-01","2019-01","2020-01"))
facet_wrap(~cluster)
## `summarise()` has grouped output by 'cluster'. You can override using the `.groups` argument.
The only worth mentioning fact considering the weekdays, is that the share of murders is significantly higher for Mondays within cluster #4 (Reed Bird Island Park) - 21% murders committed on Monday. What is interesting as far as night/daytime is concerned, for clusters 2 and 4 (the park clusters), 1/4 of all of the murders were committed during nights. It fits the “park environment assumption” quite well.
The Crime Data of Baltimore City was quite a challenge for me to deal with. It was an opportunity to explore the data, perform clustering using Clara (with PAM algo) and KNN.
The peak of criminal activity in Baltimore City is visible on Fridays. At the same time, the criminals take some rest at weekends.
What is surprising, vast majority of crimes are committed during the daytime. It also applies to the heaviest crimes (homicides).
The most popular point in the criminal world of Baltimore is the area by the Horseshoe Casino Baltimore.
The whole dataset could be assigned into ten clusters, which mainly depict Neighbourhoods of the Baltimore.
Highlandstown and Downtown areas seem to be the most dangerous ones concerning counts of committed crimes.
We were able to single out a “Sunday Crime Cluster” (Brooklyn - industrial area) - 15% of all crimes were committed on Sunday there.
We could also observe a “COVID-19 effect” for year 2020. Global Pandemic declined social activity, including the criminal activity. It is visible from April and continued untill the end of the year. Common, light crimes dropped down, but at the same time, share of heavy crimes (rapes, assaults, homicides) went up. It seems that the “recreational / light” criminals ceased their activity last year, but the heavy and ruthless ones did not stop their criminal activity due to COVID-19 pandemic. Additionally, the criminal activity for 2020 is more concentrated than in year 2019. Number of clusters went down (10 -> 6).
What is interesting, most of murders were committed during the daytime (77%) and most of them were committed on Wednesday.
The murders data could be divided into 4 clusters, two of them are located mainly within parks, which constitute natural places where the homicides could be committed.
It was a great pleasure to prepare this project. I hope it is also interesting for the Viewer! :)