Association rules in crime data
1.Introduction
Motivation
Big data analytics is becoming more popular and used in various fields of science every year. One of its main advantages is that it can evaluate huge quantities of data much faster than people, and also spot trends that they would likely miss. So, from a crime-solving perspective, data analysis can help catch criminals or even prevent crime from occurring and increase the security of entire cities.
Examples of big data analysis in forensics include the installation of license plate readers that record the plate numbers of every vehicle that enters or leaves each place. Another example might be storing and analyzing huge volume and variety of data in real time, to predict and spot patterns especially related to human interactions and behavior to determine which areas are at increased risk and more police patrols should be directed there. Therefore, it seems reasonable to analyze crime data using a data mining method called Association Rules to find rules which define the relationship or coexistence of two or more elements / objects - in our case crime incidents.
Interested in the above-mentioned big data applications, I decided to use them and explore deeper for the purpose of assigned paper during my Master’s of Data Science studies at Faculty of Economics University of Warsaw at Unsupervised Learning Classes. For analysis I used the Toronto Police Database as this city is famous for being attractive place for tourists, students as well as people looking for career development perspectives. Until now considered as one of the safest cities of North America, but in recent years, according to the Economist, become more and more dangerous, which will be visible later when analyzing the dataset.
Association rules
1.1.Definition
Association rule is a method for discovering interesting relations between variables in a large databases. By using some measures of interestingness, which will be discussed below, they are intended to identify strong rules, i.e. rule {bread, butter} => {ham} indicates that people who bought bread and butter are likely to buy ham as well. Such informations help to identify new opportunities and ways for cross-selling products to customers as well as personalizing marketing promotions, smarter inventory management or product placement strategies in the stores. In general this concept is most likely and successfully used in e-commerce in creating recommender system to enhance customer’s shopping experience.
1.2.Measures
Support
Support represents ratio between the number of transactions that contain both X and Y and the total number of transactions. The higher the support, the more frequently combinations of two or more items appeared in the data set.
\[Rule(X =>Y) = \frac{Number \ of \ transaction \ containing \ X \ and \ Y} {Number \ of \ all \ transaction}\]
Confidence
Confidence represents the probability of combining items X and Y if we know the customer will buy item X.
\[Confidence \ Rule(X=>Y) = \frac{Number \ of \ transaction \ containing \ X \ and \ Y} {Number \ of \ transaction \ containing \ X}\]
Lift
Lift represents how much the presence of item X increases the confidence that people will buy item Y.
\[Lift \ Rule(X =>Y) = \frac{Confidence \ Rule(X => Y)} {Support (Y)}\]
Where:
\[Support(Y) = \frac {Number \ of \ transaction \ containing \ Y} {Number \ of \ all \ transaction}\]
Dataset and preprocessing
I started with downloading the necessary packages, loading dataset and multi-stage preparation of the dataset for further analysis, such as removing duplicates of unique Incident.ID or empty values.
library(tidyverse)
library(networkD3)
library(arules)
library(arulesViz)
library(igraph)
library(visNetwork)
library(openxlsx)
library(data.table)
library(rmdformats)
library(rmarkdown)As it was mentioned before we’re going to apply and anlyze association rules on crime data available on Toronto Police public data portal.
trans <- read.csv("toronto_crime.csv")
trans <- subset(trans, !duplicated(trans$event_unique_id))
paged_table(trans)## [1] 2014 2013 2012 2010 2011 2015 2008 NA 2005 2006 2000 2001 2009 2002 2004
## [16] 2007 2003 2016 2017 2018 2019
As presented above dataset covers criminal incidents occured in Toronto over the period 2000-2021. Based on a below summary of how many accidents occurred in a given year, I decided to analyze data for 2019, where the most accidents took place.
year_group <- group_by(trans, occurrenceyear)
crime_by_year <- summarise(year_group,
n = n())
paged_table(crime_by_year)After importing and preprocessing the dataset, I transformed it into transactional form to easily apply association rules since they are likely used to find the association of goods bought in a supermarket. What does it mean? We want to obtain data in a form resembling transactions purchased by buyers. In our case, the transaction ID (eg. invoice number, the number of the receipt) will be Toronto district (Neighbourhood) of the reported incident (MCI - Major Crime Indicator), in turn, a given type of crime will correspond to the purchased product. In further analysis, we will not care about how many times the incident has occurred in a given district, but whether it occurs in general.
df <- trans %>% select(Neighbourhood, MCI) %>% mutate(MCI = str_trim(MCI,
side = "both")) %>% mutate(Neighbourhood = factor(Neighbourhood), MCI = str_replace_all(MCI,
"[']", replacement = "")) %>% mutate(MCI = tolower(str_replace_all(MCI,
pattern = "[ ]", replacement = "_")))
paged_table(df)The following function allows us to get data into market basket format (transactions).
prep_data <- function(x) {
y <- data.frame()
for (i in 1:n_distinct(df$Neighbourhood)) {
x <- df %>% filter(Neighbourhood == levels(Neighbourhood)[i]) %>% t() %>% as.data.frame() %>%
slice(2) %>% mutate(Neighbourhood = levels(df$Neighbourhood)[i]) %>% select(Neighbourhood,
everything())
colnames(x) <- c("Neighbourhood", paste0("item_", 1:(ncol(x) - 1)))
print(i)
y <- list(y, x) %>% rbindlist(fill = T)
}
return(y)
}
df_prep <- prep_data()Now we can save the data into .csv format and use function read.transasctions() from arules package to read data as transaction object.
trans1 <- read.transactions("/Users/andrea/Documents/UL/Association rules/transaction.csv", sep = ",",
header = T)
LIST(head(trans1, 5))## [[1]]
## [1] "assault" "auto_theft" "break_and_enter" "robbery"
## [5] "theft_over"
##
## [[2]]
## [1] "assault" "auto_theft" "break_and_enter" "robbery"
## [5] "theft_over"
##
## [[3]]
## [1] "assault" "auto_theft" "break_and_enter" "robbery"
## [5] "theft_over"
##
## [[4]]
## [1] "assault" "auto_theft" "break_and_enter" "robbery"
## [5] "theft_over"
##
## [[5]]
## [1] "assault" "auto_theft" "break_and_enter" "robbery"
## [5] "theft_over"
2.Data analysis
2.1.Creating rules
Now, on so created dataset, we can apply association rules using apriori algorithm. We’re looking for most frequent itemsets of minimum length equal to 1 and maximum equal to 5. Created rule has been limited to support minimal 0.5. For this purpose, I used Eclat algorithm which is faster version of apriori.
## Eclat
##
## parameter specification:
## tidLists support minlen maxlen target ext
## FALSE 0.5 1 5 frequent itemsets TRUE
##
## algorithmic control:
## sparse sort verbose
## 7 -2 TRUE
##
## Absolute minimum support count: 70
##
## create itemset ...
## set transactions ...[5 item(s), 140 transaction(s)] done [0.00s].
## sorting and recoding items ... [5 item(s)] done [0.00s].
## creating bit matrix ... [5 row(s), 140 column(s)] done [0.00s].
## writing ... [31 set(s)] done [0.00s].
## Creating S4 object ... done [0.00s].
Based on above results, in which absolute minimum support counts to 70, we can assume that in 70 districts in Toronto 50% of crime incidents occured together.
## items support transIdenticalToItemsets count
## [1] {assault,
## auto_theft,
## break_and_enter,
## robbery,
## theft_over} 0.9785714 137 137
## [2] {assault,
## break_and_enter,
## robbery,
## theft_over} 0.9785714 137 137
## [3] {auto_theft,
## break_and_enter,
## robbery,
## theft_over} 0.9785714 137 137
## [4] {assault,
## auto_theft,
## robbery,
## theft_over} 0.9785714 137 137
## [5] {assault,
## robbery,
## theft_over} 0.9785714 137 137
freq_rules<-ruleInduction(rules, trans1, confidence=0.95)
freq_rules<-sort(freq_rules, by="lift", decreasing=TRUE)
inspect(freq_rules[1:5,])## lhs rhs support confidence lift itemset
## [1] {auto_theft,
## break_and_enter,
## robbery,
## theft_over} => {assault} 0.9785714 1.0000000 1 1
## [2] {assault,
## break_and_enter,
## robbery,
## theft_over} => {auto_theft} 0.9785714 1.0000000 1 1
## [3] {assault,
## auto_theft,
## robbery,
## theft_over} => {break_and_enter} 0.9785714 1.0000000 1 1
## [4] {assault,
## auto_theft,
## break_and_enter,
## theft_over} => {robbery} 0.9785714 1.0000000 1 1
## [5] {assault,
## auto_theft,
## break_and_enter,
## robbery} => {theft_over} 0.9785714 0.9785714 1 1
Based on above results we can conclude that auto thefts, break and enters, robberies,thefts over are most likely to occur with assaults (we can observe the highest lift value). In addition, the value of support and confidence strengthen this hypothesis that these crimes are more likely to occur together.It is important to reflect on the correctness of the formulated conclusions. While we’re saying that some crime incidents are more commonly to occur together we don’t mean that it will take place at the same time. It means that if for given district each type of crime was reported (lhs) it is more likely that another type of crime will be reported (rhs).
After general analysis of the results, it’s worth taking a deeper look at each type of crime, i.e. assaults, auto thefts, robberies, break and enters, as well as thefts over which has been shown below:
Assault
Based on values of support, confidence and lift for the first 10 roles shown we can assume that occurrence of theft over, auto thefts, break and enter crimes and robbery indicates
the prevalence of assaults in analyzed Toronto district in 2019. It is also worth adding and visible in any further analyzed criminal incident, that the most probable chance for its occurrence is observed when given crime takes place alone (empty lhs values at first place).
rules_assault<-apriori(data=trans1, parameter=list(supp=0.001,conf = 0.08),
appearance=list(default="lhs", rhs="assault"), control=list(verbose=F))
rules_assault_byconf<-sort(rules_assault, by="confidence", decreasing=TRUE)
inspect(rules_assault_byconf[1:10])## lhs rhs support confidence coverage
## [1] {} => {assault} 1.0000000 1 1.0000000
## [2] {theft_over} => {assault} 0.9785714 1 0.9785714
## [3] {auto_theft} => {assault} 1.0000000 1 1.0000000
## [4] {break_and_enter} => {assault} 1.0000000 1 1.0000000
## [5] {robbery} => {assault} 1.0000000 1 1.0000000
## [6] {auto_theft,theft_over} => {assault} 0.9785714 1 0.9785714
## [7] {break_and_enter,theft_over} => {assault} 0.9785714 1 0.9785714
## [8] {robbery,theft_over} => {assault} 0.9785714 1 0.9785714
## [9] {auto_theft,break_and_enter} => {assault} 1.0000000 1 1.0000000
## [10] {auto_theft,robbery} => {assault} 1.0000000 1 1.0000000
## lift count
## [1] 1 140
## [2] 1 137
## [3] 1 140
## [4] 1 140
## [5] 1 140
## [6] 1 137
## [7] 1 137
## [8] 1 137
## [9] 1 140
## [10] 1 140
Auto theft
Based on the values of support, confidence and lift in this case we can also make similar conclusions as for assaults. Occurrence of robbery, theft over, break and enter crimes as well assaults indicates the prevalence of auto thefts in analyzed Toronto district in 2019.
rules_auto_theft<-apriori(data=trans1, parameter=list(supp=0.001,conf = 0.08),
appearance=list(default="lhs", rhs="auto_theft"), control=list(verbose=F))
rules_auto_theft_byconf<-sort(rules_auto_theft, by="confidence", decreasing=TRUE)
inspect(rules_auto_theft_byconf[1:10])## lhs rhs support confidence
## [1] {} => {auto_theft} 1.0000000 1
## [2] {theft_over} => {auto_theft} 0.9785714 1
## [3] {assault} => {auto_theft} 1.0000000 1
## [4] {break_and_enter} => {auto_theft} 1.0000000 1
## [5] {robbery} => {auto_theft} 1.0000000 1
## [6] {assault,theft_over} => {auto_theft} 0.9785714 1
## [7] {break_and_enter,theft_over} => {auto_theft} 0.9785714 1
## [8] {robbery,theft_over} => {auto_theft} 0.9785714 1
## [9] {assault,break_and_enter} => {auto_theft} 1.0000000 1
## [10] {assault,robbery} => {auto_theft} 1.0000000 1
## coverage lift count
## [1] 1.0000000 1 140
## [2] 0.9785714 1 137
## [3] 1.0000000 1 140
## [4] 1.0000000 1 140
## [5] 1.0000000 1 140
## [6] 0.9785714 1 137
## [7] 0.9785714 1 137
## [8] 0.9785714 1 137
## [9] 1.0000000 1 140
## [10] 1.0000000 1 140
Robbery
Also in this case, we can see that the support, confidence and lift takes similar values, and thus conclusions to the robbery will be analogous. Occurrence of assaults, theft over, break and enter crimes and assaults indicates the prevalence of robbery in analyzed Toronto district in 2019.
rules_robbery<-apriori(data=trans1, parameter=list(supp=0.001,conf = 0.08),
appearance=list(default="lhs", rhs="robbery"), control=list(verbose=F))
rules_robbery_byconf<-sort(rules_robbery, by="confidence", decreasing=TRUE)
inspect(rules_robbery_byconf[1:10])## lhs rhs support confidence coverage
## [1] {} => {robbery} 1.0000000 1 1.0000000
## [2] {theft_over} => {robbery} 0.9785714 1 0.9785714
## [3] {assault} => {robbery} 1.0000000 1 1.0000000
## [4] {auto_theft} => {robbery} 1.0000000 1 1.0000000
## [5] {break_and_enter} => {robbery} 1.0000000 1 1.0000000
## [6] {assault,theft_over} => {robbery} 0.9785714 1 0.9785714
## [7] {auto_theft,theft_over} => {robbery} 0.9785714 1 0.9785714
## [8] {break_and_enter,theft_over} => {robbery} 0.9785714 1 0.9785714
## [9] {assault,auto_theft} => {robbery} 1.0000000 1 1.0000000
## [10] {assault,break_and_enter} => {robbery} 1.0000000 1 1.0000000
## lift count
## [1] 1 140
## [2] 1 137
## [3] 1 140
## [4] 1 140
## [5] 1 140
## [6] 1 137
## [7] 1 137
## [8] 1 137
## [9] 1 140
## [10] 1 140
Theft over
In this case we can notice a relative decrease in value of support and confidence. Value of lift remains at the same level. Lower values indicate that the defined roles are weaker. However we can conclude that assaults, auto theft, robbery as well as break and enter indicates the prevalence of theft over in analyzed Toronto district in 2019.
rules_theft_over<-apriori(data=trans1, parameter=list(supp=0.001,conf = 0.08),
appearance=list(default="lhs", rhs="theft_over"), control=list(verbose=F))
rules_theft_over_byconf<-sort(rules_theft_over, by="confidence", decreasing=TRUE)
inspect(rules_theft_over_byconf[1:10]) ## lhs rhs support confidence coverage
## [1] {} => {theft_over} 0.9785714 0.9785714 1
## [2] {assault} => {theft_over} 0.9785714 0.9785714 1
## [3] {auto_theft} => {theft_over} 0.9785714 0.9785714 1
## [4] {break_and_enter} => {theft_over} 0.9785714 0.9785714 1
## [5] {robbery} => {theft_over} 0.9785714 0.9785714 1
## [6] {assault,auto_theft} => {theft_over} 0.9785714 0.9785714 1
## [7] {assault,break_and_enter} => {theft_over} 0.9785714 0.9785714 1
## [8] {assault,robbery} => {theft_over} 0.9785714 0.9785714 1
## [9] {auto_theft,break_and_enter} => {theft_over} 0.9785714 0.9785714 1
## [10] {auto_theft,robbery} => {theft_over} 0.9785714 0.9785714 1
## lift count
## [1] 1 137
## [2] 1 137
## [3] 1 137
## [4] 1 137
## [5] 1 137
## [6] 1 137
## [7] 1 137
## [8] 1 137
## [9] 1 137
## [10] 1 137
Break and enter
Again, in this case, support, confidence and lift takes high values.It means no more, no less that assault, theft over, auto theft and robbery indicates the prevalence of break and enter in analyzed Toronto district in 2019.
rules_break_and_enter<-apriori(data=trans1, parameter=list(supp=0.001,conf = 0.08),
appearance=list(default="lhs", rhs="break_and_enter"), control=list(verbose=F))
rules_break_and_enter_byconf<-sort(rules_break_and_enter, by="confidence", decreasing=TRUE)
inspect(rules_break_and_enter_byconf[1:10]) ## lhs rhs support confidence
## [1] {} => {break_and_enter} 1.0000000 1
## [2] {theft_over} => {break_and_enter} 0.9785714 1
## [3] {assault} => {break_and_enter} 1.0000000 1
## [4] {auto_theft} => {break_and_enter} 1.0000000 1
## [5] {robbery} => {break_and_enter} 1.0000000 1
## [6] {assault,theft_over} => {break_and_enter} 0.9785714 1
## [7] {auto_theft,theft_over} => {break_and_enter} 0.9785714 1
## [8] {robbery,theft_over} => {break_and_enter} 0.9785714 1
## [9] {assault,auto_theft} => {break_and_enter} 1.0000000 1
## [10] {assault,robbery} => {break_and_enter} 1.0000000 1
## coverage lift count
## [1] 1.0000000 1 140
## [2] 0.9785714 1 137
## [3] 1.0000000 1 140
## [4] 1.0000000 1 140
## [5] 1.0000000 1 140
## [6] 0.9785714 1 137
## [7] 0.9785714 1 137
## [8] 0.9785714 1 137
## [9] 1.0000000 1 140
## [10] 1.0000000 1 140
2.2. Conclusions
The above-mentioned conclusions based on association rules are very intuitive. All of the crime incidents are most likely to occur alone. However, the remaining roles are strong enough to indicate robust correlations between the occurrence of different types of accidents. It means that various types of criminal incidents often occur in a given area with an increased danger indicator.
2.3. Visualizations
For better understanding of association rules it is recommended to make some visualizations. First, by drawing basic dendrogram for our items we can confirm our conclusions about theft over where we obtained lower values of support and confidence for theft over rule. Looking from the top we can see two distinct groups, theft over differs significantly from remaining types of crimes.
trans_item<-trans1[,itemFrequency(trans1)>0.70]
djac_item<-dissimilarity(trans_item, which="items")
plot(hclust(djac_item, method = "ward.D2"), main = "Dendrogram for items")Visualizations for association rules in a form of graphs show items connected with itemsets or rules using arrows. Vertices represent rules, and the color and size of vertices show respectively lift and support values. Arrows from items to rule indicate left-hand-side items and arrows from rule to items indicate right-hand-side items. In the first case to make graph more transparent, number of rules have been limited to the 15 most frequent. Then we can compare it to a graph including all 75 defined rules.
Another way to visualize association rules could be Parallel Coordinates Plot where items are displayed on the y-axis as nominal values and the x-axis represents its positions in a rule. The width of the arrows represents support and the intensity of the color represents confidence. We can observe that earlier obtained values of support and confidence are reflected on the below parallel coordinates plot.
However a large number of rules can be severe to analyze and draw conclusions. In such cases simple visualizations and tables with indicators can be troublesome and we have to use more advanced tools like interactive charts showing relationships between rules or items that correspond them. For this purpose, we can draw network for our association rules.
3.Summary
To sum up, association rules were introduced to discover relationships between purchased products in a large-scale transaction data recorded by point-of-sale (POS) systems in supermarkets. However, they can also be successfully used in datasets consists of recorded crime incidents to discover relationships between them and improve process of combating crimes. Data analysis in Toronto in 2019 showed that each type of crime is strongly correlated with remaining ones and its occurrence implies a potential future one.