Application of text and web mining in Criminology

From the last part, we extracted the content of the web pages for Unresolved Homicides as a data frame which you can find in this link. In this part, we mostly want extract knowledge from this data using descriptive text analytics and visualization. The R code for this part can be found in this link.

Time of incidence visualization

Since there is no separate column in the website indicating the time of incidence, in this part we want to extract time from the description of each case. First we need to import “stringr” library and import the dataset. Then using str_extract() function time is extracted. In this code d{1,2}:d{1,2} is any pattern in which there are 1 or 2 digits after that there is “:” and then again 1 or 2 digits such as 2:30 or 4:50. Then s{0,1}w{2} means after the digits there might be one or zero space and then 2 characters. In order to extract 12:30 A.M which has “.” between two characters we used “|” to introduce another pattern.

library(stringr)
data2 <- read.csv("D:/Crime.csv")
lis1 <- str_extract(data2$description,'\\d{1,2}:\\d{1,2}(\\s{0,1}\\w{2})|\\d{1,2}:\\d{1,2}(\\s{0,1}\\w{1}.\\w{1})')
lis1 <- tolower(lis1)

We need to place “AM” and “PM” in a separate column. So, we need sub() function to do so. We used couple of list conversion to have the character part of the time. At the end, time_am variable contain the character part of the time and time variable contains the time in 12 hours’ format. Using a hist() function we can see the distribution of time in the dataset.

lis2 <- sub("a","AM",str_match(lis1,"a"))
lis3 <- sub("p","PM",str_match(lis1,"p"))
lis4 <- sub("NA|NANA","",paste0(lis2,lis3))
data2$time_am <- lis4
data2$time <- str_extract(lis1,'\\d{1,2}:\\d{1,2}')
lis6 <- as.character(paste0(data2$time," ",data2$time_am))
hist(strptime(lis6, "%I:%M %p"), "hours", freq = TRUE,breaks = 14,col=c("Blue","Red"),
     xlab = "Time of the Day",ylab="Frequency")

According to histogram it seems most of incidences happened in midnight or during the noon.

Using google map to visualize the cases geographically

Using ggmap package you can visualize the cases in a map. First we need to extract longitude and latitude of each case based on the address. For this purpose we can use geocode() function. This functions take the address and give us longitude and latitude of an address based on Google map API (this will process an address per second here we have 320 records and it will take approximately 5 minutes or more). Then we can add these two new variables into our dataset. If the address would not be in google map, an empty string will be assigned to longitude and latitude of these addresses.

library(ggmap)
data2$location <- as.character(data2$location)
long <- geocode(data2$location)
data2$lon <- long$lon
data2$lat <- long$lat

In the next step, some data cleanup is made on race variable. Then “dmap” data frame is created which has no missing for longitude and latitude variables and exclude any race except “B”, “H”, and “W”. In this library get_map() will download a map of area of interest from google map. We need Orlando’s map so we need this area lon = -81.40 and lat =28.5. Zoom is equal to 10 which indicates the zoom on the area. In order to have the location of each case on the map after downloading the map, ggmap function can be used. First ‘gg1’ object contains the simple map and the second one contains more decoration for the plot.

dmap <- data2[!is.na(data2$lon),]
dmap <- dmap[which(dmap$race %in% c("B","H","W")),]
map <- get_map(location =c(lon = -81.40, lat =28.5 ),zoom=10,scale="auto",color="bw")
gg1 <- ggmap(map,legend = "none") +
      geom_point(aes(x = lon, y = lat,color=dmap$race,shape=dmap$sex),size=2, show.legend = TRUE, data = dmap, alpha = .6) +
      scale_size(range=c(1,3)) +
      scale_colour_manual(values = c("blue","red", "green"))
gg1 <- gg1+
      labs(x="Longitude",y="Latitude",color="Sex",shape="Race") +
      ggtitle("Map of crime indicated by Race and Sex (Orlando)")+
      theme(axis.text.x = element_text(hjust=1,vjust=0.5,size = 12,face = "bold",color = "Black"),
            plot.title = element_text(hjust=0.5,size = 16,face = "bold",color = "Black"),
            axis.title.x =element_text(hjust=0.5,size = 16,face = "bold",color = "Black"),
            axis.title.y =element_text(hjust=0.5,size = 16,face = "bold",color = "Black"),
            legend.title = element_text(hjust=0.5,size = 14,face = "bold",color = "Black"),
            legend.text = element_text(hjust=0.5,size = 10,face = "bold",color = "Black"))
gg1

Text Mining on the cases’ descriptions

In this section, we first transform all the description to the corpus and do some text visualization on them. For text mining in R, tm() library is an essential. Using Corpus() function we should transform all the descriptions to the corpus format. Then using tn_map() we could apply different text preparation techniques which are: remove strip withe space, remove punctuations, remove numbers, transform all characters to lower case, remove stop words, and finally stemming the words. At the end, output of the prepared text for the first records is shown using writeLines() function.

#Text mining and text preparation
library(tm)

## Loading required package: NLP

library(SnowballC)
corpus = Corpus(VectorSource(data2$description))
inspect(corpus[4])

## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 1
## 
## [[1]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 175

corpus <- tm_map(corpus, stripWhitespace) 
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, content_transformer(tolower)) 
st1 <- c(stopwords("english"),"orlando","victim","wound","shot","deputies","died","orladno","result","die","found","florida")
corpus = tm_map(corpus, removeWords,st1 )
corpus = tm_map(corpus, stemDocument)
writeLines(as.character(corpus[[1]]))

##  januari th   approxim  hour   check   day inn hotel locat  mccoy road    fli   indianapoli indiana  regist   indic   two individu   parti  approxim  hour     insid  hotel room deceas   gunshot

Word Frequency

To analysis this kind of data it is always interesting to look into the term-document matrix. This is matrix with records indicating of one description and columns are words which are extracted from the all descriptions. After extracting the frequency of each word in the whole description using box-plot we can visualized most frequent words in these reports. Gunshot, location, respond, approximate, unknown, and male are the most frequent words.

# Term-document matrix creation
tdm = TermDocumentMatrix(corpus)
tdm1 <- DocumentTermMatrix(corpus)
m1 = as.matrix(tdm)
v <- sort(rowSums(m1),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
barplot(d[1:10,]$freq, las = 2, names.arg = d[1:10,]$word,
        col = heat.colors(10), main ="Most frequent words",
        ylab = "Word frequencies")

Word Cloud

For better understanding of the word frequency we can use word cloud package to depict the code frequency for 50 most frequent words as follows.

#word cloud
library(wordcloud)

## Loading required package: RColorBrewer

set.seed(1234)
wordcloud(words = d$word, freq = d$freq, min.freq = 10,
          max.words=50, random.order=FALSE, rot.per=0.1, 
          colors=brewer.pal(8, "Dark2"))

Clustering the descriptions

Here, I used graph presentation in order to show the most frequent words and cluster them based on occurrence in the descriptions. For more detail you can see this link. I used the similar codes for this case.

library(sna)

## Loading required package: statnet.common

## Loading required package: network

## network: Classes for Relational Data
## Version 1.13.0 created on 2015-08-31.
## copyright (c) 2005, Carter T. Butts, University of California-Irvine
##                     Mark S. Handcock, University of California -- Los Angeles
##                     David R. Hunter, Penn State University
##                     Martina Morris, University of Washington
##                     Skye Bender-deMoll, University of Washington
##  For citation information, type citation("network").
##  Type help("network-package") to get started.

## sna: Tools for Social Network Analysis
## Version 2.4 created on 2016-07-23.
## copyright (c) 2005, Carter T. Butts, University of California-Irvine
##  For citation information, type citation("sna").
##  Type help(package="sna") to get started.

library(igraph)

## 
## Attaching package: 'igraph'

## The following objects are masked from 'package:sna':
## 
##     betweenness, bonpow, closeness, components, degree,
##     dyad.census, evcent, hierarchy, is.connected, neighborhood,
##     triad.census

## The following objects are masked from 'package:network':
## 
##     %c%, %s%, add.edges, add.vertices, delete.edges,
##     delete.vertices, get.edge.attribute, get.edges,
##     get.vertex.attribute, is.bipartite, is.directed,
##     list.edge.attributes, list.vertex.attributes,
##     set.edge.attribute, set.vertex.attribute

## The following object is masked from 'package:stringr':
## 
##     %>%

## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum

## The following object is masked from 'package:base':
## 
##     union

#Clustering words
#word counts
wc = rowSums(m1)
#get those words above the 3rd quantile
lim = quantile(wc, probs=0.975)
good = m1[wc > lim,]
good = good[,colSums(good)!=0]
#adjacency matrix
M = good %*% t(good)
#set zeroes in diagonal
diag(M) = 0
#Create graph
g = graph.adjacency(M, weighted=TRUE, mode="undirected")
#layout
glay = layout.fruchterman.reingold(g)
#Apply K-means with 8 clusters
kmg = kmeans(M, centers=8)
gk = kmg$cluster
#create colors for each cluster
gbrew = c("red", brewer.pal(8, "Dark2"))
gpal = rgb2hsv(col2rgb(gbrew))
gcols = rep("", length(gk))
for (k in 1:8) {
  gcols[gk == k] = hsv(gpal[1,k], gpal[2,k], gpal[3,k], alpha=0.5)
}
#prepare vertices 
V(g)$size = 15
V(g)$label = V(g)$name
V(g)$color = gcols
#plot
plot(g, layout=glay)
title("Graph of Description of cases",
      col.main="gray40", cex.main=1.5, family="serif")

Based on outputs, one of the clusters contains the male and unknown and it can be inferred that there is some description which is discussing about suspect male who is identity is unknown. You can interpret the rest of clusters like this one.

Entity extraction

In this part, we are interested to extract the entities from the text. These are including organization, name of a person or locations. First of all, we have to generate an annotator which computes word and sentence token annotations using the Apache OpenNLP Maxent tokenizer. Then using annotate() function compute annotations by iteratively calling the given annotators with the given text and current annotations. For more information, you can refer to this link.

#Entity Extraction
library(NLP)
library(openNLP)
bio <- as.String(data2$description)
word_ann <- Maxent_Word_Token_Annotator()
sent_ann <- Maxent_Sent_Token_Annotator()
#Compute annotation
bio_annotations <- annotate(bio, list(sent_ann, word_ann))
head(bio_annotations)

##  id type     start end features
##   1 sentence     1 178 constituents=<<integer,36>>
##   2 sentence   180 259 constituents=<<integer,14>>
##   3 sentence   261 362 constituents=<<integer,19>>
##   4 sentence   364 503 constituents=<<integer,28>>
##   5 sentence   505 554 constituents=<<integer,11>>
##   6 sentence   556 652 constituents=<<integer,20>>

#Create annotated plain text documents
bio_doc <- AnnotatedPlainTextDocument(bio, bio_annotations)
sents(bio_doc) %>% head(2)

## [[1]]
##  [1] "On"            "January"       "4th"           ","            
##  [5] "1980"          ","             "at"            "approximately"
##  [9] "0140"          "hours"         ","             "the"          
## [13] "victim"        "checked"       "into"          "the"          
## [17] "Days"          "Inn"           "Hotel"         ","            
## [21] "located"       "on"            "McCoy"         "Road"         
## [25] ","             "Orlando"       ","             "Florida"      
## [29] "after"         "flying"        "in"            "from"         
## [33] "Indianapolis"  ","             "Indiana"       "."            
## 
## [[2]]
##  [1] "While"       "registering" ","           "the"         "victim"     
##  [6] "indicated"   "there"       "were"        "two"         "individuals"
## [11] "in"          "his"         "party"       "."

words(bio_doc) %>% head(10)

##  [1] "On"            "January"       "4th"           ","            
##  [5] "1980"          ","             "at"            "approximately"
##  [9] "0140"          "hours"

Next, type of entities should be indicated using Maxent_Entity_Annotator() function. Then we create a new pipeline list to hold our annotators in the order we want to apply them.

person_ann <- Maxent_Entity_Annotator(kind = "person")
location_ann <- Maxent_Entity_Annotator(kind = "location")
organization_ann <- Maxent_Entity_Annotator(kind = "organization")
pipeline <- list(sent_ann,
                 word_ann,
                 person_ann,
                 location_ann,
                 organization_ann)
bio_annotations <- annotate(bio, pipeline)
bio_doc <- AnnotatedPlainTextDocument(bio, bio_annotations)

Unfortunately, there is no comparably easy way to extract names entities from documents. But the function below will do the trick.

entities <- function(doc, kind) {
  s <- doc$content
  a <- annotations(doc)[[1]]
  if(hasArg(kind)) {
    k <- sapply(a$features, `[[`, "kind")
    s[a[k == kind]]
  } else {
    s[a[a$type == "entity"]]
  }
}

Finally, using the below functions we can extract all the different entities. Some of the entities are no well recognized but it is always interesting to search for special names.

pr1 <- table(entities(bio_doc, kind = "person"))
tail(sort(pr1),10)

## 
##      Jean Michaud   John Young Pkwy          Lee Road     Michelle Bass 
##                 1                 1                 1                 1 
##                On       Orlando.The      Tree Terrace    Victoria Wills 
##                 1                 1                 1                 1 
## Taylor Creek Road     Blossom Trail 
##                 2                 6

lc1 <- table(entities(bio_doc, kind = "location"))
tail(sort(lc1),10)

## 
##   Taylor Creek Road                West  Westmoreland Drive 
##                   2                   2                   2 
##              Orange              Apopka            Lee Road 
##                   3                   4                   4 
##       Orange County West Colonial Drive             Florida 
##                   6                   6                  54 
##             Orlando 
##                  98

or1 <- table(entities(bio_doc, kind = "organization"))
tail(sort(or1),10)

## 
##        US Airmotive Inc,located                  Valiant Circle 
##                               1                               1 
##                 Victoria Circle                  Waterway Court 
##                               1                               1 
##                 Fire Department Orlando Regional Medical Center 
##                               2                               2 
##                       Shell Gas                Chevrolet Impala 
##                               2                               3 
##                           Dodge                Florida Hospital 
##                               3                               4

Application of text and web mining in Criminology - Part 2

Taha Mokfi

March 16, 2017