Below are all the R packages and option sets to reproduce this analysis
#setwd("~/hiring_tests/")
setwd("/media/maxsop/78e28b90-7fe7-4dec-9642-1e034da5aed3/home/msop/hiring_tests/")
options(warn=-1)
options("scipen"=100, "digits"=15)
suppressMessages(library(readr))
suppressMessages(library(geohash))
suppressMessages(library(maps))
suppressMessages(library(ggmap))
suppressMessages(library(Hmisc))
suppressMessages(library(dplyr))
suppressMessages(library(igraph))
suppressMessages(library(RColorBrewer))
knitr::opts_chunk$set(cache=TRUE)
fte_theme based on FiveThirtyEight style is not displayedcrime_data <- read_csv("data/CreditCardCrime.csv")
## Parsed with column specification:
## cols(
## .default = col_character(),
## `Incident ID` = col_integer(),
## `CR Number` = col_integer(),
## Class = col_integer(),
## `Zip Code` = col_integer(),
## PRA = col_integer(),
## Latitude = col_double(),
## Longitude = col_double()
## )
## See spec(...) for full column specifications.
#describe(crime_data) #There's a unique crime number for all crimes even though the crime class is the same.
crime_data <- crime_data %>% filter(State == "MD") # only working in the state of Maryland
We will use the location variable instead of the provided Longititude and Latitude variables because they have less floating point numbers. This is so we can increase the accuracy of the geohashing algorithm, but otherwise not so important. We will first do some data sanitization on the date variables and others.
names(crime_data) <- make.names(names(crime_data)) # make syntactic correct R names
crime_data$Start.Date...Time <- strptime(crime_data$Start.Date...Time, "%m/%d/%Y %H:%M")
crime_data$End.Date...Time <- strptime(crime_data$End.Date...Time, "%m/%d/%Y %H:%M")
crime_data$Dispatch.Date...Time <- strptime(crime_data$Dispatch.Date...Time, "%m/%d/%Y %H:%M")
#crime_data <- crime_data %>% arrange(Start.Date...Time)
crime_data <- crime_data[order(crime_data$Start.Date...Time, decreasing = F), ]
t <- gsub(pattern = "\\(|\\)", replacement = "", x = crime_data$Location)
t <- data.frame(do.call('rbind', strsplit(t,',',fixed=TRUE)))
colnames(t) <- c("lon", "lat")
t <- sapply(t, function(x) as.numeric(as.character(x)))
t <- apply(t, MARGIN = 1, function(x) gh_encode(x[1], x[2], 10)) #geo_hashing
crime_data$geohash <- t
The larger the node the more important it is. But it our case, it’s hard to visual see the difference. This due to the fact that most nodes have simelar degrees as shown in the cumulative frequency distribution graph below. Top 3 crime locations by degree importance are Gaithersburg, Silver Spring, and Olney.
network <- graph(na.omit(t), directed = T) # named vertices
deg <- degree(network, mode="all")
names(sort(deg))[1:5]
## [1] "dqcn6mtk7d" "dqcq34wt88" "dqcnqhgke2" "dqcntuvh2z" "dqcp72bc5t"
#crime_data <- crime_data %>% filter(geohash %in% deg) %>% select(City)
t <- unique(crime_data[crime_data$geohash %in% deg, "City"])
# When the edge list has vertex names, the number of nodes is not needed
plot(network, vertex.label=NA, vertex.size=2, edge.arrow.size=0.2, edge.arrow.width=0.2, vertex.size=betweenness(network)/(max(betweenness(network) * 0.1)), main="Credit CArds Crime Locations Network")
deg.dist <- degree_distribution(network, cumulative=T, mode="all")
plot(x=0:max(deg), y=1-deg.dist, pch=19, cex=1.2, col="orange",
xlab="Degree", ylab="Cumulative Frequency", main="Degree Distribution")
Closeness
The closeness in this case is centrality based on distances to other locations in the graph. The centralization value is 0.00017140663577 and the theoretical maximum value is 871.249928263989
Eigenvector
This is centrality proportional to the sum of connection centralities of other locations. Based on the is formula, the centrality measure is 0.988225668946706
Betweeness
This is centrality based on a broker position connecting other locations. That is, the number of geodesics (shortest paths) that pass through the node or the edge to all other locations. Based on this definition, the top 5 locations are dqcnnds9q8, dqcnqr6rmx, dqcjz570kr, dqcq04cp7z, dqcjz4hp8h In the same manner, we can compute edge betweeness or center betweeness.
There are many communities/clusters in the network as shown below, but the one in the center of the network is the most striking one as it forms a conneted network with the most significantly more credit cards fraud (4 in total). One may infer the same or similar groups of individuals are perpetrating these crimes. There are 93 locations that make the main cluster(center). There are many other communty algorithms we can try, but this one gives a modularity (groups distinction) of ~98%! We display only 10 locations below
wc <- cluster_edge_betweenness(network)
modularity(wc)
## [1] 0.976821539862043
plot(wc, network, vertex.label=NA, vertex.size=2, edge.arrow.size=0.2, edge.arrow.width=0.2, main= "Crime Communities")
# groups <- membership(wc)
# ordered.groups <- names(sort(table(groups), decreasing = T))
# center <- names(groups[which(groups == as.integer(ordered.groups[1]))])
# center[1:10]
As you can see from the heatmap below, less crime are committed in mornings, and on Sundays.
hours <- as.integer(format(crime_data$Start.Date...Time, "%H"))
weekdays <- weekdays(crime_data$Start.Date...Time)
#months <- months(crime_data$Start.Date...Time)
#hour_count <- as.data.frame(table(hours))
day_hour_count <- as.data.frame(table(weekdays, hours))
day_hour_count$hours <- as.integer(as.character(day_hour_count$hours))
day_hour_count$weekdays <- factor(day_hour_count$weekdays, ordered = TRUE, levels = c("Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday"))
ggplot(data = day_hour_count, aes(x = hours, y = weekdays, fill=Freq)) +
geom_tile() +
fte_theme() +
labs(x = "Hour of Crime", y = "Day of Week", title = "# of Credit Cards Fraud in Montgomery County from 2004 – 2016") +
scale_fill_gradient(low = "white", high = "#27AE60")
lonlat <- as.data.frame(table(round(crime_data$Longitude, 2), round(crime_data$Latitude, 2)))
lonlat$Var1 <- as.numeric(as.character(lonlat$Var1))
lonlat$Var2 <- as.numeric(as.character(lonlat$Var2))
colnames(lonlat) <- c("lon", "lat", "count")
lonlat <- lonlat %>% arrange(desc(count))
The two plots below convey the the same information. One uses scatter points and a heatmap is used on the other(right). Regions like Silver Springs are most hit by credit cards crimes, and Frederick are least hit. One of the reasons might the size of these cities, according to the 2010 census. Silver Spring has a population of close to 1M (48% while and 28% black) while Frederick has about the quarter of that figure (82% white, 8% black). This makes Silver Spring much more cosmopolitan than Frederick.
mtg <- geocode("montgomery county")
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=montgomery%20county&sensor=false
Montgomery <- qmap("montgomery county", maprange = TRUE, base_layer = ggplot(aes(x = lon, y = lat), data = mtg))
## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=montgomery+county&zoom=10&size=640x640&scale=2&maptype=terrain&language=en-EN&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=montgomery%20county&sensor=false
## Scale for 'x' is already present. Adding another scale for 'x', which
## will replace the existing scale.
## Scale for 'y' is already present. Adding another scale for 'y', which
## will replace the existing scale.
Montgomery + geom_point(aes(x = lon, y = lat, size=count, color=count), data = lonlat) + scale_color_gradient(low = "white", high = "red") #+ theme(legend.position="left")
Montgomery + geom_tile(aes(x = lon, y = lat, fill=count), data = lonlat) +
scale_fill_gradient(low = "white", high = "#27AE60")
#grid.arrange(g1, g2, ncol=2)
Machine learning approaches to predicting crime location
My guess is that most GIS software would have these readily available by the way.
Measure of prediction accuracy
We need to keep in mind that a predicted area might be correct, but so large that it produces not practical benefict.
Error metrics should have:
Based on the exploratory data analysis we did above, success will be most influenced by repeat location victimizations, average distance between crimes and number of incidents in that location.