Below are all the R packages and option sets to reproduce this analysis

#setwd("~/hiring_tests/")
setwd("/media/maxsop/78e28b90-7fe7-4dec-9642-1e034da5aed3/home/msop/hiring_tests/")
options(warn=-1)
options("scipen"=100, "digits"=15)
suppressMessages(library(readr))
suppressMessages(library(geohash))
suppressMessages(library(maps))
suppressMessages(library(ggmap))
suppressMessages(library(Hmisc))
suppressMessages(library(dplyr))
suppressMessages(library(igraph))
suppressMessages(library(RColorBrewer))
knitr::opts_chunk$set(cache=TRUE)
Theme function fte_theme based on FiveThirtyEight style is not displayed
Load data into R and quick data inspection
crime_data <- read_csv("data/CreditCardCrime.csv")
## Parsed with column specification:
## cols(
##   .default = col_character(),
##   `Incident ID` = col_integer(),
##   `CR Number` = col_integer(),
##   Class = col_integer(),
##   `Zip Code` = col_integer(),
##   PRA = col_integer(),
##   Latitude = col_double(),
##   Longitude = col_double()
## )
## See spec(...) for full column specifications.
#describe(crime_data) #There's a unique crime number for all crimes even though the crime class is the same.

crime_data <- crime_data %>% filter(State == "MD") # only working in the state of Maryland
Data cleaning and geohashing

We will use the location variable instead of the provided Longititude and Latitude variables because they have less floating point numbers. This is so we can increase the accuracy of the geohashing algorithm, but otherwise not so important. We will first do some data sanitization on the date variables and others.

names(crime_data) <- make.names(names(crime_data)) # make syntactic correct R names
crime_data$Start.Date...Time <- strptime(crime_data$Start.Date...Time, "%m/%d/%Y %H:%M")
crime_data$End.Date...Time   <- strptime(crime_data$End.Date...Time, "%m/%d/%Y %H:%M")
crime_data$Dispatch.Date...Time <- strptime(crime_data$Dispatch.Date...Time, "%m/%d/%Y %H:%M")

#crime_data <- crime_data %>% arrange(Start.Date...Time)
crime_data  <- crime_data[order(crime_data$Start.Date...Time, decreasing = F), ]

t <- gsub(pattern = "\\(|\\)", replacement = "", x = crime_data$Location)
t <- data.frame(do.call('rbind', strsplit(t,',',fixed=TRUE)))
colnames(t) <- c("lon", "lat")
t <- sapply(t, function(x) as.numeric(as.character(x)))
t <- apply(t, MARGIN = 1, function(x) gh_encode(x[1], x[2], 10)) #geo_hashing
crime_data$geohash <- t
Network and network proprieties visualizations

The larger the node the more important it is. But it our case, it’s hard to visual see the difference. This due to the fact that most nodes have simelar degrees as shown in the cumulative frequency distribution graph below. Top 3 crime locations by degree importance are Gaithersburg, Silver Spring, and Olney.

network <- graph(na.omit(t), directed = T) # named vertices
deg <- degree(network, mode="all")
names(sort(deg))[1:5]
## [1] "dqcn6mtk7d" "dqcq34wt88" "dqcnqhgke2" "dqcntuvh2z" "dqcp72bc5t"
#crime_data <- crime_data %>% filter(geohash %in% deg) %>% select(City)
t <- unique(crime_data[crime_data$geohash %in% deg, "City"])
# When the edge list has vertex names, the number of nodes is not needed
plot(network, vertex.label=NA, vertex.size=2, edge.arrow.size=0.2, edge.arrow.width=0.2, vertex.size=betweenness(network)/(max(betweenness(network) * 0.1)), main="Credit CArds Crime Locations Network")

deg.dist <- degree_distribution(network, cumulative=T, mode="all")
plot(x=0:max(deg), y=1-deg.dist, pch=19, cex=1.2, col="orange", 
      xlab="Degree", ylab="Cumulative Frequency", main="Degree Distribution")

Measure of nodes importance
Centrality

Closeness
The closeness in this case is centrality based on distances to other locations in the graph. The centralization value is 0.00017140663577 and the theoretical maximum value is 871.249928263989
Eigenvector
This is centrality proportional to the sum of connection centralities of other locations. Based on the is formula, the centrality measure is 0.988225668946706
Betweeness
This is centrality based on a broker position connecting other locations. That is, the number of geodesics (shortest paths) that pass through the node or the edge to all other locations. Based on this definition, the top 5 locations are dqcnnds9q8, dqcnqr6rmx, dqcjz570kr, dqcq04cp7z, dqcjz4hp8h In the same manner, we can compute edge betweeness or center betweeness.

Hubs - Authority - PageRank

Hubs are expected to contain catalogs with a large number of outgoing links; while authorities would get many incoming links from hubs, presumably because of their high-quality relevant information. PageRank on the other hand works by counting the number and quality of links to a page (location in our case) to determine a rough estimate of how important the website (region) is. The underlying assumption is that more important regions are likely to receive more links from other locations. As you can see, this location dqcjurxx2z which is the of city Bethseda, appears strongest with all 3 metrics overall.

df <- data.frame(top_hub=character(5), top_authority=character(5), top_page_rank=character(5))
df$top_hub     <- names(sort(hub_score(network, weights=NA)$vector, decreasing = T))[1:5]
df$top_authority <- names(sort(authority_score(network, weights=NA)$vector, decreasing = T))[1:5]
df$top_page_rank <- names(sort(page_rank(network, weights=NA)$vector, decreasing = T))[1:5]
df
##      top_hub top_authority top_page_rank
## 1 dqcjurxx2z    dqcjurxx2z    dqcnnds9q8
## 2 dqcjz4hp8h    dqcnnds9q8    dqcjv8k8ed
## 3 dqcnnds9q8    dqcjz4hp8h    dqcjurxx2z
## 4 dqcnj82qjt    dqcnepjtpp    dqcq0vk6ky
## 5 dqbywv0d93    dqcnhcj1vr    dqcjz4hp8h
Community analysis

There are many communities/clusters in the network as shown below, but the one in the center of the network is the most striking one as it forms a conneted network with the most significantly more credit cards fraud (4 in total). One may infer the same or similar groups of individuals are perpetrating these crimes. There are 93 locations that make the main cluster(center). There are many other communty algorithms we can try, but this one gives a modularity (groups distinction) of ~98%! We display only 10 locations below

wc <- cluster_edge_betweenness(network)
modularity(wc)
## [1] 0.976821539862043
plot(wc, network, vertex.label=NA, vertex.size=2, edge.arrow.size=0.2, edge.arrow.width=0.2, main= "Crime Communities")

# groups <- membership(wc)
# ordered.groups <- names(sort(table(groups), decreasing = T))
# center <- names(groups[which(groups == as.integer(ordered.groups[1]))])
# center[1:10]
Crime prediction - Space & Time
Visualizing credit cards fraud based on different attributs

As you can see from the heatmap below, less crime are committed in mornings, and on Sundays.

hours <- as.integer(format(crime_data$Start.Date...Time, "%H"))
weekdays <- weekdays(crime_data$Start.Date...Time)
#months <- months(crime_data$Start.Date...Time)
#hour_count <- as.data.frame(table(hours))
day_hour_count <- as.data.frame(table(weekdays, hours))
day_hour_count$hours <- as.integer(as.character(day_hour_count$hours))
day_hour_count$weekdays <- factor(day_hour_count$weekdays, ordered = TRUE, levels = c("Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday"))
ggplot(data = day_hour_count, aes(x = hours, y = weekdays, fill=Freq)) + 
   geom_tile() +
    fte_theme() +
    labs(x = "Hour of Crime", y = "Day of Week", title = "# of Credit Cards Fraud in Montgomery County from 2004 – 2016") +
    scale_fill_gradient(low = "white", high = "#27AE60")

lonlat <- as.data.frame(table(round(crime_data$Longitude, 2), round(crime_data$Latitude, 2)))
lonlat$Var1 <- as.numeric(as.character(lonlat$Var1))
lonlat$Var2 <- as.numeric(as.character(lonlat$Var2))
colnames(lonlat) <- c("lon", "lat", "count")
lonlat <- lonlat %>% arrange(desc(count))

The two plots below convey the the same information. One uses scatter points and a heatmap is used on the other(right). Regions like Silver Springs are most hit by credit cards crimes, and Frederick are least hit. One of the reasons might the size of these cities, according to the 2010 census. Silver Spring has a population of close to 1M (48% while and 28% black) while Frederick has about the quarter of that figure (82% white, 8% black). This makes Silver Spring much more cosmopolitan than Frederick.

mtg <- geocode("montgomery county")
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=montgomery%20county&sensor=false
Montgomery <- qmap("montgomery county", maprange = TRUE, base_layer = ggplot(aes(x = lon, y = lat), data = mtg)) 
## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=montgomery+county&zoom=10&size=640x640&scale=2&maptype=terrain&language=en-EN&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=montgomery%20county&sensor=false
## Scale for 'x' is already present. Adding another scale for 'x', which
## will replace the existing scale.
## Scale for 'y' is already present. Adding another scale for 'y', which
## will replace the existing scale.
Montgomery + geom_point(aes(x = lon, y = lat, size=count, color=count), data = lonlat) +  scale_color_gradient(low = "white", high = "red") #+  theme(legend.position="left")

Montgomery + geom_tile(aes(x = lon, y = lat, fill=count), data = lonlat) +
scale_fill_gradient(low = "white", high = "#27AE60")

#grid.arrange(g1, g2, ncol=2)

Machine learning approaches to predicting crime location

My guess is that most GIS software would have these readily available by the way.

  1. Time-weighted kernel density interpolation - Kernel Density Interpolation of crime incident locations using time as a weighting variable  
  2. Minimum convex hull polygone - polygon around all of Creates a minimum bounding polygon around all of the incident locations in the series  
  3. Modified center of minimum distance - It useses the center of minimum distance as a seed point and creates a search area by drawing a circle with a radius of the average distance between crime events in the series.

Measure of prediction accuracy

We need to keep in mind that a predicted area might be correct, but so large that it produces not practical benefict.

Error metrics should have:

  1. Correctness: Was the final event location within predicted area.
  2. Search Area: Average size of the predicted area.
  3. Search Cost: Percent of base search area covered by the final predicted area.
  4. Accuracy Precision: % of correct forecasts divided by the average predicted area

Based on the exploratory data analysis we did above, success will be most influenced by repeat location victimizations, average distance between crimes and number of incidents in that location.