The Analytical Policeman

The Analytical Policeman

  • The explosion of computerized data affects all parts of society, including law and order

  • In the past, human judgement and experience was the only tool in identifying patterns in criminal behavior

  • Police forces around the US and the world are augmenting human judgement with analytics - sometimes describes as “predictive”

Example: Los Angeles Police Dept.

“I’m not going to get more money. I’m not going to get more cops. I have to be better at using what I have, and that’s what predictive policing is about… If this old street cop can change the way that he thinks about this stuff, then I know that my [officers] can do the same.”

The Role of Analytics

  • The analytical tools you have learned in this class can be used to make these “predictive policing” models

  • However,\(\color{red}{\text{communicating}}\) the results of these models is essential - a \(\color{blue}{\text{linear regression}}\) output while table will not be of use to a \(\color{red}{\text{policewoman on patrol}}\)

  • Visualization bridges the gap between \(\color{blue}{\text{the data and mathematics}}\) and the \(\color{red}{\text{end user}}\)

Understanding the Past

  • Before we even consider a predictive model, we should try to understand the historical data

  • Many cities in the US and around the world would provide logs of reported crimes, usually including the time, locations, and nature of the event

  • We will use data from Chicago about motor vehicle thefts

Crime Over Time

  • Suppose we wanted to communicate crime patterns over the course of an average week

  • We would display daily averages using a line graph, but this does not seem like it would be too useful

  • We can replace our x-axis with the hour of the day, and have different line for every day of the week, but this would be a jumbled mess with 7 lines!

  • We could use no visualization at all, and instead present the information in a table

  • This is valid but how can we make the table more interesting

Heatmaps

  • Heatmaps are a way of visualizing data using three attributes. The x-axis and y-axis are typically displayed horizontally and vertically. The third attribute is represented by shades of color.
  • For example, a low number, might be \(\color{blue}{\text{blue}}\), and a high number might be \(\color{red}{\text{red}}\)

  • We can pick different color schemes based on the type of data to convey different messages

  • The x-axis and y-axis don’t need to be continuous - they can be categorical

  • We could even combine a heatmap with a geographical map

A Chicao Crime Heatmap

  • We will use Chicago motor vehicle theft data to explore patterns of crime
    • Over days of the week
    • Over hours of the day
  • We’re interested in the total number of car thefts that occur in any particular hour of a day of the week: over the whole data set.

Eye on Crime

  • Criminal activity-related data often has both components of time and location

  • Sometimes all that is required is a line chart, but heatmaps can visualize data that would be too big for a table

  • Plotting data on maps is much more effective than a table for location based data, and is eye-catching

Predicting Policing

  • Many police forces are exploiting their databases to focus finite resources on problem areas

  • Not only do analytics help improve policework, the outputs are also good communication tools to decision makers in government and to the wider public

  • The application of analytics to data like this is new and growing, with companies like PredPol and Palantir leading the effort.

The Analytical Policeman in R

A Basic Line Plot

# Load our data:
mvt = read.csv("mvt.csv", stringsAsFactors=FALSE)
# Output structure
str(mvt)
## 'data.frame':    191641 obs. of  3 variables:
##  $ Date     : chr  "12/31/12 23:15" "12/31/12 22:00" "12/31/12 22:00" "12/31/12 22:00" ...
##  $ Latitude : num  41.8 41.9 42 41.8 41.8 ...
##  $ Longitude: num  -87.6 -87.7 -87.8 -87.7 -87.6 ...
# Convert the Date variable to a format that R will recognize:
mvt$Date = strptime(mvt$Date, format="%m/%d/%y %H:%M")
# Extract the hour and the day of the week:
mvt$Weekday = weekdays(mvt$Date)
mvt$Hour = mvt$Date$hour
# Let's take a look at the structure of our data again:
str(mvt)
## 'data.frame':    191641 obs. of  5 variables:
##  $ Date     : POSIXlt, format: "2012-12-31 23:15:00" "2012-12-31 22:00:00" "2012-12-31 22:00:00" "2012-12-31 22:00:00" ...
##  $ Latitude : num  41.8 41.9 42 41.8 41.8 ...
##  $ Longitude: num  -87.6 -87.7 -87.8 -87.7 -87.6 ...
##  $ Weekday  : chr  "Monday" "Monday" "Monday" "Monday" ...
##  $ Hour     : int  23 22 22 22 21 20 20 20 19 18 ...
# Create a simple line plot - need the total number of crimes on each day of the week. We can get this information by creating a table:
z = table(mvt$Weekday)
kable(z)
Var1 Freq
Friday 29284
Monday 27397
Saturday 27118
Sunday 26316
Thursday 27319
Tuesday 26791
Wednesday 27416
# Save this table as a data frame:
WeekdayCounts = as.data.frame(table(mvt$Weekday))
str(WeekdayCounts) 
## 'data.frame':    7 obs. of  2 variables:
##  $ Var1: Factor w/ 7 levels "Friday","Monday",..: 1 2 3 4 5 6 7
##  $ Freq: int  29284 27397 27118 26316 27319 26791 27416
# Load the ggplot2 library:
library(ggplot2)
# Create our plot
ggplot(WeekdayCounts, aes(x=Var1, y=Freq)) + geom_line(aes(group=1))  

# Make the "Var1" variable an ORDERED factor variable
WeekdayCounts$Var1 = factor(WeekdayCounts$Var1, ordered=TRUE, levels=c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday","Saturday"))
# Try again:
ggplot(WeekdayCounts, aes(x=Var1, y=Freq)) + geom_line(aes(group=1))

# Change our x and y labels:
ggplot(WeekdayCounts, aes(x=Var1, y=Freq)) + geom_line(aes(group=1)) + xlab("Day of the Week") + ylab("Total Motor Vehicle Thefts")

Adding the Hour of the Day

# Create a counts table for the weekday and hour:
z = table(mvt$Weekday, mvt$Hour)
kable(z)
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
Friday 1873 932 743 560 473 602 839 1203 1268 1286 938 822 1207 857 937 1140 1165 1318 1623 1652 1736 1881 2308 1921
Monday 1900 825 712 527 415 542 772 1123 1323 1235 971 737 1129 824 958 1059 1136 1252 1518 1503 1622 1815 2009 1490
Saturday 2050 1267 985 836 652 508 541 650 858 1039 946 789 1204 767 963 1086 1055 1084 1348 1390 1570 1702 2078 1750
Sunday 2028 1236 1019 838 607 461 478 483 615 864 884 787 1192 789 959 1037 1083 1160 1389 1342 1706 1696 2079 1584
Thursday 1856 816 696 508 400 534 799 1135 1298 1301 932 731 1093 752 831 1044 1131 1258 1510 1537 1668 1776 2134 1579
Tuesday 1691 777 603 464 414 520 845 1118 1175 1174 948 786 1108 762 908 1071 1090 1274 1553 1496 1696 1816 2044 1458
Wednesday 1814 790 619 469 396 561 862 1140 1329 1237 947 763 1225 804 863 1075 1076 1289 1580 1507 1718 1748 2093 1511
# Save this to a data frame:
DayHourCounts = as.data.frame(table(mvt$Weekday, mvt$Hour))
# Output structure
str(DayHourCounts)
## 'data.frame':    168 obs. of  3 variables:
##  $ Var1: Factor w/ 7 levels "Friday","Monday",..: 1 2 3 4 5 6 7 1 2 3 ...
##  $ Var2: Factor w/ 24 levels "0","1","2","3",..: 1 1 1 1 1 1 1 2 2 2 ...
##  $ Freq: int  1873 1900 2050 2028 1856 1691 1814 932 825 1267 ...
# Convert the second variable, Var2, to numbers and call it Hour:
DayHourCounts$Hour = as.numeric(as.character(DayHourCounts$Var2))
# Create out plot:
ggplot(DayHourCounts, aes(x=Hour, y=Freq)) + geom_line(aes(group=Var1))

# Change the colors
ggplot(DayHourCounts, aes(x=Hour, y=Freq)) + geom_line(aes(group=Var1, color=Var1), size=2)

# Separate the weekends from the weekdays:
DayHourCounts$Type = ifelse((DayHourCounts$Var1 == "Sunday") | (DayHourCounts$Var1 == "Saturday"), "Weekend", "Weekday")
# Redo our plot, this time coloring by Type:
ggplot(DayHourCounts, aes(x=Hour, y=Freq)) + geom_line(aes(group=Var1, color=Type), size=2) 

# Make the lines a little transparent:
ggplot(DayHourCounts, aes(x=Hour, y=Freq)) + geom_line(aes(group=Var1, color=Type), size=2, alpha=0.5) 

# Fix the order of the days:
DayHourCounts$Var1 = factor(DayHourCounts$Var1, ordered=TRUE, levels=c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"))
# Make a heatmap:
ggplot(DayHourCounts, aes(x = Hour, y = Var1)) + geom_tile(aes(fill = Freq))

# Change the label on the legend, and get rid of the y-label:
ggplot(DayHourCounts, aes(x = Hour, y = Var1)) + geom_tile(aes(fill = Freq)) + scale_fill_gradient(name="Total MV Thefts") + theme(axis.title.y = element_blank())

# Change the color scheme
ggplot(DayHourCounts, aes(x = Hour, y = Var1)) + geom_tile(aes(fill = Freq)) + scale_fill_gradient(name="Total MV Thefts", low="white", high="red") + theme(axis.title.y = element_blank())

VIDEO 5 - Maps

# Install and load two new packages:
library(maps)
library(devtools)  
library(ggmap)
register_google(key = "AIzaSyBlCZXGDK9dN3Vf_N1qdI6mPfFFCA34ubs")
# Load a map of Chicago into R:
chicago = get_map(location = "chicago", zoom = 11)
# Look at the map
ggmap(chicago)

# Plot the first 100 motor vehicle thefts:
ggmap(chicago) + geom_point(data = mvt[1:100,], aes(x = Longitude, y = Latitude))

# Round our latitude and longitude to 2 digits of accuracy, and create a crime counts data frame for each area:
LatLonCounts = as.data.frame(table(round(mvt$Longitude,2), round(mvt$Latitude,2)))
# Subset for Freq > 0
LatLonCounts2 = subset(LatLonCounts, Freq > 0)
#Output structure
str(LatLonCounts)
## 'data.frame':    1638 obs. of  3 variables:
##  $ Var1: Factor w/ 42 levels "-87.93","-87.92",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Var2: Factor w/ 39 levels "41.64","41.65",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Freq: int  0 0 0 0 0 0 0 0 0 0 ...
str(LatLonCounts2)
## 'data.frame':    686 obs. of  3 variables:
##  $ Var1: Factor w/ 42 levels "-87.93","-87.92",..: 33 40 32 33 34 35 38 39 40 41 ...
##  $ Var2: Factor w/ 39 levels "41.64","41.65",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ Freq: int  3 1 54 60 28 1 8 63 81 8 ...
# Convert our Longitude and Latitude variable to numbers:
LatLonCounts$Long = as.numeric(as.character(LatLonCounts$Var1))
LatLonCounts$Lat = as.numeric(as.character(LatLonCounts$Var2))
LatLonCounts2$Long = as.numeric(as.character(LatLonCounts2$Var1))
LatLonCounts2$Lat = as.numeric(as.character(LatLonCounts2$Var2))
# Plot these points on our map:
ggmap(chicago) + geom_point(data = LatLonCounts, aes(x = Long, y = Lat, color = Freq, size=Freq))

# Change the color scheme:
ggmap(chicago) + geom_point(data = LatLonCounts, aes(x = Long, y = Lat, color = Freq, size=Freq)) + scale_colour_gradient(low="yellow", high="red")

# We can also use the geom_tile geometry
ggmap(chicago) + geom_tile(data = LatLonCounts, aes(x = Long, y = Lat, alpha = Freq), fill="red")

#freq > 0
ggmap(chicago) + geom_tile(data=LatLonCounts2, aes(x = Long, y = Lat, alpha=Freq), fill="red")

Geographical Map on US

# Load our data:
murders = read.csv("murders.csv")
# Output structure
str(murders)
## 'data.frame':    51 obs. of  6 variables:
##  $ State            : Factor w/ 51 levels "Alabama","Alaska",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Population       : int  4779736 710231 6392017 2915918 37253956 5029196 3574097 897934 601723 19687653 ...
##  $ PopulationDensity: num  94.65 1.26 57.05 56.43 244.2 ...
##  $ Murders          : int  199 31 352 130 1811 117 131 48 131 987 ...
##  $ GunMurders       : int  135 19 232 93 1257 65 97 38 99 669 ...
##  $ GunOwnership     : num  0.517 0.578 0.311 0.553 0.213 0.347 0.167 0.255 0.036 0.245 ...
# Load the map of the US
statesMap = map_data("state")
# Output structure
str(statesMap)
## 'data.frame':    15537 obs. of  6 variables:
##  $ long     : num  -87.5 -87.5 -87.5 -87.5 -87.6 ...
##  $ lat      : num  30.4 30.4 30.4 30.3 30.3 ...
##  $ group    : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ order    : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ region   : chr  "alabama" "alabama" "alabama" "alabama" ...
##  $ subregion: chr  NA NA NA NA ...
# Plot the map:
ggplot(statesMap, aes(x = long, y = lat, group = group)) + geom_polygon(fill = "white", color = "black") 

# Create a new variable called region with the lowercase names to match the statesMap:
murders$region = tolower(murders$State)
# Join the statesMap data and the murders data into one dataframe:
murderMap = merge(statesMap, murders, by="region")
str(murderMap)
## 'data.frame':    15537 obs. of  12 variables:
##  $ region           : chr  "alabama" "alabama" "alabama" "alabama" ...
##  $ long             : num  -87.5 -87.5 -87.5 -87.5 -87.6 ...
##  $ lat              : num  30.4 30.4 30.4 30.3 30.3 ...
##  $ group            : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ order            : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ subregion        : chr  NA NA NA NA ...
##  $ State            : Factor w/ 51 levels "Alabama","Alaska",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Population       : int  4779736 4779736 4779736 4779736 4779736 4779736 4779736 4779736 4779736 4779736 ...
##  $ PopulationDensity: num  94.7 94.7 94.7 94.7 94.7 ...
##  $ Murders          : int  199 199 199 199 199 199 199 199 199 199 ...
##  $ GunMurders       : int  135 135 135 135 135 135 135 135 135 135 ...
##  $ GunOwnership     : num  0.517 0.517 0.517 0.517 0.517 0.517 0.517 0.517 0.517 0.517 ...
# Plot the number of murder on our map of the United States:
ggplot(murderMap, aes(x = long, y = lat, group = group, fill = Murders)) + geom_polygon(color = "black") + scale_fill_gradient(low = "black", high = "red", guide = "legend")

# Plot a map of the population:
ggplot(murderMap, aes(x = long, y = lat, group = group, fill = Population)) + geom_polygon(color = "black") + scale_fill_gradient(low = "black", high = "red", guide = "legend")

# Create a new variable that is the number of murders per 100,000 population:
murderMap$MurderRate = murderMap$Murders / murderMap$Population * 100000
# Redo our plot with murder rate:
ggplot(murderMap, aes(x = long, y = lat, group = group, fill = MurderRate)) + geom_polygon(color = "black") + scale_fill_gradient(low = "black", high = "red", guide = "legend")

# Redo the plot, removing any states with murder rates above 10:
ggplot(murderMap, aes(x = long, y = lat, group = group, fill = MurderRate)) + geom_polygon(color = "black") + scale_fill_gradient(low = "black", high = "red", guide = "legend", limits = c(0,10))