The explosion of computerized data affects all parts of society, including law and order
In the past, human judgement and experience was the only tool in identifying patterns in criminal behavior
Police forces around the US and the world are augmenting human judgement with analytics - sometimes describes as “predictive”
“I’m not going to get more money. I’m not going to get more cops. I have to be better at using what I have, and that’s what predictive policing is about… If this old street cop can change the way that he thinks about this stuff, then I know that my [officers] can do the same.”
The analytical tools you have learned in this class can be used to make these “predictive policing” models
However,\(\color{red}{\text{communicating}}\) the results of these models is essential - a \(\color{blue}{\text{linear regression}}\) output while table will not be of use to a \(\color{red}{\text{policewoman on patrol}}\)
Visualization bridges the gap between \(\color{blue}{\text{the data and mathematics}}\) and the \(\color{red}{\text{end user}}\)
Before we even consider a predictive model, we should try to understand the historical data
Many cities in the US and around the world would provide logs of reported crimes, usually including the time, locations, and nature of the event
We will use data from Chicago about motor vehicle thefts
Suppose we wanted to communicate crime patterns over the course of an average week
We would display daily averages using a line graph, but this does not seem like it would be too useful
We can replace our x-axis with the hour of the day, and have different line for every day of the week, but this would be a jumbled mess with 7 lines!
We could use no visualization at all, and instead present the information in a table
For example, a low number, might be \(\color{blue}{\text{blue}}\), and a high number might be \(\color{red}{\text{red}}\)
We can pick different color schemes based on the type of data to convey different messages
The x-axis and y-axis don’t need to be continuous - they can be categorical
We could even combine a heatmap with a geographical map
Criminal activity-related data often has both components of time and location
Sometimes all that is required is a line chart, but heatmaps can visualize data that would be too big for a table
Plotting data on maps is much more effective than a table for location based data, and is eye-catching
Many police forces are exploiting their databases to focus finite resources on problem areas
Not only do analytics help improve policework, the outputs are also good communication tools to decision makers in government and to the wider public
The application of analytics to data like this is new and growing, with companies like PredPol and Palantir leading the effort.
# Load our data:
mvt = read.csv("mvt.csv", stringsAsFactors=FALSE)
# Output structure
str(mvt)
## 'data.frame': 191641 obs. of 3 variables:
## $ Date : chr "12/31/12 23:15" "12/31/12 22:00" "12/31/12 22:00" "12/31/12 22:00" ...
## $ Latitude : num 41.8 41.9 42 41.8 41.8 ...
## $ Longitude: num -87.6 -87.7 -87.8 -87.7 -87.6 ...
# Convert the Date variable to a format that R will recognize:
mvt$Date = strptime(mvt$Date, format="%m/%d/%y %H:%M")
# Extract the hour and the day of the week:
mvt$Weekday = weekdays(mvt$Date)
mvt$Hour = mvt$Date$hour
# Let's take a look at the structure of our data again:
str(mvt)
## 'data.frame': 191641 obs. of 5 variables:
## $ Date : POSIXlt, format: "2012-12-31 23:15:00" "2012-12-31 22:00:00" "2012-12-31 22:00:00" "2012-12-31 22:00:00" ...
## $ Latitude : num 41.8 41.9 42 41.8 41.8 ...
## $ Longitude: num -87.6 -87.7 -87.8 -87.7 -87.6 ...
## $ Weekday : chr "Monday" "Monday" "Monday" "Monday" ...
## $ Hour : int 23 22 22 22 21 20 20 20 19 18 ...
# Create a simple line plot - need the total number of crimes on each day of the week. We can get this information by creating a table:
z = table(mvt$Weekday)
kable(z)| Var1 | Freq |
|---|---|
| Friday | 29284 |
| Monday | 27397 |
| Saturday | 27118 |
| Sunday | 26316 |
| Thursday | 27319 |
| Tuesday | 26791 |
| Wednesday | 27416 |
# Save this table as a data frame:
WeekdayCounts = as.data.frame(table(mvt$Weekday))
str(WeekdayCounts)
## 'data.frame': 7 obs. of 2 variables:
## $ Var1: Factor w/ 7 levels "Friday","Monday",..: 1 2 3 4 5 6 7
## $ Freq: int 29284 27397 27118 26316 27319 26791 27416
# Load the ggplot2 library:
library(ggplot2)
# Create our plot
ggplot(WeekdayCounts, aes(x=Var1, y=Freq)) + geom_line(aes(group=1)) # Make the "Var1" variable an ORDERED factor variable
WeekdayCounts$Var1 = factor(WeekdayCounts$Var1, ordered=TRUE, levels=c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday","Saturday"))
# Try again:
ggplot(WeekdayCounts, aes(x=Var1, y=Freq)) + geom_line(aes(group=1))# Change our x and y labels:
ggplot(WeekdayCounts, aes(x=Var1, y=Freq)) + geom_line(aes(group=1)) + xlab("Day of the Week") + ylab("Total Motor Vehicle Thefts")# Create a counts table for the weekday and hour:
z = table(mvt$Weekday, mvt$Hour)
kable(z)| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Friday | 1873 | 932 | 743 | 560 | 473 | 602 | 839 | 1203 | 1268 | 1286 | 938 | 822 | 1207 | 857 | 937 | 1140 | 1165 | 1318 | 1623 | 1652 | 1736 | 1881 | 2308 | 1921 |
| Monday | 1900 | 825 | 712 | 527 | 415 | 542 | 772 | 1123 | 1323 | 1235 | 971 | 737 | 1129 | 824 | 958 | 1059 | 1136 | 1252 | 1518 | 1503 | 1622 | 1815 | 2009 | 1490 |
| Saturday | 2050 | 1267 | 985 | 836 | 652 | 508 | 541 | 650 | 858 | 1039 | 946 | 789 | 1204 | 767 | 963 | 1086 | 1055 | 1084 | 1348 | 1390 | 1570 | 1702 | 2078 | 1750 |
| Sunday | 2028 | 1236 | 1019 | 838 | 607 | 461 | 478 | 483 | 615 | 864 | 884 | 787 | 1192 | 789 | 959 | 1037 | 1083 | 1160 | 1389 | 1342 | 1706 | 1696 | 2079 | 1584 |
| Thursday | 1856 | 816 | 696 | 508 | 400 | 534 | 799 | 1135 | 1298 | 1301 | 932 | 731 | 1093 | 752 | 831 | 1044 | 1131 | 1258 | 1510 | 1537 | 1668 | 1776 | 2134 | 1579 |
| Tuesday | 1691 | 777 | 603 | 464 | 414 | 520 | 845 | 1118 | 1175 | 1174 | 948 | 786 | 1108 | 762 | 908 | 1071 | 1090 | 1274 | 1553 | 1496 | 1696 | 1816 | 2044 | 1458 |
| Wednesday | 1814 | 790 | 619 | 469 | 396 | 561 | 862 | 1140 | 1329 | 1237 | 947 | 763 | 1225 | 804 | 863 | 1075 | 1076 | 1289 | 1580 | 1507 | 1718 | 1748 | 2093 | 1511 |
# Save this to a data frame:
DayHourCounts = as.data.frame(table(mvt$Weekday, mvt$Hour))
# Output structure
str(DayHourCounts)
## 'data.frame': 168 obs. of 3 variables:
## $ Var1: Factor w/ 7 levels "Friday","Monday",..: 1 2 3 4 5 6 7 1 2 3 ...
## $ Var2: Factor w/ 24 levels "0","1","2","3",..: 1 1 1 1 1 1 1 2 2 2 ...
## $ Freq: int 1873 1900 2050 2028 1856 1691 1814 932 825 1267 ...
# Convert the second variable, Var2, to numbers and call it Hour:
DayHourCounts$Hour = as.numeric(as.character(DayHourCounts$Var2))
# Create out plot:
ggplot(DayHourCounts, aes(x=Hour, y=Freq)) + geom_line(aes(group=Var1))# Change the colors
ggplot(DayHourCounts, aes(x=Hour, y=Freq)) + geom_line(aes(group=Var1, color=Var1), size=2)# Separate the weekends from the weekdays:
DayHourCounts$Type = ifelse((DayHourCounts$Var1 == "Sunday") | (DayHourCounts$Var1 == "Saturday"), "Weekend", "Weekday")
# Redo our plot, this time coloring by Type:
ggplot(DayHourCounts, aes(x=Hour, y=Freq)) + geom_line(aes(group=Var1, color=Type), size=2) # Make the lines a little transparent:
ggplot(DayHourCounts, aes(x=Hour, y=Freq)) + geom_line(aes(group=Var1, color=Type), size=2, alpha=0.5) # Fix the order of the days:
DayHourCounts$Var1 = factor(DayHourCounts$Var1, ordered=TRUE, levels=c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"))
# Make a heatmap:
ggplot(DayHourCounts, aes(x = Hour, y = Var1)) + geom_tile(aes(fill = Freq))# Change the label on the legend, and get rid of the y-label:
ggplot(DayHourCounts, aes(x = Hour, y = Var1)) + geom_tile(aes(fill = Freq)) + scale_fill_gradient(name="Total MV Thefts") + theme(axis.title.y = element_blank())# Change the color scheme
ggplot(DayHourCounts, aes(x = Hour, y = Var1)) + geom_tile(aes(fill = Freq)) + scale_fill_gradient(name="Total MV Thefts", low="white", high="red") + theme(axis.title.y = element_blank())# Install and load two new packages:
library(maps)
library(devtools)
library(ggmap)
register_google(key = "AIzaSyBlCZXGDK9dN3Vf_N1qdI6mPfFFCA34ubs")
# Load a map of Chicago into R:
chicago = get_map(location = "chicago", zoom = 11)
# Look at the map
ggmap(chicago)# Plot the first 100 motor vehicle thefts:
ggmap(chicago) + geom_point(data = mvt[1:100,], aes(x = Longitude, y = Latitude))# Round our latitude and longitude to 2 digits of accuracy, and create a crime counts data frame for each area:
LatLonCounts = as.data.frame(table(round(mvt$Longitude,2), round(mvt$Latitude,2)))
# Subset for Freq > 0
LatLonCounts2 = subset(LatLonCounts, Freq > 0)
#Output structure
str(LatLonCounts)
## 'data.frame': 1638 obs. of 3 variables:
## $ Var1: Factor w/ 42 levels "-87.93","-87.92",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Var2: Factor w/ 39 levels "41.64","41.65",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Freq: int 0 0 0 0 0 0 0 0 0 0 ...
str(LatLonCounts2)
## 'data.frame': 686 obs. of 3 variables:
## $ Var1: Factor w/ 42 levels "-87.93","-87.92",..: 33 40 32 33 34 35 38 39 40 41 ...
## $ Var2: Factor w/ 39 levels "41.64","41.65",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ Freq: int 3 1 54 60 28 1 8 63 81 8 ...
# Convert our Longitude and Latitude variable to numbers:
LatLonCounts$Long = as.numeric(as.character(LatLonCounts$Var1))
LatLonCounts$Lat = as.numeric(as.character(LatLonCounts$Var2))
LatLonCounts2$Long = as.numeric(as.character(LatLonCounts2$Var1))
LatLonCounts2$Lat = as.numeric(as.character(LatLonCounts2$Var2))
# Plot these points on our map:
ggmap(chicago) + geom_point(data = LatLonCounts, aes(x = Long, y = Lat, color = Freq, size=Freq))# Change the color scheme:
ggmap(chicago) + geom_point(data = LatLonCounts, aes(x = Long, y = Lat, color = Freq, size=Freq)) + scale_colour_gradient(low="yellow", high="red")# We can also use the geom_tile geometry
ggmap(chicago) + geom_tile(data = LatLonCounts, aes(x = Long, y = Lat, alpha = Freq), fill="red")#freq > 0
ggmap(chicago) + geom_tile(data=LatLonCounts2, aes(x = Long, y = Lat, alpha=Freq), fill="red")# Load our data:
murders = read.csv("murders.csv")
# Output structure
str(murders)
## 'data.frame': 51 obs. of 6 variables:
## $ State : Factor w/ 51 levels "Alabama","Alaska",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Population : int 4779736 710231 6392017 2915918 37253956 5029196 3574097 897934 601723 19687653 ...
## $ PopulationDensity: num 94.65 1.26 57.05 56.43 244.2 ...
## $ Murders : int 199 31 352 130 1811 117 131 48 131 987 ...
## $ GunMurders : int 135 19 232 93 1257 65 97 38 99 669 ...
## $ GunOwnership : num 0.517 0.578 0.311 0.553 0.213 0.347 0.167 0.255 0.036 0.245 ...
# Load the map of the US
statesMap = map_data("state")
# Output structure
str(statesMap)
## 'data.frame': 15537 obs. of 6 variables:
## $ long : num -87.5 -87.5 -87.5 -87.5 -87.6 ...
## $ lat : num 30.4 30.4 30.4 30.3 30.3 ...
## $ group : num 1 1 1 1 1 1 1 1 1 1 ...
## $ order : int 1 2 3 4 5 6 7 8 9 10 ...
## $ region : chr "alabama" "alabama" "alabama" "alabama" ...
## $ subregion: chr NA NA NA NA ...
# Plot the map:
ggplot(statesMap, aes(x = long, y = lat, group = group)) + geom_polygon(fill = "white", color = "black") # Create a new variable called region with the lowercase names to match the statesMap:
murders$region = tolower(murders$State)
# Join the statesMap data and the murders data into one dataframe:
murderMap = merge(statesMap, murders, by="region")
str(murderMap)
## 'data.frame': 15537 obs. of 12 variables:
## $ region : chr "alabama" "alabama" "alabama" "alabama" ...
## $ long : num -87.5 -87.5 -87.5 -87.5 -87.6 ...
## $ lat : num 30.4 30.4 30.4 30.3 30.3 ...
## $ group : num 1 1 1 1 1 1 1 1 1 1 ...
## $ order : int 1 2 3 4 5 6 7 8 9 10 ...
## $ subregion : chr NA NA NA NA ...
## $ State : Factor w/ 51 levels "Alabama","Alaska",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Population : int 4779736 4779736 4779736 4779736 4779736 4779736 4779736 4779736 4779736 4779736 ...
## $ PopulationDensity: num 94.7 94.7 94.7 94.7 94.7 ...
## $ Murders : int 199 199 199 199 199 199 199 199 199 199 ...
## $ GunMurders : int 135 135 135 135 135 135 135 135 135 135 ...
## $ GunOwnership : num 0.517 0.517 0.517 0.517 0.517 0.517 0.517 0.517 0.517 0.517 ...
# Plot the number of murder on our map of the United States:
ggplot(murderMap, aes(x = long, y = lat, group = group, fill = Murders)) + geom_polygon(color = "black") + scale_fill_gradient(low = "black", high = "red", guide = "legend")# Plot a map of the population:
ggplot(murderMap, aes(x = long, y = lat, group = group, fill = Population)) + geom_polygon(color = "black") + scale_fill_gradient(low = "black", high = "red", guide = "legend")# Create a new variable that is the number of murders per 100,000 population:
murderMap$MurderRate = murderMap$Murders / murderMap$Population * 100000
# Redo our plot with murder rate:
ggplot(murderMap, aes(x = long, y = lat, group = group, fill = MurderRate)) + geom_polygon(color = "black") + scale_fill_gradient(low = "black", high = "red", guide = "legend")# Redo the plot, removing any states with murder rates above 10:
ggplot(murderMap, aes(x = long, y = lat, group = group, fill = MurderRate)) + geom_polygon(color = "black") + scale_fill_gradient(low = "black", high = "red", guide = "legend", limits = c(0,10))