Analytics Edge: Unit 7 - Visualization for Law and Order

The Analytical Policeman

The explosion of computerized data affects all parts of society, including law and order
In the past, human judgement and experience was the only tool in identifying patterns in criminal behavior
Police forces around the US and the world are augmenting human judgement with analytics - sometimes describes as “predictive”

Example: Los Angeles Police Dept.

“I’m not going to get more money. I’m not going to get more cops. I have to be better at using what I have, and that’s what predictive policing is about… If this old street cop can change the way that he thinks about this stuff, then I know that my [officers] can do the same.”

The Role of Analytics

The analytical tools you have learned in this class can be used to make these “predictive policing” models
However,\(\color{red}{\text{communicating}}\) the results of these models is essential - a \(\color{blue}{\text{linear regression}}\) output while table will not be of use to a \(\color{red}{\text{policewoman on patrol}}\)
Visualization bridges the gap between \(\color{blue}{\text{the data and mathematics}}\) and the \(\color{red}{\text{end user}}\)

Understanding the Past

Before we even consider a predictive model, we should try to understand the historical data
Many cities in the US and around the world would provide logs of reported crimes, usually including the time, locations, and nature of the event
We will use data from Chicago about motor vehicle thefts

Crime Over Time

Suppose we wanted to communicate crime patterns over the course of an average week
We would display daily averages using a line graph, but this does not seem like it would be too useful

We can replace our x-axis with the hour of the day, and have different line for every day of the week, but this would be a jumbled mess with 7 lines!
We could use no visualization at all, and instead present the information in a table

This is valid but how can we make the table more interesting

Heatmaps

Heatmaps are a way of visualizing data using three attributes. The x-axis and y-axis are typically displayed horizontally and vertically. The third attribute is represented by shades of color.

For example, a low number, might be \(\color{blue}{\text{blue}}\), and a high number might be \(\color{red}{\text{red}}\)
We can pick different color schemes based on the type of data to convey different messages
The x-axis and y-axis don’t need to be continuous - they can be categorical
We could even combine a heatmap with a geographical map

A Chicao Crime Heatmap

We will use Chicago motor vehicle theft data to explore patterns of crime
- Over days of the week
- Over hours of the day
We’re interested in the total number of car thefts that occur in any particular hour of a day of the week: over the whole data set.

Eye on Crime

Criminal activity-related data often has both components of time and location
Sometimes all that is required is a line chart, but heatmaps can visualize data that would be too big for a table
Plotting data on maps is much more effective than a table for location based data, and is eye-catching

Predicting Policing

Many police forces are exploiting their databases to focus finite resources on problem areas
Not only do analytics help improve policework, the outputs are also good communication tools to decision makers in government and to the wider public
The application of analytics to data like this is new and growing, with companies like PredPol and Palantir leading the effort.

The Analytical Policeman in R

A Basic Line Plot

# Load our data:
mvt = read.csv("mvt.csv", stringsAsFactors=FALSE)
# Output structure
str(mvt)
## 'data.frame':    191641 obs. of  3 variables:
##  $ Date     : chr  "12/31/12 23:15" "12/31/12 22:00" "12/31/12 22:00" "12/31/12 22:00" ...
##  $ Latitude : num  41.8 41.9 42 41.8 41.8 ...
##  $ Longitude: num  -87.6 -87.7 -87.8 -87.7 -87.6 ...
# Convert the Date variable to a format that R will recognize:
mvt$Date = strptime(mvt$Date, format="%m/%d/%y %H:%M")
# Extract the hour and the day of the week:
mvt$Weekday = weekdays(mvt$Date)
mvt$Hour = mvt$Date$hour
# Let's take a look at the structure of our data again:
str(mvt)
## 'data.frame':    191641 obs. of  5 variables:
##  $ Date     : POSIXlt, format: "2012-12-31 23:15:00" "2012-12-31 22:00:00" "2012-12-31 22:00:00" "2012-12-31 22:00:00" ...
##  $ Latitude : num  41.8 41.9 42 41.8 41.8 ...
##  $ Longitude: num  -87.6 -87.7 -87.8 -87.7 -87.6 ...
##  $ Weekday  : chr  "Monday" "Monday" "Monday" "Monday" ...
##  $ Hour     : int  23 22 22 22 21 20 20 20 19 18 ...
# Create a simple line plot - need the total number of crimes on each day of the week. We can get this information by creating a table:
z = table(mvt$Weekday)
kable(z)

Var1	Freq
Friday	29284
Monday	27397
Saturday	27118
Sunday	26316
Thursday	27319
Tuesday	26791
Wednesday	27416

# Save this table as a data frame:
WeekdayCounts = as.data.frame(table(mvt$Weekday))
str(WeekdayCounts) 
## 'data.frame':    7 obs. of  2 variables:
##  $ Var1: Factor w/ 7 levels "Friday","Monday",..: 1 2 3 4 5 6 7
##  $ Freq: int  29284 27397 27118 26316 27319 26791 27416
# Load the ggplot2 library:
library(ggplot2)
# Create our plot
ggplot(WeekdayCounts, aes(x=Var1, y=Freq)) + geom_line(aes(group=1))

# Make the "Var1" variable an ORDERED factor variable
WeekdayCounts$Var1 = factor(WeekdayCounts$Var1, ordered=TRUE, levels=c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday","Saturday"))
# Try again:
ggplot(WeekdayCounts, aes(x=Var1, y=Freq)) + geom_line(aes(group=1))

# Change our x and y labels:
ggplot(WeekdayCounts, aes(x=Var1, y=Freq)) + geom_line(aes(group=1)) + xlab("Day of the Week") + ylab("Total Motor Vehicle Thefts")

Adding the Hour of the Day

# Create a counts table for the weekday and hour:
z = table(mvt$Weekday, mvt$Hour)
kable(z)

	0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22	23
Friday	1873	932	743	560	473	602	839	1203	1268	1286	938	822	1207	857	937	1140	1165	1318	1623	1652	1736	1881	2308	1921
Monday	1900	825	712	527	415	542	772	1123	1323	1235	971	737	1129	824	958	1059	1136	1252	1518	1503	1622	1815	2009	1490
Saturday	2050	1267	985	836	652	508	541	650	858	1039	946	789	1204	767	963	1086	1055	1084	1348	1390	1570	1702	2078	1750
Sunday	2028	1236	1019	838	607	461	478	483	615	864	884	787	1192	789	959	1037	1083	1160	1389	1342	1706	1696	2079	1584
Thursday	1856	816	696	508	400	534	799	1135	1298	1301	932	731	1093	752	831	1044	1131	1258	1510	1537	1668	1776	2134	1579
Tuesday	1691	777	603	464	414	520	845	1118	1175	1174	948	786	1108	762	908	1071	1090	1274	1553	1496	1696	1816	2044	1458
Wednesday	1814	790	619	469	396	561	862	1140	1329	1237	947	763	1225	804	863	1075	1076	1289	1580	1507	1718	1748	2093	1511

# Save this to a data frame:
DayHourCounts = as.data.frame(table(mvt$Weekday, mvt$Hour))
# Output structure
str(DayHourCounts)
## 'data.frame':    168 obs. of  3 variables:
##  $ Var1: Factor w/ 7 levels "Friday","Monday",..: 1 2 3 4 5 6 7 1 2 3 ...
##  $ Var2: Factor w/ 24 levels "0","1","2","3",..: 1 1 1 1 1 1 1 2 2 2 ...
##  $ Freq: int  1873 1900 2050 2028 1856 1691 1814 932 825 1267 ...
# Convert the second variable, Var2, to numbers and call it Hour:
DayHourCounts$Hour = as.numeric(as.character(DayHourCounts$Var2))
# Create out plot:
ggplot(DayHourCounts, aes(x=Hour, y=Freq)) + geom_line(aes(group=Var1))

# Change the colors
ggplot(DayHourCounts, aes(x=Hour, y=Freq)) + geom_line(aes(group=Var1, color=Var1), size=2)

# Separate the weekends from the weekdays:
DayHourCounts$Type = ifelse((DayHourCounts$Var1 == "Sunday") | (DayHourCounts$Var1 == "Saturday"), "Weekend", "Weekday")
# Redo our plot, this time coloring by Type:
ggplot(DayHourCounts, aes(x=Hour, y=Freq)) + geom_line(aes(group=Var1, color=Type), size=2)

# Make the lines a little transparent:
ggplot(DayHourCounts, aes(x=Hour, y=Freq)) + geom_line(aes(group=Var1, color=Type), size=2, alpha=0.5)

# Fix the order of the days:
DayHourCounts$Var1 = factor(DayHourCounts$Var1, ordered=TRUE, levels=c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"))
# Make a heatmap:
ggplot(DayHourCounts, aes(x = Hour, y = Var1)) + geom_tile(aes(fill = Freq))

# Change the label on the legend, and get rid of the y-label:
ggplot(DayHourCounts, aes(x = Hour, y = Var1)) + geom_tile(aes(fill = Freq)) + scale_fill_gradient(name="Total MV Thefts") + theme(axis.title.y = element_blank())

# Change the color scheme
ggplot(DayHourCounts, aes(x = Hour, y = Var1)) + geom_tile(aes(fill = Freq)) + scale_fill_gradient(name="Total MV Thefts", low="white", high="red") + theme(axis.title.y = element_blank())

VIDEO 5 - Maps

# Install and load two new packages:
library(maps)
library(devtools)  
library(ggmap)
register_google(key = "AIzaSyBlCZXGDK9dN3Vf_N1qdI6mPfFFCA34ubs")
# Load a map of Chicago into R:
chicago = get_map(location = "chicago", zoom = 11)
# Look at the map
ggmap(chicago)

# Plot the first 100 motor vehicle thefts:
ggmap(chicago) + geom_point(data = mvt[1:100,], aes(x = Longitude, y = Latitude))

# Round our latitude and longitude to 2 digits of accuracy, and create a crime counts data frame for each area:
LatLonCounts = as.data.frame(table(round(mvt$Longitude,2), round(mvt$Latitude,2)))
# Subset for Freq > 0
LatLonCounts2 = subset(LatLonCounts, Freq > 0)
#Output structure
str(LatLonCounts)
## 'data.frame':    1638 obs. of  3 variables:
##  $ Var1: Factor w/ 42 levels "-87.93","-87.92",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Var2: Factor w/ 39 levels "41.64","41.65",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Freq: int  0 0 0 0 0 0 0 0 0 0 ...
str(LatLonCounts2)
## 'data.frame':    686 obs. of  3 variables:
##  $ Var1: Factor w/ 42 levels "-87.93","-87.92",..: 33 40 32 33 34 35 38 39 40 41 ...
##  $ Var2: Factor w/ 39 levels "41.64","41.65",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ Freq: int  3 1 54 60 28 1 8 63 81 8 ...
# Convert our Longitude and Latitude variable to numbers:
LatLonCounts$Long = as.numeric(as.character(LatLonCounts$Var1))
LatLonCounts$Lat = as.numeric(as.character(LatLonCounts$Var2))
LatLonCounts2$Long = as.numeric(as.character(LatLonCounts2$Var1))
LatLonCounts2$Lat = as.numeric(as.character(LatLonCounts2$Var2))
# Plot these points on our map:
ggmap(chicago) + geom_point(data = LatLonCounts, aes(x = Long, y = Lat, color = Freq, size=Freq))

# Change the color scheme:
ggmap(chicago) + geom_point(data = LatLonCounts, aes(x = Long, y = Lat, color = Freq, size=Freq)) + scale_colour_gradient(low="yellow", high="red")

# We can also use the geom_tile geometry
ggmap(chicago) + geom_tile(data = LatLonCounts, aes(x = Long, y = Lat, alpha = Freq), fill="red")

#freq > 0
ggmap(chicago) + geom_tile(data=LatLonCounts2, aes(x = Long, y = Lat, alpha=Freq), fill="red")

Geographical Map on US

# Load our data:
murders = read.csv("murders.csv")
# Output structure
str(murders)
## 'data.frame':    51 obs. of  6 variables:
##  $ State            : Factor w/ 51 levels "Alabama","Alaska",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Population       : int  4779736 710231 6392017 2915918 37253956 5029196 3574097 897934 601723 19687653 ...
##  $ PopulationDensity: num  94.65 1.26 57.05 56.43 244.2 ...
##  $ Murders          : int  199 31 352 130 1811 117 131 48 131 987 ...
##  $ GunMurders       : int  135 19 232 93 1257 65 97 38 99 669 ...
##  $ GunOwnership     : num  0.517 0.578 0.311 0.553 0.213 0.347 0.167 0.255 0.036 0.245 ...
# Load the map of the US
statesMap = map_data("state")
# Output structure
str(statesMap)
## 'data.frame':    15537 obs. of  6 variables:
##  $ long     : num  -87.5 -87.5 -87.5 -87.5 -87.6 ...
##  $ lat      : num  30.4 30.4 30.4 30.3 30.3 ...
##  $ group    : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ order    : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ region   : chr  "alabama" "alabama" "alabama" "alabama" ...
##  $ subregion: chr  NA NA NA NA ...
# Plot the map:
ggplot(statesMap, aes(x = long, y = lat, group = group)) + geom_polygon(fill = "white", color = "black")

# Create a new variable called region with the lowercase names to match the statesMap:
murders$region = tolower(murders$State)
# Join the statesMap data and the murders data into one dataframe:
murderMap = merge(statesMap, murders, by="region")
str(murderMap)
## 'data.frame':    15537 obs. of  12 variables:
##  $ region           : chr  "alabama" "alabama" "alabama" "alabama" ...
##  $ long             : num  -87.5 -87.5 -87.5 -87.5 -87.6 ...
##  $ lat              : num  30.4 30.4 30.4 30.3 30.3 ...
##  $ group            : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ order            : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ subregion        : chr  NA NA NA NA ...
##  $ State            : Factor w/ 51 levels "Alabama","Alaska",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Population       : int  4779736 4779736 4779736 4779736 4779736 4779736 4779736 4779736 4779736 4779736 ...
##  $ PopulationDensity: num  94.7 94.7 94.7 94.7 94.7 ...
##  $ Murders          : int  199 199 199 199 199 199 199 199 199 199 ...
##  $ GunMurders       : int  135 135 135 135 135 135 135 135 135 135 ...
##  $ GunOwnership     : num  0.517 0.517 0.517 0.517 0.517 0.517 0.517 0.517 0.517 0.517 ...
# Plot the number of murder on our map of the United States:
ggplot(murderMap, aes(x = long, y = lat, group = group, fill = Murders)) + geom_polygon(color = "black") + scale_fill_gradient(low = "black", high = "red", guide = "legend")

# Plot a map of the population:
ggplot(murderMap, aes(x = long, y = lat, group = group, fill = Population)) + geom_polygon(color = "black") + scale_fill_gradient(low = "black", high = "red", guide = "legend")

# Create a new variable that is the number of murders per 100,000 population:
murderMap$MurderRate = murderMap$Murders / murderMap$Population * 100000
# Redo our plot with murder rate:
ggplot(murderMap, aes(x = long, y = lat, group = group, fill = MurderRate)) + geom_polygon(color = "black") + scale_fill_gradient(low = "black", high = "red", guide = "legend")

# Redo the plot, removing any states with murder rates above 10:
ggplot(murderMap, aes(x = long, y = lat, group = group, fill = MurderRate)) + geom_polygon(color = "black") + scale_fill_gradient(low = "black", high = "red", guide = "legend", limits = c(0,10))

Analytics Edge: Unit 7 - Visualization for Law and Order

Sulman Khan

October 26, 2018

The Analytical Policeman

The Analytical Policeman

Example: Los Angeles Police Dept.

The Role of Analytics

Understanding the Past

Crime Over Time

Heatmaps

A Chicao Crime Heatmap

Eye on Crime

Predicting Policing

The Analytical Policeman in R

A Basic Line Plot

Adding the Hour of the Day

VIDEO 5 - Maps

Geographical Map on US