Introduction

The explosion of computerized data affects all parts of society, including law and order. In the past, human judgment and experience was the only tool in identifying patterns in criminal behavior. Police forces around the US and the world are augmenting human judgment with analytics – sometimes described as predictive policing.

In this notebook, we will take a look at two case studies:

  1. Visualising the Total Number of Motor Vehicle Thefts in the State of Chicago in the United States

  2. Visualising the Totoal Number of Murders on the Map of the United States

Prerequisites

The following packages are required for the analysis:

library(ggplot2); library(dplyr); library(readr); library(leaflet)
library(lubridate); library(plotly); library(maps); library(ggmap) 
library(ggthemes); library(DT)

Note: The ggthemes and DT packages are not required for analysis. Only for enhancing the appearance of plots and of printed tables respectively

Motor Vehicle Thefts in Chicago

We will use data from Chicago Police Department to explore patterns of crime:

We’re interested in the total number of car thefts that occur in any particular hour of a day of the week over the whole data set

Reading and Preparing Data for Analsyis

# mvt stands for motor vehicle theft
mvt <- read_csv('mvt.csv')

# have a look at the structure of  data
str(mvt, give.attr = FALSE)
## Classes 'tbl_df', 'tbl' and 'data.frame':    191641 obs. of  3 variables:
##  $ Date     : chr  "12/31/12 23:15" "12/31/12 22:00" "12/31/12 22:00" "12/31/12 22:00" ...
##  $ Latitude : num  41.8 41.9 42 41.8 41.8 ...
##  $ Longitude: num  -87.6 -87.7 -87.8 -87.7 -87.6 ...

We notice that the Date variable is recognized as a character variable. We need to convert it into an R date-time structure so that we can extract the day of the week and the hour of the day. We will use the package lubridate for date and time manipulation to accomplish that easily:

mvt$Date <- mdy_hm(mvt$Date)

# now let's look again
str(mvt, give.attr = FALSE)
## Classes 'tbl_df', 'tbl' and 'data.frame':    191641 obs. of  3 variables:
##  $ Date     : POSIXct, format: "2012-12-31 23:15:00" "2012-12-31 22:00:00" ...
##  $ Latitude : num  41.8 41.9 42 41.8 41.8 ...
##  $ Longitude: num  -87.6 -87.7 -87.8 -87.7 -87.6 ...

We can see now the Date variable has been converted to date and time data structure.

# have a look at the first few rows of our data
head(mvt, 50) %>% datatable(style = 'bootstrap', 
                            caption = "Motor Vehicle Thefts in Chicago")

In this format, we can extract the hour and the day of the week from the Date variable, and we can add these as new variables to our data frame.

# add a new variable representing the day of the week 
mvt$Weekday <- wday(mvt$Date, label = T)

# add another new variable representing the hour
mvt$Hour <- hour(mvt$Date)

# let's look at our data again
head(mvt, 50) %>% datatable(style = 'bootstrap',
                            caption = "Motor Vehicle Thefts in Chicago")

Now, we’re ready to make some data visualization and exploration.

Exploratory Data Analysis

Exploring Two Variables

Let’s start by plotting the number of crimes per each day of the week. First, we will use dplyr package to create a data frame of the weekdays and the corresponding number of crimes per each weekday.

ncrimesPerWeekday <- mvt %>% 
  group_by(Weekday) %>% 
  summarise(CrimeFreq = n()) %>% 
  ungroup()

# have a look
ncrimesPerWeekday %>% datatable(style = 'bootstrap',
                                caption = "Motor Vehicle Thefts in Chicago by Day of the Week")

Now, we’re ready to make our plot:

p_crimeRatePerWeekday <- ggplot(data = ncrimesPerWeekday, 
                                aes(x=Weekday, y = CrimeFreq)) +
  
  geom_point(color = 'red') +
  geom_line(aes(group=1), color = "light blue") +
  xlab("Day of the Week") +
  ylab("Total Motor Vehicle Thefts") +
  ggtitle("Total Motor Vehicle Thefts vs. by Week Day")

ggplotly(p_crimeRatePerWeekday)

Exploring Three Variables

Now let’s add the hour of the day to our line plot, and then create an alternative visualization. We can do this by creating a plot for each day of the week and making the x-axis the hour of the day.

We first need to create a counts table for the weekday, and hour. We will again use dplyr to accomplish this. This time we will group by two variables: day of the week, and hour of the day.

ncrimesDayHour <- mvt %>% 
  group_by(Weekday, Hour) %>% 
  summarize(CrimeFreq = n())

ncrimesDayHour %>% datatable(style = 'bootstrap',
                             caption = "Total Motor Vehicle Thefts by Day and Hour")

This table gives, for each day of the week and each hour, the total number of motor vehicle thefts that occurred. For example, on Sunday at 4 AM, there were 607 motor vehicle thefts.

Now, plotting:

p_ncrimesDayHour <-ggplot(data = ncrimesDayHour, aes(x = Hour, y = CrimeFreq)) +
  geom_line(aes(group = Weekday, color = Weekday)) +
  labs(x = 'Hour of the Day',
       y = "Total Motor Vehicle Thefts",
       title = "Total Motor Vehicle Thefts by Day and Hour",
       color = "Week Day") 

ggplotly(p_ncrimesDayHour)

Note: Play around with the interactive plot. For example, click on any day (on the right hand side of the plot) to remove it from the plot. Click again to add it back.

We notice that although we can see the differences between each line and the other, the plot is still quite hard to interpret. Seven lines is a lot. We can make a better visualization using facets:

p_ncrimesDayHour <-ggplot(data = ncrimesDayHour, aes(x = Hour, y = CrimeFreq)) +
  geom_line(aes(color = Weekday)) +
  facet_wrap(~Weekday) +
  labs(x = 'Hour of the Day',
       y = "Total Motor Vehicle Thefts",
       title = "Total Motor Vehicle Thefts by Day and Hour",
       color = "Week Day") 


ggplotly(p_ncrimesDayHour)

Although this is better than before, but we still need a better visualization that provide us with clear comparisons across the hour of the day and the day of the week.

Let’s instead visualize the same information with a heat map. To make a heat map, we’ll use our data in our data frame, ncrimesDayHour.

p_heatMap <- ggplot(data = ncrimesDayHour, aes(x = Hour, y = Weekday)) +
  # add the heat map, and map CrimeFreq to the color of each square
  geom_tile(aes(fill = CrimeFreq)) +
  # change the name of the legend
  scale_fill_gradient(name = "No. of Thefts", low = 'white', high = 'red') +
  labs(x = "Hour of the Day",
       title = "Motor Vehicle Thefts Heat Map") +
  # remove the y label
  theme(axis.title.y = element_blank())

ggplotly(p_heatMap)

Now this color scheme shows the hot spots, or the places with more crime, in red. So, the most crime is shown by the red spots and the least crime is shown by the lighter areas. It looks like Friday night is a pretty common time for motor vehicle thefts.

Plotting on Maps

Now let’s plot crime on a map of Chicago. we will use the 2 packages: maps and ggmap to accomplish that.

Let’s load a map of Chicago into R. We can easily do this by using the get_map function:

# load the map of chicago
chicago <- get_map(location = 'chicago', zoom = 11)

# have a look at chicago's map
ggmap(chicago) + 
  labs(x = "Longitudes", y = "Latitudes", title = "Raw Chicago Map")

Now let’s plot the total number of vehicle thefts in our data set on this map. We’re interested in whether or not an area has a high amount of crime, so let’s round our latitudes and longitudes to two digits of accuracy and create a crime counts data frame for each area:

ncrimesLatLong <- mvt %>%
  mutate(Long = round(Longitude, 2),
         Lat = round(Latitude, 2)) %>% 
  filter(!is.na(Lat) | !is.na(Long)) %>% 
  group_by(Long, Lat) %>% 
  summarize(CrimeCount = n())

head(ncrimesLatLong, 50) %>% datatable(style = 'bootstrap')

Now, let’s plot these points on our map, making the size and color of the points depend on the total number of motor vehicle thefts.

ggmap(chicago) + 
  geom_point(data = ncrimesLatLong, aes(x = Long, y = Lat, 
                                        color = CrimeCount,
                                        size = CrimeCount)) +
  scale_color_gradient(low = 'yellow', high = 'red') + 
  labs(x = "Longitudes", y = "Latitudes", 
       title = "Total MV Thefts in Chicago")

We can also use geom_tile to make something that looks more like a traditional heat map

ggmap(chicago) + 
  geom_tile(data = ncrimesLatLong, aes(x = Long, y = Lat, alpha = CrimeCount), fill ='red') +
  labs(x = "Longitudes", y = "Latitudes", 
       title = "Total MV Thefts in Chicago ",
       subtitle = "Another Color Scheme",
       alpha = "Crime Count")

Interactive Maps

Now we will produce the same plot on the map of Chicago, but we will make it interactive. We will use the leaflet package to achieve that.

ncrimesLatLong <- ncrimesLatLong %>% 
  mutate(CrimeBins = findInterval(CrimeCount, c(0, 250, 500, 750, 1000)))


# create a function to create a palette of colors
colfunc <- colorRampPalette(c("white", "red"))

pal <- colorNumeric(
  palette = colfunc(10),
  domain = ncrimesLatLong$CrimeCount
)

m <- leaflet(ncrimesLatLong) %>% 
  addTiles() %>% 
  addCircles(lng = ~Long, lat = ~Lat,
             #color = 'red',
             stroke = T,
             color = ~pal(CrimeCount),
             fillColor = ~pal(CrimeCount), 
             opacity = ~CrimeCount,
             fillOpacity = ~CrimeCount,
             radius = ~CrimeCount,
             popup = paste0("coordinates: ", "<br>", 
                            ncrimesLatLong$Long, ",", ncrimesLatLong$Lat, "<br>", 
                            "Crime Count: ", ncrimesLatLong$CrimeCount),
             popupOptions(zoomAnimation = T)) %>% 
  addLegend("topright", pal = pal, values = ~CrimeCount, title = "Crime Count")
m

Interact with the map. Try zoom in and click on any of the circles to see the number of crimes associated with each longitude, latitude pair. The size of the circles is proportionate to the number of crimes as well as its color. White and small circles indicate low crime rate areas, while red and big circles indicate high crime rate areas.

Murders in the United States

In this section, we’ll create a heat map representing the murder rat on the map of the United States. We’ll be using data provided by the U.S. Census Bureau and the FBI, and is decribed here.

Reading and Exploring Data

Let’s start by reading in our data set:

murders <- read_csv("murders.csv")

Let’s take a look at the structure of this data using the str function:

str(murders, give.attr = F)
## Classes 'tbl_df', 'tbl' and 'data.frame':    51 obs. of  6 variables:
##  $ State            : chr  "Alabama" "Alaska" "Arizona" "Arkansas" ...
##  $ Population       : int  4779736 710231 6392017 2915918 37253956 5029196 3574097 897934 601723 19687653 ...
##  $ PopulationDensity: num  94.65 1.26 57.05 56.43 244.2 ...
##  $ Murders          : int  199 31 352 130 1811 117 131 48 131 987 ...
##  $ GunMurders       : int  135 19 232 93 1257 65 97 38 99 669 ...
##  $ GunOwnership     : num  0.517 0.578 0.311 0.553 0.213 0.347 0.167 0.255 0.036 0.245 ...

We have 51 observations for the 50 states plus Washington, DC, and six different variables: the name of the state, the population, the population density, the number of murders, the number of murders that used guns, and the rate of gun ownership.

A map of the United States is included in R. Let’s load the map and call it statesMap. We can do so using the map_data function, where the only argument is “state” in quotes.

statesMap <- map_data('state')

# Let's see what this looks like by typing in str(statesMap).
str(statesMap)
## 'data.frame':    15537 obs. of  6 variables:
##  $ long     : num  -87.5 -87.5 -87.5 -87.5 -87.6 ...
##  $ lat      : num  30.4 30.4 30.4 30.3 30.3 ...
##  $ group    : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ order    : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ region   : chr  "alabama" "alabama" "alabama" "alabama" ...
##  $ subregion: chr  NA NA NA NA ...

This is just a data frame summarizing how to draw the United States. To plot the map, we’ll use the polygons geometry of ggplot:

statesMap_p <-ggplot(data = statesMap, aes(x = long, y = lat, group = group)) +
  geom_polygon(fill = 'lightblue', color= 'black', size = 0.5) +
  labs(title = "Raw Map of the United States") + 
  theme_tufte() +
  theme(axis.title.y = element_blank(),
        axis.ticks.y = element_blank(),
        axis.text.y = element_blank(),
        axis.title.x = element_blank(),
        axis.ticks.x = element_blank(),
        axis.text.x = element_blank())

ggplotly(statesMap_p)

Before we can plot our data on this map, we need to make sure that the state names are the same in the murders data frame and in the statesMap data frame. In the murders data frame, our state names are in the State variable, and they start with a capital letter. But in the statesMap data frame, our state names are in the region variable, and they’re all lowercase.

So let’s create a new variable called region in our murders data frame to match the state name variable in the statesMap data frame. So we’ll add to our murders data frame the variable region. We will use dplyr package to accomplish that:

murders <- murders %>% 
  mutate(region = tolower(State)) %>% 
  select(State, region, everything())

# let's have a look again
head(murders, 50) %>% datatable(style = 'bootstrap', 
                                caption = "Murders Statistics by State")

Now we can join the statesMap data frame with the murders data frame using dplyr

murderMap <- statesMap %>% 
  inner_join(murders, by = "region") %>% 
  # add a column (region_abbr) that matches states full names to its appreviations
  mutate(region_abbr = state.abb[match(State,state.name)]) %>% 
  select(long, lat, group, order, region, region_abbr, everything())

# look at the new talbe
head(murderMap, 50) %>% datatable(style = 'bootstrap', 
                                  caption = "The Joined Table")

We have the same number of observations here that we had in the statesMap data frame, but now we have both the variables from the statesMap data frame and the variables from the murders data frame, which were matched up based on the region variable.

So now, let’s plot the number of murders on our map of the United States. We’ll again use the ggplot function, but this time, our data frame is murderMap:

# create a df to store the coordinates of the centers of each state
# to place name abbreviation of each state on this center
centroids <- murderMap %>% 
  group_by(State, region, region_abbr) %>% 
  summarise(long = mean(range(long), na.rm=T),
            lat  = mean(range(lat), na.rm=T))

murderMap_p <-ggplot(data = murderMap, aes(x = long, y = lat)) +
  geom_polygon(aes(group = group, fill = Murders)) +
  geom_text(data = centroids, aes(long, lat, label = region_abbr), size = 2) +
  scale_fill_gradient(low = 'white', high = 'red') +
  theme_tufte() +
  theme(axis.title.y = element_blank(),
        axis.ticks.y = element_blank(),
        axis.text.y = element_blank(),
        axis.title.x = element_blank(),
        axis.ticks.x = element_blank(),
        axis.text.x = element_blank()) +
  labs(title = "Frequency of Murders <br> Depicted on The Map of the United States")

ggplotly(murderMap_p)

So it looks like California and Texas have the largest number of murders. But is that just because they’re the most populous states? Let’s create a map of the population of each state to check.

murderMap %>% str()
## 'data.frame':    15537 obs. of  13 variables:
##  $ long             : num  -87.5 -87.5 -87.5 -87.5 -87.6 ...
##  $ lat              : num  30.4 30.4 30.4 30.3 30.3 ...
##  $ group            : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ order            : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ region           : chr  "alabama" "alabama" "alabama" "alabama" ...
##  $ region_abbr      : chr  "AL" "AL" "AL" "AL" ...
##  $ subregion        : chr  NA NA NA NA ...
##  $ State            : chr  "Alabama" "Alabama" "Alabama" "Alabama" ...
##  $ Population       : int  4779736 4779736 4779736 4779736 4779736 4779736 4779736 4779736 4779736 4779736 ...
##  $ PopulationDensity: num  94.7 94.7 94.7 94.7 94.7 ...
##  $ Murders          : int  199 199 199 199 199 199 199 199 199 199 ...
##  $ GunMurders       : int  135 135 135 135 135 135 135 135 135 135 ...
##  $ GunOwnership     : num  0.517 0.517 0.517 0.517 0.517 0.517 0.517 0.517 0.517 0.517 ...
populationMap_p <-ggplot(data = murderMap, aes(x = long, y = lat)) +
  geom_polygon(aes(group = group, fill = Population)) +
  geom_text(data = centroids, aes(long, lat, label = region_abbr), size = 2) +
  scale_fill_gradient(low = 'white', high = 'red') +
  theme_tufte() +
  theme(axis.title.y = element_blank(),
        axis.ticks.y = element_blank(),
        axis.text.y = element_blank(),
        axis.title.x = element_blank(),
        axis.ticks.x = element_blank(),
        axis.text.x = element_blank()) +
  labs(title = "Population by State")

ggplotly(populationMap_p)

We notice that the population map looks exactly the same as our murders map. So we need to plot the murder rate (per each 100,000 of the population) instead of the number of murders to make sure we’re not just plotting a population map:

murderMap <- murderMap %>% 
  mutate(murderRate = Murders/Population * 100000)

murderRateMap_p <-ggplot(data = murderMap, aes(x = long, y = lat)) +
  geom_polygon(aes(group = group, fill = murderRate)) +
  geom_text(data = centroids, aes(long, lat, label = region_abbr), size = 2) +
  scale_fill_gradient(low = 'white', high = 'red') +
  theme_tufte() +
  theme(axis.title.y = element_blank(),
        axis.ticks.y = element_blank(),
        axis.text.y = element_blank(),
        axis.title.x = element_blank(),
        axis.ticks.x = element_blank(),
        axis.text.x = element_blank()) +
  labs(title = "Murder Rate <br> Per each 100,000 of Population for each State",
       fill = "Murder Rate")

ggplotly(murderRateMap_p)

We can see that the plot surprisingly doesn’t contain any red states. Why? It turns out that Washington, DC is an outlier with a very high murder rate, but it’s such a small region on the map that we can’t even see it. So let’s redo our plot, removing any observations with murder rates above 10, which we know will only exclude Washington, DC.

murderRateMap2_p <-ggplot(data = murderMap, aes(x = long, y = lat)) +
  geom_polygon(aes(group = group, fill = murderRate)) +
  geom_text(data = centroids, aes(long, lat, label = region_abbr), size = 2) +
  # by adding the argument (limits), we can remove the unwanted values
  scale_fill_gradient(low = 'white', high = 'red', limits = c(0, 10)) +
  theme_tufte() +
  theme(axis.title.y = element_blank(),
        axis.ticks.y = element_blank(),
        axis.text.y = element_blank(),
        axis.title.x = element_blank(),
        axis.ticks.x = element_blank(),
        axis.text.x = element_blank()) +
  labs(title = "Murder Rate <br> Per each 100,000 of Population for each State <br> Except Washington DC",
       fill = "Murder Rate")

ggplotly(murderRateMap2_p)

Now we can see some states colored by red on the map, especially LA. These are the highest states in terms of murder rate.

Conclusion

Criminal activity-related data often has both components of time and location. Sometimes, all that is required is a line chart, but heatmaps can better visualize data that are too big for a table, and is also more effective and intuitive in making fast comparisons.

Also, we saw that plotting data on maps is much more effective than a table for location based data, and is eye catching.

The edge of predictive policing is that police forces have finite resources. In order to optimize these resources, they need to focus on the most problematic areas. We saw that by combining heatmaps and plotting on geographic maps, not only do analytics help improve policework, but the outputs are also good communication tools to decision makers in government, and the wider public.