HW #4: Crime Analysis Using R

Problem definition
Data extraction
Data Exploration and pre-processing
Visualizations

Problem definition

Predictive policing is a multi-dimensional optimization problem where law enforcement agencies try to efficiently utilize a scarce resource to minimize instances of crime overtime and across geographies. But how do we optimize? This is precisely what we answer in this assignment by using real and publicly available criminal data of Chicago. Using statistics and computer aided technologies we try to devise a solution for this optimization problem. In an attempt to address a real-world problem, our primary focus here is on the fundamentals of data rather than on acrobatics with techniques.

Crime analysis includes looking at the data from 2 different dimension spatial and temporal. Spatial dimension involves observing the characteristics of a particular region along with its neighbor. Temporal dimension involves observing the characteristics of a particular region overtime. The question then is how far away, from the epicenter, do we look for a similar pattern and how far back in time, from the date of event, do we go to capture the trend. Ideally we would like to have as much data as possible. Often, in reality we don’t, and that makes data science a creative process. A process bound by mathematical logic in centered around statistical validity.

Crime data are not easy to deal with period with both spatial and temporal attributes, processing them can be a challenging task. The challenge is not limited to handling spatial and temporal data but also deriving information from them at these levels. Any predictive model for crime will have to have these 2 dimensions attached to it. And to make an effort toward effective predictive policing strategies, this inherent structure of the data needs to be leveraged.

Data extraction

For this exercise, we use crime data for the city of Chicago which are available from 2001 onwards on the city’s open data portal. Crime data for city of Chicago available from their open data portal at: https://data.cityofchicago.org/. To make analysis manageable, we utilized the past one year of data from the current date.

R has the capability of reading files and data tables directly from the Web. We can do this by specifying the connection string instead of the file name in the read_csv() function. We can access the Chicago crime data from the following url: https://data.cityofchicago.org/api/views/x2n5-8w5q/rows.csv?accessType=DOWNLOAD

library(tidyverse)
# Download one year of crime data from the open data portal of city of Chicago
# NOTE: This may take a while depening on the strength of your internet connection
# First I ran read_csv() to find the default col_types() then I updated them to this:
type=cols( `CASE#` = col_character(),
           `DATE  OF OCCURRENCE` = col_datetime(format="%m/%d/%Y %I:%M:%S %p"),
           BLOCK = col_factor(),
           IUCR = col_factor(),
           `PRIMARY DESCRIPTION` = col_factor(),
           `SECONDARY DESCRIPTION` = col_factor(),
           `LOCATION DESCRIPTION` = col_factor(),
           ARREST = col_factor(),
           DOMESTIC = col_factor(),
           BEAT = col_factor(),
           WARD = col_factor(),
           `FBI CD` = col_factor(),
           `X COORDINATE` = col_double(),
           `Y COORDINATE` = col_double(),
           LATITUDE = col_double(),
           LONGITUDE = col_double(),
           LOCATION = col_character()
)

# Specify download url
url.data <- "https://data.cityofchicago.org/api/views/x2n5-8w5q/rows.csv?accessType=DOWNLOAD"

# Read in data
crime_raw <- read_csv(url.data, na='',col_types = type)

# Fix column names
names(crime_raw)<-str_to_lower(names(crime_raw)) %>%
  str_replace_all(" ","_") %>%
  str_replace_all("__","_") %>%
  str_replace_all("#","_num")

Data Exploration and pre-processing

Before we start playing with data, it is important to understand how the data are organized, what fields are present in the table, and how they are stored. We can investigate the internal structure of this data easily since it’s stored as a tibble.

# Print out the tibble
crime_raw

# Understanding the data fields
str(crime_raw)

## tibble [234,401 x 17] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ case_num             : chr [1:234401] "JD163753" "JD212847" "JC320782" "JC497784" ...
##  $ date_of_occurrence   : POSIXct[1:234401], format: "2020-02-24 20:15:00" "2020-04-10 22:56:00" ...
##  $ block                : Factor w/ 27277 levels "031XX W LEXINGTON ST",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ iucr                 : Factor w/ 314 levels "1153","0560",..: 1 2 3 4 2 5 3 3 6 3 ...
##  $ primary_description  : Factor w/ 32 levels "DECEPTIVE PRACTICE",..: 1 2 3 3 2 3 3 3 4 3 ...
##  $ secondary_description: Factor w/ 427 levels "FINANCIAL IDENTITY THEFT OVER $ 300",..: 1 2 3 4 2 5 3 3 6 3 ...
##  $ location_description : Factor w/ 162 levels "RESIDENCE","RESIDENTIAL YARD (FRONT/BACK)",..: NA 1 2 3 4 5 6 7 1 8 ...
##  $ arrest               : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ...
##  $ domestic             : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 2 1 ...
##  $ beat                 : Factor w/ 274 levels "1134","2232",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ ward                 : Factor w/ 50 levels "24","9","4","44",..: 1 2 3 4 3 5 2 6 7 8 ...
##  $ fbi_cd               : Factor w/ 26 levels "11","08A","06",..: 1 2 3 3 2 3 3 3 4 3 ...
##  $ x_coordinate         : num [1:234401] NA 1174583 NA NA NA ...
##  $ y_coordinate         : num [1:234401] NA 1836593 NA NA NA ...
##  $ latitude             : num [1:234401] NA 41.7 NA NA NA ...
##  $ longitude            : num [1:234401] NA -87.6 NA NA NA ...
##  $ location             : chr [1:234401] NA "(41.707000821, -87.636288063)" NA NA ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   `CASE#` = col_character(),
##   ..   `DATE  OF OCCURRENCE` = col_datetime(format = "%m/%d/%Y %I:%M:%S %p"),
##   ..   BLOCK = col_factor(levels = NULL, ordered = FALSE, include_na = FALSE),
##   ..   IUCR = col_factor(levels = NULL, ordered = FALSE, include_na = FALSE),
##   ..   `PRIMARY DESCRIPTION` = col_factor(levels = NULL, ordered = FALSE, include_na = FALSE),
##   ..   `SECONDARY DESCRIPTION` = col_factor(levels = NULL, ordered = FALSE, include_na = FALSE),
##   ..   `LOCATION DESCRIPTION` = col_factor(levels = NULL, ordered = FALSE, include_na = FALSE),
##   ..   ARREST = col_factor(levels = NULL, ordered = FALSE, include_na = FALSE),
##   ..   DOMESTIC = col_factor(levels = NULL, ordered = FALSE, include_na = FALSE),
##   ..   BEAT = col_factor(levels = NULL, ordered = FALSE, include_na = FALSE),
##   ..   WARD = col_factor(levels = NULL, ordered = FALSE, include_na = FALSE),
##   ..   `FBI CD` = col_factor(levels = NULL, ordered = FALSE, include_na = FALSE),
##   ..   `X COORDINATE` = col_double(),
##   ..   `Y COORDINATE` = col_double(),
##   ..   LATITUDE = col_double(),
##   ..   LONGITUDE = col_double(),
##   ..   LOCATION = col_character()
##   .. )

# Summarize the data
summary(crime_raw)

##    case_num         date_of_occurrence                             block       
##  Length:234401      Min.   :2019-06-22 05:25:00   001XX N STATE ST    :   854  
##  Class :character   1st Qu.:2019-09-04 13:15:00   008XX N MICHIGAN AVE:   384  
##  Mode  :character   Median :2019-11-27 10:33:00   0000X W TERMINAL ST :   340  
##                     Mean   :2019-12-05 01:45:20   011XX S CANAL ST    :   286  
##                     3rd Qu.:2020-02-27 09:00:00   076XX S CICERO AVE  :   265  
##                     Max.   :2020-06-20 23:49:00   0000X S STATE ST    :   264  
##                                                   (Other)             :232008  
##       iucr                primary_description             secondary_description
##  0486   : 22362   THEFT             :54387    SIMPLE                 : 26734   
##  0820   : 21286   BATTERY           :45773    DOMESTIC BATTERY SIMPLE: 22362   
##  0460   : 14179   CRIMINAL DAMAGE   :25821    $500 AND UNDER         : 21286   
##  0810   : 12884   ASSAULT           :18911    OVER $500              : 12884   
##  1310   : 12861   DECEPTIVE PRACTICE:15785    TO PROPERTY            : 12861   
##  0560   : 12426   OTHER OFFENSE     :14223    TO VEHICLE             : 11998   
##  (Other):138403   (Other)           :59501    (Other)                :126276   
##  location_description arrest     domestic        beat             ward       
##  STREET   :52306      N:189233   N:193827   1834   :  2715   42     : 12721  
##  RESIDENCE:39001      Y: 45168   Y: 40574   421    :  1995   28     : 11577  
##  APARTMENT:34001                            624    :  1900   27     : 10562  
##  SIDEWALK :17434                            111    :  1884   24     :  9842  
##  OTHER    : 7101                            511    :  1858   6      :  9086  
##  (Other)  :83537                            1112   :  1842   (Other):180602  
##  NA's     : 1021                            (Other):222207   NA's   :    11  
##      fbi_cd       x_coordinate      y_coordinate        latitude    
##  06     :54387   Min.   :      0   Min.   :      0   Min.   :36.62  
##  08B    :38530   1st Qu.:1153345   1st Qu.:1858454   1st Qu.:41.77  
##  14     :25821   Median :1166888   Median :1892182   Median :41.86  
##  26     :17210   Mean   :1165099   Mean   :1885688   Mean   :41.84  
##  08A    :16583   3rd Qu.:1176592   3rd Qu.:1908190   3rd Qu.:41.90  
##  11     :14305   Max.   :1205112   Max.   :1951507   Max.   :42.02  
##  (Other):67565   NA's   :1352      NA's   :1352      NA's   :1352   
##    longitude        location        
##  Min.   :-91.69   Length:234401     
##  1st Qu.:-87.71   Class :character  
##  Median :-87.66   Mode  :character  
##  Mean   :-87.67                     
##  3rd Qu.:-87.63                     
##  Max.   :-87.52                     
##  NA's   :1352

# Identidy duplicate identifiers
crime_raw %>%
  group_by(case_num) %>%
  mutate(count=n()) %>%
  filter(count>1)

The data are stored at a crime incident global, that is, there is observation for each crime incident and the data table. Each incident has a unique identifier associated with it which is stored in the case_number variable. By definition then, case_number should have all unique values however we see that some instances are duplicated, i.e., there are two or mor rows which have the same case value. For example, there are two rows in the data that have a case value equal to “JC438604”.

# Get row names for display
getrow<-t(filter(crime_raw,case_num=='JC438604'))

# Create tibble for example duplicate record
JC438604<-as_tibble(t(filter(crime_raw,case_num=='JC438604')),.name_repair=NULL,validate=NULL)

# add row names and reorganize the duplicate for display
JC438604 %>% 
  mutate(Variable=rownames(getrow)) %>%
  rename(Row1=V1,Row2=V2,Row3=V3) %>%
  select(Variable, Row1, Row2)

These duplicated rows need to be removed. Since the differences only exist in one variable and the difference is minor, these duplications are likely a recording error. We can exlude the duplicated case_number’s with the distinct command inside the of the dplyr::filter command function

# Remove duplicates
crime_no_dup<-filter(distinct(crime_raw,case_num,.keep_all=TRUE))

# Check to make sure 
crime_no_dup %>%
  group_by(case_num) %>%
  summarize(count=n()) %>%
  filter(count>1)

The date_of_occurrence gives an approximate date and time stamp as to when the crime incident might have happened. This variable was initially read in as a character, but I used the col_datetime() designation with the correct format to make R recognize that this is in fact a date.

crime_no_dup %>% 
  select(date_of_occurrence) %>% 
  head()

crime_no_dup %>% 
  select(date_of_occurrence) %>% 
  tail()

The timezone for date_of_occurrence should be America/Chicago, even though the timezone is stored in R as Coordinated Universal Time (UTC). Time zone is actually unnecessary since all of the crimes committed occured in the same time zone. As such, I’m going to leave the timezone as UTC for this analysis. Moreover, setting the time zone to America/Chicago coerces four date_of_occurrence to NA since March 1, 2020 02:00:00 doesn’t exist (thank you daylight savings). As there were definitely crimes committed those records should remain in the data set.

R understands the data stored in the date_of_occurrence column is a date and time stamp. Processing the data a bit further we can separate the time stamps from the date part using the functions from the lubridate library.

The frequency of crimes is probably not consistent throughout the day. There could be certain time intervals of the day where criminal activity is more prevalent compared to other intervals. To check this, we can bucket the timestamps into a few categories and then see the distribution the buckets. As an example we create four 6-hour time windows beginning at midnight to bucket the time stamps. The four time intervals we get are midnight to 6 AM, 6 AM to noon, noon to 6 PM, and 6 PM to midnight.

For bucketing we first create variable bins using the four time intervals mentioned above. Once the bins are created the next step is to match each timestamp in the data to one of these time enter this can be done using the cut function.

# Remove timestamp from datetime and place in separate column
library(lubridate)
crime_clean<-crime_no_dup %>%
  mutate(time=hms::as.hms(hour(date_of_occurrence)*60+minute(date_of_occurrence)), # Remove timestamp from datetime and place in separate column
         date=date(date_of_occurrence), # Separate date part from date time
         time_group=cut(as.numeric(time),breaks=c(0,6*60,12*60,18*60,23*60+59),labels=c("00-06","06-12","12-18","18-00"),include.lowest = TRUE))

crime_clean %>% select(case_num, date_of_occurrence, date, time, time_group)

crime_clean %>% group_by(time_group) %>% summarize(count=n())

The distribution of crime incidents across the day suggests that crimes are more frequent during the latter half of the day.

One of the core aspects of data mining is deriving increasigly more information from the limited data that we have. We will see a few examples of what we mean by this as we go along. Let’s start with something simple and intuitive.

We can use the date of incidence to determine which day of the week and which month of the year the crime occurred. It is possible that there is a pattern in the way crimes occur (or are committed) depending on the day of the week and month.

crime_clean <- crime_clean %>%
         mutate(
           day=wday(date,label=TRUE,abbr=TRUE),
           month=month(date,label=TRUE,abbr=TRUE)
         )

crime_clean %>% select(case_num, date_of_occurrence, day, month)

There are two fields in the data which provide the description of the crime incident. The first, primary description provides a broad category of the crime type and the second provides more detailed information about the first. We use the primary description to categorize different crime types.

# Specific crime types
(t<-crime_clean %>% 
  group_by(primary_description) %>%
  summarize(count=n()) %>%
  arrange(desc(count)))

The data contains 32 crime types; not all of which are mutually exclusive. We can combine two or more similar categories into one to reduce this number and make the analysis a bit more manageable.

# Some categories can be combined to reduce this number
crime_clean<-crime_clean %>%
  mutate(
    crime=fct_recode(primary_description,
                     "DAMAGE"="CRIMINAL DAMAGE",
                     "DRUG"="NARCOTICS",
                     "DRUG"="OTHER NARCOTIC VIOLATION",
                     "FRAUD"="DECEPTIVE PRACTICE",
                     "MVT"="MOTOR VEHICLE THEFT",
                     "NONVIOLENT"="LIQUOR LAW VIOLATION",
                     "NONVIOLENT"="CONCEALED CARRY LICENSE VIOLATION",
                     "NONVIOLENT"="STALKING",
                     "NONVIOLENT"="INTIMIDATION",
                     "NONVIOLENT"="GAMBLING",
                     "NONVIOLENT"="OBSCENITY",
                     "NONVIOLENT"="PUBLIC INDECENCY",
                     "NONVIOLENT"="INTERFERENCE WITH PUBLIC OFFICER",
                     "NONVIOLENT"="PUBLIC PEACE VIOLATION",
                     "NONVIOLENT"="NON-CRIMINAL",
                     "OTHER"="OTHER OFFENSE",
                     "SEX"="HUMAN TRAFFICKING",
                     "SEX"="CRIMINAL SEXUAL ASSAULT",
                     "SEX"="SEX OFFENSE",
                     "SEX"="CRIM SEXUAL ASSAULT",
                     "SEX"="PROSTITUTION",
                     "TRESSPASS"="CRIMINAL TRESPASS",
                     "VIOLENT"="KIDNAPPING",
                     "VIOLENT"="WEAPONS VIOLATION",
                     "VIOLENT"="OFFENSE INVOLVING CHILDREN"
                     ),
    crime_type=fct_recode(crime,
                          "VIOLENT"="SEX",
                          "VIOLENT"="ARSON",
                          "VIOLENT"="ASSAULT",
                          "VIOLENT"="HOMICIDE",
                          "VIOLENT"="VIOLENT",
                          "VIOLENT"="BATTERY",
                          "NONVIOLENT"="BURGLARY",
                          "NONVIOLENT"="DAMAGE",
                          "NONVIOLENT"="DRUG",
                          "NONVIOLENT"="FRAUD",
                          "NONVIOLENT"="MVT",
                          "NONVIOLENT"="NONVIOLENT",
                          "NONVIOLENT"="ROBBERY",
                          "NONVIOLENT"="THEFT",
                          "NONVIOLENT"="TRESSPASS",
                          "NONVIOLENT"="OTHER"
                          ) # Further combination into violent and non-violent crime types
  )
crime_clean %>%
  group_by(crime) %>%
  summarize(count=n()) %>%
  arrange(desc(count))

crime_clean %>%
  group_by(crime_type) %>%
  summarize(count=n()) %>%
  arrange(count)

With a couple of basic variables in place, we can start with a few visualizations to see how, when, and where are the crime incidents occuring.

Visualizations

Visualizing data is a powerful way to derive high-level insights about the underlying patterns in the data. Visualizations provide helpful clues as to where we need to investigate further. To see a few examples, we start with some simple plots of variables we processed in the previous section using the powerful ggplot2 library.

# Frequency of crime
library(scales)
crime_clean %>% 
  group_by(crime) %>%
  summarise(count=n()) %>%
  ggplot(aes(x = reorder(crime,count), y = count)) +
  geom_bar(stat = "identity", fill = "#756bb1") +
  labs(x ="Crimes", y = "Number of crimes", title = "Crimes in Chicago") + 
  scale_y_continuous(label = comma) +
  coord_flip()

Prevalence of different crimes seem to be an evenly distributed in Chicago with theft and battery being much more frequent. It would be interesting to look at how crimes are distributed with respect to time of day, day of week, and month.

# Time of day
crime_clean %>%
  ggplot(aes(x = time_group)) +
  geom_bar(fill = "#756bb1") +
  labs(x = "Time of day", y= "Number of crimes", title = "Crimes by time of day")

# Day of week
crime_clean %>%
  ggplot(aes(x = day)) +
  geom_bar(fill = "#756bb1") +
  labs(x = "Day of week", y = "Number of crimes", title = "Crimes by day of week")

# Month
crime_clean %>%
  ggplot(aes(x = month)) +
  geom_bar(fill = "#756bb1") +
  labs(x = "Month", y = "Number of crimes", title = "Crimes by month")

There does seem to be a pattern in the occurrence of crime with respect to the dimension of time. The latter part of the day, Fridays, and summer months witness more crime incidents, on average, with respect to other corresponding time periods.

These plots show the combined distribution of all crime with respect to different intervals of time. We can demonstrate the same plots with additional information by splitting out the different crime types. For example, we can see how different crimes vary by different times of the day. To get the number of different crimes by time of day, we need to aggregate the data at a crime – time group level. That is, four rows for each crime type – one for each time interval of the day. An easy way to aggregate data is to use the summarize function.

library(viridis)
library(scales)
crime_clean %>%
  group_by(crime,time_group) %>%
  summarise(count=n()) %>%
  ggplot(aes(x=crime, y=time_group)) +
  geom_tile(aes(fill=count)) +
  labs(x="Crime", y = "Time of day", title="Theft occurs most often between noon and 6pm") +
  scale_fill_viridis_c("Number of Crimes",label=comma) +
  coord_flip()

A quick look at the heat map shows that most of the theft incidents occur in the afternoon whereas drug related crimes are more prevalent in the evening.

We can perform a similar analysis by day of week and month as well.

# Crimes by day of the week
crime_clean %>%
  group_by(crime,day) %>%
  summarise(count=n()) %>%
  ggplot(aes(x=crime, y=day)) +
  geom_tile(aes(fill=count)) +
  labs(x="Crime", y = "Day of week", title="Battery is more prevelant on Sundays") +
  scale_fill_viridis_c("Number of Crimes",label=comma) +
  coord_flip()

# Crimes by month
# A third way of aggregating data is using the summaryBy function from the doBy package
crime_clean %>%
  group_by(crime,month) %>%
  summarise(count=n()) %>%
  ggplot(aes(x=crime, y=month)) +
  geom_tile(aes(fill=count)) +
  labs(x="Crime", y = "Month of year", title="Summer is popular for crimes") +
  scale_fill_viridis_c("Number of Crimes",label=comma) +
  coord_flip()

Till now we have only looked at the temporal distribution of crimes. But there is also a spatial element attached to them. Crimes vary considerably with respect to geographies. Typically, within an area like a zip code, city, or county, there will be pockets or zones which observe higher criminal activity as compared to the others. These zones are labeled as crime hot-stops and are often the focus areas for effective predictive policing. We have the location of each crime incident in our data that can be used to look for these spatial patterns in the city of Chicago. For this purpose, we will utilize the shape files for Chicago Police Department’s beats by processing them in R using the maptools library. The shape files for CPD beats can be downloaded from https://data.cityofchicago.org/Public-Safety/BoundariesPolice-Beats/kd6k-pxkv.