Using data to understand crime

Introduction

Purpose of the study

We constantly hear statements on crime from famous personalities or our known circles like these:

  1. “Cime is rising” - Donald Trump (link)
  2. We are not safe anymore
  3. Xyz neighbourhood is bad. Never visit that place.

This study focuses to provide some insights on these views using actual data from the Los Angeles Police Department.

Proposed Methodology and Approach

Since this is a descriptive analysis, major focus of the analysis would be preparing data, slicing, dicing, vizualization and generating insights. Following is the step by step approach:

  1. Import data
  2. Data preparation
  3. Following are the different data cuts which we would look into:
    1. Time (year, month, day) vs. number of crimes and type of crimes
    2. Victim’s characteristics (age and descent)
    3. Geospatial data
    4. Types of weapons used in crimes
    5. Modus operandi vs. crime types
  4. Summarize the results from step 3 using various types of plots
  5. Generate insights

Outcomes

This analysis will provide readers basic findings and insights on the crime in LA. Some of the expected outcomes:

  1. Relatively safe and unsafe areas in LA
  2. Seasonality in crime. For example: specific day of the week or month of the year which is more prone to criminal activities
  3. Year on year increase in over-all crimes and also specific types of crimes
  4. Modus operandi of some major type of crimes
  5. Relationship of crime types and victim’s characteristics like sex and race descent

Packages Required

Following are the packages used for this analysis:

  1. DT - To display formatted tables in R markdown
  2. data.table - To import files using fread function
  3. dplyr - To perform data manipulation on tibble or dataframes
  4. tidyr - To convert data from wide to long format
  5. lubridate - To perform operations on date formats
  6. stringr - To manipulate strings
  7. pdftools - To import PDF files and convert them to text
  8. ggplot2 - To create plots
  9. ggthemes - To use themes for plots created
  10. scales - Formats for axes
  11. forecast - For time series forecasting
  12. leaflet - To create geospatial plots
################ Loading the required packages ################################

library(DT)         #To display scrollable tables in r markdown
library(data.table) #For import using fread
library(dplyr)      #For data manipulation
library(tidyr)      #For getting the data in tidy format
library(lubridate)  #For extracting and working with dates
library(stringr)    #For working of strings
library(pdftools)   #To import PDF files and convert them to text
library(ggplot2)    #To create plots
library(ggthemes)   #To use themes for plots created
library(scales)     #Formats for axes
library(forecast)   #For time series forecasting
library(leaflet)    #To create geospatial plots

Data Preparation

This section provides details on steps involved in preparing analytical dataset.

Data Source

Data for this analysis is sourced from the Los Angeles Police Department (LAPD) and is available here.

The data is transcribed from the original crime reports that are typed on paper and reflects criminal incidents in the City of Los Angeles. Due to the nature of data collection, we might encounter some inaccuracies. Also, address field is provided to the nearest hundred block to maintain privacy. Some important characteristics of data are provided below:

  1. Time Period - The data includes crimes from 01 January, 2010 to 28 October, 2017

  2. Attributes - For each crime, this data provides 26 different variables like date when crime occurred, date when crime was reported, location of the crime, type of crime, victim’s description and current status of the investigation. Detailed description of each variable is available in this Codebook.

  3. Missing values - Location field with missing values are replaced with (0o,0o). Also, unknown values of victim’s sex and victim’s descent are represented with character ‘X’.

Data Import

Following two files are imported directly from the LA city’s data website:

  1. A .csv file which contains data of the crimes in city since 01 Janaury, 2010

  2. A .PDF file which provides description of Modus Operandi(MO) codes of suspect in the crime data

################################## Importing Data #############################################

#Assign the URL for data to a variable
link_data <- "https://data.lacity.org/api/views/y8tr-7khq/rows.csv?accessType=DOWNLOAD&bom=true&format=true"

#Imort the data in R and ensure proper naming convention for column names
crime_data <- fread(link_data, col.names = c("DR_number",   "date_reported",    "date_occurred",    
                                             "time_occurred",   "area_id",  "area_name",    
                                             "reporting_district",  "crime_code",   
                                             "crime_code_description",  "MO_codes", 
                                             "victim_age",  "victim_sex",   "victim_descent",   
                                             "premise_code",    "premise_description",  
                                             "weapon_used_code",    "weapon_description",   
                                             "status_code", "status_description",
                                             "crime_code_1", "crime_code_2",    "crime_code_3", 
                                             "crime_code_4",    "address",  "cross_street", 
                                             "location"))

#Import the PDF files with MO code descriptions and convert it to text
link_MO_codes <- "https://data.lacity.org/api/views/y8tr-7khq/files/3a967fbd-f210-4857-bc52-60230efe256c?download=true&filename=MO%20CODES%20(numerical%20order).pdf"
MO_codes_txt <- pdf_text(link_MO_codes)

The crime_data contains 1,674,208 rows and 26 columns. Following code provides a glimpse of the data set imported:

#View the imported data
glimpse(crime_data)
## Observations: 1,674,208
## Variables: 26
## $ DR_number              <int> 1208575, 102005556, 418, 101822289, 421...
## $ date_reported          <chr> "03/14/2013", "01/25/2010", "03/19/2013...
## $ date_occurred          <chr> "03/11/2013", "01/22/2010", "03/18/2013...
## $ time_occurred          <int> 1800, 2300, 2030, 1800, 2300, 1400, 223...
## $ area_id                <int> 12, 20, 18, 18, 21, 1, 11, 16, 19, 9, 1...
## $ area_name              <chr> "77th Street", "Olympic", "Southeast", ...
## $ reporting_district     <int> 1241, 2071, 1823, 1803, 2133, 111, 1125...
## $ crime_code             <int> 626, 510, 510, 510, 745, 110, 510, 510,...
## $ crime_code_description <chr> "INTIMATE PARTNER - SIMPLE ASSAULT", "V...
## $ MO_codes               <chr> "0416 0446 1243 2000", "", "", "", "032...
## $ victim_age             <int> 30, NA, 12, NA, 84, 49, NA, NA, NA, 27,...
## $ victim_sex             <chr> "F", "", "", "", "M", "F", "", "", "", ...
## $ victim_descent         <chr> "W", "", "", "", "W", "W", "", "", "", ...
## $ premise_code           <int> 502, 101, 101, 101, 501, 501, 108, 101,...
## $ premise_description    <chr> "MULTI-UNIT DWELLING (APARTMENT, DUPLEX...
## $ weapon_used_code       <int> 400, NA, NA, NA, NA, 400, NA, NA, NA, N...
## $ weapon_description     <chr> "STRONG-ARM (HANDS, FIST, FEET OR BODIL...
## $ status_code            <chr> "AO", "IC", "IC", "IC", "IC", "AA", "IC...
## $ status_description     <chr> "Adult Other", "Invest Cont", "Invest C...
## $ crime_code_1           <int> 626, 510, 510, 510, 745, 110, 510, 510,...
## $ crime_code_2           <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ crime_code_3           <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ crime_code_4           <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ address                <chr> "6300    BRYNHURST                    A...
## $ cross_street           <chr> "", "15TH", "", "WALL", "", "", "AVENUE...
## $ location               <chr> "(33.9829, -118.3338)", "(34.0454, -118...

The PDF file imported contains raw text. In order to convert it into a data frame, we will have to perform string manipulations and cleaning which is covered in the next section.

Data Cleaning

Currently, the imported datasets do not follow priciples of tidy data. This section provides step by step approach to data cleaning and creating new dataframes and features.

DF from PDF

Creating a dataframe from the raw text extracted from the PDF file containing MO descriptions. This dataframe will be useful when understanding MO codes associated with the crimes in crime_data.

######################## Cleaning PDF with Look up values for MO codes ########################

#Split the text using new line character in Windows i.e. \r\n
MO_codes_1 <- str_split(MO_codes_txt, pattern = '\r\n')

#Convert the list of chracter vector into single vector
MO_codes_2 <- unlist(lapply(1:length(MO_codes_1), function(x) MO_codes_1[[x]]))

#Convert the chracter vector into a dataframe
MO_codes_df <- as.data.frame(MO_codes_2, stringsAsFactors = FALSE)

#Split the strings into code and description; and remove the blank MO_Codes
MO_codes_f <- MO_codes_df %>%
                separate(col = MO_codes_2, into = c("MO_codes","MO_description"), sep = 4) %>%
                filter(str_trim(MO_codes) != "") %>%
                mutate(MO_description = str_trim(MO_description))

#View the data
glimpse(MO_codes_f)
## Observations: 532
## Variables: 2
## $ MO_codes       <chr> "0100", "0101", "0102", "0103", "0104", "0105",...
## $ MO_description <chr> "Suspect Impersonate", "Aid victim", "Blind", "...

The MO_codes_f dataframe contains 532 MO codes and its description.

Tidying Data

The crime_data does not follow the concepts of tidy data. The field MO_codes has multiple codes associated with one crime within same cell and also the location field has latitude and longitude within the same cell. In this section we will deal these two problems.

Since one crime has many MO codes within one cell, analyzing Modus Operandi with respect to crime will be problematic. Hence, let us create a new dataframe which has only crime codes and modus operandi in tall format. This dataframe will be unique at a DR_number (unique identifier for crime) and MO code level.

##################### Create separate table for DR_Number & MO Codes ##########################

#Keep only two columns - DR number and codes & Drop MO_codes from main table
crime_MO_codes <- crime_data[,c("DR_number","MO_codes")]

#Remove data with blank MO_codes
crime_MO_codes <- crime_MO_codes %>%
                      filter(MO_codes != '')

#Split the column MO_codes into individual MO_codes
crime_MO_codes_sp <- cbind(crime_MO_codes$DR_number, data.frame(do.call('rbind',
                          strsplit(as.character(crime_MO_codes$MO_codes)," ",fixed=TRUE)), 
                          stringsAsFactors = FALSE))
colnames(crime_MO_codes_sp)[1] <- "DR_number"

#Make this a tall dataset instead of wide
crime_MO_map <- crime_MO_codes_sp %>% 
                        gather("var_for_MO_codes", "MO_codes", -DR_number) %>%
                        #Keeping distinct pairs of crime and MO codes
                        group_by(DR_number, MO_codes) %>% 
                        summarise(n = n()) %>%
                        select(DR_number, MO_codes) #Keeping the required columnns

#Delete MO Codes from crime data
crime_data <- crime_data[,-"MO_codes"]

#View the mapping of crime with MO Codes
glimpse(crime_MO_map)
## Observations: 4,045,004
## Variables: 2
## $ DR_number <int> 1208575, 1208575, 1208575, 1208575, 1307355, 1307355...
## $ MO_codes  <chr> "0416", "0446", "1243", "2000", "0913", "1814", "200...

Now, let us split the location column in the crime_data into latitude and longitude so that the dataset follows the principles of tidy data.

###################### Split location into latitude and longitude ##########################

crime_data <- crime_data %>%
                  #Split the location variable using ','
                separate(location, into = c("latitude", "longitude"), sep = ',') %>%
                  #Replace brackets in the strings and then convert them to numeric
                mutate(latitude = as.numeric(str_replace(latitude,'\\(','')),
                longitude = as.numeric(str_replace(longitude,'\\)','')))

#Replacing (0,0) to NAs as per the codebook
crime_data$latitude[crime_data$latitude==0] <- NA
crime_data$longitude[crime_data$longitude==0] <- NA
Formatting Columns

Converting the date columns (imported as character) into date formats so that we can operate on them.

############################ Converting dates & time to proper format ########################

crime_data <- crime_data %>%
                    #Splitting using -3 to incorporate values like 800, 530, etc.
                  separate(col = time_occurred, into = c("hour_occurred", "min_occurred"), 
                           sep = -3) %>% 
                    #Padding extra zeros so that lubridate function hm doesn't 
                    #generate NAs for cases where hour = ""
                  mutate (hour_occurred = str_pad(hour_occurred, 2, side="left", 
                                                  pad = 0)) %>%
                    #Adding a separator between hour and minutes 
                    #for lubridate to identify time correctly
                  unite(time_occurred, hour_occurred, min_occurred, sep = ':', 
                        remove = FALSE) %>%
                    #formatting the date and time variables
                  mutate(date_reported = mdy(date_reported), 
                         date_occurred = mdy(date_occurred),
                         time_occurred = hm(time_occurred))

Converting victim’s sex and victim’s descent from characters to factors for ease of summary operations:

####################### Converting variables like sex into factor format ######################

#Convert the variables sex, descent and status code into factors
crime_data <- crime_data %>%
                  mutate(victim_sex = as.factor(victim_sex),
                         victim_descent = as.factor(victim_descent))
Cleaning Observations

Looking at the summary of all the variables in the dataset, we observe that victim_sex and victim_descent have invalid observations with value ‘-’. Assiging these values to ‘X’ i.e. unknown as per the Codebook:

####################### Replacing ivalid observations #######################################

#For sex and descent

levels(crime_data$victim_sex)[levels(crime_data$victim_sex) == '-'] <- 'X'
levels(crime_data$victim_descent)[levels(crime_data$victim_descent) == '-'] <- 'X'

Also, from the summary we notice the fact that 6 observations have missing values for crime_code_1 but crime_code does not have any missing values. According to the code book crime_code_1 should be same as crime_code. On more inspection, we realise that crime_code_2 has values for crime_code_1. Hence, we can rectify this error by replacing crime_code_1 values with crime_code and assigning NA to crime_code_2 for those observations.

####################### Replacing ivalid observations #######################################

#For crime_code_1 and crime_code_2

index <- which(is.na(crime_data$crime_code_1))
crime_data[index, "crime_code_1"] <- crime_data[index, "crime_code"]
crime_data[index, "crime_code_2"] <- NA

New features

In this section we will create few additional variables which will be used to analyze the data:

  1. days_before_reporting - This variable provides the difference between the crime reported date and crime occurred date in days
  2. year_occurred - This variable indicates the year in which crime occurred
  3. month_occurred - This variable indicates the month in which crime occurred
  4. day_occurred - This variable indicates the day of month on which crime occurred
  5. day_of_week - This variable indicates the day of the week on which crime occurred
##################### Create new variables for analysis ##################################

crime_data <- crime_data %>%
                  #Creating a variable for days between crime occurred and crime reported
                mutate(days_before_reporting = as.numeric(date_reported - date_occurred),
                          #Creating variables on date when crime occurred
                        year_occurred = year(date_occurred),
                        month_occurred = month(date_occurred),
                        day_occurred = day(date_occurred),
                        day_of_week = wday(date_occurred, label = TRUE))

# Finally Filtering the data to restrict dates till latest full month available i.e. Oct 2017
crime_data_oct17 <- crime_data %>%
                        filter(date_occurred <= '2017-10-31')

Summary

From the raw crime dataset we created following two datsets:

  1. crime_data - After cleaning and creating new variables, the final crime dataset has 1,624,790 rows and 33 columns. Each row in this data represents a crime incident and each column provides one attriubte of the crime. DR_number is the unique identifier for this data. We have filtered the data to restrict date of crime till Oct 2017 as that is the latest full month of data available.

  2. crime_MO_map - MO_codes from the crime data were removed as single crime record has multiple modus operandi codes. A new dataset crime_MO_map was created which has 4,045,004 rows and 2 columns to analyze data at MO_code level. This dataset is unique at DR_number and MO_code level.

A PDF with explanations of MO_codes was imported and a dataset MO_codes_f with 532 rows and 2 columns was created.

Following table provides basic summary of variables which will be used for the analysis:

Following table represents a 100 rows preview of final crime data which will be used for analysis:

EDA Plots

Following sections display different data cuts and visualization techniques used to generate insights from the LA crime data.

Following is the format used in this section:

  1. Code
  2. Plot
  3. Inference

Seasonality Analysis

This section tries to explore the seasonality aspect of number of crimes. We start by looking at the data at year level and then eventually drill down to hour level analysis.

################# Over-all year level numbers #############################

#Identify top 5 crimes in the dataset
top_5_crimes <- crime_data_oct17 %>%
  group_by(crime_code, crime_code_description) %>%
  summarise(n = n_distinct(DR_number)) %>%
  arrange(desc(n)) %>%
  head(5) %>%
  select(crime_code, crime_code_description) %>%
  rename("top_5_crime_desc" = "crime_code_description")

#Filter the data for top 5 crimes by using joins
crime_data_oct17 %>%
  left_join(top_5_crimes, by = "crime_code") %>%
  mutate(top_5_crime_desc = ifelse(is.na(top_5_crime_desc), 
                                   "OTHERS", top_5_crime_desc)) %>%
  group_by(year_occurred, top_5_crime_desc) %>%
  summarise(n = n_distinct(DR_number)) %>%
  mutate(pct = n/sum(n), #Calculate percentage for labels
         ypos = cumsum(n) - 0.5*n) %>% #Calculate postions for label
  
  #Plot the stacked bar graph
  ggplot(aes(x = year_occurred, y = n, fill = top_5_crime_desc)) +
  geom_bar(position = position_stack(reverse = TRUE), stat = "identity", width = 0.7, 
           alpha = 0.5, color = "black") +
  
  #Axis titles and labels
  geom_text(aes(label = paste0(sprintf("%1.0f", pct*100),"%"), y = ypos), size = 3) +
  labs(x = "Year", y = "Number of crimes occurred", 
       title = "Number of crimes occurred per year split by top 5 crime categories",
       fill = "Crime description") +
  scale_y_continuous(labels = comma) +
  theme_classic() 

From the graph above, we can infer that number of crimes each year were decreasing from 2010 to 2013 but the trend reversed from 2014 and it started increasing. We cannot look at 2017 as it just has data available for 10 months.

The proportion of major crime types within each year remains more or less similar.

################### Month wise roll up and plot ########################

#Roll up the data at monthe level
crime_data_oct17 %>%
  group_by(month_occurred) %>%
  summarise(n = n_distinct(DR_number)) %>%

  #Plot the data  
  ggplot(aes(x = as.factor(month_occurred), y = n)) +
  geom_bar(stat = "identity", fill = "blue", alpha = 0.3) +
  
  #Format axes, titles and lables
  geom_text(aes(x = month_occurred, y = 1, label = comma(n)),
            hjust=0, vjust=.5, size = 3, colour = 'black', fontface = 'bold') +
  scale_y_continuous(labels = comma) +
  labs(y = "Number of crime occurred", x = "Month", 
       title = "Number of crimes by month") +
  coord_flip() +
  theme_bw() 

January has highest number of crimes reported while November and December has the lowest number of crimes reported. We can hypothesize that during holiday season either the crime rates are low or crimes are being reported or recorded late due to the holiday season.

########################## Day of week vs. hour of the day ####################

#Roll up the data at week and hour level
crime_data_oct17 %>%
  group_by(day_of_week, hour_occurred) %>%
  summarise(n = n_distinct(DR_number)) %>%
  
  #Plot the tiles  
  ggplot(aes(x = day_of_week, y = hour_occurred)) +
  geom_tile(aes(fill = n)) +
  scale_fill_gradient(low = "light blue", high = "dark blue") +
  
  #Axes labels and titles
  labs(x = "Day of the week", y = "Hour of the day", 
       title = "Day of the week Vs. Hour of the day",
       fill = "Number of crimes") +
  theme_bw()

A counter intuitive trend is noticed from the graph above. The crimes are maximum at 12 in the afternoon. Digging deeper into data below to analyze this trend.

################## Looking at crimes at 12:00 ########################

#Filter and roll up the data for hour == 12
crime_data_oct17 %>%
  filter(hour_occurred == '12') %>% 
  group_by(crime_code_description) %>%
  summarise(num_crimes = n_distinct(DR_number)) %>%
  arrange(desc(num_crimes)) %>%
  head(10) %>%
  mutate(crime_code_description = reorder(crime_code_description,num_crimes)) %>%
  
  #Plot the data
  ggplot(aes(x = crime_code_description, y = num_crimes)) +
  geom_bar(stat = "identity", fill = "maroon", alpha = 0.5) +
  
  #Format axes, titles and labels
  geom_text(aes(x = crime_code_description, y = 1, label = comma(num_crimes)),
            hjust=-0.15, vjust=.5, size = 3, colour = 'black',
            fontface = 'bold') +
  scale_y_continuous(labels = comma) +
  labs(y = "Number of crime occurred", x = "Crime Description", 
       title = "Top 10 crimes at 12:00 hours") +
  coord_flip() +
  theme_bw() 

The above plot shows that maximum number of crimes occuring during noon is theft of identity. This may be because time for these type of crimes cannot be determined and 12:00 hrs may be used by the police to report them.

To analyze when does harmful crime occur during the day or week let us look at the crimes where weapons were used. This will provide a good proxy for harmful crimes.

############ Day of the week Vs. Hour for crimes where weapons was used ########

#Filter and roll up the data
crime_data_oct17 %>%
  filter(!is.na(weapon_used_code)) %>% #Filter for crimes where weapon was used
  group_by(day_of_week, hour_occurred) %>%
  summarise(n = n_distinct(DR_number)) %>%
  
  #Plot the tiles  
  ggplot(aes(x = day_of_week, y = hour_occurred)) +
  geom_tile(aes(fill = n)) +
  scale_fill_gradient(low = "light blue", high = "dark blue") +
  
  #Axes labels and titles
  labs(x = "Day of the week", y = "Hour of the day", 
       title = "Day of the week Vs. Hour of the day for crimes where a Weapon was used",
       fill = "Number of crimes") +
  theme_bw()

The above heat map provides some good information regarding crime occurrence in which a weapon was used. We can notice that: 1. During a day, crime generally is low during morning hours i.e. around 6 AM and then gradually increases till 11 PM or 1 AM and then again falls down. 2. Weekends (Saturday and Sunday) have higher crime occurrences compared to other days 3. Saturday and Sunday around 1 AM has very high crime occurrences

Crime Type and Victim Analysis

This section explores data on victims’ characteristics, crime types and modus operandi of top 2 crimes in LA

######################### Victims Age distiribution by sex ##########################

#Filter the data to remove missing values and filter for victim's sex
crime_data_oct17 %>%
  filter(!is.na(victim_age) & victim_sex %in% c('F', 'M')) %>%
  
  #Plot the histograms
  ggplot(aes(victim_age, fill = victim_sex)) + 
  geom_histogram(alpha = 0.5, aes(y = ..density..), position = 'identity') +
  
  #Axes labels and titles
  labs(x = "Victim's Age", y = "Density", 
       title = "Distribution of Victim's Age by Victim's Sex",
       fill = "Victim's Sex") +
  theme_bw()

For the age group around 25, number of crimes against women is typically higher than men. Also, overall the crime against age group of 20 to 30 is highest. This may be due to the outward and independent lifestyle of this age group compared to other age groups.

############Top 10 crime with showing distirbution of sex in them ##################

#Creating table with top 10 crimes where victim's sex is identified
top_10_crimes_victims <- crime_data_oct17 %>%
  filter(victim_sex %in% c('F', 'M')) %>%
  group_by(crime_code, crime_code_description) %>%
  summarise(n = n_distinct(DR_number)) %>%
  arrange(desc(n)) %>%
  head(10) %>%
  select(crime_code, crime_code_description) %>%
  rename("top_10_crime_desc" = "crime_code_description")

#Filtering the data by joining with top_10_crimes identified
crime_data_oct17 %>%
  filter(victim_sex %in% c('F', 'M')) %>%
  inner_join(top_10_crimes_victims, by = "crime_code") %>%
  group_by(victim_sex, top_10_crime_desc) %>%
  summarise(n = n_distinct(DR_number)) %>%

  #Plot the line  
  ggplot(aes(x = top_10_crime_desc, y = n, fill = victim_sex)) +
  geom_bar(position = position_stack(reverse = TRUE), stat = "identity", width = 0.7, 
           alpha = 0.5, color = "black") +
  labs(x = "Crime Description", y = "Number of crimes occurred", 
       title = "Top 10 crimes split by Gender",
       fill = "Victim's gender") +
  scale_y_continuous(labels = comma) +
  coord_flip() +
  theme_bw() 

The top 10 crimes for the above graph are identified after filtering data for crimes where victim’s sex was identified. We can infer from the graph that proportion of male and female victims crimes depends on the nature of the crime. For example, Intimate Partner Simple Assualt has higher proportion of female victims while Assault With Deadly Weapons has higher proportion of male victims. Theft of Identity has similar proportion of male and female victims.

############### Victims descent vs. age boxplots ####################

#Filter the data
crime_data_oct17 %>%
  filter(!is.na(victim_age) & victim_descent != '') %>% #Remove blanks and NAs
  
  #Box plots
  ggplot (aes(x=victim_descent, y=victim_age)) +
  geom_boxplot(aes(color = victim_descent)) +

  #Axes labels and titles
  labs(x = "Victim's Age", y = "Victim's Descent", 
       title = "Distribution of Victim's Age across Victim's Descent") +
  ggtitle("Victim's Age Vs. Victim's Descent") +
  scale_x_discrete(labels=c('Other Asian',
                            'Black',
                            'Chinese',
                            'Cambodian',
                            'Filipino',
                            'Guamanian',
                            'Hispanic/Latin/Mexican',
                            'American Indian/Alaskan Native',
                            'Japanese',
                            'Korean',
                            'Laotian',
                            'Other',
                            'Pacific Islander',
                            'Samoan',
                            'Hawaiian',
                            'Vietnamese',
                            'White',
                            'Unknown',
                            'Asian Indian')) +
  coord_flip() +
  theme_bw() +
  theme(legend.position = "none")

Crimes against people of Other Asian descent clearly stands out. The average age of victim is significantly low compared to victims from other descents. This might be due to higher immigrant population from Asia.

Analyzing modus operandi of crimes helps in taking precautionary measures to avoid the crimes, in future. Following section explores the modus operandi for two major crime types.

############## Modus Operandi for burglary from vehicles ##################

#Filter data for burglary from vehicle
crime_data_oct17 %>%
  filter(crime_code == 330) %>%
  
  #Merge it with MO Codes based on DR_number
  inner_join(crime_MO_map, by = "DR_number") %>%
  group_by(MO_codes) %>%
  summarise(n = n()) %>%
  arrange(desc(n)) %>%
  
  #Merge it with description file to get the MO_codes description
  inner_join(MO_codes_f, by = "MO_codes") %>%
  #Removing first row because it states that burglary took place 
  #which is obvious in this scenario as we are look at theft from car
    filter(row_number(desc(n)) > 1 & row_number(desc(n)) <= 11) %>%
    mutate(MO_description = reorder(MO_description,n)) %>%

  #Plot the data
  ggplot(aes(x = MO_description, y = n)) +
  geom_bar(stat = "identity", fill = "purple", alpha = 0.7) +
  
  #Format axes, labels and titles
  geom_text(aes(x = MO_description, y = 1, label = comma(n)),
            hjust=-0.25, vjust=.5, size = 3, colour = 'white',
            fontface = 'bold') +
  scale_y_continuous(labels = comma) +
  labs(y = "Number of cases", x = "Modus Operandi", 
       title = "Top Modus Operandi for burglary from vehicle") +
  coord_flip() +
  theme_bw() 

The bar plot depicts top 10 ways in which Burglary from vehicle is executed. The crime may involve one or more aspects of the above modus operendi.

In LA, Burglary from a vehicle is generally done by smashing, breaking windows, using another vehicle and forceful entry.

#Filter data for battery - simple assault
crime_data_oct17 %>%
  filter(crime_code == 624) %>%
  
  #Merge it with MO Codes based on DR_number
  inner_join(crime_MO_map, by = "DR_number") %>%
  group_by(MO_codes) %>%
  summarise(n = n()) %>%
  arrange(desc(n)) %>%
  
  #Merge it with description file to get the MO_codes description
  inner_join(MO_codes_f, by = "MO_codes") %>%
  head(10) %>%
  mutate(MO_description = reorder(MO_description,n)) %>%
  
  #Plot the data
  ggplot(aes(x = MO_description, y = n)) +
  geom_bar(stat = "identity", fill = "dark green", alpha = 0.7) +
  
  #Format axes, titles and labels
  geom_text(aes(x = MO_description, y = 1, label = comma(n)),
            hjust=-0.25, vjust=.5, size = 3, colour = 'white',
            fontface = 'bold') +
  scale_y_continuous(labels = comma) +
  labs(y = "Number of cases", x = "Modus Operandi", 
       title = "Top Modus Operandi for simple assault") +
  coord_flip() +
  theme_bw() 

Again, the bar plot above depicts top 10 modus operandi involved in Battery - Simple Assault. In LA, most of the simple assaults involve a hit with or without weapon. We can also notice that generally victim knows the suspect.

Geospatial Analysis

In this we will analyze the location of crime. Since, overall number of crimes is very high, plotting them using latitude and longitude won’t reveal any patterns. Hence, let us look at the data at police station level.

Following graph depicts 21 LAPD community police stations (denoted by circles). The size of the circle on the map is directly proportional to the number of crimes occurred in that area.

#################### Number of crimes based on Area  ###################

#Identify center of LA to position the maps accordingly in leaflet
center_lat = median(crime_data_oct17$latitude,na.rm = TRUE)
center_lon = median(crime_data_oct17$longitude, na.rm = TRUE)

#Crime at Area level - size of the circle is directly proportional to number of crime incidents
#Remove nulls and roll up the data at area level
crime_data_oct17 %>%
    filter(!is.na(latitude)) %>%
    group_by(area_name) %>%
    summarise(num_crimes = n(),
              # calculate area lat long to identify center of area
              latitude = median(latitude),
              longitude = median(longitude)) %>% 
  
    #plot the data
    leaflet() %>% 
    addProviderTiles("OpenStreetMap") %>%
    addCircles(lng = ~longitude, lat = ~latitude, color = c("maroon"), 
               radius = ~num_crimes/30) %>%
    setView(lng = center_lon, lat = center_lat, zoom = 10)

77th Street and Southwest has maximum number of crimes reported while Foothill and Hollenbeck has least number of crimes reported.

Interactive Dashboard

Following is a shiny app (hosted here) which we have created to allow the users to interact with data i.e. filter for crimes based on year, time of the day, major crime types and victim’s sex. The app shows the data for crime in which adults were arrested. This provides a proxy to analyze severe crime patterns.

Trend Analysis

This section deals with the time series trends in crime reported over the years.

######## Plot number of crimes occurred on each day ########

#Summarize the data on daily level
crime_data_oct17 %>%
  group_by(date_occurred) %>%
  summarise(n = n_distinct(DR_number)) %>%
  arrange(date_occurred) %>%

  #Plot the line  
  ggplot(aes(x = date_occurred, y = n)) +
  geom_line() +
  
  #Point out the spikes in the data
  annotate("rect",xmin = as.Date('2010-12-25'), xmax = as.Date('2011-01-10'),
                ymin = 350, ymax = 2100, fill = "red", alpha = 0.5) +
  annotate("rect",xmin = as.Date('2012-05-25'), xmax = as.Date('2012-06-05'),
           ymin = 450, ymax = 1050, fill = "blue", alpha = 0.5) +
  annotate("text", label = "Spikes on 1st January of each year", 
           x = as.Date('2011-07-15'), y = 2150, size = 3.5, color = "red", alpha = 0.5) +
  annotate("text", label = "Small Spikes on 1st every month", 
           x = as.Date('2012-11-10'), y = 1100, size = 3.5, color = "blue", alpha = 0.5) +

  #Axes labels and titles
  labs(x = "Date on which crime occurred", y = "Number of crimes", 
       title = "Number of crimes Vs. Date Occurred") +
  scale_y_continuous(labels = comma) +
  theme_bw()

When plotting data at day level from 2010, we noticed that number of crimes reported on 1st of January has a very high spike. This may be due to the Christmas break but then this spike reduces significantly as we move towards 2017. We can infer that there might be some issue in data reporting. Also, we observe small spikes at the start of each month.

Since, daily level trends have a very high variation, looking at month level trends in following section.

################ Summarize the data on month level & then plot ########

crime_data_oct17 %>%
    mutate(year_month = format(as.Date(date_occurred),"%Y_%m")) %>%
    group_by(year_month) %>%
    summarise(n = n_distinct(DR_number)) %>%
    
    #Plot the line  
    ggplot(aes(x = year_month, y = n, group = 1)) +
    geom_line(color = "blue") +
    geom_point(color = "blue") +
    geom_smooth(se = FALSE, color = "dark grey") +

    #Axes labels and titles
    labs(x = "Year and Month of crime occurrence",
        y = "Number of crimes",
        title = "Number of crimes Vs. Year and Month of crime occurrence") +
    scale_y_continuous(limits = c(10000,20500), labels = comma) +
    scale_x_discrete(breaks = 
      levels(as.factor(format(as.Date(crime_data_oct17$date_occurred),"%Y_%m")))[c(T, rep(F, 11))]) +
    theme_bw()

As expected, the month level trends has less variation compared to day level data. We can observe that numnber of crimes has a slight decreasing trend till 2014 and an increasing trend henceforth. Also, we can notice that there is always a spike in number of crimes in January and a dip in the month of February.

####################### trends for top 5 crimes ########################

#Filter the data by joining with top 5 crimes data created earlier 
crime_data_oct17 %>%
  inner_join(top_5_crimes, by = "crime_code") %>%
  mutate(year_month = format(as.Date(date_occurred),"%Y_%m")) %>%
  group_by(year_month, crime_code_description) %>%
  summarise(n = n_distinct(DR_number)) %>%
  
  #Plot the line  
  ggplot(aes(x = year_month, y = n, group = 1)) +
  geom_line(color = "blue") +
  geom_point(color = "blue", size = 0.5) +
  facet_wrap(~ crime_code_description, nrow = 5, ncol = 1) +
  
  #Axes labels and titles
  xlab("Year and Month of crime occurrence") +
  ylab("Number of crimes") +
  ggtitle("Monthly trends for top 5 crimes in LA") +
  scale_y_continuous(labels = comma) +
  scale_x_discrete(breaks = 
                     levels(as.factor(format(as.Date(crime_data_oct17$date_occurred),"%Y_%m")))[c(T, rep(F, 23))]) +
  theme_bw()

The above plots show monthly trends of top 5 crimes in LA. Simple Assault and Burglary remains more or less similar over time. Burglary from vehicle and vehicles stolen initially have a decreasing trend and then from 2014 it increases (similar to overall number of crimes). Petty thefts have a slight increasing trend overall.

Let us use ARIMA to foreast number of crimes for next 5 months.

#################### Time series for top 5 crimes ####################

#Filter for data for top 5 crimes
top_crimes_monthly_data <- crime_data_oct17 %>%
                            inner_join(top_5_crimes, by = "crime_code") %>%
                            mutate(year_month = format(as.Date(date_occurred),"%Y_%m")) %>%
                            group_by(year_month, crime_code_description) %>%
                            summarise(num_crimes = n_distinct(DR_number))

top_crime_desc <- unique(top_crimes_monthly_data$crime_code_description)

#Creating a for loop to predict and plot trends for top 5 crimes
for(i in 1:length(top_crime_desc))
{
num_crimes <- filter(top_crimes_monthly_data, 
                     crime_code_description == top_crime_desc[i])$num_crimes

#Creating time series object
crimes_ts <- ts(num_crimes)

#Fitting the best arima model
model <- auto.arima(crimes_ts, stepwise = FALSE, approximation = FALSE)

#Predict next 3 months based on the model created
predict <- model %>% forecast(level = c(95), h = 5)

#Plot the graph 
print(predict %>% 
        autoplot() +
        labs(x = "Year and Month from Oct 2010 to Mar 2018",
             y = "Number of crimes",
             title = paste0("Crime Forecast for ", top_crime_desc[i]),
             label = "XYZ") +
        scale_y_continuous(labels = comma) +
        theme_bw())

#Print the table for predictions
year_month <- c("2017_11", "2017_12", "2018_01", "2018_02", "2018_03")
table <- cbind(year_month, round(predict$mean, 0), 
               round(predict$lower, 0), round(predict$upper, 0))
colnames(table) <- c("Year_Month", "Predicted_Mean", "95%_CI_lower_estimate", "95%_CI_upper_estimate")
print(paste0("Predictions for ", top_crime_desc[i], ":"))
print(as.data.frame(table))

}

## [1] "Predictions for BATTERY - SIMPLE ASSAULT:"
##   Year_Month Predicted_Mean 95%_CI_lower_estimate 95%_CI_upper_estimate
## 1    2017_11           1592                  1372                  1812
## 2    2017_12           1496                  1262                  1730
## 3    2018_01           1399                  1145                  1653
## 4    2018_02           1378                  1123                  1634
## 5    2018_03           1402                  1146                  1657

## [1] "Predictions for BURGLARY:"
##   Year_Month Predicted_Mean 95%_CI_lower_estimate 95%_CI_upper_estimate
## 1    2017_11           1272                  1072                  1473
## 2    2017_12           1257                  1042                  1472
## 3    2018_01           1253                  1034                  1472
## 4    2018_02           1251                  1030                  1472
## 5    2018_03           1251                  1028                  1473

## [1] "Predictions for BURGLARY FROM VEHICLE:"
##   Year_Month Predicted_Mean 95%_CI_lower_estimate 95%_CI_upper_estimate
## 1    2017_11           1571                  1364                  1778
## 2    2017_12           1571                  1343                  1799
## 3    2018_01           1571                  1324                  1818
## 4    2018_02           1571                  1306                  1836
## 5    2018_03           1571                  1289                  1853

## [1] "Predictions for THEFT PLAIN - PETTY ($950 & UNDER):"
##   Year_Month Predicted_Mean 95%_CI_lower_estimate 95%_CI_upper_estimate
## 1    2017_11           1246                  1081                  1411
## 2    2017_12           1284                  1101                  1468
## 3    2018_01           1263                  1053                  1474
## 4    2018_02           1271                  1029                  1512
## 5    2018_03           1272                  1015                  1529

## [1] "Predictions for VEHICLE - STOLEN:"
##   Year_Month Predicted_Mean 95%_CI_lower_estimate 95%_CI_upper_estimate
## 1    2017_11           1520                  1320                  1721
## 2    2017_12           1576                  1347                  1806
## 3    2018_01           1559                  1293                  1824
## 4    2018_02           1556                  1269                  1843
## 5    2018_03           1558                  1266                  1849

The above forecasts are generated from auto selection of ARIMA models. We need to do proper model fit analysis before coming to conclusions. This part is beyond the scope of this project for now.

We can draw some inferences from the predictions like Simple Assault crime cases might decrease for next few months, Burglary and Burglary from vehicle crime cases might remain similar for next few months and vehicle stolen might increase in next few months.

Summary

The purpose of this study was to generate some insights on the views on crime like “Crime is rising” using actual data from the Los Angeles Police Department. Since this is a descriptive analysis, we approached this problem by cleaning the data and then slicing and dicing it to generate insights. Mainly Visualizations were used to generate insights. Following are some key insights on LA crime:

  1. Battery - Simple Assault, Burglary, Burglary from vehicle, Petty theft and Vehicle stolen are top 5 crimes in the city
  2. The total number of crimes in LA showed a decreasing trend from 2010 till 2014. But, an increasing trend since 2014
  3. Proportion of top five crimes in a year remains similar since 2010
  4. January month has the highest number of reported crime whereas Novemeber has the least number of crimes reported.
  5. We generally percieve that crimes occur at odd times of the day. But, that is true only for dangerous crime. Crimes like identity theft majorly occur during the day.
  6. Saturday and Sunday around 1 AM has highest number of crimes reported when compared to all other times during the week
  7. Based on age, major victims of crime belong to 20 to 30. This may be due to the outward and independent lifestyle
  8. Victims belonging to Other Asian descent have a very low average age when compared to victims from other descent
  9. Burglary from a vehicle is generally done by smashing, breaking windows, using another vehicle and forceful entry
  10. Most of the simple assaults involve a hit with or without weapon. We can also notice that generally victim knows the suspect
  11. Areas under 77th Street Police Department and Southwest Police Department have maximum number of crimes reported

This analysis will provide consumers a general idea about the crime in LA with data to support the conclusions. Consumers can also explore the data on a map using the following shiny app

Following are the next steps to improvise this analysis:

  1. Improve the ARIMA time series models by performing model fit analysis
  2. Improvise shiny app to reduce the time it takes to load initially
  3. Statistical significance is not checked for the inferences made above. Hypothesis testing will provide strong support to the conclusions