Exploratory Data Analysis

Introduction

Purpose of the study

We constantly hear statements on crime from famous personalities or our known circles like these:

  1. “Cime is rising” - Donald Trump (link)
  2. We are not safe anymore
  3. Xyz neighbourhood is bad. Never visit that place.

This study focuses to provide some insights on these views using actual data from the Los Angeles Police Department.

Proposed Methodology and Approach

Since this a descriptive analysis, major focus of the analysis would be preparing data, slicing, dicing, vizualization and generating insights. Following is the step by step approach:

  1. Import data
  2. Data preparation
  3. Following are the different data cuts which we will look into:
    1. Time (year, month, day) vs. number of crimes and type of crimes
    2. Victim’s characteristics (age and descent)
    3. Geospatial data
    4. Arrested cases vs. crime types
    5. Types of weapons used in crimes
    6. Modus operandi vs. crime types
  4. Summarize the results from step 3 using various types of plots
  5. Generate insights

Outcomes

This analysis will provide readers basic findings and insights on the crime in LA. Some of the expected outcomes:

  1. Relatively safe and unsafe areas in LA
  2. Seasonality in crime. For example: specific day of the week or month of the year which is more prone to criminal activities
  3. Year on year increase in over-all crimes and also specific types of crimes
  4. Modus operandi of some major type of crimes
  5. Relationship of crime types and victim’s characteristics like sex and race descent
  6. Relationship between arrests made and time taken to report the crimes

Packages Required

Following are the packages used for this analysis:

  1. DT - To display formatted tables in R markdown
  2. data.table - To import files using fread function
  3. dplyr - To perform data manipulation on tibble or dataframes
  4. tidyr - To convert data from wide to long format
  5. lubridate - To perform operations on date formats
  6. stringr - To manipulate strings
  7. pdftools - To import PDF files and convert them to text
################################ Loading the required packages ################################

library(DT)         #To display scrollable tables in r markdown
library(data.table) #For import using fread
library(dplyr)      #For data manipulation
library(tidyr)      #For getting the data in tidy format
library(lubridate)  #For extracting and working with dates
library(stringr)    #For working of strings
library(pdftools)   #To import PDF files and convert them to text

Data Preparation

This section provides details on steps involved in preparing analytical dataset.

Data Source

Data for this analysis is sourced from the Los Angeles Police Department (LAPD) and is available here.

The data is transcribed from the original crime reports that are typed on paper and reflects criminal incidents in the City of Los Angeles. Due to the nature of data collection, we might encounter some inaccuracies. Also, address field is provided to the nearest hundred block to maintain privacy. Some important characteristics of data are provided below:

  1. Time Period - The data includes crimes from 01 January, 2010 to 28 October, 2017

  2. Attributes - For each crime, this data provides 26 different variables like date when crime occurred, date when crime was reported, location of the crime, type of crime, victim’s description and current status of the investigation. Detailed description of each variable is available in this Codebook.

  3. Missing values - Location field with missing values are replaced with (0o,0o). Also, unknown values of victim’s sex and victim’s descent are represented with charachter ‘X’.

Data Import

Following two files are imported directly from the LA city’s data website:

  1. A .csv file which contains data of the crimes in city since 01 Janaury, 2010

  2. A .PDF file which provides description of Modus Operandi(MO) codes of suspect in the crime data

################################## Importing Data #############################################

#Assign the URL for data to a variable
link_data <- "https://data.lacity.org/api/views/y8tr-7khq/rows.csv?accessType=DOWNLOAD&bom=true&format=true"

#Imort the data in R and ensure proper naming convention for column names
crime_data <- fread(link_data, col.names = c("DR_number",   "date_reported",    "date_occurred",    
                                             "time_occurred",   "area_id",  "area_name",    
                                             "reporting_district",  "crime_code",   
                                             "crime_code_description",  "MO_codes", 
                                             "victim_age",  "victim_sex",   "victim_descent",   
                                             "premise_code",    "premise_description",  
                                             "weapon_used_code",    "weapon_description",   
                                             "status_code", "status_description",
                                             "crime_code_1", "crime_code_2",    "crime_code_3", 
                                             "crime_code_4",    "address",  "cross_street", 
                                             "location"))

#Import the PDF files with MO code descriptions and convert it to text
link_MO_codes <- "https://data.lacity.org/api/views/y8tr-7khq/files/3a967fbd-f210-4857-bc52-60230efe256c?download=true&filename=MO%20CODES%20(numerical%20order).pdf"
MO_codes_txt <- pdf_text(link_MO_codes)

The crime_data contains 1,617,731 rows and 26 columns. Following code provides a glimpse of the data set imported:

#View the imported data
glimpse(crime_data)
## Observations: 1,617,731
## Variables: 26
## $ DR_number              <int> 1208575, 102005556, 418, 101822289, 421...
## $ date_reported          <chr> "03/14/2013", "01/25/2010", "03/19/2013...
## $ date_occurred          <chr> "03/11/2013", "01/22/2010", "03/18/2013...
## $ time_occurred          <int> 1800, 2300, 2030, 1800, 2300, 1400, 223...
## $ area_id                <int> 12, 20, 18, 18, 21, 1, 11, 16, 19, 9, 1...
## $ area_name              <chr> "77th Street", "Olympic", "Southeast", ...
## $ reporting_district     <int> 1241, 2071, 1823, 1803, 2133, 111, 1125...
## $ crime_code             <int> 626, 510, 510, 510, 745, 110, 510, 510,...
## $ crime_code_description <chr> "INTIMATE PARTNER - SIMPLE ASSAULT", "V...
## $ MO_codes               <chr> "0416 0446 1243 2000", "", "", "", "032...
## $ victim_age             <int> 30, NA, 12, NA, 84, 49, NA, NA, NA, 27,...
## $ victim_sex             <chr> "F", "", "", "", "M", "F", "", "", "", ...
## $ victim_descent         <chr> "W", "", "", "", "W", "W", "", "", "", ...
## $ premise_code           <int> 502, 101, 101, 101, 501, 501, 108, 101,...
## $ premise_description    <chr> "MULTI-UNIT DWELLING (APARTMENT, DUPLEX...
## $ weapon_used_code       <int> 400, NA, NA, NA, NA, 400, NA, NA, NA, N...
## $ weapon_description     <chr> "STRONG-ARM (HANDS, FIST, FEET OR BODIL...
## $ status_code            <chr> "AO", "IC", "IC", "IC", "IC", "AA", "IC...
## $ status_description     <chr> "Adult Other", "Invest Cont", "Invest C...
## $ crime_code_1           <int> 626, 510, 510, 510, 745, 110, 510, 510,...
## $ crime_code_2           <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ crime_code_3           <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ crime_code_4           <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ address                <chr> "6300    BRYNHURST                    A...
## $ cross_street           <chr> "", "15TH", "", "WALL", "", "", "AVENUE...
## $ location               <chr> "(33.9829, -118.3338)", "(34.0454, -118...

The PDF file imported contains raw text. In order to convert it into a data frame, we will have to perform string manipulations and cleaning which is covered in the next section.

Data Cleaning

Currently, the imported datasets do not follow priciples of tidy data. This section provides step by step approach to data cleaning and creating new dataframes and features.

DF from PDF

Creating a dataframe from the raw text extracted from the PDF file containing MO descriptions. This dataframe will be useful when understanding MO codes associated with the crimes in crime_data.

######################## Cleaning PDF with Look up values for MO codes ########################

#Split the text using new line character in Windows i.e. \r\n
MO_codes_1 <- str_split(MO_codes_txt, pattern = '\r\n')

#Convert the list of chracter vector into single vector
MO_codes_2 <- unlist(lapply(1:length(MO_codes_1), function(x) MO_codes_1[[x]]))

#Convert the chracter vector into a dataframe
MO_codes_df <- as.data.frame(MO_codes_2, stringsAsFactors = FALSE)

#Split the strings into code and description; and remove the blank MO_Codes
MO_codes_f <- MO_codes_df %>%
                separate(col = MO_codes_2, into = c("MO_codes","MO_description"), sep = 4) %>%
                filter(str_trim(MO_codes) != "") %>%
                mutate(MO_description = str_trim(MO_description))

#View the data
glimpse(MO_codes_f)
## Observations: 532
## Variables: 2
## $ MO_codes       <chr> "0100", "0101", "0102", "0103", "0104", "0105",...
## $ MO_description <chr> "Suspect Impersonate", "Aid victim", "Blind", "...

The MO_codes_f dataframe contains 532 MO codes and its description.

Tidying Data

The crime_data do not follow the concepts of tidy data. The field MO_codes has multiple codes associated with one crime within same cell and also the location field has latitude and longitude within the same cell. In this section we will deal these two problems.

Since one crime has many MO codes within one cell, analyzing Modus Operandi with respect to crime will be problematic. Hence, let us create a new dataframe which has only crime codes and modus operandi in tall format. This dataframe will be unique at a DR_number (unique identifier for crime) and MO code level.

##################### Create separate table for DR_Number & MO Codes ##########################

#Keep only two columns - DR number and codes & Drop MO_codes from main table
crime_MO_codes <- crime_data[,c("DR_number","MO_codes")]

#Remove data with blank MO_codes
crime_MO_codes <- crime_MO_codes %>%
                      filter(MO_codes != '')

#Split the column MO_codes into individual MO_codes
crime_MO_codes_sp <- cbind(crime_MO_codes$DR_number, data.frame(do.call('rbind',
                          strsplit(as.character(crime_MO_codes$MO_codes)," ",fixed=TRUE)), 
                          stringsAsFactors = FALSE))
colnames(crime_MO_codes_sp)[1] <- "DR_number"

#Make this a tall dataset instead of wide
crime_MO_map <- crime_MO_codes_sp %>% 
                        gather("var_for_MO_codes", "MO_codes", -DR_number) %>%
                        #Keeping distinct pairs of crime and MO codes
                        group_by(DR_number, MO_codes) %>% 
                        summarise(n = n()) %>%
                        select(DR_number, MO_codes) #Keeping the required columnns

#Delete MO Codes from crime data
crime_data <- crime_data[,-"MO_codes"]

#View the mapping of crime with MO Codes
glimpse(crime_MO_map)
## Observations: 3,904,150
## Variables: 2
## $ DR_number <int> 1208575, 1208575, 1208575, 1208575, 1307355, 1307355...
## $ MO_codes  <chr> "0416", "0446", "1243", "2000", "0913", "1814", "200...

Now, let us split the location column in the crime_data into latitude and longitude so that the dataset follows the principles of tidy data.

###################### Split location into latitude and longitude ##########################

crime_data <- crime_data %>%
                  #Split the location variable using ','
                separate(location, into = c("latitude", "longitude"), sep = ',') %>%
                  #Replace brackets in the strings and then convert them to numeric
                mutate(latitude = as.numeric(str_replace(latitude,'\\(','')),
                longitude = as.numeric(str_replace(longitude,'\\)','')))

#Replacing (0,0) to NAs as per the codebook
crime_data$latitude[crime_data$latitude==0] <- NA
crime_data$longitude[crime_data$longitude==0] <- NA
Formatting Columns

Converting the date columns (imported as character) into date formats so that we can operate on them.

############################ Converting dates & time to proper format ########################

crime_data <- crime_data %>%
                    #Splitting using -3 to incorporate values like 800, 530, etc.
                  separate(col = time_occurred, into = c("hour_occurred", "min_occurred"), 
                           sep = -3) %>% 
                    #Padding extra zeros so that lubridate function hm doesn't 
                    #generate NAs for cases where hour = ""
                  mutate (hour_occurred = str_pad(hour_occurred, 2, side="left", pad = 0)) %>%
                    #Adding a separator between hour and minutes 
                    #for lubridate to identify time correctly
                  unite(time_occurred, hour_occurred, min_occurred, sep = ':') %>%
                    #formatting the date and time variables
                  mutate(date_reported = mdy(date_reported), 
                         date_occurred = mdy(date_occurred),
                         time_occurred = hm(time_occurred))

Converting victim’s sex and victim’s descent from characters to facotrs for ease of summary operations:

####################### Converting variables like sex into factor format ######################

#Convert the variables sex, descent and status code into factors
crime_data <- crime_data %>%
                  mutate(victim_sex = as.factor(victim_sex),
                         victim_descent = as.factor(victim_descent))
Cleaning Observations

Looking at the summary of all the variables in the dataset, we observe that victim_sex and victim_descent have invalid observations with value ‘-’. Assiging these values to ‘X’ i.e. unknown as per the Codebook:

####################### Replacing ivalid observations #######################################

#For sex and descent

levels(crime_data$victim_sex)[levels(crime_data$victim_sex) == '-'] <- 'X'
levels(crime_data$victim_descent)[levels(crime_data$victim_descent) == '-'] <- 'X'

Also, from the summary we notice the fact that 6 observations have missing values for crime_code_1 but crime_code does not have any missing values. According to the code book crime_code_1 should be same as crime_code. On more inspection, we realise that crime_code_2 has values for crime_code_1. Hence, we can rectify this error by replacing crime_code_1 values with crime_code and assigning NA to crime_code_2 for those observations.

####################### Replacing ivalid observations #######################################

#For crime_code_1 and crime_code_2

index <- which(is.na(crime_data$crime_code_1))
crime_data[index, "crime_code_1"] <- crime_data[index, "crime_code"]
crime_data[index, "crime_code_2"] <- NA

New features

In this section we will create few additional variables which will be used to analyze the data:

  1. days_before_reporting - This variable provides the difference between the reported date and occurred date in days.
  2. year_occurred - This variable indicates the year in which crime occurred
  3. month_occurred - This variable indicates the month in which crime occurred
  4. day_occurred - This variable indicates the day of month on which crime occurred
  5. day_of_week - This variable indicates the day of the week on which crime occurred
##################### Create new variables for analysis ##################################

crime_data <- crime_data %>%
                  #Creating a variable for days between crime occurred and crime reported
                mutate(days_before_reporting = as.numeric(date_reported - date_occurred),
                          #Creating variables on date when crime occurred
                        year_occurred = year(date_occurred),
                        month_occurred = month(date_occurred),
                        day_occurred = day(date_occurred),
                        day_of_week = wday(date_occurred, label = TRUE))

Summary

From the raw crime dataset we created following two datsets:

  1. crime_data - After cleaning and creating new viables, the final crime dataset has 1,617,731 rows and 31 columns. Each row in this data represents a crime incident and each column provides one attriubte of the crime. DR_number is the unique identifier for this data.

  2. crime_MO_map - MO_codes from the crime data were removed as single crime record has multiple modus operandi codes. A new dataset crime_MO_map was created which has 3,904,150 rows and 2 columns to analyze data at MO_code level. This dataset is unique at DR_number and MO_code level.

A PDF with explanations of MO_codes was imported and a dataset MO_codes_f with 532 rows and 2 columns was created.

Following table provides basic summary of variables which will be used for the analysis:

Following table represents a 100 rows preview of final crime data which will be used for analysis:

Proposed Exploratory Analysis

Details on next steps:

Following are the detailed next steps in analyzing the data:

  1. Crime vs. Time - Number of crimes vs. month and year from January 2010 will be plotted to look at the trends. We will also look at trends of major crime types over time.

  2. Seasonality - Look at the trends of number of crimes and crime types vs. month, day and day of week. We can use trend lines or stacked bar charts to visualize this data subset.

  3. Victim’s characteristics - Look at victim’s sex and descent vs. crime type and number of crimes. We can provide a statistically significant answer to questions like females are more prone to theft than males using t-test.

  4. Geospatial data - Viualize the crime rates in LA neighborhoods using latitude and longitude. We can also plot using area and reporting district shape files provided by the LAPD.

  5. Areas - Look at the areas with respect to number of crimes and type of crimes. Then categorize them using deciling tecniques to provide level of safety.

  6. Arrests vs. crime type - We can see which crime categories have highest arrest percentages. We can also add time dimension in this analysis to look at the arrested percentages over years.

  7. Weapons - We can look at the association of weapons with the crime types and identify commonly used weapons for major crime types. We can again add time dimension to see how the usage of particular weapon like guns have increased over time in LA

  8. Modus Operandi vs. crime type - We can analyze the most common ways of executing particular types of crime. This analysis can provide insights into common mistakes victims make

  9. Number of days before reporting - Analyze the effect of number of days before reporting and arrests made for major type of crimes

Machine learning techniques:

If time permits, I would also like to explore time series analysis and use it to forecast number of crimes of particular type like theft or assault in upcoming months.

Following are the areas that I need to explore to finish this analysis:

  1. Creating good quality plots
  2. Creating geospatial plots
  3. Learn new type of plots to better visualize data
  4. Learn time series forecasting in R