We constantly hear statements on crime from famous personalities or our known circles like these:
This study focuses to provide some insights on these views using actual data from the Los Angeles Police Department.
Since this is a descriptive analysis, major focus of the analysis would be preparing data, slicing, dicing, vizualization and generating insights. Following is the step by step approach:
This analysis will provide readers basic findings and insights on the crime in LA. Some of the expected outcomes:
Following are the packages used for this analysis:
################ Loading the required packages ################################
library(DT) #To display scrollable tables in r markdown
library(data.table) #For import using fread
library(dplyr) #For data manipulation
library(tidyr) #For getting the data in tidy format
library(lubridate) #For extracting and working with dates
library(stringr) #For working of strings
library(pdftools) #To import PDF files and convert them to text
library(ggplot2) #To create plots
library(ggthemes) #To use themes for plots created
library(scales) #Formats for axes
library(forecast) #For time series forecasting
library(leaflet) #To create geospatial plots
This section provides details on steps involved in preparing analytical dataset.
Data for this analysis is sourced from the Los Angeles Police Department (LAPD) and is available here.
The data is transcribed from the original crime reports that are typed on paper and reflects criminal incidents in the City of Los Angeles. Due to the nature of data collection, we might encounter some inaccuracies. Also, address field is provided to the nearest hundred block to maintain privacy. Some important characteristics of data are provided below:
Time Period - The data includes crimes from 01 January, 2010 to 28 October, 2017
Attributes - For each crime, this data provides 26 different variables like date when crime occurred, date when crime was reported, location of the crime, type of crime, victim’s description and current status of the investigation. Detailed description of each variable is available in this Codebook.
Missing values - Location field with missing values are replaced with (0o,0o). Also, unknown values of victim’s sex and victim’s descent are represented with character ‘X’.
Following two files are imported directly from the LA city’s data website:
A .csv file which contains data of the crimes in city since 01 Janaury, 2010
A .PDF file which provides description of Modus Operandi(MO) codes of suspect in the crime data
################################## Importing Data #############################################
#Assign the URL for data to a variable
link_data <- "https://data.lacity.org/api/views/y8tr-7khq/rows.csv?accessType=DOWNLOAD&bom=true&format=true"
#Imort the data in R and ensure proper naming convention for column names
crime_data <- fread(link_data, col.names = c("DR_number", "date_reported", "date_occurred",
"time_occurred", "area_id", "area_name",
"reporting_district", "crime_code",
"crime_code_description", "MO_codes",
"victim_age", "victim_sex", "victim_descent",
"premise_code", "premise_description",
"weapon_used_code", "weapon_description",
"status_code", "status_description",
"crime_code_1", "crime_code_2", "crime_code_3",
"crime_code_4", "address", "cross_street",
"location"))
#Import the PDF files with MO code descriptions and convert it to text
link_MO_codes <- "https://data.lacity.org/api/views/y8tr-7khq/files/3a967fbd-f210-4857-bc52-60230efe256c?download=true&filename=MO%20CODES%20(numerical%20order).pdf"
MO_codes_txt <- pdf_text(link_MO_codes)
The crime_data contains 1,674,208 rows and 26 columns. Following code provides a glimpse of the data set imported:
#View the imported data
glimpse(crime_data)
## Observations: 1,674,208
## Variables: 26
## $ DR_number <int> 1208575, 102005556, 418, 101822289, 421...
## $ date_reported <chr> "03/14/2013", "01/25/2010", "03/19/2013...
## $ date_occurred <chr> "03/11/2013", "01/22/2010", "03/18/2013...
## $ time_occurred <int> 1800, 2300, 2030, 1800, 2300, 1400, 223...
## $ area_id <int> 12, 20, 18, 18, 21, 1, 11, 16, 19, 9, 1...
## $ area_name <chr> "77th Street", "Olympic", "Southeast", ...
## $ reporting_district <int> 1241, 2071, 1823, 1803, 2133, 111, 1125...
## $ crime_code <int> 626, 510, 510, 510, 745, 110, 510, 510,...
## $ crime_code_description <chr> "INTIMATE PARTNER - SIMPLE ASSAULT", "V...
## $ MO_codes <chr> "0416 0446 1243 2000", "", "", "", "032...
## $ victim_age <int> 30, NA, 12, NA, 84, 49, NA, NA, NA, 27,...
## $ victim_sex <chr> "F", "", "", "", "M", "F", "", "", "", ...
## $ victim_descent <chr> "W", "", "", "", "W", "W", "", "", "", ...
## $ premise_code <int> 502, 101, 101, 101, 501, 501, 108, 101,...
## $ premise_description <chr> "MULTI-UNIT DWELLING (APARTMENT, DUPLEX...
## $ weapon_used_code <int> 400, NA, NA, NA, NA, 400, NA, NA, NA, N...
## $ weapon_description <chr> "STRONG-ARM (HANDS, FIST, FEET OR BODIL...
## $ status_code <chr> "AO", "IC", "IC", "IC", "IC", "AA", "IC...
## $ status_description <chr> "Adult Other", "Invest Cont", "Invest C...
## $ crime_code_1 <int> 626, 510, 510, 510, 745, 110, 510, 510,...
## $ crime_code_2 <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ crime_code_3 <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ crime_code_4 <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ address <chr> "6300 BRYNHURST A...
## $ cross_street <chr> "", "15TH", "", "WALL", "", "", "AVENUE...
## $ location <chr> "(33.9829, -118.3338)", "(34.0454, -118...
The PDF file imported contains raw text. In order to convert it into a data frame, we will have to perform string manipulations and cleaning which is covered in the next section.
Currently, the imported datasets do not follow priciples of tidy data. This section provides step by step approach to data cleaning and creating new dataframes and features.
Creating a dataframe from the raw text extracted from the PDF file containing MO descriptions. This dataframe will be useful when understanding MO codes associated with the crimes in crime_data.
######################## Cleaning PDF with Look up values for MO codes ########################
#Split the text using new line character in Windows i.e. \r\n
MO_codes_1 <- str_split(MO_codes_txt, pattern = '\r\n')
#Convert the list of chracter vector into single vector
MO_codes_2 <- unlist(lapply(1:length(MO_codes_1), function(x) MO_codes_1[[x]]))
#Convert the chracter vector into a dataframe
MO_codes_df <- as.data.frame(MO_codes_2, stringsAsFactors = FALSE)
#Split the strings into code and description; and remove the blank MO_Codes
MO_codes_f <- MO_codes_df %>%
separate(col = MO_codes_2, into = c("MO_codes","MO_description"), sep = 4) %>%
filter(str_trim(MO_codes) != "") %>%
mutate(MO_description = str_trim(MO_description))
#View the data
glimpse(MO_codes_f)
## Observations: 532
## Variables: 2
## $ MO_codes <chr> "0100", "0101", "0102", "0103", "0104", "0105",...
## $ MO_description <chr> "Suspect Impersonate", "Aid victim", "Blind", "...
The MO_codes_f dataframe contains 532 MO codes and its description.
The crime_data does not follow the concepts of tidy data. The field MO_codes has multiple codes associated with one crime within same cell and also the location field has latitude and longitude within the same cell. In this section we will deal these two problems.
Since one crime has many MO codes within one cell, analyzing Modus Operandi with respect to crime will be problematic. Hence, let us create a new dataframe which has only crime codes and modus operandi in tall format. This dataframe will be unique at a DR_number (unique identifier for crime) and MO code level.
##################### Create separate table for DR_Number & MO Codes ##########################
#Keep only two columns - DR number and codes & Drop MO_codes from main table
crime_MO_codes <- crime_data[,c("DR_number","MO_codes")]
#Remove data with blank MO_codes
crime_MO_codes <- crime_MO_codes %>%
filter(MO_codes != '')
#Split the column MO_codes into individual MO_codes
crime_MO_codes_sp <- cbind(crime_MO_codes$DR_number, data.frame(do.call('rbind',
strsplit(as.character(crime_MO_codes$MO_codes)," ",fixed=TRUE)),
stringsAsFactors = FALSE))
colnames(crime_MO_codes_sp)[1] <- "DR_number"
#Make this a tall dataset instead of wide
crime_MO_map <- crime_MO_codes_sp %>%
gather("var_for_MO_codes", "MO_codes", -DR_number) %>%
#Keeping distinct pairs of crime and MO codes
group_by(DR_number, MO_codes) %>%
summarise(n = n()) %>%
select(DR_number, MO_codes) #Keeping the required columnns
#Delete MO Codes from crime data
crime_data <- crime_data[,-"MO_codes"]
#View the mapping of crime with MO Codes
glimpse(crime_MO_map)
## Observations: 4,045,004
## Variables: 2
## $ DR_number <int> 1208575, 1208575, 1208575, 1208575, 1307355, 1307355...
## $ MO_codes <chr> "0416", "0446", "1243", "2000", "0913", "1814", "200...
Now, let us split the location column in the crime_data into latitude and longitude so that the dataset follows the principles of tidy data.
###################### Split location into latitude and longitude ##########################
crime_data <- crime_data %>%
#Split the location variable using ','
separate(location, into = c("latitude", "longitude"), sep = ',') %>%
#Replace brackets in the strings and then convert them to numeric
mutate(latitude = as.numeric(str_replace(latitude,'\\(','')),
longitude = as.numeric(str_replace(longitude,'\\)','')))
#Replacing (0,0) to NAs as per the codebook
crime_data$latitude[crime_data$latitude==0] <- NA
crime_data$longitude[crime_data$longitude==0] <- NA
Converting the date columns (imported as character) into date formats so that we can operate on them.
############################ Converting dates & time to proper format ########################
crime_data <- crime_data %>%
#Splitting using -3 to incorporate values like 800, 530, etc.
separate(col = time_occurred, into = c("hour_occurred", "min_occurred"),
sep = -3) %>%
#Padding extra zeros so that lubridate function hm doesn't
#generate NAs for cases where hour = ""
mutate (hour_occurred = str_pad(hour_occurred, 2, side="left",
pad = 0)) %>%
#Adding a separator between hour and minutes
#for lubridate to identify time correctly
unite(time_occurred, hour_occurred, min_occurred, sep = ':',
remove = FALSE) %>%
#formatting the date and time variables
mutate(date_reported = mdy(date_reported),
date_occurred = mdy(date_occurred),
time_occurred = hm(time_occurred))
Converting victim’s sex and victim’s descent from characters to factors for ease of summary operations:
####################### Converting variables like sex into factor format ######################
#Convert the variables sex, descent and status code into factors
crime_data <- crime_data %>%
mutate(victim_sex = as.factor(victim_sex),
victim_descent = as.factor(victim_descent))
Looking at the summary of all the variables in the dataset, we observe that victim_sex and victim_descent have invalid observations with value ‘-’. Assiging these values to ‘X’ i.e. unknown as per the Codebook:
####################### Replacing ivalid observations #######################################
#For sex and descent
levels(crime_data$victim_sex)[levels(crime_data$victim_sex) == '-'] <- 'X'
levels(crime_data$victim_descent)[levels(crime_data$victim_descent) == '-'] <- 'X'
Also, from the summary we notice the fact that 6 observations have missing values for crime_code_1 but crime_code does not have any missing values. According to the code book crime_code_1 should be same as crime_code. On more inspection, we realise that crime_code_2 has values for crime_code_1. Hence, we can rectify this error by replacing crime_code_1 values with crime_code and assigning NA to crime_code_2 for those observations.
####################### Replacing ivalid observations #######################################
#For crime_code_1 and crime_code_2
index <- which(is.na(crime_data$crime_code_1))
crime_data[index, "crime_code_1"] <- crime_data[index, "crime_code"]
crime_data[index, "crime_code_2"] <- NA
In this section we will create few additional variables which will be used to analyze the data:
##################### Create new variables for analysis ##################################
crime_data <- crime_data %>%
#Creating a variable for days between crime occurred and crime reported
mutate(days_before_reporting = as.numeric(date_reported - date_occurred),
#Creating variables on date when crime occurred
year_occurred = year(date_occurred),
month_occurred = month(date_occurred),
day_occurred = day(date_occurred),
day_of_week = wday(date_occurred, label = TRUE))
# Finally Filtering the data to restrict dates till latest full month available i.e. Oct 2017
crime_data_oct17 <- crime_data %>%
filter(date_occurred <= '2017-10-31')
From the raw crime dataset we created following two datsets:
crime_data - After cleaning and creating new variables, the final crime dataset has 1,624,790 rows and 33 columns. Each row in this data represents a crime incident and each column provides one attriubte of the crime. DR_number is the unique identifier for this data. We have filtered the data to restrict date of crime till Oct 2017 as that is the latest full month of data available.
crime_MO_map - MO_codes from the crime data were removed as single crime record has multiple modus operandi codes. A new dataset crime_MO_map was created which has 4,045,004 rows and 2 columns to analyze data at MO_code level. This dataset is unique at DR_number and MO_code level.
A PDF with explanations of MO_codes was imported and a dataset MO_codes_f with 532 rows and 2 columns was created.
Following table provides basic summary of variables which will be used for the analysis:
Following table represents a 100 rows preview of final crime data which will be used for analysis:
Following sections display different data cuts and visualization techniques used to generate insights from the LA crime data.
Following is the format used in this section:
This section tries to explore the seasonality aspect of number of crimes. We start by looking at the data at year level and then eventually drill down to hour level analysis.
################# Over-all year level numbers #############################
#Identify top 5 crimes in the dataset
top_5_crimes <- crime_data_oct17 %>%
group_by(crime_code, crime_code_description) %>%
summarise(n = n_distinct(DR_number)) %>%
arrange(desc(n)) %>%
head(5) %>%
select(crime_code, crime_code_description) %>%
rename("top_5_crime_desc" = "crime_code_description")
#Filter the data for top 5 crimes by using joins
crime_data_oct17 %>%
left_join(top_5_crimes, by = "crime_code") %>%
mutate(top_5_crime_desc = ifelse(is.na(top_5_crime_desc),
"OTHERS", top_5_crime_desc)) %>%
group_by(year_occurred, top_5_crime_desc) %>%
summarise(n = n_distinct(DR_number)) %>%
mutate(pct = n/sum(n), #Calculate percentage for labels
ypos = cumsum(n) - 0.5*n) %>% #Calculate postions for label
#Plot the stacked bar graph
ggplot(aes(x = year_occurred, y = n, fill = top_5_crime_desc)) +
geom_bar(position = position_stack(reverse = TRUE), stat = "identity", width = 0.7,
alpha = 0.5, color = "black") +
#Axis titles and labels
geom_text(aes(label = paste0(sprintf("%1.0f", pct*100),"%"), y = ypos), size = 3) +
labs(x = "Year", y = "Number of crimes occurred",
title = "Number of crimes occurred per year split by top 5 crime categories",
fill = "Crime description") +
scale_y_continuous(labels = comma) +
theme_classic()
From the graph above, we can infer that number of crimes each year were decreasing from 2010 to 2013 but the trend reversed from 2014 and it started increasing. We cannot look at 2017 as it just has data available for 10 months.
The proportion of major crime types within each year remains more or less similar.
################### Month wise roll up and plot ########################
#Roll up the data at monthe level
crime_data_oct17 %>%
group_by(month_occurred) %>%
summarise(n = n_distinct(DR_number)) %>%
#Plot the data
ggplot(aes(x = as.factor(month_occurred), y = n)) +
geom_bar(stat = "identity", fill = "blue", alpha = 0.3) +
#Format axes, titles and lables
geom_text(aes(x = month_occurred, y = 1, label = comma(n)),
hjust=0, vjust=.5, size = 3, colour = 'black', fontface = 'bold') +
scale_y_continuous(labels = comma) +
labs(y = "Number of crime occurred", x = "Month",
title = "Number of crimes by month") +
coord_flip() +
theme_bw()
January has highest number of crimes reported while November and December has the lowest number of crimes reported. We can hypothesize that during holiday season either the crime rates are low or crimes are being reported or recorded late due to the holiday season.
########################## Day of week vs. hour of the day ####################
#Roll up the data at week and hour level
crime_data_oct17 %>%
group_by(day_of_week, hour_occurred) %>%
summarise(n = n_distinct(DR_number)) %>%
#Plot the tiles
ggplot(aes(x = day_of_week, y = hour_occurred)) +
geom_tile(aes(fill = n)) +
scale_fill_gradient(low = "light blue", high = "dark blue") +
#Axes labels and titles
labs(x = "Day of the week", y = "Hour of the day",
title = "Day of the week Vs. Hour of the day",
fill = "Number of crimes") +
theme_bw()
A counter intuitive trend is noticed from the graph above. The crimes are maximum at 12 in the afternoon. Digging deeper into data below to analyze this trend.
################## Looking at crimes at 12:00 ########################
#Filter and roll up the data for hour == 12
crime_data_oct17 %>%
filter(hour_occurred == '12') %>%
group_by(crime_code_description) %>%
summarise(num_crimes = n_distinct(DR_number)) %>%
arrange(desc(num_crimes)) %>%
head(10) %>%
mutate(crime_code_description = reorder(crime_code_description,num_crimes)) %>%
#Plot the data
ggplot(aes(x = crime_code_description, y = num_crimes)) +
geom_bar(stat = "identity", fill = "maroon", alpha = 0.5) +
#Format axes, titles and labels
geom_text(aes(x = crime_code_description, y = 1, label = comma(num_crimes)),
hjust=-0.15, vjust=.5, size = 3, colour = 'black',
fontface = 'bold') +
scale_y_continuous(labels = comma) +
labs(y = "Number of crime occurred", x = "Crime Description",
title = "Top 10 crimes at 12:00 hours") +
coord_flip() +
theme_bw()
The above plot shows that maximum number of crimes occuring during noon is theft of identity. This may be because time for these type of crimes cannot be determined and 12:00 hrs may be used by the police to report them.
To analyze when does harmful crime occur during the day or week let us look at the crimes where weapons were used. This will provide a good proxy for harmful crimes.
############ Day of the week Vs. Hour for crimes where weapons was used ########
#Filter and roll up the data
crime_data_oct17 %>%
filter(!is.na(weapon_used_code)) %>% #Filter for crimes where weapon was used
group_by(day_of_week, hour_occurred) %>%
summarise(n = n_distinct(DR_number)) %>%
#Plot the tiles
ggplot(aes(x = day_of_week, y = hour_occurred)) +
geom_tile(aes(fill = n)) +
scale_fill_gradient(low = "light blue", high = "dark blue") +
#Axes labels and titles
labs(x = "Day of the week", y = "Hour of the day",
title = "Day of the week Vs. Hour of the day for crimes where a Weapon was used",
fill = "Number of crimes") +
theme_bw()
The above heat map provides some good information regarding crime occurrence in which a weapon was used. We can notice that: 1. During a day, crime generally is low during morning hours i.e. around 6 AM and then gradually increases till 11 PM or 1 AM and then again falls down. 2. Weekends (Saturday and Sunday) have higher crime occurrences compared to other days 3. Saturday and Sunday around 1 AM has very high crime occurrences
This section explores data on victims’ characteristics, crime types and modus operandi of top 2 crimes in LA
######################### Victims Age distiribution by sex ##########################
#Filter the data to remove missing values and filter for victim's sex
crime_data_oct17 %>%
filter(!is.na(victim_age) & victim_sex %in% c('F', 'M')) %>%
#Plot the histograms
ggplot(aes(victim_age, fill = victim_sex)) +
geom_histogram(alpha = 0.5, aes(y = ..density..), position = 'identity') +
#Axes labels and titles
labs(x = "Victim's Age", y = "Density",
title = "Distribution of Victim's Age by Victim's Sex",
fill = "Victim's Sex") +
theme_bw()
For the age group around 25, number of crimes against women is typically higher than men. Also, overall the crime against age group of 20 to 30 is highest. This may be due to the outward and independent lifestyle of this age group compared to other age groups.
############Top 10 crime with showing distirbution of sex in them ##################
#Creating table with top 10 crimes where victim's sex is identified
top_10_crimes_victims <- crime_data_oct17 %>%
filter(victim_sex %in% c('F', 'M')) %>%
group_by(crime_code, crime_code_description) %>%
summarise(n = n_distinct(DR_number)) %>%
arrange(desc(n)) %>%
head(10) %>%
select(crime_code, crime_code_description) %>%
rename("top_10_crime_desc" = "crime_code_description")
#Filtering the data by joining with top_10_crimes identified
crime_data_oct17 %>%
filter(victim_sex %in% c('F', 'M')) %>%
inner_join(top_10_crimes_victims, by = "crime_code") %>%
group_by(victim_sex, top_10_crime_desc) %>%
summarise(n = n_distinct(DR_number)) %>%
#Plot the line
ggplot(aes(x = top_10_crime_desc, y = n, fill = victim_sex)) +
geom_bar(position = position_stack(reverse = TRUE), stat = "identity", width = 0.7,
alpha = 0.5, color = "black") +
labs(x = "Crime Description", y = "Number of crimes occurred",
title = "Top 10 crimes split by Gender",
fill = "Victim's gender") +
scale_y_continuous(labels = comma) +
coord_flip() +
theme_bw()
The top 10 crimes for the above graph are identified after filtering data for crimes where victim’s sex was identified. We can infer from the graph that proportion of male and female victims crimes depends on the nature of the crime. For example, Intimate Partner Simple Assualt has higher proportion of female victims while Assault With Deadly Weapons has higher proportion of male victims. Theft of Identity has similar proportion of male and female victims.
############### Victims descent vs. age boxplots ####################
#Filter the data
crime_data_oct17 %>%
filter(!is.na(victim_age) & victim_descent != '') %>% #Remove blanks and NAs
#Box plots
ggplot (aes(x=victim_descent, y=victim_age)) +
geom_boxplot(aes(color = victim_descent)) +
#Axes labels and titles
labs(x = "Victim's Age", y = "Victim's Descent",
title = "Distribution of Victim's Age across Victim's Descent") +
ggtitle("Victim's Age Vs. Victim's Descent") +
scale_x_discrete(labels=c('Other Asian',
'Black',
'Chinese',
'Cambodian',
'Filipino',
'Guamanian',
'Hispanic/Latin/Mexican',
'American Indian/Alaskan Native',
'Japanese',
'Korean',
'Laotian',
'Other',
'Pacific Islander',
'Samoan',
'Hawaiian',
'Vietnamese',
'White',
'Unknown',
'Asian Indian')) +
coord_flip() +
theme_bw() +
theme(legend.position = "none")
Crimes against people of Other Asian descent clearly stands out. The average age of victim is significantly low compared to victims from other descents. This might be due to higher immigrant population from Asia.
Analyzing modus operandi of crimes helps in taking precautionary measures to avoid the crimes, in future. Following section explores the modus operandi for two major crime types.
############## Modus Operandi for burglary from vehicles ##################
#Filter data for burglary from vehicle
crime_data_oct17 %>%
filter(crime_code == 330) %>%
#Merge it with MO Codes based on DR_number
inner_join(crime_MO_map, by = "DR_number") %>%
group_by(MO_codes) %>%
summarise(n = n()) %>%
arrange(desc(n)) %>%
#Merge it with description file to get the MO_codes description
inner_join(MO_codes_f, by = "MO_codes") %>%
#Removing first row because it states that burglary took place
#which is obvious in this scenario as we are look at theft from car
filter(row_number(desc(n)) > 1 & row_number(desc(n)) <= 11) %>%
mutate(MO_description = reorder(MO_description,n)) %>%
#Plot the data
ggplot(aes(x = MO_description, y = n)) +
geom_bar(stat = "identity", fill = "purple", alpha = 0.7) +
#Format axes, labels and titles
geom_text(aes(x = MO_description, y = 1, label = comma(n)),
hjust=-0.25, vjust=.5, size = 3, colour = 'white',
fontface = 'bold') +
scale_y_continuous(labels = comma) +
labs(y = "Number of cases", x = "Modus Operandi",
title = "Top Modus Operandi for burglary from vehicle") +
coord_flip() +
theme_bw()
The bar plot depicts top 10 ways in which Burglary from vehicle is executed. The crime may involve one or more aspects of the above modus operendi.
In LA, Burglary from a vehicle is generally done by smashing, breaking windows, using another vehicle and forceful entry.
#Filter data for battery - simple assault
crime_data_oct17 %>%
filter(crime_code == 624) %>%
#Merge it with MO Codes based on DR_number
inner_join(crime_MO_map, by = "DR_number") %>%
group_by(MO_codes) %>%
summarise(n = n()) %>%
arrange(desc(n)) %>%
#Merge it with description file to get the MO_codes description
inner_join(MO_codes_f, by = "MO_codes") %>%
head(10) %>%
mutate(MO_description = reorder(MO_description,n)) %>%
#Plot the data
ggplot(aes(x = MO_description, y = n)) +
geom_bar(stat = "identity", fill = "dark green", alpha = 0.7) +
#Format axes, titles and labels
geom_text(aes(x = MO_description, y = 1, label = comma(n)),
hjust=-0.25, vjust=.5, size = 3, colour = 'white',
fontface = 'bold') +
scale_y_continuous(labels = comma) +
labs(y = "Number of cases", x = "Modus Operandi",
title = "Top Modus Operandi for simple assault") +
coord_flip() +
theme_bw()
Again, the bar plot above depicts top 10 modus operandi involved in Battery - Simple Assault. In LA, most of the simple assaults involve a hit with or without weapon. We can also notice that generally victim knows the suspect.
In this we will analyze the location of crime. Since, overall number of crimes is very high, plotting them using latitude and longitude won’t reveal any patterns. Hence, let us look at the data at police station level.
Following graph depicts 21 LAPD community police stations (denoted by circles). The size of the circle on the map is directly proportional to the number of crimes occurred in that area.
#################### Number of crimes based on Area ###################
#Identify center of LA to position the maps accordingly in leaflet
center_lat = median(crime_data_oct17$latitude,na.rm = TRUE)
center_lon = median(crime_data_oct17$longitude, na.rm = TRUE)
#Crime at Area level - size of the circle is directly proportional to number of crime incidents
#Remove nulls and roll up the data at area level
crime_data_oct17 %>%
filter(!is.na(latitude)) %>%
group_by(area_name) %>%
summarise(num_crimes = n(),
# calculate area lat long to identify center of area
latitude = median(latitude),
longitude = median(longitude)) %>%
#plot the data
leaflet() %>%
addProviderTiles("OpenStreetMap") %>%
addCircles(lng = ~longitude, lat = ~latitude, color = c("maroon"),
radius = ~num_crimes/30) %>%
setView(lng = center_lon, lat = center_lat, zoom = 10)
77th Street and Southwest has maximum number of crimes reported while Foothill and Hollenbeck has least number of crimes reported.
Following is a shiny app (hosted here) which we have created to allow the users to interact with data i.e. filter for crimes based on year, time of the day, major crime types and victim’s sex. The app shows the data for crime in which adults were arrested. This provides a proxy to analyze severe crime patterns.
This section deals with the time series trends in crime reported over the years.
######## Plot number of crimes occurred on each day ########
#Summarize the data on daily level
crime_data_oct17 %>%
group_by(date_occurred) %>%
summarise(n = n_distinct(DR_number)) %>%
arrange(date_occurred) %>%
#Plot the line
ggplot(aes(x = date_occurred, y = n)) +
geom_line() +
#Point out the spikes in the data
annotate("rect",xmin = as.Date('2010-12-25'), xmax = as.Date('2011-01-10'),
ymin = 350, ymax = 2100, fill = "red", alpha = 0.5) +
annotate("rect",xmin = as.Date('2012-05-25'), xmax = as.Date('2012-06-05'),
ymin = 450, ymax = 1050, fill = "blue", alpha = 0.5) +
annotate("text", label = "Spikes on 1st January of each year",
x = as.Date('2011-07-15'), y = 2150, size = 3.5, color = "red", alpha = 0.5) +
annotate("text", label = "Small Spikes on 1st every month",
x = as.Date('2012-11-10'), y = 1100, size = 3.5, color = "blue", alpha = 0.5) +
#Axes labels and titles
labs(x = "Date on which crime occurred", y = "Number of crimes",
title = "Number of crimes Vs. Date Occurred") +
scale_y_continuous(labels = comma) +
theme_bw()
When plotting data at day level from 2010, we noticed that number of crimes reported on 1st of January has a very high spike. This may be due to the Christmas break but then this spike reduces significantly as we move towards 2017. We can infer that there might be some issue in data reporting. Also, we observe small spikes at the start of each month.
Since, daily level trends have a very high variation, looking at month level trends in following section.
################ Summarize the data on month level & then plot ########
crime_data_oct17 %>%
mutate(year_month = format(as.Date(date_occurred),"%Y_%m")) %>%
group_by(year_month) %>%
summarise(n = n_distinct(DR_number)) %>%
#Plot the line
ggplot(aes(x = year_month, y = n, group = 1)) +
geom_line(color = "blue") +
geom_point(color = "blue") +
geom_smooth(se = FALSE, color = "dark grey") +
#Axes labels and titles
labs(x = "Year and Month of crime occurrence",
y = "Number of crimes",
title = "Number of crimes Vs. Year and Month of crime occurrence") +
scale_y_continuous(limits = c(10000,20500), labels = comma) +
scale_x_discrete(breaks =
levels(as.factor(format(as.Date(crime_data_oct17$date_occurred),"%Y_%m")))[c(T, rep(F, 11))]) +
theme_bw()
As expected, the month level trends has less variation compared to day level data. We can observe that numnber of crimes has a slight decreasing trend till 2014 and an increasing trend henceforth. Also, we can notice that there is always a spike in number of crimes in January and a dip in the month of February.
####################### trends for top 5 crimes ########################
#Filter the data by joining with top 5 crimes data created earlier
crime_data_oct17 %>%
inner_join(top_5_crimes, by = "crime_code") %>%
mutate(year_month = format(as.Date(date_occurred),"%Y_%m")) %>%
group_by(year_month, crime_code_description) %>%
summarise(n = n_distinct(DR_number)) %>%
#Plot the line
ggplot(aes(x = year_month, y = n, group = 1)) +
geom_line(color = "blue") +
geom_point(color = "blue", size = 0.5) +
facet_wrap(~ crime_code_description, nrow = 5, ncol = 1) +
#Axes labels and titles
xlab("Year and Month of crime occurrence") +
ylab("Number of crimes") +
ggtitle("Monthly trends for top 5 crimes in LA") +
scale_y_continuous(labels = comma) +
scale_x_discrete(breaks =
levels(as.factor(format(as.Date(crime_data_oct17$date_occurred),"%Y_%m")))[c(T, rep(F, 23))]) +
theme_bw()
The above plots show monthly trends of top 5 crimes in LA. Simple Assault and Burglary remains more or less similar over time. Burglary from vehicle and vehicles stolen initially have a decreasing trend and then from 2014 it increases (similar to overall number of crimes). Petty thefts have a slight increasing trend overall.
Let us use ARIMA to foreast number of crimes for next 5 months.
#################### Time series for top 5 crimes ####################
#Filter for data for top 5 crimes
top_crimes_monthly_data <- crime_data_oct17 %>%
inner_join(top_5_crimes, by = "crime_code") %>%
mutate(year_month = format(as.Date(date_occurred),"%Y_%m")) %>%
group_by(year_month, crime_code_description) %>%
summarise(num_crimes = n_distinct(DR_number))
top_crime_desc <- unique(top_crimes_monthly_data$crime_code_description)
#Creating a for loop to predict and plot trends for top 5 crimes
for(i in 1:length(top_crime_desc))
{
num_crimes <- filter(top_crimes_monthly_data,
crime_code_description == top_crime_desc[i])$num_crimes
#Creating time series object
crimes_ts <- ts(num_crimes)
#Fitting the best arima model
model <- auto.arima(crimes_ts, stepwise = FALSE, approximation = FALSE)
#Predict next 3 months based on the model created
predict <- model %>% forecast(level = c(95), h = 5)
#Plot the graph
print(predict %>%
autoplot() +
labs(x = "Year and Month from Oct 2010 to Mar 2018",
y = "Number of crimes",
title = paste0("Crime Forecast for ", top_crime_desc[i]),
label = "XYZ") +
scale_y_continuous(labels = comma) +
theme_bw())
#Print the table for predictions
year_month <- c("2017_11", "2017_12", "2018_01", "2018_02", "2018_03")
table <- cbind(year_month, round(predict$mean, 0),
round(predict$lower, 0), round(predict$upper, 0))
colnames(table) <- c("Year_Month", "Predicted_Mean", "95%_CI_lower_estimate", "95%_CI_upper_estimate")
print(paste0("Predictions for ", top_crime_desc[i], ":"))
print(as.data.frame(table))
}
## [1] "Predictions for BATTERY - SIMPLE ASSAULT:"
## Year_Month Predicted_Mean 95%_CI_lower_estimate 95%_CI_upper_estimate
## 1 2017_11 1592 1372 1812
## 2 2017_12 1496 1262 1730
## 3 2018_01 1399 1145 1653
## 4 2018_02 1378 1123 1634
## 5 2018_03 1402 1146 1657
## [1] "Predictions for BURGLARY:"
## Year_Month Predicted_Mean 95%_CI_lower_estimate 95%_CI_upper_estimate
## 1 2017_11 1272 1072 1473
## 2 2017_12 1257 1042 1472
## 3 2018_01 1253 1034 1472
## 4 2018_02 1251 1030 1472
## 5 2018_03 1251 1028 1473
## [1] "Predictions for BURGLARY FROM VEHICLE:"
## Year_Month Predicted_Mean 95%_CI_lower_estimate 95%_CI_upper_estimate
## 1 2017_11 1571 1364 1778
## 2 2017_12 1571 1343 1799
## 3 2018_01 1571 1324 1818
## 4 2018_02 1571 1306 1836
## 5 2018_03 1571 1289 1853
## [1] "Predictions for THEFT PLAIN - PETTY ($950 & UNDER):"
## Year_Month Predicted_Mean 95%_CI_lower_estimate 95%_CI_upper_estimate
## 1 2017_11 1246 1081 1411
## 2 2017_12 1284 1101 1468
## 3 2018_01 1263 1053 1474
## 4 2018_02 1271 1029 1512
## 5 2018_03 1272 1015 1529
## [1] "Predictions for VEHICLE - STOLEN:"
## Year_Month Predicted_Mean 95%_CI_lower_estimate 95%_CI_upper_estimate
## 1 2017_11 1520 1320 1721
## 2 2017_12 1576 1347 1806
## 3 2018_01 1559 1293 1824
## 4 2018_02 1556 1269 1843
## 5 2018_03 1558 1266 1849
The above forecasts are generated from auto selection of ARIMA models. We need to do proper model fit analysis before coming to conclusions. This part is beyond the scope of this project for now.
We can draw some inferences from the predictions like Simple Assault crime cases might decrease for next few months, Burglary and Burglary from vehicle crime cases might remain similar for next few months and vehicle stolen might increase in next few months.
The purpose of this study was to generate some insights on the views on crime like “Crime is rising” using actual data from the Los Angeles Police Department. Since this is a descriptive analysis, we approached this problem by cleaning the data and then slicing and dicing it to generate insights. Mainly Visualizations were used to generate insights. Following are some key insights on LA crime:
This analysis will provide consumers a general idea about the crime in LA with data to support the conclusions. Consumers can also explore the data on a map using the following shiny app
Following are the next steps to improvise this analysis: