Introducton
The main objective of this study is to analyze the actual data from the Los Angeles Police Department. Hence, the data was sourced from the LAPD.
The dataset provides information on crime Incident Reports in Los Angeles city. This is the combined raw crime data for 2010 through 2017. Crime incident reports are provided by Los Angeles Police Department (LAPD) to document the initial details surrounding an incident to which BPD officers respond. This is a dataset containing records from the new crime incident report system, which includes a reduced set of fields focused on capturing the type of incident as well as when and where it occurred. This dataset is 349 MB, contains over 1.6 million observations and has different types of variable such as number, character and date. It has to be cleaned by using programming language, which, in this case, is R.
Since this a descriptive analysis, major focus of the analysis would be preparing data, slicing, dicing, vizualization and generating insights.
What is the question you hope to answer?
Crime rate in Los Angles (LA) since 2010 and what are the major areas in LA what have higher crime rate and could be most dangerous and which area were amoung the safest.
What data are you planning to use to answer that question?
The data I’m planning to use is sourced from the LAPD. The dataset provides information on crime Incident Reports in Los Angeles city. This is the combined raw crime data for 2010 through 2017.
Packages Requiried
################################ Loading the required packages ################################
library(DT) #To display scrollable tables in r markdown
library(data.table) #For import using fread
library(dplyr) #For data manipulation
library(tidyr) #For getting the data in tidy format
library(lubridate) #For extracting and working with dates
library(stringr) #For working of strings
library(ggplot2) ## visualize data
library(janitor) # fF
library(cowplot)# Combines easily multiple plots
library(knitr) ## present table in webpage
library(shiny) ## develop shiny app
library(googleVis)
library(sf)
options(tigris_class = "sf")
What do you know about the data so far?
The raw data shown so far was focused only on its description.
Why did you choose this topic?
The Crime data dataset has been chosen because it is a good starting point for understanding different class of variables and how to use them in data analysis. It would be interesting to apply data science and visualization techneques to see what insight data can provide and how can we leverage data science in the field of crime and protection of general public.
Data Preparation
This section provides details on steps involved in preparing analytical dataset.
Data import
Some important characteristics of data are provided below:
Time Period - The data includes crimes from 01 January, 2010 to 28 October, 2017
Attributes - For each crime, this data provides 26 different variables like date when crime occurred, date when crime was reported, location of the crime, type of crime, victim’s description and current status of the investigation.
Missing values - Location field with missing values are replaced with (0,0). Also, unknown values of victim’s sex and victim’s descent are represented with charachter ‘X’.
The Raw data was imported into R using fread:
setwd("C:/Users/Sanjiv/Desktop/SanjivKukadia_FinalProject")
rawdata <- fread("Crime_Data_from_2010_to_Present.csv", header = T, sep= ",", dec = "." ,stringsAsFactors = T,
na.strings=c("NA","","(0, 0)","X","-"),
col.names =
c("DR_number", "date_reported", "date_occurred","time_occurred", "area_id", "area_name",
"reporting_district", "crime_code","crime_code_description", "MO_codes",
"victim_age", "victim_sex", "victim_descent","premise_code", "premise_description",
"weapon_used_code", "weapon_description", "status_code", "status_description",
"crime_code_1", "crime_code_2", "crime_code_3", "crime_code_4", "address", "cross_street","location"))
The rawdata contains 1621438 rows and 26 columns:
str(rawdata)#View the imported data
## Classes 'data.table' and 'data.frame': 1621438 obs. of 26 variables:
## $ DR_number : int 1208575 102005556 418 101822289 42104479 120125367 101105609 101620051 101910498 120908292 ...
## $ date_reported : Factor w/ 2865 levels "01/01/2010","01/01/2011",..: 582 193 622 2509 85 60 217 2509 771 702 ...
## $ date_occurred : Factor w/ 2865 levels "01/01/2010","01/01/2011",..: 558 169 614 2502 29 60 209 2481 771 116 ...
## $ time_occurred : int 1800 2300 2030 1800 2300 1400 2230 1600 1600 800 ...
## $ area_id : int 12 20 18 18 21 1 11 16 19 9 ...
## $ area_name : Factor w/ 21 levels "77th Street",..: 1 12 15 15 17 2 11 4 8 18 ...
## $ reporting_district : int 1241 2071 1823 1803 2133 111 1125 1641 1902 904 ...
## $ crime_code : int 626 510 510 510 745 110 510 510 510 668 ...
## $ crime_code_description: Factor w/ 135 levels "ABORTION/ILLEGAL",..: 72 131 131 131 129 38 131 131 131 57 ...
## $ MO_codes : Factor w/ 356621 levels "0100","0100 0104",..: 162934 NA NA NA 46964 254850 NA NA NA 104184 ...
## $ victim_age : int 30 NA 12 NA 84 49 NA NA NA 27 ...
## $ victim_sex : Factor w/ 3 levels "F","H","M": 1 NA NA NA 3 1 NA NA NA 1 ...
## $ victim_descent : Factor w/ 18 levels "A","B","C","D",..: 17 NA NA NA 17 17 NA NA NA 12 ...
## $ premise_code : int 502 101 101 101 501 501 108 101 101 203 ...
## $ premise_description : Factor w/ 210 levels "ABANDONED BUILDING ABANDONED HOUSE",..: 114 178 178 178 168 168 135 178 178 126 ...
## $ weapon_used_code : int 400 NA NA NA NA 400 NA NA NA NA ...
## $ weapon_description : Factor w/ 79 levels "AIR PISTOL/REVOLVER/RIFLE/BB GUN",..: 66 NA NA NA NA 66 NA NA NA NA ...
## $ status_code : Factor w/ 9 levels "13","19","AA",..: 4 6 6 6 6 3 6 6 6 6 ...
## $ status_description : Factor w/ 6 levels "Adult Arrest",..: 2 3 3 3 3 1 3 3 3 3 ...
## $ crime_code_1 : int 626 510 510 510 745 110 510 510 510 668 ...
## $ crime_code_2 : int NA NA NA NA NA NA NA NA NA NA ...
## $ crime_code_3 : int NA NA NA NA NA NA NA NA NA NA ...
## $ crime_code_4 : int NA NA NA NA NA NA NA NA NA NA ...
## $ address : Factor w/ 71283 levels "00","00 17TH AV",..: 49016 70335 24424 58467 52720 47587 71242 64574 65307 52870 ...
## $ cross_street : Factor w/ 11247 levels "10 FY",..: NA 136 NA 10680 NA NA 1107 9956 3305 NA ...
## $ location : Factor w/ 60818 levels "(0, 0)","(33.3427, -118.3258)",..: 9124 20743 4847 6165 45434 23970 34320 52232 60220 45589 ...
## - attr(*, ".internal.selfref")=<externalptr>
neighbourhoods.geojson : GeoJSON file of neighbourhoods of the city.
la_nhoods <- sf::read_sf("neighbourhoods.geojson")
Data Cleaning
The rawdata do not follow the concepts of tidy data, since sile issues ware founf:
The location field has latitude and longitude within the same cell
- The format of date is mm/dd/yy, which is fine in daily life, but I want to separate it so I can analyze on different scales of time.
Some columns such as crime code, case number and area number are not my concern so I want to delete some columns.
Delete observations that contain missing values
I also created new variable “crime_type1” from the variable “crime status description” because it has different values that have the same meaning.
Missing value
table(is.na(rawdata))
##
## FALSE TRUE
## 33199366 8958022
There is 5975002 missing values in the data. We will exclude variables with miising value from the analysis, such as victim_age, victim_sex, premise_code, weapon_used_code, crime_code_1, crime_code_2, crime_code_3, crime_code_4, victim_descent.
sapply(rawdata, function(df) {
sum(is.na(df)==TRUE)/ length(df);
})
## DR_number date_reported date_occurred time_occurred
## 0.00000e+00 0.00000e+00 0.00000e+00 0.00000e+00
## area_id area_name reporting_district crime_code
## 0.00000e+00 0.00000e+00 0.00000e+00 0.00000e+00
## crime_code_description MO_codes victim_age victim_sex
## 2.78148e-04 1.08441e-01 8.09238e-02 1.07604e-01
## victim_descent premise_code premise_description weapon_used_code
## 1.18705e-01 4.74887e-05 2.13946e-03 6.68695e-01
## weapon_description status_code status_description crime_code_1
## 6.68696e-01 1.23347e-06 0.00000e+00 3.70042e-06
## crime_code_2 crime_code_3 crime_code_4 address
## 9.36779e-01 9.98598e-01 9.99954e-01 0.00000e+00
## cross_street location
## 8.33867e-01 5.55063e-06
rawdata <- rawdata %>%
select(-MO_codes,-victim_age, -victim_sex,-victim_descent,-premise_code, -weapon_used_code,-weapon_description, -crime_code_1, -crime_code_2, -crime_code_3, -crime_code_4,-cross_street) %>% ## drop useless columns,
na.omit() ## omit missing values.
Formating Columns
###################### Split location into latitude and longitude ##########################
rawdata <- rawdata %>%
#Split the location variable using ','
separate(location, into = c("latitude", "longitude"), sep = ',') %>%
#Split the crime_code_description variable using '-'
separate(crime_code_description, into = c("crime_type1", "crime_type2"), sep = '-') %>%
#Replace brackets in the strings and then convert them to numeric
mutate(latitude = as.numeric(str_replace(latitude,'\\(','')),
longitude = as.numeric(str_replace(longitude,'\\)','')),
# Remove leading and/or trailing whitespace from character strings.
crime_type1 = trimws(crime_type1, which="both") ,
crime_type2 = trimws(crime_type2, which="both"),
crime_type1 = gsub("UNK|Unknown", "unknown", crime_type1)
)
Creating new variables
In this section, few additional variables can be created:
TimeOfDay
Creating TimeOfDay, which is the Splitting of time_occurred into six hour segments:
-Midnight - 6am: Overnight
-6am - Midday: Morning
-Midday - 6pm: Afternoon
-6pm - Midnight: Evening
############################ Converting time to proper format ########################
rawdata$TimeOfDay <- cut(rawdata$time_occurred,
c(0, 600, 1200, 1800, 2400),
labels = c("Overnight", "Morning", "Afternoon", "Evening"),
right = FALSE)
year_occurred, month_occurred, day_occurred and the dayofweek
##################### Create new variables for analysis ##################################
rawdata <- rawdata %>%
mutate(
#formatting the date and time variables
date_occurred = mdy(date_occurred),
#Creating variables on date when crime occurred
year_occurred = year(date_occurred),
month_occurred = month(date_occurred),
day_occurred = day(date_occurred),
dayOfWeek = wday(date_occurred, label = TRUE))
Data Dictionary
This dictionary can help you understand the type and meaning of variables in my dataset
var_name <- colnames(rawdata)
var_des <- c("Division of Records Number","Date crime was reported","Date crime was occurred",
"Time crime occured", "ID of area where crime happened", "Name of area where crime happened","Sub-area within a Geographic Area.","crime_code" , "Type of crime 1", "Type of crime 2", "The type of structure, vehicle, or location where the crime took place.","Status of the case. (IC is the default)","Defines the Status Code provided.",
"Street where crime happened", "Longgitude of location where crime happened", "latitude of location where crime happened", "Time Of Day", "Year crime was occured","Month crime occured", "Day crime was occured","the day of the week on which crime occurred")
var_type <- sapply(rawdata, class)
data_d <- as.data.frame(cbind(var_name, var_type, var_des))
colnames(data_d) <- c("Variable Name", "Type of Variable", "Description")
kable(data_d, caption = "Variable Dictionary")
| Variable Name | Type of Variable | Description | |
|---|---|---|---|
| DR_number | DR_number | integer | Division of Records Number |
| date_reported | date_reported | factor | Date crime was reported |
| date_occurred | date_occurred | Date | Date crime was occurred |
| time_occurred | time_occurred | integer | Time crime occured |
| area_id | area_id | integer | ID of area where crime happened |
| area_name | area_name | factor | Name of area where crime happened |
| reporting_district | reporting_district | integer | Sub-area within a Geographic Area. |
| crime_code | crime_code | integer | crime_code |
| crime_type1 | crime_type1 | character | Type of crime 1 |
| crime_type2 | crime_type2 | character | Type of crime 2 |
| premise_description | premise_description | factor | The type of structure, vehicle, or location where the crime took place. |
| status_code | status_code | factor | Status of the case. (IC is the default) |
| status_description | status_description | factor | Defines the Status Code provided. |
| address | address | factor | Street where crime happened |
| latitude | latitude | numeric | Longgitude of location where crime happened |
| longitude | longitude | numeric | latitude of location where crime happened |
| TimeOfDay | TimeOfDay | factor | Time Of Day |
| year_occurred | year_occurred | numeric | Year crime was occured |
| month_occurred | month_occurred | numeric | Month crime occured |
| day_occurred | day_occurred | integer | Day crime was occured |
| dayOfWeek | dayOfWeek | c(“ordered”, “factor”) | the day of the week on which crime occurred |
#save(file="crime_LA.Rdata",rawdata)
#load("crime_LA.Rdata")
Exploratory Data Analysis
Crime in Time of day
rawdata %>% tabyl(TimeOfDay)
## TimeOfDay n percent
## Overnight 222039 0.137272
## Morning 331295 0.204817
## Afternoon 534518 0.330456
## Evening 529665 0.327456
The above table shows us that the most of crime was committed in the afternoon period (34%), followed by the Evening (32.8%) and the morning(20.5%), while in the overnight there was only 13.7%.
dcount <- as.data.frame(table(rawdata$TimeOfDay))
day <- gvisPieChart(dcount, numvar = "10", options=list(title="Number of Crimes per Day (past 17 Months)"))
plot(day)
The most dangerous areas in LA
Will develop a shiny app so the audience can select time and location to observe number and most common types of crimes in LA .
The number of crime of each area is given in the following charts:
## barchart for crimes in different areas ##
#reorder the table and reset the factor to that ordering
rawdata %>%
tabyl(area_name) %>% # calculate the counts
arrange(percent) %>% # sort by counts
mutate(area_name = factor(area_name, area_name)) %>% # reset factor
tail(15) %>%
ggplot(aes(x=area_name, y=n)) + # plot
geom_bar(aes(fill = area_name),stat="identity") + # plot histogram
labs(title = "Dangerous Areas in LA",
subtitle = "Number of Crimes in Different Areas",x="",y = "Number of Crimes", fill="area_name") +
theme(legend.position="none", axis.text.x = element_text(angle = 45, hjust = 1)) +
coord_flip()
From this bar chart and the Map below, the most dangerous area in the past 18 years are 77th street, South West and N Hollywood
## Linechart for crime status in different years ##
rawdata %>% crosstab(area_name,year_occurred) %>% # calculate tcross table
melt(id="area_name") %>%
arrange(desc(value)) %>%
head(45) %>%
ggplot(aes(x=variable, y=value,group = area_name)) + # plot
geom_line(aes(linetype = area_name, color = area_name)) +
geom_point(aes(shape = area_name, color = area_name)) +
ylab("Count")+xlab("YEAR")+
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
ggtitle("Most Dangerous Areas: Number of Crime in Different Years")
The number of crime is decreasing in 77th street and South West from 2010 to 2013, while it is increasing from 2013 to 2016. For the areas Pacific and southwest, we can see that there is no big variation in the number of crime by year.
library(ggplot2)
library(tidyverse)
library(tidyr)
ggplot(la_nhoods) +
geom_count(data = subset(rawdata,year_occurred=2017), aes(x = longitude, y = latitude, color = area_name))+
labs(title = "Dangerous Areas in LA: YEAR 2017")+
theme(
panel.background = element_rect(colour = "black",fill="white",size=.1),
legend.text = element_text(size = 8),
legend.title = element_text(size = 8),
panel.grid.major = element_line(colour = 'grey'),
panel.grid.minor = element_line(colour = 'grey'),
plot.title = element_text(lineheight=.5, face="bold"),
axis.text.x = element_text(angle = -45, hjust = -0.05))+
guides(col = guide_legend(reverse = TRUE))
Top 15 types of crime
## barchart for Top 15 types of crime ##
rawdata %>%
tabyl(crime_type1) %>% # calculate the counts
arrange(percent) %>% # sort by counts
mutate(crime_type1 = factor(crime_type1, crime_type1)) %>% # reset factor
tail(15) -> crime_index
crime_index %>%
ggplot(aes(x=crime_type1, y=percent)) + # plot
geom_bar(aes(fill = crime_type1),stat="identity") + # plot histogram
geom_text(aes( label = scales::percent(percent),
y= percent ), vjust = -.5) +
labs(title = "Top 15 Types of Crime",x="",y = "Percent", fill="DISTRICT") +
theme(legend.position="none", axis.text.x = element_text(angle = 45, hjust = 1),
axis.text.y = element_text(size=8)) +
coord_flip()
In the last 18 years , Vandalism, Battery and Vehicle Accident kept there place in the top of the crime list in LA.
## Linechart for crime status in different years ##
rawdata %>% crosstab(crime_type1,year_occurred) %>% # calculate tcross table
melt(id="crime_type1") %>%
arrange(desc(value)) %>%
head(45) %>%
ggplot(aes(x=variable, y=value,group = crime_type1)) + # plot
geom_line(aes(linetype = crime_type1, color = crime_type1)) +
geom_point(aes(shape = crime_type1, color = crime_type1)) +
ylab("Count")+xlab("YEAR")+
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
ggtitle("Number of Crime in Different Years")
Crime based on year(s)
## linechart for crime in different years ##
table(rawdata$year_occurred) %>%
as.data.frame() %>%
arrange(Var1) %>%
ggplot() +
geom_line(mapping = aes(x = Var1, y = Freq, group = 1),
linetype = 1, color = "#006699") +
labs(title = "Number of Crime in Different Years",
x = "Year", y = "Count") +
theme(axis.text.x = element_text(angle = 0)) +
geom_point(aes(x = Var1, y = Freq), size = 4, color = "#FF9933")
From our results above it is obvious that the number of crime in LA was increasing until 2016.
Conclusion
From our findings, it can be conclude clearly that the crime rate in LA was increasing until 2016. And we can also determine the safest to visit or leave in LA. On the other side we can also determine the most dangerous area in LA which should be avoided and might need some extra steps taken from the crime preventive agencies. We also determiend what type of crimes were on top of list.
Results also determined that the crime rate was decreasing in 77th street and South West from 2010 to 2013, while it was increasing from 2013 to 2016. However, for the areas Pacific and southwest, we can see that there is no big variation in the number of crime by year.
Further advanced studies can be conducted to explain the demograhics and other factors such as (age, and sex) might contributes to the crime rate.