Introducton

The main objective of this study is to analyze the actual data from the Los Angeles Police Department. Hence, the data was sourced from the LAPD.

The dataset provides information on crime Incident Reports in Los Angeles city. This is the combined raw crime data for 2010 through 2017. Crime incident reports are provided by Los Angeles Police Department (LAPD) to document the initial details surrounding an incident to which BPD officers respond. This is a dataset containing records from the new crime incident report system, which includes a reduced set of fields focused on capturing the type of incident as well as when and where it occurred. This dataset is 349 MB, contains over 1.6 million observations and has different types of variable such as number, character and date. It has to be cleaned by using programming language, which, in this case, is R.

Since this a descriptive analysis, major focus of the analysis would be preparing data, slicing, dicing, vizualization and generating insights.

What is the question you hope to answer?

Crime rate in Los Angles (LA) since 2010 and what are the major areas in LA what have higher crime rate and could be most dangerous and which area were amoung the safest.

What data are you planning to use to answer that question?

The data I’m planning to use is sourced from the LAPD. The dataset provides information on crime Incident Reports in Los Angeles city. This is the combined raw crime data for 2010 through 2017.

Packages Requiried

################################ Loading the required packages ################################

library(DT)         #To display scrollable tables in r markdown
library(data.table) #For import using fread
library(dplyr)      #For data manipulation
library(tidyr)      #For getting the data in tidy format
library(lubridate)  #For extracting and working with dates
library(stringr)    #For working of strings
library(ggplot2) ## visualize data
library(janitor) # fF
library(cowplot)# Combines easily multiple plots
library(knitr) ## present table in webpage
library(shiny) ## develop shiny app
library(googleVis)
library(sf)
options(tigris_class = "sf")

What do you know about the data so far?

The raw data shown so far was focused only on its description.

Why did you choose this topic?

The Crime data dataset has been chosen because it is a good starting point for understanding different class of variables and how to use them in data analysis. It would be interesting to apply data science and visualization techneques to see what insight data can provide and how can we leverage data science in the field of crime and protection of general public.

Data Preparation

This section provides details on steps involved in preparing analytical dataset.

Data import

Some important characteristics of data are provided below:

Time Period - The data includes crimes from 01 January, 2010 to 28 October, 2017
Attributes - For each crime, this data provides 26 different variables like date when crime occurred, date when crime was reported, location of the crime, type of crime, victim’s description and current status of the investigation.
Missing values - Location field with missing values are replaced with (0,0). Also, unknown values of victim’s sex and victim’s descent are represented with charachter ‘X’.

The Raw data was imported into R using fread:

setwd("C:/Users/Sanjiv/Desktop/SanjivKukadia_FinalProject")
rawdata <- fread("Crime_Data_from_2010_to_Present.csv", header = T, sep= ",", dec = "." ,stringsAsFactors = T,
                 
                 na.strings=c("NA","","(0, 0)","X","-"),
                 
                 col.names =
                   c("DR_number",   "date_reported",    "date_occurred","time_occurred",   "area_id",  "area_name",   
                     "reporting_district",  "crime_code","crime_code_description",  "MO_codes",
                     "victim_age",  "victim_sex",   "victim_descent","premise_code",    "premise_description",
                     "weapon_used_code",    "weapon_description",  "status_code", "status_description",
                     "crime_code_1", "crime_code_2",    "crime_code_3", "crime_code_4",    "address",  "cross_street","location"))

The rawdata contains 1621438 rows and 26 columns:

str(rawdata)#View the imported data

## Classes 'data.table' and 'data.frame':   1621438 obs. of  26 variables:
##  $ DR_number             : int  1208575 102005556 418 101822289 42104479 120125367 101105609 101620051 101910498 120908292 ...
##  $ date_reported         : Factor w/ 2865 levels "01/01/2010","01/01/2011",..: 582 193 622 2509 85 60 217 2509 771 702 ...
##  $ date_occurred         : Factor w/ 2865 levels "01/01/2010","01/01/2011",..: 558 169 614 2502 29 60 209 2481 771 116 ...
##  $ time_occurred         : int  1800 2300 2030 1800 2300 1400 2230 1600 1600 800 ...
##  $ area_id               : int  12 20 18 18 21 1 11 16 19 9 ...
##  $ area_name             : Factor w/ 21 levels "77th Street",..: 1 12 15 15 17 2 11 4 8 18 ...
##  $ reporting_district    : int  1241 2071 1823 1803 2133 111 1125 1641 1902 904 ...
##  $ crime_code            : int  626 510 510 510 745 110 510 510 510 668 ...
##  $ crime_code_description: Factor w/ 135 levels "ABORTION/ILLEGAL",..: 72 131 131 131 129 38 131 131 131 57 ...
##  $ MO_codes              : Factor w/ 356621 levels "0100","0100 0104",..: 162934 NA NA NA 46964 254850 NA NA NA 104184 ...
##  $ victim_age            : int  30 NA 12 NA 84 49 NA NA NA 27 ...
##  $ victim_sex            : Factor w/ 3 levels "F","H","M": 1 NA NA NA 3 1 NA NA NA 1 ...
##  $ victim_descent        : Factor w/ 18 levels "A","B","C","D",..: 17 NA NA NA 17 17 NA NA NA 12 ...
##  $ premise_code          : int  502 101 101 101 501 501 108 101 101 203 ...
##  $ premise_description   : Factor w/ 210 levels "ABANDONED BUILDING ABANDONED HOUSE",..: 114 178 178 178 168 168 135 178 178 126 ...
##  $ weapon_used_code      : int  400 NA NA NA NA 400 NA NA NA NA ...
##  $ weapon_description    : Factor w/ 79 levels "AIR PISTOL/REVOLVER/RIFLE/BB GUN",..: 66 NA NA NA NA 66 NA NA NA NA ...
##  $ status_code           : Factor w/ 9 levels "13","19","AA",..: 4 6 6 6 6 3 6 6 6 6 ...
##  $ status_description    : Factor w/ 6 levels "Adult Arrest",..: 2 3 3 3 3 1 3 3 3 3 ...
##  $ crime_code_1          : int  626 510 510 510 745 110 510 510 510 668 ...
##  $ crime_code_2          : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ crime_code_3          : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ crime_code_4          : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ address               : Factor w/ 71283 levels "00","00    17TH                         AV",..: 49016 70335 24424 58467 52720 47587 71242 64574 65307 52870 ...
##  $ cross_street          : Factor w/ 11247 levels "10                           FY",..: NA 136 NA 10680 NA NA 1107 9956 3305 NA ...
##  $ location              : Factor w/ 60818 levels "(0, 0)","(33.3427, -118.3258)",..: 9124 20743 4847 6165 45434 23970 34320 52232 60220 45589 ...
##  - attr(*, ".internal.selfref")=<externalptr>

neighbourhoods.geojson : GeoJSON file of neighbourhoods of the city.

la_nhoods <- sf::read_sf("neighbourhoods.geojson")

Data Cleaning

The rawdata do not follow the concepts of tidy data, since sile issues ware founf:

The location field has latitude and longitude within the same cell
The format of date is mm/dd/yy, which is fine in daily life, but I want to separate it so I can analyze on different scales of time.
Some columns such as crime code, case number and area number are not my concern so I want to delete some columns.
Delete observations that contain missing values
I also created new variable “crime_type1” from the variable “crime status description” because it has different values that have the same meaning.

Missing value

table(is.na(rawdata))

## 
##    FALSE     TRUE 
## 33199366  8958022

There is 5975002 missing values in the data. We will exclude variables with miising value from the analysis, such as victim_age, victim_sex, premise_code, weapon_used_code, crime_code_1, crime_code_2, crime_code_3, crime_code_4, victim_descent.

sapply(rawdata, function(df) {
sum(is.na(df)==TRUE)/ length(df);
})

##              DR_number          date_reported          date_occurred          time_occurred 
##            0.00000e+00            0.00000e+00            0.00000e+00            0.00000e+00 
##                area_id              area_name     reporting_district             crime_code 
##            0.00000e+00            0.00000e+00            0.00000e+00            0.00000e+00 
## crime_code_description               MO_codes             victim_age             victim_sex 
##            2.78148e-04            1.08441e-01            8.09238e-02            1.07604e-01 
##         victim_descent           premise_code    premise_description       weapon_used_code 
##            1.18705e-01            4.74887e-05            2.13946e-03            6.68695e-01 
##     weapon_description            status_code     status_description           crime_code_1 
##            6.68696e-01            1.23347e-06            0.00000e+00            3.70042e-06 
##           crime_code_2           crime_code_3           crime_code_4                address 
##            9.36779e-01            9.98598e-01            9.99954e-01            0.00000e+00 
##           cross_street               location 
##            8.33867e-01            5.55063e-06

rawdata <- rawdata %>% 
  select(-MO_codes,-victim_age, -victim_sex,-victim_descent,-premise_code, -weapon_used_code,-weapon_description, -crime_code_1, -crime_code_2, -crime_code_3, -crime_code_4,-cross_street)  %>% ## drop useless columns,
  
  na.omit() ##  omit missing values.

Formating Columns

###################### Split location into latitude and longitude ##########################
rawdata <- rawdata %>% 
          #Split the location variable using ','
                separate(location, into = c("latitude", "longitude"), sep = ',') %>%
  #Split the crime_code_description variable using '-'
                separate(crime_code_description, into = c("crime_type1", "crime_type2"), sep = '-') %>% 
                  #Replace brackets in the strings and then convert them to numeric
                mutate(latitude = as.numeric(str_replace(latitude,'\\(','')),
                longitude = as.numeric(str_replace(longitude,'\\)','')),
                # Remove leading and/or trailing whitespace from character strings.
                crime_type1 = trimws(crime_type1, which="both") ,
                crime_type2 = trimws(crime_type2, which="both"),
                crime_type1 = gsub("UNK|Unknown", "unknown", crime_type1)
                )

Creating new variables

In this section, few additional variables can be created:

TimeOfDay

Creating TimeOfDay, which is the Splitting of time_occurred into six hour segments:

-Midnight - 6am: Overnight

-6am - Midday: Morning

-Midday - 6pm: Afternoon

-6pm - Midnight: Evening

############################ Converting time to proper format ########################

rawdata$TimeOfDay <- cut(rawdata$time_occurred, 
                        c(0, 600, 1200, 1800, 2400), 
                        labels = c("Overnight", "Morning", "Afternoon", "Evening"),
                        right = FALSE)

year_occurred, month_occurred, day_occurred and the dayofweek

##################### Create new variables for analysis ##################################

rawdata <- rawdata %>%
                mutate(
                  #formatting the date and time variables
                  date_occurred = mdy(date_occurred),
                          #Creating variables on date when crime occurred
                        year_occurred = year(date_occurred),
                        month_occurred = month(date_occurred),
                        day_occurred = day(date_occurred),
                        dayOfWeek = wday(date_occurred, label = TRUE))

Data Dictionary

This dictionary can help you understand the type and meaning of variables in my dataset

var_name <- colnames(rawdata)
var_des <- c("Division of Records Number","Date crime was reported","Date crime was occurred", 
             
             "Time crime occured", "ID of area where crime happened", "Name of area where crime happened","Sub-area within a Geographic Area.","crime_code" , "Type of crime 1", "Type of crime 2", "The type of structure, vehicle, or location where the crime took place.","Status of the case. (IC is the default)","Defines the Status Code provided.",
             "Street where crime happened", "Longgitude of location where crime happened", "latitude of location where crime happened", "Time Of Day", "Year crime was occured","Month crime occured", "Day crime was occured","the day of the week on which crime occurred")
var_type <- sapply(rawdata, class)
                        
data_d <- as.data.frame(cbind(var_name, var_type, var_des))

colnames(data_d) <- c("Variable Name", "Type of Variable", "Description")

kable(data_d, caption = "Variable Dictionary")

Variable Dictionary
	Variable Name	Type of Variable	Description
DR_number	DR_number	integer	Division of Records Number
date_reported	date_reported	factor	Date crime was reported
date_occurred	date_occurred	Date	Date crime was occurred
time_occurred	time_occurred	integer	Time crime occured
area_id	area_id	integer	ID of area where crime happened
area_name	area_name	factor	Name of area where crime happened
reporting_district	reporting_district	integer	Sub-area within a Geographic Area.
crime_code	crime_code	integer	crime_code
crime_type1	crime_type1	character	Type of crime 1
crime_type2	crime_type2	character	Type of crime 2
premise_description	premise_description	factor	The type of structure, vehicle, or location where the crime took place.
status_code	status_code	factor	Status of the case. (IC is the default)
status_description	status_description	factor	Defines the Status Code provided.
address	address	factor	Street where crime happened
latitude	latitude	numeric	Longgitude of location where crime happened
longitude	longitude	numeric	latitude of location where crime happened
TimeOfDay	TimeOfDay	factor	Time Of Day
year_occurred	year_occurred	numeric	Year crime was occured
month_occurred	month_occurred	numeric	Month crime occured
day_occurred	day_occurred	integer	Day crime was occured
dayOfWeek	dayOfWeek	c(“ordered”, “factor”)	the day of the week on which crime occurred

#save(file="crime_LA.Rdata",rawdata)
#load("crime_LA.Rdata")

Exploratory Data Analysis

Crime in Time of day

rawdata %>% tabyl(TimeOfDay)

##  TimeOfDay      n  percent
##  Overnight 222039 0.137272
##    Morning 331295 0.204817
##  Afternoon 534518 0.330456
##    Evening 529665 0.327456

The above table shows us that the most of crime was committed in the afternoon period (34%), followed by the Evening (32.8%) and the morning(20.5%), while in the overnight there was only 13.7%.

dcount <- as.data.frame(table(rawdata$TimeOfDay))

day <- gvisPieChart(dcount, numvar = "10", options=list(title="Number of Crimes per Day (past 17 Months)"))

plot(day)

The most dangerous areas in LA

Will develop a shiny app so the audience can select time and location to observe number and most common types of crimes in LA .

The number of crime of each area is given in the following charts:

## barchart for crimes in different areas ## 
#reorder the table and reset the factor to that ordering
rawdata %>%
  tabyl(area_name) %>% # calculate the counts
  arrange(percent) %>%                                # sort by counts
  mutate(area_name = factor(area_name, area_name)) %>%   # reset factor
  tail(15)  %>%
  ggplot(aes(x=area_name, y=n)) +                 # plot 
    geom_bar(aes(fill = area_name),stat="identity") +                        # plot histogram
    labs(title = "Dangerous Areas in LA",
       subtitle = "Number of Crimes in Different Areas",x="",y = "Number of Crimes", fill="area_name") +
  theme(legend.position="none", axis.text.x = element_text(angle = 45, hjust = 1)) +
    coord_flip()

From this bar chart and the Map below, the most dangerous area in the past 18 years are 77th street, South West and N Hollywood

## Linechart for crime status in different years ##

rawdata %>% crosstab(area_name,year_occurred) %>% # calculate tcross table
  melt(id="area_name")  %>% 
  arrange(desc(value)) %>% 
  head(45) %>% 
  ggplot(aes(x=variable, y=value,group = area_name)) +                 # plot 
  geom_line(aes(linetype = area_name, color = area_name)) + 
  geom_point(aes(shape = area_name, color = area_name)) +
  ylab("Count")+xlab("YEAR")+
   theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  ggtitle("Most Dangerous Areas: Number of Crime in Different Years")

The number of crime is decreasing in 77th street and South West from 2010 to 2013, while it is increasing from 2013 to 2016. For the areas Pacific and southwest, we can see that there is no big variation in the number of crime by year.

library(ggplot2)
library(tidyverse)
library(tidyr)
ggplot(la_nhoods) +
  geom_count(data = subset(rawdata,year_occurred=2017), aes(x = longitude, y =  latitude,  color = area_name))+
  labs(title = "Dangerous Areas in LA: YEAR 2017")+
  theme(
         panel.background = element_rect(colour = "black",fill="white",size=.1),
         legend.text = element_text(size = 8),
         legend.title = element_text(size = 8),
         panel.grid.major = element_line(colour = 'grey'),
         panel.grid.minor = element_line(colour = 'grey'),
         plot.title = element_text(lineheight=.5, face="bold"),
         axis.text.x = element_text(angle = -45, hjust = -0.05))+
  guides(col = guide_legend(reverse = TRUE))

Top 15 types of crime

## barchart for Top 15 types of crime ##

rawdata %>%
  tabyl(crime_type1) %>% # calculate the counts
  arrange(percent) %>%                                # sort by counts
  mutate(crime_type1 = factor(crime_type1, crime_type1)) %>%   # reset factor
  tail(15) -> crime_index


crime_index %>% 
  ggplot(aes(x=crime_type1, y=percent)) +                 # plot 
    geom_bar(aes(fill = crime_type1),stat="identity") +                        # plot histogram
   geom_text(aes( label = scales::percent(percent),
                   y= percent ), vjust = -.5) +
    labs(title = "Top 15 Types of Crime",x="",y = "Percent", fill="DISTRICT") +
  theme(legend.position="none", axis.text.x = element_text(angle = 45, hjust = 1),
        axis.text.y = element_text(size=8)) +
    coord_flip()

In the last 18 years , Vandalism, Battery and Vehicle Accident kept there place in the top of the crime list in LA.

## Linechart for crime status in different years ##

rawdata %>% crosstab(crime_type1,year_occurred) %>% # calculate tcross table
  melt(id="crime_type1")  %>% 
  arrange(desc(value)) %>% 
  head(45) %>% 
  ggplot(aes(x=variable, y=value,group = crime_type1)) +                 # plot 
  geom_line(aes(linetype = crime_type1, color = crime_type1)) + 
  geom_point(aes(shape = crime_type1, color = crime_type1)) +
  ylab("Count")+xlab("YEAR")+
   theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  ggtitle("Number of Crime in Different Years")

Crime based on year(s)

## linechart for crime in different years ##

table(rawdata$year_occurred) %>% 
  as.data.frame() %>% 
  arrange(Var1) %>% 
  ggplot() +
  geom_line(mapping = aes(x = Var1, y = Freq, group = 1), 
            linetype = 1, color = "#006699") +
  labs(title = "Number of Crime in Different Years",
       x = "Year", y = "Count") +
  theme(axis.text.x = element_text(angle = 0)) +
  geom_point(aes(x = Var1, y = Freq), size = 4, color = "#FF9933")

From our results above it is obvious that the number of crime in LA was increasing until 2016.

Conclusion

From our findings, it can be conclude clearly that the crime rate in LA was increasing until 2016. And we can also determine the safest to visit or leave in LA. On the other side we can also determine the most dangerous area in LA which should be avoided and might need some extra steps taken from the crime preventive agencies. We also determiend what type of crimes were on top of list.

Results also determined that the crime rate was decreasing in 77th street and South West from 2010 to 2013, while it was increasing from 2013 to 2016. However, for the areas Pacific and southwest, we can see that there is no big variation in the number of crime by year.

Further advanced studies can be conducted to explain the demograhics and other factors such as (age, and sex) might contributes to the crime rate.

DATA PREPROCESS AND VISUALIZATION
Crime Analysis - Los Angeles City

Sanjiv Kukadia

2018-11-23

Introducton

What is the question you hope to answer?

What data are you planning to use to answer that question?

Packages Requiried

What do you know about the data so far?

Why did you choose this topic?

Data Preparation

Data import

Data Cleaning

Missing value

Formating Columns

Creating new variables

TimeOfDay

year_occurred, month_occurred, day_occurred and the dayofweek

Data Dictionary

Exploratory Data Analysis

Crime in Time of day

The most dangerous areas in LA

Top 15 types of crime

In the last 18 years , Vandalism, Battery and Vehicle Accident kept there place in the top of the crime list in LA.

Crime based on year(s)

From our results above it is obvious that the number of crime in LA was increasing until 2016.

Conclusion

DATA PREPROCESS AND VISUALIZATION Crime Analysis - Los Angeles City

Sanjiv Kukadia

2018-11-23

Introducton

What is the question you hope to answer?

What data are you planning to use to answer that question?

Packages Requiried

What do you know about the data so far?

Why did you choose this topic?

Data Preparation

Data import

Data Cleaning

Missing value

Formating Columns

Creating new variables

TimeOfDay

year_occurred, month_occurred, day_occurred and the dayofweek

Data Dictionary

Exploratory Data Analysis

Crime in Time of day

The most dangerous areas in LA

Top 15 types of crime

In the last 18 years , Vandalism, Battery and Vehicle Accident kept there place in the top of the crime list in LA.

Crime based on year(s)

From our results above it is obvious that the number of crime in LA was increasing until 2016.

Conclusion

DATA PREPROCESS AND VISUALIZATION
Crime Analysis - Los Angeles City