For my data wrangling project, I am trying to analyse the crime patterns for the city of Cincinnati. While the city has come a long way from 2001 riots, I wanted to check whether trend of falling crime rates has been sustained over the years and answer the question of whether or not Cincinnati’s neighborhoods are really getting safer.
I am using data from the last 5 years to understand how crime rates have changed by time and location and hopefully show drop in reported crimes over the entire city.The conclusion of my analysis should answer following specific questions.
This data can be leveraged by city officials and police department to roll out targeted interventions in the crime prone neighborhoods. This analysis would also help people take precautions while travelling through crime prone areas.
For my analysis , I would be using dataset provided by Police Data Initiative for which Cincinnati Police Department is a member.
During the project, I would be using following packages.
library(dplyr) ## For Data manipulation
library(lubridate) ## Used for date manipulation
library(readr) ## To read the csv input file
library(tidyr) ## For Data Cleaning
library(DT) ## To render R data objects as tables in HTML page.
The following section would explain the required steps to in preparing data for analysis.
We would be using read_csv function to import the dataset into R.
# Read the file and view data
dataset <- read.csv("https://www.dropbox.com/s/cpya6okvbopi81y/city_of_cincinnati_police_data_initiative_crime_incidents.csv?dl=1")
View(head(dataset))
While head() outputs the result to console, the View() function displays the output in a new window.This is useful when dataset has a large number of columns.
Since our analysis relies on time , we focus DATE_FROM column. We convert this column from text to date datatype.Further, I created two separate fields, one to hold the full date with time and the other to hold just the time for any further analysis on time column.
#Establish time of occurence
event_datetime_occurence <- substr(dataset$DATE_FROM, 1, 22)
event_datetime_occurence <- (strptime(event_datetime_occurence, '%m/%d/%Y %I:%M:%S %p'))
event_time_occurence <- substr(dataset$DATE_FROM, 12, 22)
event_time_occurence <- as.difftime(event_time_occurence, '%I:%M:%S %p', units = "hours")
#round to nearest hour for easier analysis
event_time_occurence <- round(event_time_occurence, 1)
#Bind to Dataset
dataset <- cbind(dataset, event_datetime_occurence, event_time_occurence)
#Convert to a table
dataset <- tbl_df(dataset)
As the analysis is restricted last 5 years , use filter to select the required data only.
#Filter for years 2014 - 2018
dataset_filtered <- filter(dataset, event_datetime_occurence >= "2014-1-1" & event_datetime_occurence < "2019-1-1")
The original dataset has over 40 columns.We would need only the subset of those for our analysis.
# select only the required columns
dataset_final <- select(dataset_filtered, INSTANCEID,
event_datetime_occurence,
event_time_occurence,
OFFENSE,
DAYOFWEEK,
CPD_NEIGHBORHOOD,
WEAPONS,
LONGITUDE_X,LATITUDE_X,VICTIM_AGE,ZIP)
I would like to rename a couple of columns to make it more readable.
colnames(dataset_final)[colnames(dataset_final)=="LATITUDE_X"] <- "LATITUDE"
colnames(dataset_final)[colnames(dataset_final)=="LONGITUDE_X"] <- "LONGITUDE"
Next check and remove NA from the dataset.
dataset_final <- na.omit(dataset_final)
dataset_final <- with(dataset_final, dataset_final[!(WEAPONS == "" | is.na(WEAPONS)), ])
dataset_final <- with(dataset_final, dataset_final[!(OFFENSE == "" | is.na(OFFENSE)), ])
dataset_final <- with(dataset_final, dataset_final[!(VICTIM_AGE == "" | is.na(VICTIM_AGE)), ])
dataset_final <- with(dataset_final, dataset_final[!(INSTANCEID == "" | is.na(INSTANCEID)), ])
dataset_final <- with(dataset_final, dataset_final[!(DAYOFWEEK == "" | is.na(DAYOFWEEK)), ])
dataset_final <- with(dataset_final, dataset_final[!(CPD_NEIGHBORHOOD == "" | is.na(CPD_NEIGHBORHOOD)), ])
dataset_final <- with(dataset_final, dataset_final[!(ZIP == "" | is.na(ZIP)), ])
dataset_final <- with(dataset_final, dataset_final[!(WEAPONS == "" | is.na(WEAPONS)), ])
dataset_final <- with(dataset_final, dataset_final[!(OFFENSE == "" | is.na(OFFENSE)), ])
dataset_final <- with(dataset_final, dataset_final[!(VICTIM_AGE == "" | is.na(VICTIM_AGE)), ])
dataset_final <- with(dataset_final, dataset_final[!(INSTANCEID == "" | is.na(INSTANCEID)), ])
dataset_final <- with(dataset_final, dataset_final[!(DAYOFWEEK == "" | is.na(DAYOFWEEK)), ])
dataset_final <- with(dataset_final, dataset_final[!(CPD_NEIGHBORHOOD == "" | is.na(CPD_NEIGHBORHOOD)), ])
dataset_final <- with(dataset_final, dataset_final[!(ZIP == "" | is.na(ZIP)), ])
Finally check the dimensions of the dataset
dim(dataset_final)
## [1] 161616 11
Now we take a look a structure of final dataset
#looking at structure of final dataset
str(dataset_final)
## Classes 'tbl_df', 'tbl' and 'data.frame': 161616 obs. of 11 variables:
## $ INSTANCEID : Factor w/ 280111 levels "00002938-D5AD-4F41-890B-1D10A6611A6F",..: 235246 202681 64816 277019 71716 175854 204673 8016 112024 2230 ...
## $ event_datetime_occurence: POSIXct, format: "2015-03-16 15:02:00" "2018-03-13 16:45:00" ...
## $ event_time_occurence : 'difftime' num 15 16.8 0 1.2 ...
## ..- attr(*, "units")= chr "hours"
## $ OFFENSE : Factor w/ 202 levels "","ABDUCTION",..: 15 24 69 24 175 31 24 173 175 175 ...
## $ DAYOFWEEK : Factor w/ 8 levels "","FRIDAY","MONDAY",..: 3 7 2 6 5 7 5 5 6 5 ...
## $ CPD_NEIGHBORHOOD : Factor w/ 54 levels "","AVONDALE",..: 50 51 18 38 29 11 4 39 22 38 ...
## $ WEAPONS : Factor w/ 70 levels "","11--FIREARM (TYPE NOT STATED)",..: 51 36 69 36 51 51 36 51 51 51 ...
## $ LONGITUDE : num -84.5 -84.5 -84.5 -84.5 -84.5 ...
## $ LATITUDE : num 39.1 39.1 39.1 39.1 39.1 ...
## $ VICTIM_AGE : Factor w/ 13 levels "","00","18-25",..: 4 6 12 6 3 6 6 12 8 7 ...
## $ ZIP : num 45206 45214 45207 45202 45219 ...
## - attr(*, "na.action")= 'omit' Named int 89 93 178 248 294 340 384 396 405 435 ...
## ..- attr(*, "names")= chr "89" "93" "178" "248" ...
We see most of columns are factors apart from date and time columns. The explanation of each of those column is given below.
We would be using visual analysis by plotting crimes over time for each neighborhood.This should help us confirm or reject our assumption.This would also help us identify the crime prone neighborhood.We can also use MapR function to plot the same.Also we should be able to check trend of most common crimes across neighborhoods.We can also check to see if there is any pattern when plotting crimes against day of the week and time.
While we could use linear regression to predict future crimes, i feel that 5 years would not constitute big enough time-frame.