Cincinnati Crime Analysis

How safe is Cincinnati

For my data wrangling project, I am trying to analyse the crime patterns for the city of Cincinnati. While the city has come a long way from 2001 riots, I wanted to check whether trend of falling crime rates has been sustained over the years and answer the question of whether or not Cincinnati’s neighborhoods are really getting safer.

I am using data from the last 5 years to understand how crime rates have changed by time and location and hopefully show drop in reported crimes over the entire city.The conclusion of my analysis should answer following specific questions.

Which neighborhoods are getting safer?
- Neighborhoods show drop in number of crime occurrences.
- Drop in violent crimes
What times are most dangerous?
- Which days of week is most crime prone?
- Which hours of day is most crime prone?
Which areas are unsafe for minor(Age less than 18)?

This data can be leveraged by city officials and police department to roll out targeted interventions in the crime prone neighborhoods. This analysis would also help people take precautions while travelling through crime prone areas.

DataSet

For my analysis , I would be using dataset provided by Police Data Initiative for which Cincinnati Police Department is a member.

Pacakges required

During the project, I would be using following packages.

library(dplyr)      ## For Data manipulation
library(lubridate)  ## Used for date manipulation
library(readr)      ## To read the csv input file
library(tidyr)      ## For Data Cleaning
library(DT)         ## To render R data objects as tables in HTML page.

Data Preparation

The following section would explain the required steps to in preparing data for analysis.

Importing Data

We would be using read_csv function to import the dataset into R.

# Read the file and view data
dataset <- read.csv("https://www.dropbox.com/s/cpya6okvbopi81y/city_of_cincinnati_police_data_initiative_crime_incidents.csv?dl=1")
View(head(dataset))

While head() outputs the result to console, the View() function displays the output in a new window.This is useful when dataset has a large number of columns.

Data Cleaning

Since our analysis relies on time , we focus DATE_FROM column. We convert this column from text to date datatype.Further, I created two separate fields, one to hold the full date with time and the other to hold just the time for any further analysis on time column.

#Establish time of occurence
event_datetime_occurence <- substr(dataset$DATE_FROM, 1, 22)
event_datetime_occurence <- (strptime(event_datetime_occurence, '%m/%d/%Y %I:%M:%S %p'))

event_time_occurence <- substr(dataset$DATE_FROM, 12, 22)
event_time_occurence <- as.difftime(event_time_occurence, '%I:%M:%S %p', units = "hours")

#round to nearest hour for easier analysis
event_time_occurence <- round(event_time_occurence, 1)

#Bind to Dataset
dataset <- cbind(dataset, event_datetime_occurence, event_time_occurence)

#Convert to a table
dataset <- tbl_df(dataset)

As the analysis is restricted last 5 years , use filter to select the required data only.

#Filter for years 2014 - 2018
dataset_filtered <- filter(dataset, event_datetime_occurence >= "2014-1-1" & event_datetime_occurence < "2019-1-1")

The original dataset has over 40 columns.We would need only the subset of those for our analysis.

# select only the required columns
dataset_final <- select(dataset_filtered, INSTANCEID,
                        event_datetime_occurence,
                        event_time_occurence,
                        OFFENSE,
                        DAYOFWEEK,
                        CPD_NEIGHBORHOOD,
                        WEAPONS,
                        LONGITUDE_X,LATITUDE_X,VICTIM_AGE,ZIP)

I would like to rename a couple of columns to make it more readable.

colnames(dataset_final)[colnames(dataset_final)=="LATITUDE_X"] <- "LATITUDE"
colnames(dataset_final)[colnames(dataset_final)=="LONGITUDE_X"] <- "LONGITUDE"

Next check and remove NA from the dataset.

dataset_final <- na.omit(dataset_final)

dataset_final <- with(dataset_final, dataset_final[!(WEAPONS == "" | is.na(WEAPONS)), ])
dataset_final <- with(dataset_final, dataset_final[!(OFFENSE == "" | is.na(OFFENSE)), ])
dataset_final <- with(dataset_final, dataset_final[!(VICTIM_AGE == "" | is.na(VICTIM_AGE)), ])
dataset_final <- with(dataset_final, dataset_final[!(INSTANCEID == "" | is.na(INSTANCEID)), ])
dataset_final <- with(dataset_final, dataset_final[!(DAYOFWEEK == "" | is.na(DAYOFWEEK)), ])
dataset_final <- with(dataset_final, dataset_final[!(CPD_NEIGHBORHOOD == "" | is.na(CPD_NEIGHBORHOOD)), ])
dataset_final <- with(dataset_final, dataset_final[!(ZIP == "" | is.na(ZIP)), ])
dataset_final <- with(dataset_final, dataset_final[!(WEAPONS == "" | is.na(WEAPONS)), ])
dataset_final <- with(dataset_final, dataset_final[!(OFFENSE == "" | is.na(OFFENSE)), ])
dataset_final <- with(dataset_final, dataset_final[!(VICTIM_AGE == "" | is.na(VICTIM_AGE)), ])
dataset_final <- with(dataset_final, dataset_final[!(INSTANCEID == "" | is.na(INSTANCEID)), ])
dataset_final <- with(dataset_final, dataset_final[!(DAYOFWEEK == "" | is.na(DAYOFWEEK)), ])
dataset_final <- with(dataset_final, dataset_final[!(CPD_NEIGHBORHOOD == "" | is.na(CPD_NEIGHBORHOOD)), ])
dataset_final <- with(dataset_final, dataset_final[!(ZIP == "" | is.na(ZIP)), ])

Finally check the dimensions of the dataset

dim(dataset_final)

## [1] 161616     11

Data dictionary

Now we take a look a structure of final dataset

#looking at structure of final dataset
str(dataset_final)

## Classes 'tbl_df', 'tbl' and 'data.frame':    161616 obs. of  11 variables:
##  $ INSTANCEID              : Factor w/ 280111 levels "00002938-D5AD-4F41-890B-1D10A6611A6F",..: 235246 202681 64816 277019 71716 175854 204673 8016 112024 2230 ...
##  $ event_datetime_occurence: POSIXct, format: "2015-03-16 15:02:00" "2018-03-13 16:45:00" ...
##  $ event_time_occurence    : 'difftime' num  15 16.8 0 1.2 ...
##   ..- attr(*, "units")= chr "hours"
##  $ OFFENSE                 : Factor w/ 202 levels "","ABDUCTION",..: 15 24 69 24 175 31 24 173 175 175 ...
##  $ DAYOFWEEK               : Factor w/ 8 levels "","FRIDAY","MONDAY",..: 3 7 2 6 5 7 5 5 6 5 ...
##  $ CPD_NEIGHBORHOOD        : Factor w/ 54 levels "","AVONDALE",..: 50 51 18 38 29 11 4 39 22 38 ...
##  $ WEAPONS                 : Factor w/ 70 levels "","11--FIREARM (TYPE NOT STATED)",..: 51 36 69 36 51 51 36 51 51 51 ...
##  $ LONGITUDE               : num  -84.5 -84.5 -84.5 -84.5 -84.5 ...
##  $ LATITUDE                : num  39.1 39.1 39.1 39.1 39.1 ...
##  $ VICTIM_AGE              : Factor w/ 13 levels "","00","18-25",..: 4 6 12 6 3 6 6 12 8 7 ...
##  $ ZIP                     : num  45206 45214 45207 45202 45219 ...
##  - attr(*, "na.action")= 'omit' Named int  89 93 178 248 294 340 384 396 405 435 ...
##   ..- attr(*, "names")= chr  "89" "93" "178" "248" ...

We see most of columns are factors apart from date and time columns. The explanation of each of those column is given below.

Instance ID - This is used to uniquely identify a particular crime in the dataset.
event_datetime_occurrence - The date and time the crime was reported.
event_time_occurrence - The time of day the crime was reported.
Offense - The type of offence that the occurred.
Dayofweek - Day of the week the crime occurred
CPD_Neighborhood - Identifies the neighborhood in which the crime occurred.
Weapons - Were any weapons were used? If so what weapons were they?
Latitude - Latitude the crime location
Longitude - Longitude the crime location
Victim Age - Age of the victim affected by crime

Proposed Exploratory Data Analysis

We would be using visual analysis by plotting crimes over time for each neighborhood.This should help us confirm or reject our assumption.This would also help us identify the crime prone neighborhood.We can also use MapR function to plot the same.Also we should be able to check trend of most common crimes across neighborhoods.We can also check to see if there is any pattern when plotting crimes against day of the week and time.

While we could use linear regression to predict future crimes, i feel that 5 years would not constitute big enough time-frame.