Cincinnati Crime Data

Header Image

Introduction

This dataset includes proactive and reactive police incident data for the city of Cincinnati, OH. The Computer-Aided Dispatch (CAD) system is used to capture calls for service and facilitate the incident response.

The problem statement I am trying to address from this dataset is:

“What are the most common crimes committed against women in Cincinnati”

Addressing the problem:

The variable “Offense” in the datset tells us what offense was committed. The variable VICTIM_GENDER tells us the gender of the victim was. The CPD_NEIGHBORHOOD variable tells us where the location of the crime was. These variables coupled with a few more can give us some insights into how safe the city of Cincinnati is for women and help us flag unsafe neighborhoods as safe.

Proposed Approach:

To find what percentage of the total crime incidents has women victims.
What are the most commonly committed crimes against women?
To find the most usafe neighborhoods for women in Cincinnati

Goal:

To be able to identify the most commonly committed crimes against women in Cincinnati

Packages Required:

The following packages will be required:

tidyr - This package helps us clean and reorganize data for different kinds of analysis. knitr - This package helps with dynamic report generation. I’m using it to display values from my dataframes.

Rcode:

library(tidyr)
library(knitr)

Data Preparation

This dataset was obtained from the Cincinnati Open Data Portal. It can be directly downloaded from here

Original purpose of the dataset

The indicents of reported crimes in the city of Cincinnati were originally collected and collated by an agency for management. Incidents are typically housed in a Records Management System (RMS) that stores agency-wide data about law enforcement operations. This data is open everyone for analysis. The data is updated everyday and is always current.

Steps used for data cleaning

First, we read the .csv file into a dataframe in R and proceed to check the dimensions of the dataset.

setwd("C:/Users/Tanvi/Documents")
crimedata <- read.csv("Crime.csv")
dim(crimedata)

## [1] 357880     40

View(crimedata)

We can see that there are 4 columns representing 4 dates and times they are DATE_REPORTED, DATE_TO, DATE_FROM and DATE_OF_CLEARANCE These columns have both the date as well as the time. For our business problem, granularity of upto days is good enough. We will split the data and store them in different variables as follows:

time_reported <- substr(crimedata$DATE_REPORTED, 12, 22)
date_reported <- substr(crimedata$DATE_REPORTED, 1, 10)


time_from <- substr(crimedata$DATE_FROM, 12, 22)
date_from <- substr(crimedata$DATE_FROM, 1, 10)

time_to <- substr(crimedata$DATE_TO, 12, 22)
date_to <- substr(crimedata$DATE_TO, 1, 10)

time_of_clearance <- substr(crimedata$DATE_OF_CLEARANCE, 12, 22)
date_of_clearance <- substr(crimedata$DATE_OF_CLEARANCE, 1, 10)

Many of the columns in the dataset can be taken out right away as they won’t help us with our goal. We will remove the indentifiably unnecessary columns right away.

crime_clean <- crimedata[, -c(3:6, 13:15, 22:24, 39)]
dim(crime_clean)

## [1] 357880     29

Columns Removed:

DATE_REPORTED, DATE_FROM, DATE_TO = We have already added dates with a higher granularity to the datset which is why we can remove these.
Floor, Side, Opening = These columns refer to the venue where the crime occured and is irrelevant to our business case.
Hour_From, Hour_To = These columns refer to the duration of the incident.
CLSD = When the case was closed
Day of Clearance = The date and time when the incident was cleared.
Community Counsil Neighborhood = The place where the neighborhood community counsil is located

Now that the columns we don’t need have been dropped, we can add the columns we created by splitting the date columns

crime_clean <- cbind(crime_clean, date_reported, date_from, date_to)

Let us now check the dimensions of the resultant dataframe

dim(crime_clean)

## [1] 357880     32

The dataframe still has a column called THEFT_CODE which contains both the code of the offense committed as well as its description in the same columns. We would want to separate these into 2 different columns for ease of querying. We can separate them using the separate() function in tidyr.

crime_clean <- separate(crime_clean, THEFT_CODE, into = c("THEFT_CODE", "CODE_DESC"), sep = "-")

The dataset has lots of missing values, however omitting all the rows containing even a single null value will result in us having very few rows of data to work with. It might not be a very useful subset. We therefore want to take out values which are either null or UNKNOWN because these might affect our result numbers. I am also displaying the dimensions of the dataframe with each reduction so that we can see how many rows and columns we are ending up with each time.

crime_clean <- with(crime_clean, crime_clean[!(VICTIM_GENDER == "" | is.na(VICTIM_GENDER) | VICTIM_GENDER == "UNKNOWN"), ])
dim(crime_clean)

## [1] 302926     33

crime_clean <- with(crime_clean, crime_clean[!(ZIP == "" | is.na(ZIP)), ])
dim(crime_clean)

## [1] 302833     33

crime_clean <- with(crime_clean, crime_clean[!(OFFENSE == "" | is.na(OFFENSE)), ])
dim(crime_clean)

## [1] 302783     33

crime_clean <- with(crime_clean, crime_clean[!(ADDRESS_X == "" | is.na(ADDRESS_X)), ])
dim(crime_clean)

## [1] 299697     33

crime_clean <- with(crime_clean, crime_clean[!(LOCATION == "" | is.na(LOCATION)), ])
dim(crime_clean)

## [1] 299688     33

Let us see what our cleaned dataset looks like

kable(crime_clean[1:10,], caption = "Glimpse of cleaned data")

Glimpse of cleaned data
	INSTANCEID	INCIDENT_NO	UCR	DST	BEAT	OFFENSE	LOCATION	THEFT_CODE	CODE_DESC	HATE_BIAS	DAYOFWEEK	RPT_AREA	CPD_NEIGHBORHOOD	SNA_NEIGHBORHOOD	WEAPONS	ADDRESS_X	LONGITUDE_X	LATITUDE_X	VICTIM_AGE	VICTIM_RACE	VICTIM_ETHNICITY	VICTIM_GENDER	SUSPECT_AGE	SUSPECT_RACE	SUSPECT_ETHNICITY	SUSPECT_GENDER	TOTALNUMBERVICTIMS	TOTALSUSPECTS	UCR_GROUP	ZIP	date_reported	date_from	date_to
1	CBF5B090-FBF2-480A-9F4A-FCD1342FB75F	179002225	401	3	2	FELONIOUS ASSAULT	47-STREET		NA	N–NO BIAS/NOT APPLICABLE	SUNDAY	223	EAST PRICE HILL	EAST PRICE HILL	12 - HANDGUN	9XX HAWTHORNE AV	-84.56178	39.10665	18-25	WHITE	NOT OF HISPANIC ORIG	MALE	UNKNOWN	UNKNOWN	UNKNOWN	UNKNOWN	1	3	AGGRAVATED ASSAULTS	45205	01/23/2017	01/22/2017	01/22/2017
3	6C10F3A8-AA93-47B2-8AB9-F7E317197351	21005801	551	2	2	BURGLARY	02-MULTI FAMILY APARTMENT		NA	N–NO BIAS/NOT APPLICABLE	WEDNESDAY	128	OAKLEY	OAKLEY	99–NONE	31XX MADISON RD	-84.42805	39.15390	51-60	WHITE	UNKNOWN	MALE	UNKNOWN				1	NA	BURGLARY/BREAKING ENTERING	45208	12/22/2010	12/22/2010	12/22/2010
4	2A65A08F-8C74-403B-8636-6824B50B002D	129016167	600	2	1	THEFT	01-SINGLE FAMILY HOME	23H	ALL OTHER LARCENY	N–NO BIAS/NOT APPLICABLE	SATURDAY	68	EVANSTON	EVANSTON	99–NONE	32XX HACKBERRY ST	-84.47504	39.13586	51-60	BLACK	NOT OF HISPANIC ORIG	FEMALE	51-60	BLACK	NOT OF HISPANIC ORIG	MALE	1	1	THEFT	45207	06/12/2012	05/12/2012	05/12/2012
5	F7D48934-5C4A-4339-98AA-A98DB5DAB9FC	189020414	600	3	3	THEFT	47-STREET	24O	MOTOR VEHICLE THEFT	N–NO BIAS/NOT APPLICABLE	SATURDAY	311	EAST WESTWOOD	N/A	99 - NONE	23XX MCHENRY AV	NA	NA	61-70	BLACK	NOT OF HISPANIC ORIG	FEMALE	UNDER 18	BLACK	NOT OF HISPANIC ORIG	FEMALE	1	3	THEFT	45211	07/28/2018	07/28/2018	07/28/2018
8	23D84A29-D13F-44FC-AC87-4B30257759B2	179014947	551	3	1	BURGLARY	02-MULTI FAMILY APARTMENT		NA	N–NO BIAS/NOT APPLICABLE	THURSDAY	266	SAYLER PARK	SAYLER PARK	99 - NONE	1XX RICHARDSON PL	-84.68554	39.10888	51-60	WHITE	NOT OF HISPANIC ORIG	FEMALE	UNKNOWN				1	NA	BURGLARY/BREAKING ENTERING	45233	04/27/2017	04/20/2017	04/27/2017
9	8D6B4610-F302-48C5-A5E5-532CC00E368C	41106838	600	4	1	THEFT	47-STREET	23F	THEFT FROM MOTOR VEHICLE	N–NO BIAS/NOT APPLICABLE	FRIDAY	363	CORRYVILLE	N/A	99–NONE	2XX E KING DR	NA	NA	51-60	BLACK	NOT OF HISPANIC ORIG	MALE	UNKNOWN				1	NA	THEFT	45219	10/22/2011	10/21/2011	10/22/2011
10	C5CB70FD-7249-4A76-A9F1-2569FE11E32F	179002343	802	1	2	AGGRAVATED MENACING	47-STREET		NA	N–NO BIAS/NOT APPLICABLE	TUESDAY	23	PENDLETON	N/A	99 - NONE	3XX E LIBERTY ST	NA	NA	31-40	BLACK	NOT OF HISPANIC ORIG	MALE	26-30	WHITE	NOT OF HISPANIC ORIG	FEMALE	1	1	PART 2 MINOR	45202	01/24/2017	01/24/2017	01/24/2017
12	709D8B61-2ADB-4AB0-9F9A-E30CD5002FA0	129037156	303	4	3	AGGRAVATED ROBBERY	02-MULTI FAMILY		NA	N–NO BIAS/NOT APPLICABLE	SATURDAY	366	AVONDALE	AVONDALE	11 - FIREARM (TYPE NOT STATED)	5XX HALE AV	-84.49682	39.14022	41-50	BLACK	NOT OF HISPANIC ORIG	FEMALE	UNKNOWN	BLACK	NOT OF HISPANIC ORIG	MALE	1	3	ROBBERY	45229	12/22/2012	12/22/2012	12/22/2012
13	DA964133-D9A6-4BE4-8710-46856B2F3576	31105587	600	3	3	THEFT	37-GROCERY/SUPERMARKET	23F	THEFT FROM MOTOR VEHICLE	N–NO BIAS/NOT APPLICABLE	WEDNESDAY	271	WESTWOOD	WESTWOOD	99–NONE	23XX FERGUSON RD	-84.59760	39.12662	51-60	WHITE	NOT OF HISPANIC ORIG	MALE	UNKNOWN				1	NA	THEFT	45238	06/22/2011	06/22/2011	06/22/2011
14	30B5114C-2861-48D2-B049-5A54AEAE810A	149C000042	600	4	1	THEFT	25-OTHER COMMERCIAL SERVICE LOCATION	23F	THEFT FROM MOTOR VEHICLE	N–NO BIAS/NOT APPLICABLE		367	CORRYVILLE	N/A	99 - NONE	2XX ERKENBRECHER AV	NA	NA	18-25	WHITE	NOT OF HISPANIC ORIG	MALE	UNKNOWN				1	NA	THEFT	45229	10/06/2014	10/04/2014	10/04/2014

Summary of variables of concern and what they mean

VICTIM_GENDER - The gender of the victim for every incident recorded.
date_reported - The date the incident was reported
CPD_NEIGHBORHOOD - The neighborhood where the incident took place
OFFENSE - The category under which the offense falls
INSTANCEID - A unique identifier for each incident
THEFT_CODE - The code for the offense committed
CODE_DESC - The description of the corresponding THEFT_CODE

Exploratory Data Analysis

Exploratory data analysis usually consists of closely examining variables to understand what they represent and find their distributions. Distributions can be observed by plotting histograms.

I will be using additional libraries such as gglplot to uncover trands over time to visualize information and dplyr for data manipulation to subset and generate interpretable views of the data.

Let us take a look at some of the statstics involved:

We will be using the column “VICTIM_GENDER” to calculate the percentage of crimes against women in Cincinnati.

total_crimes <- nrow(crime_clean[crime_clean$VICTIM_GENDER])
total_crimes

## [1] 299688

against_women <- nrow(crime_clean[crime_clean$VICTIM_GENDER == "FEMALE" | crime_clean$VICTIM_GENDER == "F - FEMALE",])
against_women

## [1] 163172

perc_women <- (against_women/total_crimes)*100
perc_women

## [1] 54.44729

against_business <- nrow(crime_clean[crime_clean$VICTIM_GENDER == "NON-PERSON (BUSINESS",])
perc_business <- (against_business/total_crimes)*100
perc_business

## [1] 0.01768506

against_men <- nrow(crime_clean[crime_clean$VICTIM_GENDER == "MALE" | crime_clean$VICTIM_GENDER == "M - MALE",])
perc_men <- (against_men/total_crimes)*100
perc_men

## [1] 45.53502

pie_frame <- data.frame("Gender" = c("Female","Male","Non-Person"),
             "Percentage" = c(perc_women,perc_men,perc_business))

kable(pie_frame, caption = "Percentage of crimes")

Percentage of crimes
Gender	Percentage
Female	54.4472919
Male	45.5350231
Non-Person	0.0176851

In the code above, I have found out the percentage of crimes against men (perc_men), percentage of crimes against women (perc_women) and the percentage of crimes against businesses (perc_business)

library(ggplot2)
Gender <- c(perc_women, perc_men, perc_business)
lbls <- c("Women","Men","Non-Person")
pie(Gender, labels = c(round(perc_women, digits = 3), round(perc_men, digits = 3), round(perc_business, digits = 3)),col = c("pink", "blue", "white"), main = "Percentage of Crime by Gender")
legend("topright", c("Women","Men","Non-Person"), cex = 0.8, fill = c("pink", "blue", "white"))

The pie chart above shows that crimes against women are in fact a whopping 9% higher than crimes against men. Crimes against non-persons/business are only ~0.02%.

So now that we have established you’d be safer as a man in Cincinnati than as a woman, let us actually see what are the most common crimes against women.

library(dplyr)
crime_2 <- crime_clean[,-31:-33]
crime_3 <- crime_2[which(crime_2$VICTIM_GENDER == 'FEMALE'),] %>%
  group_by(OFFENSE) %>%
  summarize(SumOfIncidents = n()) %>%
  arrange(desc(SumOfIncidents)) %>%
  head(10)
  ggplot(data = crime_3, aes(x = OFFENSE, y = SumOfIncidents, fill = OFFENSE)) +
  geom_bar(stat = "identity") + labs(x = "Offense", y = "Number of Incidents", title = "Top 10 crimes against women" ) +
  theme(axis.text.x = element_blank(),
        axis.ticks.x = element_blank())

The plot above tells us that “Theft” is the most commonly commited crime against women in Cincinnati. This is then followed by Assault, Criminal Damaging/Endangering, Burglary and Domestic violence.

What are the most unsafe neighbourhoods in Cincinnati?

Let us try to find out which are the most unsafe neighbourhoods in Cincinnati by plotting a bar graph that shows count of incidents against different neighbourhoods.

crime_clean %>%
  ggplot(aes(factor(CPD_NEIGHBORHOOD), fill = CPD_NEIGHBORHOOD)) + 
  geom_bar(stat = "count", position = "dodge") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs( y = "Number of Occurences",
        x = "Neighborhoods",
        title = "Most Unsafe neighbourhoods in Cincinnati") +
  guides(fill = guide_legend(title = "Neighborhoods"))

From the graph above, we can decude that the following are the top 10 most unsafe neighborhoods:

1- Westwood
2- West Price Hill
3- Walnut Hills
4- Over-The-Rhine
5- Northside
6- Mount Auburn
7- East Price Hill
8- College Hill
9- C.B.D/ Riverfront
10- Avondale

Let us now observe the trend over time in the number of crimes in these neighborhoods in general.

crime_clean$occurence_year <- substr(crime_clean$date_reported,7,10)
crime_clean %>%
  filter(crime_clean$CPD_NEIGHBORHOOD == "WESTWOOD" |
           crime_clean$CPD_NEIGHBORHOOD == "WEST PRICE HILL" |
           crime_clean$CPD_NEIGHBORHOOD == "EAST PRICE HILL" |
           crime_clean$CPD_NEIGHBORHOOD == "OVER-THE-RHINE" |
           crime_clean$CPD_NEIGHBORHOOD == "C. B. D. / RIVERFRONT" |
           crime_clean$CPD_NEIGHBORHOOD == "AVONDALE" |
           crime_clean$CPD_NEIGHBORHOOD == "WALNUT HILLS" |
           crime_clean$CPD_NEIGHBORHOOD == "COLLEGE  HILL" |
           crime_clean$CPD_NEIGHBORHOOD == "MOUNT  AUBURN" |
           crime_clean$CPD_NEIGHBORHOOD == "NORTHSIDE") %>% 
  group_by(CPD_NEIGHBORHOOD, occurence_year) %>%
  tally() %>% 
  ggplot(aes(x = occurence_year, y = n, group = CPD_NEIGHBORHOOD, color = CPD_NEIGHBORHOOD)) + 
  geom_line() +
  labs(y = "Number of Incidents",
       x = "Year",
       title = "Trend of Crimes in unsafe neighbourhoods over time",
       color = "Neighborhoods")

Now let us observe the trend over time in the same neighborhoods but only against women and then compare the two.

crime_female <- crime_clean[which(crime_clean$VICTIM_GENDER == 'FEMALE'),] 

crime_clean$occurence_year <- substr(crime_clean$date_reported,7,10)

crime_female %>%
  filter(crime_female$CPD_NEIGHBORHOOD == "WESTWOOD" |
           crime_female$CPD_NEIGHBORHOOD == "WEST PRICE HILL" |
           crime_female$CPD_NEIGHBORHOOD == "EAST PRICE HILL" |
           crime_female$CPD_NEIGHBORHOOD == "OVER-THE-RHINE" |
           crime_female$CPD_NEIGHBORHOOD == "C. B. D. / RIVERFRONT" |
           crime_female$CPD_NEIGHBORHOOD == "AVONDALE" |
           crime_female$CPD_NEIGHBORHOOD == "WALNUT HILLS" |
           crime_female$CPD_NEIGHBORHOOD == "COLLEGE  HILL" |
           crime_female$CPD_NEIGHBORHOOD == "MOUNT  AUBURN" |
           crime_female$CPD_NEIGHBORHOOD == "NORTHSIDE" ) %>% 
  group_by(CPD_NEIGHBORHOOD, occurence_year) %>%
tally() %>% 
ggplot(aes(x = occurence_year, y = n, group = CPD_NEIGHBORHOOD, color = CPD_NEIGHBORHOOD)) + 
  geom_line() +
  labs(y = "Number of Incidents",
       x = "Year",
       title = "Trend of Crimes in unsafe neighborhoods over time against women",
       color = "Neighborhood")

From both the trends we can see a sharp increase in the crime rate during the 2010 - 2011 period. Could an increase in time during this period be attributed to the recession that preceeded these years? Accodring to the Department of Labor, 8.7 million people in America lost their jobs by 2010 and there was a decrease in the GDP by 4.2%. It might not be unreasonable to assume that this might be a contributing factor.

Then we see a dip until 2013 and another increase in 2014. It might also be noted that crimes in 2018 were the lowest recorded in the entire decade. This tells us that our law enforcement is doing something right. Westwood tops the charts with the highest crime rate in both the graphs. There are no obviouos differences in the trends of crimes against women vs those against everyone.

Summary

In summarizing my findings, crimes against women are significantly higher than crimes against men or businesses. This must be a cause for concern to everyone. Criminals think of women as easy targets specially for crimes like theft. As women, we must be careful and learn self defense techniques so we can defend outselves against criminals.

Westwood is the most unsafe neighborhood for everyone including women followed by West Price Hill and Walnut Hills. We are living in the safest years with the lowest number of crimes around 2018.

Limitations

I wanted to flag zipcodes as safe or unsafe but I realised I would have to create a separate data set for it a zipcodes in this one are repeated. There is no place I can store only zipcodes and the flag in the same dataset.

Future Work

I could then build a model which could predict given the vicinity of a neighborhood whether an area would be safe or unsafe. This could be done using a logistic regression model. Here the response variable would be the new column that is added to the aggregated dataset - flag and the possible covariates or independent variables could be neighborhood and date (date_reported).