Header Image

Header Image

Introduction

This dataset includes proactive and reactive police incident data for the city of Cincinnati, OH. The Computer-Aided Dispatch (CAD) system is used to capture calls for service and facilitate the incident response.

The problem statement I am trying to address from this dataset is:

“What are the most common crimes committed against women in Cincinnati”

Addressing the problem:

The variable “Offense” in the datset tells us what offense was committed. The variable VICTIM_GENDER tells us the gender of the victim was. The CPD_NEIGHBORHOOD variable tells us where the location of the crime was. These variables coupled with a few more can give us some insights into how safe the city of Cincinnati is for women and help us flag unsafe neighborhoods as safe.

Proposed Approach:

  1. To find what percentage of the total crime incidents has women victims.
  2. What are the most commonly committed crimes against women?
  3. To find the most usafe neighborhoods for women in Cincinnati

Goal:

To be able to identify the most commonly committed crimes against women in Cincinnati

Packages Required:

The following packages will be required:

tidyr - This package helps us clean and reorganize data for different kinds of analysis. knitr - This package helps with dynamic report generation. I’m using it to display values from my dataframes.

Rcode:

library(tidyr)
library(knitr)

Data Preparation

This dataset was obtained from the Cincinnati Open Data Portal. It can be directly downloaded from here

Original purpose of the dataset

The indicents of reported crimes in the city of Cincinnati were originally collected and collated by an agency for management. Incidents are typically housed in a Records Management System (RMS) that stores agency-wide data about law enforcement operations. This data is open everyone for analysis. The data is updated everyday and is always current.

Steps used for data cleaning

First, we read the .csv file into a dataframe in R and proceed to check the dimensions of the dataset.

setwd("C:/Users/Tanvi/Documents")
crimedata <- read.csv("Crime.csv")
dim(crimedata)
## [1] 357880     40
View(crimedata)

We can see that there are 4 columns representing 4 dates and times they are DATE_REPORTED, DATE_TO, DATE_FROM and DATE_OF_CLEARANCE These columns have both the date as well as the time. For our business problem, granularity of upto days is good enough. We will split the data and store them in different variables as follows:

time_reported <- substr(crimedata$DATE_REPORTED, 12, 22)
date_reported <- substr(crimedata$DATE_REPORTED, 1, 10)


time_from <- substr(crimedata$DATE_FROM, 12, 22)
date_from <- substr(crimedata$DATE_FROM, 1, 10)

time_to <- substr(crimedata$DATE_TO, 12, 22)
date_to <- substr(crimedata$DATE_TO, 1, 10)

time_of_clearance <- substr(crimedata$DATE_OF_CLEARANCE, 12, 22)
date_of_clearance <- substr(crimedata$DATE_OF_CLEARANCE, 1, 10)

Many of the columns in the dataset can be taken out right away as they won’t help us with our goal. We will remove the indentifiably unnecessary columns right away.

crime_clean <- crimedata[, -c(3:6, 13:15, 22:24, 39)]
dim(crime_clean)
## [1] 357880     29

Columns Removed:

Now that the columns we don’t need have been dropped, we can add the columns we created by splitting the date columns

crime_clean <- cbind(crime_clean, date_reported, date_from, date_to)

Let us now check the dimensions of the resultant dataframe

dim(crime_clean)
## [1] 357880     32

The dataframe still has a column called THEFT_CODE which contains both the code of the offense committed as well as its description in the same columns. We would want to separate these into 2 different columns for ease of querying. We can separate them using the separate() function in tidyr.

crime_clean <- separate(crime_clean, THEFT_CODE, into = c("THEFT_CODE", "CODE_DESC"), sep = "-")

The dataset has lots of missing values, however omitting all the rows containing even a single null value will result in us having very few rows of data to work with. It might not be a very useful subset. We therefore want to take out values which are either null or UNKNOWN because these might affect our result numbers. I am also displaying the dimensions of the dataframe with each reduction so that we can see how many rows and columns we are ending up with each time.

crime_clean <- with(crime_clean, crime_clean[!(VICTIM_GENDER == "" | is.na(VICTIM_GENDER) | VICTIM_GENDER == "UNKNOWN"), ])
dim(crime_clean)
## [1] 302926     33
crime_clean <- with(crime_clean, crime_clean[!(ZIP == "" | is.na(ZIP)), ])
dim(crime_clean)
## [1] 302833     33
crime_clean <- with(crime_clean, crime_clean[!(OFFENSE == "" | is.na(OFFENSE)), ])
dim(crime_clean)
## [1] 302783     33
crime_clean <- with(crime_clean, crime_clean[!(ADDRESS_X == "" | is.na(ADDRESS_X)), ])
dim(crime_clean)
## [1] 299697     33
crime_clean <- with(crime_clean, crime_clean[!(LOCATION == "" | is.na(LOCATION)), ])
dim(crime_clean)
## [1] 299688     33

Let us see what our cleaned dataset looks like

kable(crime_clean[1:10,], caption = "Glimpse of cleaned data")
Glimpse of cleaned data
INSTANCEID INCIDENT_NO UCR DST BEAT OFFENSE LOCATION THEFT_CODE CODE_DESC HATE_BIAS DAYOFWEEK RPT_AREA CPD_NEIGHBORHOOD SNA_NEIGHBORHOOD WEAPONS ADDRESS_X LONGITUDE_X LATITUDE_X VICTIM_AGE VICTIM_RACE VICTIM_ETHNICITY VICTIM_GENDER SUSPECT_AGE SUSPECT_RACE SUSPECT_ETHNICITY SUSPECT_GENDER TOTALNUMBERVICTIMS TOTALSUSPECTS UCR_GROUP ZIP date_reported date_from date_to
1 CBF5B090-FBF2-480A-9F4A-FCD1342FB75F 179002225 401 3 2 FELONIOUS ASSAULT 47-STREET NA N–NO BIAS/NOT APPLICABLE SUNDAY 223 EAST PRICE HILL EAST PRICE HILL 12 - HANDGUN 9XX HAWTHORNE AV -84.56178 39.10665 18-25 WHITE NOT OF HISPANIC ORIG MALE UNKNOWN UNKNOWN UNKNOWN UNKNOWN 1 3 AGGRAVATED ASSAULTS 45205 01/23/2017 01/22/2017 01/22/2017
3 6C10F3A8-AA93-47B2-8AB9-F7E317197351 21005801 551 2 2 BURGLARY 02-MULTI FAMILY APARTMENT NA N–NO BIAS/NOT APPLICABLE WEDNESDAY 128 OAKLEY OAKLEY 99–NONE 31XX MADISON RD -84.42805 39.15390 51-60 WHITE UNKNOWN MALE UNKNOWN 1 NA BURGLARY/BREAKING ENTERING 45208 12/22/2010 12/22/2010 12/22/2010
4 2A65A08F-8C74-403B-8636-6824B50B002D 129016167 600 2 1 THEFT 01-SINGLE FAMILY HOME 23H ALL OTHER LARCENY N–NO BIAS/NOT APPLICABLE SATURDAY 68 EVANSTON EVANSTON 99–NONE 32XX HACKBERRY ST -84.47504 39.13586 51-60 BLACK NOT OF HISPANIC ORIG FEMALE 51-60 BLACK NOT OF HISPANIC ORIG MALE 1 1 THEFT 45207 06/12/2012 05/12/2012 05/12/2012
5 F7D48934-5C4A-4339-98AA-A98DB5DAB9FC 189020414 600 3 3 THEFT 47-STREET 24O MOTOR VEHICLE THEFT N–NO BIAS/NOT APPLICABLE SATURDAY 311 EAST WESTWOOD N/A 99 - NONE 23XX MCHENRY AV NA NA 61-70 BLACK NOT OF HISPANIC ORIG FEMALE UNDER 18 BLACK NOT OF HISPANIC ORIG FEMALE 1 3 THEFT 45211 07/28/2018 07/28/2018 07/28/2018
8 23D84A29-D13F-44FC-AC87-4B30257759B2 179014947 551 3 1 BURGLARY 02-MULTI FAMILY APARTMENT NA N–NO BIAS/NOT APPLICABLE THURSDAY 266 SAYLER PARK SAYLER PARK 99 - NONE 1XX RICHARDSON PL -84.68554 39.10888 51-60 WHITE NOT OF HISPANIC ORIG FEMALE UNKNOWN 1 NA BURGLARY/BREAKING ENTERING 45233 04/27/2017 04/20/2017 04/27/2017
9 8D6B4610-F302-48C5-A5E5-532CC00E368C 41106838 600 4 1 THEFT 47-STREET 23F THEFT FROM MOTOR VEHICLE N–NO BIAS/NOT APPLICABLE FRIDAY 363 CORRYVILLE N/A 99–NONE 2XX E KING DR NA NA 51-60 BLACK NOT OF HISPANIC ORIG MALE UNKNOWN 1 NA THEFT 45219 10/22/2011 10/21/2011 10/22/2011
10 C5CB70FD-7249-4A76-A9F1-2569FE11E32F 179002343 802 1 2 AGGRAVATED MENACING 47-STREET NA N–NO BIAS/NOT APPLICABLE TUESDAY 23 PENDLETON N/A 99 - NONE 3XX E LIBERTY ST NA NA 31-40 BLACK NOT OF HISPANIC ORIG MALE 26-30 WHITE NOT OF HISPANIC ORIG FEMALE 1 1 PART 2 MINOR 45202 01/24/2017 01/24/2017 01/24/2017
12 709D8B61-2ADB-4AB0-9F9A-E30CD5002FA0 129037156 303 4 3 AGGRAVATED ROBBERY 02-MULTI FAMILY NA N–NO BIAS/NOT APPLICABLE SATURDAY 366 AVONDALE AVONDALE 11 - FIREARM (TYPE NOT STATED) 5XX HALE AV -84.49682 39.14022 41-50 BLACK NOT OF HISPANIC ORIG FEMALE UNKNOWN BLACK NOT OF HISPANIC ORIG MALE 1 3 ROBBERY 45229 12/22/2012 12/22/2012 12/22/2012
13 DA964133-D9A6-4BE4-8710-46856B2F3576 31105587 600 3 3 THEFT 37-GROCERY/SUPERMARKET 23F THEFT FROM MOTOR VEHICLE N–NO BIAS/NOT APPLICABLE WEDNESDAY 271 WESTWOOD WESTWOOD 99–NONE 23XX FERGUSON RD -84.59760 39.12662 51-60 WHITE NOT OF HISPANIC ORIG MALE UNKNOWN 1 NA THEFT 45238 06/22/2011 06/22/2011 06/22/2011
14 30B5114C-2861-48D2-B049-5A54AEAE810A 149C000042 600 4 1 THEFT 25-OTHER COMMERCIAL SERVICE LOCATION 23F THEFT FROM MOTOR VEHICLE N–NO BIAS/NOT APPLICABLE 367 CORRYVILLE N/A 99 - NONE 2XX ERKENBRECHER AV NA NA 18-25 WHITE NOT OF HISPANIC ORIG MALE UNKNOWN 1 NA THEFT 45229 10/06/2014 10/04/2014 10/04/2014

Summary of variables of concern and what they mean

Exploratory Data Analysis

Exploratory data analysis usually consists of closely examining variables to understand what they represent and find their distributions. Distributions can be observed by plotting histograms.

I will be using additional libraries such as gglplot to uncover trands over time to visualize information and dplyr for data manipulation to subset and generate interpretable views of the data.

Let us take a look at some of the statstics involved:

We will be using the column “VICTIM_GENDER” to calculate the percentage of crimes against women in Cincinnati.

total_crimes <- nrow(crime_clean[crime_clean$VICTIM_GENDER])
total_crimes
## [1] 299688
against_women <- nrow(crime_clean[crime_clean$VICTIM_GENDER == "FEMALE" | crime_clean$VICTIM_GENDER == "F - FEMALE",])
against_women
## [1] 163172
perc_women <- (against_women/total_crimes)*100
perc_women
## [1] 54.44729
against_business <- nrow(crime_clean[crime_clean$VICTIM_GENDER == "NON-PERSON (BUSINESS",])
perc_business <- (against_business/total_crimes)*100
perc_business
## [1] 0.01768506
against_men <- nrow(crime_clean[crime_clean$VICTIM_GENDER == "MALE" | crime_clean$VICTIM_GENDER == "M - MALE",])
perc_men <- (against_men/total_crimes)*100
perc_men
## [1] 45.53502
pie_frame <- data.frame("Gender" = c("Female","Male","Non-Person"),
             "Percentage" = c(perc_women,perc_men,perc_business))

kable(pie_frame, caption = "Percentage of crimes")
Percentage of crimes
Gender Percentage
Female 54.4472919
Male 45.5350231
Non-Person 0.0176851

In the code above, I have found out the percentage of crimes against men (perc_men), percentage of crimes against women (perc_women) and the percentage of crimes against businesses (perc_business)

library(ggplot2)
Gender <- c(perc_women, perc_men, perc_business)
lbls <- c("Women","Men","Non-Person")
pie(Gender, labels = c(round(perc_women, digits = 3), round(perc_men, digits = 3), round(perc_business, digits = 3)),col = c("pink", "blue", "white"), main = "Percentage of Crime by Gender")
legend("topright", c("Women","Men","Non-Person"), cex = 0.8, fill = c("pink", "blue", "white"))

The pie chart above shows that crimes against women are in fact a whopping 9% higher than crimes against men. Crimes against non-persons/business are only ~0.02%.

So now that we have established you’d be safer as a man in Cincinnati than as a woman, let us actually see what are the most common crimes against women.

library(dplyr)
crime_2 <- crime_clean[,-31:-33]
crime_3 <- crime_2[which(crime_2$VICTIM_GENDER == 'FEMALE'),] %>%
  group_by(OFFENSE) %>%
  summarize(SumOfIncidents = n()) %>%
  arrange(desc(SumOfIncidents)) %>%
  head(10)
  ggplot(data = crime_3, aes(x = OFFENSE, y = SumOfIncidents, fill = OFFENSE)) +
  geom_bar(stat = "identity") + labs(x = "Offense", y = "Number of Incidents", title = "Top 10 crimes against women" ) +
  theme(axis.text.x = element_blank(),
        axis.ticks.x = element_blank())

The plot above tells us that “Theft” is the most commonly commited crime against women in Cincinnati. This is then followed by Assault, Criminal Damaging/Endangering, Burglary and Domestic violence.

What are the most unsafe neighbourhoods in Cincinnati?

Let us try to find out which are the most unsafe neighbourhoods in Cincinnati by plotting a bar graph that shows count of incidents against different neighbourhoods.

crime_clean %>%
  ggplot(aes(factor(CPD_NEIGHBORHOOD), fill = CPD_NEIGHBORHOOD)) + 
  geom_bar(stat = "count", position = "dodge") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs( y = "Number of Occurences",
        x = "Neighborhoods",
        title = "Most Unsafe neighbourhoods in Cincinnati") +
  guides(fill = guide_legend(title = "Neighborhoods"))

From the graph above, we can decude that the following are the top 10 most unsafe neighborhoods:

Let us now observe the trend over time in the number of crimes in these neighborhoods in general.

crime_clean$occurence_year <- substr(crime_clean$date_reported,7,10)
crime_clean %>%
  filter(crime_clean$CPD_NEIGHBORHOOD == "WESTWOOD" |
           crime_clean$CPD_NEIGHBORHOOD == "WEST PRICE HILL" |
           crime_clean$CPD_NEIGHBORHOOD == "EAST PRICE HILL" |
           crime_clean$CPD_NEIGHBORHOOD == "OVER-THE-RHINE" |
           crime_clean$CPD_NEIGHBORHOOD == "C. B. D. / RIVERFRONT" |
           crime_clean$CPD_NEIGHBORHOOD == "AVONDALE" |
           crime_clean$CPD_NEIGHBORHOOD == "WALNUT HILLS" |
           crime_clean$CPD_NEIGHBORHOOD == "COLLEGE  HILL" |
           crime_clean$CPD_NEIGHBORHOOD == "MOUNT  AUBURN" |
           crime_clean$CPD_NEIGHBORHOOD == "NORTHSIDE") %>% 
  group_by(CPD_NEIGHBORHOOD, occurence_year) %>%
  tally() %>% 
  ggplot(aes(x = occurence_year, y = n, group = CPD_NEIGHBORHOOD, color = CPD_NEIGHBORHOOD)) + 
  geom_line() +
  labs(y = "Number of Incidents",
       x = "Year",
       title = "Trend of Crimes in unsafe neighbourhoods over time",
       color = "Neighborhoods")

Now let us observe the trend over time in the same neighborhoods but only against women and then compare the two.

crime_female <- crime_clean[which(crime_clean$VICTIM_GENDER == 'FEMALE'),] 

crime_clean$occurence_year <- substr(crime_clean$date_reported,7,10)

crime_female %>%
  filter(crime_female$CPD_NEIGHBORHOOD == "WESTWOOD" |
           crime_female$CPD_NEIGHBORHOOD == "WEST PRICE HILL" |
           crime_female$CPD_NEIGHBORHOOD == "EAST PRICE HILL" |
           crime_female$CPD_NEIGHBORHOOD == "OVER-THE-RHINE" |
           crime_female$CPD_NEIGHBORHOOD == "C. B. D. / RIVERFRONT" |
           crime_female$CPD_NEIGHBORHOOD == "AVONDALE" |
           crime_female$CPD_NEIGHBORHOOD == "WALNUT HILLS" |
           crime_female$CPD_NEIGHBORHOOD == "COLLEGE  HILL" |
           crime_female$CPD_NEIGHBORHOOD == "MOUNT  AUBURN" |
           crime_female$CPD_NEIGHBORHOOD == "NORTHSIDE" ) %>% 
  group_by(CPD_NEIGHBORHOOD, occurence_year) %>%
tally() %>% 
ggplot(aes(x = occurence_year, y = n, group = CPD_NEIGHBORHOOD, color = CPD_NEIGHBORHOOD)) + 
  geom_line() +
  labs(y = "Number of Incidents",
       x = "Year",
       title = "Trend of Crimes in unsafe neighborhoods over time against women",
       color = "Neighborhood")

From both the trends we can see a sharp increase in the crime rate during the 2010 - 2011 period. Could an increase in time during this period be attributed to the recession that preceeded these years? Accodring to the Department of Labor, 8.7 million people in America lost their jobs by 2010 and there was a decrease in the GDP by 4.2%. It might not be unreasonable to assume that this might be a contributing factor.

Then we see a dip until 2013 and another increase in 2014. It might also be noted that crimes in 2018 were the lowest recorded in the entire decade. This tells us that our law enforcement is doing something right. Westwood tops the charts with the highest crime rate in both the graphs. There are no obviouos differences in the trends of crimes against women vs those against everyone.

Summary

In summarizing my findings, crimes against women are significantly higher than crimes against men or businesses. This must be a cause for concern to everyone. Criminals think of women as easy targets specially for crimes like theft. As women, we must be careful and learn self defense techniques so we can defend outselves against criminals.

Westwood is the most unsafe neighborhood for everyone including women followed by West Price Hill and Walnut Hills. We are living in the safest years with the lowest number of crimes around 2018.

Limitations

I wanted to flag zipcodes as safe or unsafe but I realised I would have to create a separate data set for it a zipcodes in this one are repeated. There is no place I can store only zipcodes and the flag in the same dataset.

Future Work

I could then build a model which could predict given the vicinity of a neighborhood whether an area would be safe or unsafe. This could be done using a logistic regression model. Here the response variable would be the new column that is added to the aggregated dataset - flag and the possible covariates or independent variables could be neighborhood and date (date_reported).