Header Image
This dataset includes proactive and reactive police incident data for the city of Cincinnati, OH. The Computer-Aided Dispatch (CAD) system is used to capture calls for service and facilitate the incident response.
The problem statement I am trying to address from this dataset is:
“What are the most common crimes committed against women in Cincinnati”
Addressing the problem:
The variable “Offense” in the datset tells us what offense was committed. The variable VICTIM_GENDER tells us the gender of the victim was. The CPD_NEIGHBORHOOD variable tells us where the location of the crime was. These variables coupled with a few more can give us some insights into how safe the city of Cincinnati is for women and help us flag unsafe neighborhoods as safe.
Proposed Approach:
Goal:
To be able to identify the most commonly committed crimes against women in Cincinnati
Packages Required:
The following packages will be required:
tidyr - This package helps us clean and reorganize data for different kinds of analysis. knitr - This package helps with dynamic report generation. I’m using it to display values from my dataframes.
Rcode:
library(tidyr)
library(knitr)
This dataset was obtained from the Cincinnati Open Data Portal. It can be directly downloaded from here
Original purpose of the dataset
The indicents of reported crimes in the city of Cincinnati were originally collected and collated by an agency for management. Incidents are typically housed in a Records Management System (RMS) that stores agency-wide data about law enforcement operations. This data is open everyone for analysis. The data is updated everyday and is always current.
Steps used for data cleaning
First, we read the .csv file into a dataframe in R and proceed to check the dimensions of the dataset.
setwd("C:/Users/Tanvi/Documents")
crimedata <- read.csv("Crime.csv")
dim(crimedata)
## [1] 357880 40
View(crimedata)
We can see that there are 4 columns representing 4 dates and times they are DATE_REPORTED, DATE_TO, DATE_FROM and DATE_OF_CLEARANCE These columns have both the date as well as the time. For our business problem, granularity of upto days is good enough. We will split the data and store them in different variables as follows:
time_reported <- substr(crimedata$DATE_REPORTED, 12, 22)
date_reported <- substr(crimedata$DATE_REPORTED, 1, 10)
time_from <- substr(crimedata$DATE_FROM, 12, 22)
date_from <- substr(crimedata$DATE_FROM, 1, 10)
time_to <- substr(crimedata$DATE_TO, 12, 22)
date_to <- substr(crimedata$DATE_TO, 1, 10)
time_of_clearance <- substr(crimedata$DATE_OF_CLEARANCE, 12, 22)
date_of_clearance <- substr(crimedata$DATE_OF_CLEARANCE, 1, 10)
Many of the columns in the dataset can be taken out right away as they won’t help us with our goal. We will remove the indentifiably unnecessary columns right away.
crime_clean <- crimedata[, -c(3:6, 13:15, 22:24, 39)]
dim(crime_clean)
## [1] 357880 29
Columns Removed:
Now that the columns we don’t need have been dropped, we can add the columns we created by splitting the date columns
crime_clean <- cbind(crime_clean, date_reported, date_from, date_to)
Let us now check the dimensions of the resultant dataframe
dim(crime_clean)
## [1] 357880 32
The dataframe still has a column called THEFT_CODE which contains both the code of the offense committed as well as its description in the same columns. We would want to separate these into 2 different columns for ease of querying. We can separate them using the separate() function in tidyr.
crime_clean <- separate(crime_clean, THEFT_CODE, into = c("THEFT_CODE", "CODE_DESC"), sep = "-")
The dataset has lots of missing values, however omitting all the rows containing even a single null value will result in us having very few rows of data to work with. It might not be a very useful subset. We therefore want to take out values which are either null or UNKNOWN because these might affect our result numbers. I am also displaying the dimensions of the dataframe with each reduction so that we can see how many rows and columns we are ending up with each time.
crime_clean <- with(crime_clean, crime_clean[!(VICTIM_GENDER == "" | is.na(VICTIM_GENDER) | VICTIM_GENDER == "UNKNOWN"), ])
dim(crime_clean)
## [1] 302926 33
crime_clean <- with(crime_clean, crime_clean[!(ZIP == "" | is.na(ZIP)), ])
dim(crime_clean)
## [1] 302833 33
crime_clean <- with(crime_clean, crime_clean[!(OFFENSE == "" | is.na(OFFENSE)), ])
dim(crime_clean)
## [1] 302783 33
crime_clean <- with(crime_clean, crime_clean[!(ADDRESS_X == "" | is.na(ADDRESS_X)), ])
dim(crime_clean)
## [1] 299697 33
crime_clean <- with(crime_clean, crime_clean[!(LOCATION == "" | is.na(LOCATION)), ])
dim(crime_clean)
## [1] 299688 33
Let us see what our cleaned dataset looks like
kable(crime_clean[1:10,], caption = "Glimpse of cleaned data")
| INSTANCEID | INCIDENT_NO | UCR | DST | BEAT | OFFENSE | LOCATION | THEFT_CODE | CODE_DESC | HATE_BIAS | DAYOFWEEK | RPT_AREA | CPD_NEIGHBORHOOD | SNA_NEIGHBORHOOD | WEAPONS | ADDRESS_X | LONGITUDE_X | LATITUDE_X | VICTIM_AGE | VICTIM_RACE | VICTIM_ETHNICITY | VICTIM_GENDER | SUSPECT_AGE | SUSPECT_RACE | SUSPECT_ETHNICITY | SUSPECT_GENDER | TOTALNUMBERVICTIMS | TOTALSUSPECTS | UCR_GROUP | ZIP | date_reported | date_from | date_to | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | CBF5B090-FBF2-480A-9F4A-FCD1342FB75F | 179002225 | 401 | 3 | 2 | FELONIOUS ASSAULT | 47-STREET | NA | N–NO BIAS/NOT APPLICABLE | SUNDAY | 223 | EAST PRICE HILL | EAST PRICE HILL | 12 - HANDGUN | 9XX HAWTHORNE AV | -84.56178 | 39.10665 | 18-25 | WHITE | NOT OF HISPANIC ORIG | MALE | UNKNOWN | UNKNOWN | UNKNOWN | UNKNOWN | 1 | 3 | AGGRAVATED ASSAULTS | 45205 | 01/23/2017 | 01/22/2017 | 01/22/2017 | |
| 3 | 6C10F3A8-AA93-47B2-8AB9-F7E317197351 | 21005801 | 551 | 2 | 2 | BURGLARY | 02-MULTI FAMILY APARTMENT | NA | N–NO BIAS/NOT APPLICABLE | WEDNESDAY | 128 | OAKLEY | OAKLEY | 99–NONE | 31XX MADISON RD | -84.42805 | 39.15390 | 51-60 | WHITE | UNKNOWN | MALE | UNKNOWN | 1 | NA | BURGLARY/BREAKING ENTERING | 45208 | 12/22/2010 | 12/22/2010 | 12/22/2010 | ||||
| 4 | 2A65A08F-8C74-403B-8636-6824B50B002D | 129016167 | 600 | 2 | 1 | THEFT | 01-SINGLE FAMILY HOME | 23H | ALL OTHER LARCENY | N–NO BIAS/NOT APPLICABLE | SATURDAY | 68 | EVANSTON | EVANSTON | 99–NONE | 32XX HACKBERRY ST | -84.47504 | 39.13586 | 51-60 | BLACK | NOT OF HISPANIC ORIG | FEMALE | 51-60 | BLACK | NOT OF HISPANIC ORIG | MALE | 1 | 1 | THEFT | 45207 | 06/12/2012 | 05/12/2012 | 05/12/2012 |
| 5 | F7D48934-5C4A-4339-98AA-A98DB5DAB9FC | 189020414 | 600 | 3 | 3 | THEFT | 47-STREET | 24O | MOTOR VEHICLE THEFT | N–NO BIAS/NOT APPLICABLE | SATURDAY | 311 | EAST WESTWOOD | N/A | 99 - NONE | 23XX MCHENRY AV | NA | NA | 61-70 | BLACK | NOT OF HISPANIC ORIG | FEMALE | UNDER 18 | BLACK | NOT OF HISPANIC ORIG | FEMALE | 1 | 3 | THEFT | 45211 | 07/28/2018 | 07/28/2018 | 07/28/2018 |
| 8 | 23D84A29-D13F-44FC-AC87-4B30257759B2 | 179014947 | 551 | 3 | 1 | BURGLARY | 02-MULTI FAMILY APARTMENT | NA | N–NO BIAS/NOT APPLICABLE | THURSDAY | 266 | SAYLER PARK | SAYLER PARK | 99 - NONE | 1XX RICHARDSON PL | -84.68554 | 39.10888 | 51-60 | WHITE | NOT OF HISPANIC ORIG | FEMALE | UNKNOWN | 1 | NA | BURGLARY/BREAKING ENTERING | 45233 | 04/27/2017 | 04/20/2017 | 04/27/2017 | ||||
| 9 | 8D6B4610-F302-48C5-A5E5-532CC00E368C | 41106838 | 600 | 4 | 1 | THEFT | 47-STREET | 23F | THEFT FROM MOTOR VEHICLE | N–NO BIAS/NOT APPLICABLE | FRIDAY | 363 | CORRYVILLE | N/A | 99–NONE | 2XX E KING DR | NA | NA | 51-60 | BLACK | NOT OF HISPANIC ORIG | MALE | UNKNOWN | 1 | NA | THEFT | 45219 | 10/22/2011 | 10/21/2011 | 10/22/2011 | |||
| 10 | C5CB70FD-7249-4A76-A9F1-2569FE11E32F | 179002343 | 802 | 1 | 2 | AGGRAVATED MENACING | 47-STREET | NA | N–NO BIAS/NOT APPLICABLE | TUESDAY | 23 | PENDLETON | N/A | 99 - NONE | 3XX E LIBERTY ST | NA | NA | 31-40 | BLACK | NOT OF HISPANIC ORIG | MALE | 26-30 | WHITE | NOT OF HISPANIC ORIG | FEMALE | 1 | 1 | PART 2 MINOR | 45202 | 01/24/2017 | 01/24/2017 | 01/24/2017 | |
| 12 | 709D8B61-2ADB-4AB0-9F9A-E30CD5002FA0 | 129037156 | 303 | 4 | 3 | AGGRAVATED ROBBERY | 02-MULTI FAMILY | NA | N–NO BIAS/NOT APPLICABLE | SATURDAY | 366 | AVONDALE | AVONDALE | 11 - FIREARM (TYPE NOT STATED) | 5XX HALE AV | -84.49682 | 39.14022 | 41-50 | BLACK | NOT OF HISPANIC ORIG | FEMALE | UNKNOWN | BLACK | NOT OF HISPANIC ORIG | MALE | 1 | 3 | ROBBERY | 45229 | 12/22/2012 | 12/22/2012 | 12/22/2012 | |
| 13 | DA964133-D9A6-4BE4-8710-46856B2F3576 | 31105587 | 600 | 3 | 3 | THEFT | 37-GROCERY/SUPERMARKET | 23F | THEFT FROM MOTOR VEHICLE | N–NO BIAS/NOT APPLICABLE | WEDNESDAY | 271 | WESTWOOD | WESTWOOD | 99–NONE | 23XX FERGUSON RD | -84.59760 | 39.12662 | 51-60 | WHITE | NOT OF HISPANIC ORIG | MALE | UNKNOWN | 1 | NA | THEFT | 45238 | 06/22/2011 | 06/22/2011 | 06/22/2011 | |||
| 14 | 30B5114C-2861-48D2-B049-5A54AEAE810A | 149C000042 | 600 | 4 | 1 | THEFT | 25-OTHER COMMERCIAL SERVICE LOCATION | 23F | THEFT FROM MOTOR VEHICLE | N–NO BIAS/NOT APPLICABLE | 367 | CORRYVILLE | N/A | 99 - NONE | 2XX ERKENBRECHER AV | NA | NA | 18-25 | WHITE | NOT OF HISPANIC ORIG | MALE | UNKNOWN | 1 | NA | THEFT | 45229 | 10/06/2014 | 10/04/2014 | 10/04/2014 |
Summary of variables of concern and what they mean
Exploratory data analysis usually consists of closely examining variables to understand what they represent and find their distributions. Distributions can be observed by plotting histograms.
I will be using additional libraries such as gglplot to uncover trands over time to visualize information and dplyr for data manipulation to subset and generate interpretable views of the data.
Let us take a look at some of the statstics involved:
We will be using the column “VICTIM_GENDER” to calculate the percentage of crimes against women in Cincinnati.
total_crimes <- nrow(crime_clean[crime_clean$VICTIM_GENDER])
total_crimes
## [1] 299688
against_women <- nrow(crime_clean[crime_clean$VICTIM_GENDER == "FEMALE" | crime_clean$VICTIM_GENDER == "F - FEMALE",])
against_women
## [1] 163172
perc_women <- (against_women/total_crimes)*100
perc_women
## [1] 54.44729
against_business <- nrow(crime_clean[crime_clean$VICTIM_GENDER == "NON-PERSON (BUSINESS",])
perc_business <- (against_business/total_crimes)*100
perc_business
## [1] 0.01768506
against_men <- nrow(crime_clean[crime_clean$VICTIM_GENDER == "MALE" | crime_clean$VICTIM_GENDER == "M - MALE",])
perc_men <- (against_men/total_crimes)*100
perc_men
## [1] 45.53502
pie_frame <- data.frame("Gender" = c("Female","Male","Non-Person"),
"Percentage" = c(perc_women,perc_men,perc_business))
kable(pie_frame, caption = "Percentage of crimes")
| Gender | Percentage |
|---|---|
| Female | 54.4472919 |
| Male | 45.5350231 |
| Non-Person | 0.0176851 |
In the code above, I have found out the percentage of crimes against men (perc_men), percentage of crimes against women (perc_women) and the percentage of crimes against businesses (perc_business)
library(ggplot2)
Gender <- c(perc_women, perc_men, perc_business)
lbls <- c("Women","Men","Non-Person")
pie(Gender, labels = c(round(perc_women, digits = 3), round(perc_men, digits = 3), round(perc_business, digits = 3)),col = c("pink", "blue", "white"), main = "Percentage of Crime by Gender")
legend("topright", c("Women","Men","Non-Person"), cex = 0.8, fill = c("pink", "blue", "white"))
The pie chart above shows that crimes against women are in fact a whopping 9% higher than crimes against men. Crimes against non-persons/business are only ~0.02%.
So now that we have established you’d be safer as a man in Cincinnati than as a woman, let us actually see what are the most common crimes against women.
library(dplyr)
crime_2 <- crime_clean[,-31:-33]
crime_3 <- crime_2[which(crime_2$VICTIM_GENDER == 'FEMALE'),] %>%
group_by(OFFENSE) %>%
summarize(SumOfIncidents = n()) %>%
arrange(desc(SumOfIncidents)) %>%
head(10)
ggplot(data = crime_3, aes(x = OFFENSE, y = SumOfIncidents, fill = OFFENSE)) +
geom_bar(stat = "identity") + labs(x = "Offense", y = "Number of Incidents", title = "Top 10 crimes against women" ) +
theme(axis.text.x = element_blank(),
axis.ticks.x = element_blank())
The plot above tells us that “Theft” is the most commonly commited crime against women in Cincinnati. This is then followed by Assault, Criminal Damaging/Endangering, Burglary and Domestic violence.
Let us try to find out which are the most unsafe neighbourhoods in Cincinnati by plotting a bar graph that shows count of incidents against different neighbourhoods.
crime_clean %>%
ggplot(aes(factor(CPD_NEIGHBORHOOD), fill = CPD_NEIGHBORHOOD)) +
geom_bar(stat = "count", position = "dodge") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs( y = "Number of Occurences",
x = "Neighborhoods",
title = "Most Unsafe neighbourhoods in Cincinnati") +
guides(fill = guide_legend(title = "Neighborhoods"))
From the graph above, we can decude that the following are the top 10 most unsafe neighborhoods:
Let us now observe the trend over time in the number of crimes in these neighborhoods in general.
crime_clean$occurence_year <- substr(crime_clean$date_reported,7,10)
crime_clean %>%
filter(crime_clean$CPD_NEIGHBORHOOD == "WESTWOOD" |
crime_clean$CPD_NEIGHBORHOOD == "WEST PRICE HILL" |
crime_clean$CPD_NEIGHBORHOOD == "EAST PRICE HILL" |
crime_clean$CPD_NEIGHBORHOOD == "OVER-THE-RHINE" |
crime_clean$CPD_NEIGHBORHOOD == "C. B. D. / RIVERFRONT" |
crime_clean$CPD_NEIGHBORHOOD == "AVONDALE" |
crime_clean$CPD_NEIGHBORHOOD == "WALNUT HILLS" |
crime_clean$CPD_NEIGHBORHOOD == "COLLEGE HILL" |
crime_clean$CPD_NEIGHBORHOOD == "MOUNT AUBURN" |
crime_clean$CPD_NEIGHBORHOOD == "NORTHSIDE") %>%
group_by(CPD_NEIGHBORHOOD, occurence_year) %>%
tally() %>%
ggplot(aes(x = occurence_year, y = n, group = CPD_NEIGHBORHOOD, color = CPD_NEIGHBORHOOD)) +
geom_line() +
labs(y = "Number of Incidents",
x = "Year",
title = "Trend of Crimes in unsafe neighbourhoods over time",
color = "Neighborhoods")
Now let us observe the trend over time in the same neighborhoods but only against women and then compare the two.
crime_female <- crime_clean[which(crime_clean$VICTIM_GENDER == 'FEMALE'),]
crime_clean$occurence_year <- substr(crime_clean$date_reported,7,10)
crime_female %>%
filter(crime_female$CPD_NEIGHBORHOOD == "WESTWOOD" |
crime_female$CPD_NEIGHBORHOOD == "WEST PRICE HILL" |
crime_female$CPD_NEIGHBORHOOD == "EAST PRICE HILL" |
crime_female$CPD_NEIGHBORHOOD == "OVER-THE-RHINE" |
crime_female$CPD_NEIGHBORHOOD == "C. B. D. / RIVERFRONT" |
crime_female$CPD_NEIGHBORHOOD == "AVONDALE" |
crime_female$CPD_NEIGHBORHOOD == "WALNUT HILLS" |
crime_female$CPD_NEIGHBORHOOD == "COLLEGE HILL" |
crime_female$CPD_NEIGHBORHOOD == "MOUNT AUBURN" |
crime_female$CPD_NEIGHBORHOOD == "NORTHSIDE" ) %>%
group_by(CPD_NEIGHBORHOOD, occurence_year) %>%
tally() %>%
ggplot(aes(x = occurence_year, y = n, group = CPD_NEIGHBORHOOD, color = CPD_NEIGHBORHOOD)) +
geom_line() +
labs(y = "Number of Incidents",
x = "Year",
title = "Trend of Crimes in unsafe neighborhoods over time against women",
color = "Neighborhood")
From both the trends we can see a sharp increase in the crime rate during the 2010 - 2011 period. Could an increase in time during this period be attributed to the recession that preceeded these years? Accodring to the Department of Labor, 8.7 million people in America lost their jobs by 2010 and there was a decrease in the GDP by 4.2%. It might not be unreasonable to assume that this might be a contributing factor.
Then we see a dip until 2013 and another increase in 2014. It might also be noted that crimes in 2018 were the lowest recorded in the entire decade. This tells us that our law enforcement is doing something right. Westwood tops the charts with the highest crime rate in both the graphs. There are no obviouos differences in the trends of crimes against women vs those against everyone.
In summarizing my findings, crimes against women are significantly higher than crimes against men or businesses. This must be a cause for concern to everyone. Criminals think of women as easy targets specially for crimes like theft. As women, we must be careful and learn self defense techniques so we can defend outselves against criminals.
Westwood is the most unsafe neighborhood for everyone including women followed by West Price Hill and Walnut Hills. We are living in the safest years with the lowest number of crimes around 2018.
I wanted to flag zipcodes as safe or unsafe but I realised I would have to create a separate data set for it a zipcodes in this one are repeated. There is no place I can store only zipcodes and the flag in the same dataset.
I could then build a model which could predict given the vicinity of a neighborhood whether an area would be safe or unsafe. This could be done using a logistic regression model. Here the response variable would be the new column that is added to the aggregated dataset - flag and the possible covariates or independent variables could be neighborhood and date (date_reported).