Cincinnati is home to many prestigious universities and 9 Fortune 500 companies. With nine Fortune 500 company headquarters in the area, the region ranks in the United States Top 10 markets for number of Fortune 500 headquarters per million residents, higher than New York, Boston, Chicago or Los Angeles. So, Cincinnati became main attraction to college students and young professionals, who are looking to start their career or those looking to make their next big switch.
Safety is one of the paramount factors people consider while making a switch. Cincinnati Police Department and Cincinnati Government has taken many initiatives to make Cincinnati a safer place. In this project, we are analyzing the crime scenario in Cincinnati over last 10 years (Mar 2010 to Mar 2019). Through this project we are trying to map the places which are more dangerous particularly for age groups below 30.
In order to find whether Cincinnati is a safer placer or not, I have obtained a crime dataset from Cincinnati open data portal. I will find the hidden patterns / trends associated with the data. I will find the answers to below questions to gain some helpful insights.
I will then dive deeper in to the worst neighborhoods to observe the above patterns and see if any thing in particular stands out. Doing this analysis will help the consumer make a better decision and remove any pre formed predjudices.
I will made use of various R libraries to clean the data, perform data manipulations, perform the Exploratory Data Analysis to make reasonable assumptions on the safety in Queen city.
library(data.table)
library(tidyverse)
library(dplyr)
library(DT)
library(rmarkdown)
library(leaflet)
I has used the dataset released by City of Cincinnati under PDI (Police Data Initiative) Crime Incidents. The dataset can be found here
# Import data from local machine
cincy.data <- fread(file = "city_of_cincinnati_police_data_initiative_crime_incidents.csv")
glimpse(cincy.data)
Our Dataset contains 355379 observations and 40 variables. Data is spread over 1991 to 2019. The data is stored and recorded in the Record Management System (RMS) which stores agency-wide data about low enforcement operations and is updated on a dialy basis. As per privacy laws, some data, for example, addresses and latitude longitude information has been masked before it was made public for use.
table(cincy.data$VICTIM_AGE)
##
## 00 18-25
## 1 4 72619
## 26-30 31-40 41-50
## 41978 58768 43554
## 51-60 61-70 ADULT (18+)
## 35961 16074 1925
## JUVENILE (UNDER 18) OVER 70 UNDER 18
## 1134 8102 17722
## UNKNOWN
## 57537
#Select particular age groups
cincy.data <- filter(cincy.data, VICTIM_AGE == "18-25" | VICTIM_AGE == "26-30" | VICTIM_AGE == "UNDER 18" | VICTIM_AGE == "JUVENILE (UNDER 18)" )
#Establish time of occurence
DATE_TIME_OCCURENCE <- substr(cincy.data$DATE_FROM, 1, 22)
DATE_TIME_OCCURENCE <- (strptime(DATE_TIME_OCCURENCE, '%m/%d/%Y %I:%M:%S %p'))
TIME_OCCURENCE <- substr(cincy.data$DATE_FROM, 12, 22)
TIME_OCCURENCE <- as.difftime(TIME_OCCURENCE, '%I:%M:%S %p', units = "hours")
#rounding the filed for future use
TIME_OCCURENCE <- round(TIME_OCCURENCE, 1)
#Bind to Data Frame
cincy.data <- cbind(cincy.data, DATE_TIME_OCCURENCE, TIME_OCCURENCE)
#Creating a clomun which stores the year
cincy.data <- cincy.data %>% mutate(Reported_year = substr(cincy.data$DATE_TIME_OCCURENCE, 1, 4))
#Select only required columns to reduce the number of variables
cincy.data <- cincy.data %>% select(INSTANCEID,DATE_TIME_OCCURENCE,TIME_OCCURENCE,
OFFENSE,LOCATION,DAYOFWEEK,CPD_NEIGHBORHOOD,WEAPONS,
LONGITUDE_X:VICTIM_GENDER,Reported_year)
#Select required time frame
cincy.data <- cincy.data %>% filter(DATE_TIME_OCCURENCE < "2019-03-06" & DATE_TIME_OCCURENCE > "2010-03-06")
# Finding unique names in the Gender column
unique(cincy.data$VICTIM_GENDER)
## [1] "FEMALE" "MALE" "UNKNOWN"
## [4] "NON-PERSON (BUSINESS" "F - FEMALE" "M - MALE"
# Updating the values so that FEMALE will be represented as F and Males will be represented as M
cincy.data$VICTIM_GENDER <- ifelse(test = cincy.data$VICTIM_GENDER == "F - FEMALE" | cincy.data$VICTIM_GENDER == "FEMALE", yes = "F", no = cincy.data$VICTIM_GENDER)
cincy.data <- cincy.data %>%
mutate(WEAPONS = gsub(".*11.*", "FIREARM", WEAPONS)) %>%
mutate(WEAPONS = gsub(".*12.*", "HANDGUN", WEAPONS)) %>%
mutate(WEAPONS = gsub(".*13.*", "RIFLE", WEAPONS)) %>%
mutate(WEAPONS = gsub(".*14.*", "SHOTGUN", WEAPONS)) %>%
mutate(WEAPONS = gsub(".*15.*", "FIREARM", WEAPONS)) %>%
mutate(WEAPONS = gsub(".*16.*", "FIREARM", WEAPONS)) %>%
mutate(WEAPONS = gsub(".*17.*", "FIREARM", WEAPONS)) %>%
mutate(WEAPONS = gsub(".*18.*", "BB AND PELLET GUNS", WEAPONS)) %>%
mutate(WEAPONS = gsub(".*20.*", "KNIFE/CUTTING INSTRUMENT", WEAPONS)) %>%
mutate(WEAPONS = gsub(".*30.*", "BLUNT OBJECT", WEAPONS)) %>%
mutate(WEAPONS = gsub(".*35.*", "MOTOR VEHICLE", WEAPONS)) %>%
mutate(WEAPONS = gsub(".*40.*", "PERSONAL WEAPON", WEAPONS)) %>%
mutate(WEAPONS = gsub(".*60.*", "EXPLOSIVES", WEAPONS)) %>%
mutate(WEAPONS = gsub(".*70.*", "DRUGS", WEAPONS)) %>%
mutate(WEAPONS = gsub(".*80.*", "OTHER WEAPONS", WEAPONS)) %>%
mutate(WEAPONS = gsub(".*U.*", "UNKNOWN", WEAPONS)) %>%
mutate(WEAPONS = gsub(".*65.*", "FIRE/INCENDIARY DEVICE", WEAPONS)) %>%
mutate(WEAPONS = gsub(".*50.*", "POISON", WEAPONS)) %>%
mutate(WEAPONS = gsub(".*99.*", "NONE", WEAPONS)) %>%
mutate(WEAPONS = gsub(".*85.*", "ASPHYXIATION", WEAPONS))
cincy.data <- cincy.data %>%
mutate(OFFENSE = gsub(".*ASSAULT.*", "ASSAULT", OFFENSE)) %>%
mutate(OFFENSE = gsub(".*BURGLARY.*", "BURGLARY", OFFENSE)) %>%
mutate(OFFENSE = gsub(".*RAPE.*", "RAPE", OFFENSE)) %>%
mutate(OFFENSE = gsub(".*ROBBERY.*", "ROBBERY", OFFENSE)) %>%
mutate(OFFENSE = gsub(".*MURDER.*", "MURDER", OFFENSE)) %>%
mutate(OFFENSE = gsub(".*ABDUCTION.*", "ABDUCTION", OFFENSE)) %>%
mutate(OFFENSE = gsub(".*MENACING.*", "MENACING", OFFENSE)) %>%
mutate(OFFENSE = gsub(".*FORGERY.*", "FORGERY", OFFENSE)) %>%
mutate(OFFENSE = gsub(".*KIDNAPPING.*", "KIDNAPPING", OFFENSE)) %>%
mutate(OFFENSE = gsub(".*ARSON.*", "ARSON", OFFENSE)) %>%
mutate(OFFENSE = gsub(".*SEX.*", "SEX", OFFENSE)) %>%
mutate(OFFENSE = gsub(".*INTIMID.*", "INTIMIDATION", OFFENSE)) %>%
mutate(OFFENSE = gsub(".*HARRASS.*", "HARRASS", OFFENSE)) %>%
mutate(OFFENSE = gsub(".*VANDALISM.*", "VANDALISM", OFFENSE)) %>%
mutate(OFFENSE = gsub(".*THEFT.*", "THEFT", OFFENSE)) %>%
mutate(OFFENSE = gsub(".*CRIMINAL.*", "CRIMINAL", OFFENSE)) %>%
mutate(OFFENSE = gsub(".*DISORDERLY CONDUCT.*", "DISORDERLY CONDUCT", OFFENSE)) %>%
mutate(OFFENSE = gsub(".*ENDANGERING CHILDREN.*", "ENDANGERING CHILDREN", OFFENSE)) %>%
mutate(OFFENSE = gsub(".*VIOL.*", "VIOLATE PROTECTION ORDER", OFFENSE)) %>%
mutate(OFFENSE = gsub(".*CREDIT CARD.*", "CREDIT CARD FRAUD", OFFENSE)) %>%
mutate(OFFENSE = gsub(".*TELEPHONE HARRASSMENT.*", "TELEPHONE HARRASSMENT", OFFENSE)) %>%
mutate(OFFENSE = gsub(".*PATIENT ABUSE.*", "PATIENT ABUSE", OFFENSE)) %>%
mutate(OFFENSE = gsub(".*UNAUTHORISED USE.*", "UNAUTHORISED USE", OFFENSE))
#Finding missing values in each column
colSums(is.na(cincy.data))
## INSTANCEID DATE_TIME_OCCURENCE TIME_OCCURENCE
## 0 0 0
## OFFENSE LOCATION DAYOFWEEK
## 0 0 0
## CPD_NEIGHBORHOOD WEAPONS LONGITUDE_X
## 0 0 19537
## LATITUDE_X VICTIM_AGE VICTIM_RACE
## 19537 0 0
## VICTIM_ETHNICITY VICTIM_GENDER Reported_year
## 0 0 0
#Removing missing values
cincy.data <- na.omit(cincy.data)
After removing missing values, we have 113456 observations and 15 variables. We will use it as a final dataste to do further analysis.
#Final Dataset
datatable(head(cincy.data,200),caption = "Cincy Crime Data ", class = 'cell-border stripe')
I am not converting some of the variables to factors at this stage. I will convert them dynamically when they are required.
We will explore the dataset and plot the trends for identifying some of the patterns
Below plot shows the trend on the number of incidents reported over years
ggplot(cincy.data) +
aes(x = Reported_year) +
geom_bar(stat = "count",fill = 'red') +
geom_text(stat = "count",aes(label = ..count..),vjust = -0.5) +
labs(title = "Incidents reported YOY",x = "Year",y = "Number of Incidents") +
theme(axis.text.x = element_text(size = 10, angle = 60)) +
scale_y_continuous(limit = c(0,20000))
From the plot, we could observe that number of incidents reported have been gradually reducing since 2011 (except for 2014). However, the number of incidents remained almost same from 2015
Below is the trend based on the month
ggplot(cincy.data) +
aes(x = as.numeric(substr(cincy.data$DATE_TIME_OCCURENCE, 6, 7))) +
geom_bar(stat = "count",fill = 'red') +
geom_text(stat = "count",aes(label = ..count..),vjust = -0.5) +
labs(title = "Number of incidents reported based on Month",x = "Month",y = "Number of Incidents") +
theme(axis.text.x = element_text(size = 10, angle = 60)) +
scale_x_discrete(limits = month_names) +
scale_y_continuous(limit = c(0,13000))
We can see that the months July to October has most number of incidents reported compared to other months. This is worrying to observe as these are the months new school year is starting.
Common Offense
cincy.data %>% count(OFFENSE, sort = TRUE) %>% top_n(5)
## # A tibble: 5 x 2
## OFFENSE n
## <chr> <int>
## 1 THEFT 28950
## 2 ASSAULT 24312
## 3 BURGLARY 13252
## 4 CRIMINAL 12341
## 5 ROBBERY 10391
Common Weapon used
## # A tibble: 5 x 2
## WEAPONS n
## <chr> <int>
## 1 PERSONAL WEAPON 25912
## 2 FIREARM 4684
## 3 OTHER WEAPONS 1141
## 4 PERSONAL WEAPONS (HANDS, FEET, TEETH, ETC.) 662
## 5 MOTOR VEHICLE 450
Most dangerous neighborhoods
## # A tibble: 5 x 2
## CPD_NEIGHBORHOOD n
## <chr> <int>
## 1 WESTWOOD 9896
## 2 WEST PRICE HILL 7175
## 3 EAST PRICE HILL 6339
## 4 AVONDALE 5525
## 5 FAIRVIEW 5149
Dangerous Hours
## # A tibble: 5 x 2
## TIME_OCCURENCE n
## <time> <int>
## 1 0 hours 3502
## 2 12 hours 2859
## 3 21 hours 2824
## 4 22 hours 2758
## 5 23 hours 2678
Below are the statistics for the incidents reported based on the age group
##
## 18-25 26-30 UNDER 18
## 62455 36496 14505
Below are the most dangerous neighborhoods based on the number of incidents reported.
## # A tibble: 5 x 2
## CPD_NEIGHBORHOOD n
## <chr> <int>
## 1 WESTWOOD 9896
## 2 WEST PRICE HILL 7175
## 3 EAST PRICE HILL 6339
## 4 AVONDALE 5525
## 5 FAIRVIEW 5149
In order to do the further analysis, I am subsetting the data so that only the above shown neighborhoods are present
sub.data <- cincy.data %>%
filter(cincy.data$CPD_NEIGHBORHOOD == "WESTWOOD" |
cincy.data$CPD_NEIGHBORHOOD == "WEST PRICE HILL" |
cincy.data$CPD_NEIGHBORHOOD == "EAST PRICE HILL" |
cincy.data$CPD_NEIGHBORHOOD == "AVONDALE" |
cincy.data$CPD_NEIGHBORHOOD == "FAIRVIEW")
Below plot shows the incident count in terms of histograms. Showing the trend in histograms will tend for easy comparison. Neighborhood “WESTWOOD” stands out for the incidents reported. The count in WESTWOOD is almost 2 times the incidetnts reported in FAIRVIEW
ggplot(sub.data, aes(factor(CPD_NEIGHBORHOOD), fill = CPD_NEIGHBORHOOD)) +
geom_bar(stat = "count", position = "dodge") +
geom_text(stat = "count",aes(label = ..count..),vjust = -0.5) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs( y = "Number of Reported incidents",
x = "Neighborhoods",
title = "Most Dangerous Neighborhoods in Cincinnati") +
scale_y_continuous(limit = c(0,11000)) +
guides(fill = guide_legend(title = "Neighborhood in Cincinnati"))
Trends in these neigborhoods over Year to Year
This comparison will help us in understanding whether the crime has grown / subdued in the particular neighborhood over the years.
sub.data %>% group_by(CPD_NEIGHBORHOOD, Reported_year) %>%
tally() %>%
ggplot(aes(x = Reported_year, y = n, group = CPD_NEIGHBORHOOD, color = CPD_NEIGHBORHOOD)) + geom_line() +
labs(y = "Number of Occurences",
x = "Years",
title = "Trend of Crimes in Cincinnati's Worst Neightborhoods",
color = "Neighborhood in Cincinnati")
Some of the important findings from above plot are
Crime Trends in neighborhoods based on month of the year
month_name <- as.numeric(substr(sub.data$DATE_TIME_OCCURENCE, 6, 7))
sub.data <- cbind(sub.data, month_name)
sub.data$month <- as.factor(sub.data$month_name)
sub.data %>%
group_by(CPD_NEIGHBORHOOD, month_name) %>%
tally() %>%
ggplot(aes(x = month_name, y = n, group = CPD_NEIGHBORHOOD, color = CPD_NEIGHBORHOOD)) + geom_line() +
labs(y = "Number of Occurences",
x = "Months",
title = "Trend of Crimes in different months",
color = "Neighborhood") +
scale_x_discrete(limits = month_names)
The crime trend in these areas is in accordance with the overall month over month trend marking July to September / October as the troublesome months.
Crime Trends in neighborhoods based on time of the day
sub.data$TIME_OCCURENCE <- round(sub.data$TIME_OCCURENCE,digits = 0)
sub.data %>% group_by(CPD_NEIGHBORHOOD, TIME_OCCURENCE) %>%
tally() %>%
ggplot(aes(x = TIME_OCCURENCE, y = n, group = CPD_NEIGHBORHOOD, color = CPD_NEIGHBORHOOD)) +
geom_point(alpha = .5) +
stat_smooth(aes(x = TIME_OCCURENCE, y = n),method = "lm", formula = y ~ poly(x, 10), se = FALSE) +
labs(y = "Number of Occurences",
x = "Time of day",
title = "Worst Times of Day in Cincinnati's Most Dangerous Neighborhoods",
color = "Neighborhood in Cincinnati")
The crime rate tends to go higher during late hours of the day. The trend remains relitively same for these neighborhoods as well.
We are plotting the incidents reported on the map of cincinnati to observe the heat areas on the map.
Below map is plotted using leaflet package. Leaflet package uses the latitude and longtidue values and mark them as pointers on the map. We can cluster these markers based on the zoom level. The map which is generated below can be zoomed in further to check the incidents at neighborhood level / street level. You can even find the location (from which incident is reported) once you have zoomed in enough.
leaflet() %>% addTiles() %>% addMarkers(lng = cincy.data$LONGITUDE_X,
lat = cincy.data$LATITUDE_X,
clusterOptions = markerClusterOptions())
From the very high level, we could see more incidents were reported near the clifton and downtown areas. This is understandable, as we have selected the data only for the ager groups below 30. Most of the people in that age group are either students or early career professionals and tend to live in more accessible areas like clifton or downtown.
When zoomed in further, we could observe many incidents are reported in areas of Clifton heights and Over The Rhine (OTR)
I am further adding an option to check the crime scenario based on the year. I am adding the map for the year 2018. But this funtinality can be extended to any year.
#Creating Function for the map by years
map_year <- function(YEAR) {
cincy.data.subset <- cincy.data %>% filter(YEAR == Reported_year)
map_y <- leaflet() %>% addTiles() %>%
addMarkers(lng = cincy.data.subset$LONGITUDE_X,
lat = cincy.data.subset$LATITUDE_X,
clusterOptions = markerClusterOptions())
}
Map of year 2018
(map_year("2018"))
To view map for different year, change the value between the " " in above code
I have started the analysis by cleaning the dataset and subsetting the data as per our needs. We summarized the data at different levels to observe the trends / patterns. By looking at the YOY situation, I have found that there was a general downward trend to crime occurrences. I have then checked for the trends of Month over month, Day of the week, worst neighborhoods, which weapons were used the most, which offences were the most common and what time of the day that had the greatest number of occurrences. I have decided to do more detailed analysis surrounding the worst neighborhoods starting by simply plotting the worst of them to get a visual representation of how bad they really were. I have then checked the trends based on the different granules (Day of week, time of day, month of the year). Most of the analysis done on these neighborhoods are in line with the overall trend of cincinnati.
My analysis has lead me to beleive that Cincinnati is slowly but surely becoming a safe place to live. In addition to this conclusion, my analysis can help the general public to observe the trends of crime in their neighborhood. It can also help others (who are not familiar with cincinnati) to understand the crime scenario in cincinnati and help in making right decision (where to stay, which places to avoid). Finally, this can help the law enforcements officers to measure the effect of their efforts to curb the crime.
The above work can be improved further by building an interactive Shiny app with the features of sliders and radio buttons. Building this app can be very helpful as the decisions can be taken the higher officials by just looking at the dashboard or interacting with the dashboards as per requirement. I have been working on developing this shiny app. I will post it once I am done with that.