Before the riot of 2001, Cincinnati’s overall crime rate was dropping dramatically and had reached its lowest point since 1992. After the riot violent crime increased. Reflecting national trends, crime rates in Cincinnati have dropped in all categories from 2006 to 2010.
In 2011, using FBI data, the CQ Press ranked Cincinnati the 16th most dangerous city in the United States.However, the FBI web site recommends against using its data for rankings, since there are many factors that influence crime rates. Also, the CQ Press did not include Chicago, Illinois in its ranking.[6]
According to the Hamilton County Prosecutor, the vast majority of people being murdered are inside the city and are African American. In 2009, 44 black men and 11 black women were murdered, but no whites.[8] According to the Prosecutor, people almost always commit murders inside their racial classifications, and there is a subset in the underclass of Cincinnati which “commit a lot of violent crime and tend to be black,”[8] similar to national trends.
We have analysed the data to understand how crime rates have changed by time and location.
The final conclusions of the analysis will answer:
We will be leveraging this information using various libraries in R and answer the question: Is Cincinnati a safe place as it was earlier?
The following packages are used :
library(tidyr) #for data cleaning
library(tidyverse) #to install the required packages
library(readr) #to import the csv file
library(dplyr) #to manipulate the data
library(ggplot2) #for plotting different graphs
library(knitr) #for R markdown
library(leaflet) #for visulaizing the data on map
DATA SOURCE
I am using the crime dataset for the city of Cincinnati to explore the situation of crimes in the city
The link to the data source is here
PURPOSE OF THE DATA
The purpose is to identify crime rate and analyze different factors that lead to the crime. The original data contains 314000 rows and 39 columns and for the years 1991 to 2018.
DATA CLEANING
The first step is to import the data from a CSV file. #Importing data
#Import CSV file
crime_data <- read.csv('crimedata.csv')
On initial analysis on the data I found that I could use the date reported column and split it into two columns. The two columns would represent the date of crime and the time of occurrence.
#Date of crime and time of crime
Date_Rpt <- substr(crime_data$DATE_FROM, 1, 22)
Date_Rpt <- (strptime(Date_Rpt, '%m/%d/%Y %I:%M:%S %p'))
Time_of_rpt <- substr(crime_data$DATE_FROM, 12, 22)
Time_of_rpt<- as.difftime(Time_of_rpt,'%I:%M:%S %p', units="hours")
Date_Rpt <- substr(crime_data$DATE_FROM, 1, 10)
crime_data$DATE_FROM <- Date_Rpt
crime_data <- cbind(crime_data, Time_of_rpt)
Now I converted the data into a table and filtered out the required columns for analysis
#Converting to a table
crime_data <- tbl_df(crime_data)
#selecting the required columns
cd_cleandata <- select(crime_data,
INSTANCEID,
DATE_FROM,
Time_of_rpt,
OFFENSE,
DAYOFWEEK,
SNA_NEIGHBORHOOD,
WEAPONS,
LONGITUDE_X,
LATITUDE_X,
VICTIM_AGE,
VICTIM_GENDER
)
Providing the summary of the cleaned dataset
summary(cd_cleandata)
## INSTANCEID DATE_FROM
## C2CA9A5A-5C3C-419F-BFE1-F6650C936EFF: 127 Length:355379
## 932627CA-318A-4B93-8FE3-57914736D1F6: 126 Class :character
## 456C6412-EC72-4F59-B1B4-5980631026AD: 84 Mode :character
## 3EFCADA0-F622-403C-9B64-50E23EC5F00F: 80
## 46504773-3D1B-48A2-96F6-74BD8EBE5F73: 64
## 1AF3B2F3-411A-4A01-A3A3-1E0F0DDAD73E: 60
## (Other) :354838
## Time_of_rpt OFFENSE
## Length:355379 THEFT :120315
## Class :difftime CRIMINAL DAMAGING/ENDANGERING: 43536
## Mode :numeric ASSAULT : 38225
## BURGLARY : 28817
## AGGRAVATED ROBBERY : 18516
## BREAKING AND ENTERING : 17148
## (Other) : 88822
## DAYOFWEEK SNA_NEIGHBORHOOD
## FRIDAY :52163 N/A : 44776
## SATURDAY :51460 WESTWOOD : 32163
## SUNDAY :50299 EAST PRICE HILL: 23103
## MONDAY :50214 WEST PRICE HILL: 18588
## WEDNESDAY:49728 AVONDALE : 17911
## TUESDAY :49450 CUF : 14680
## (Other) :52065 (Other) :204158
## WEAPONS
## 99 - NONE :141022
## 99--NONE : 70866
## 40 - PERSONAL WEAPONS (HANDS, FEET, TEETH, ETC.): 41219
## U - UNKNOWN : 24493
## 40--PERSONAL WEAPONS (HANDS, FEET, TEETH, ETC.) : 16932
## 12 - HANDGUN : 15743
## (Other) : 45104
## LONGITUDE_X LATITUDE_X VICTIM_AGE
## Min. :-84.82 Min. :39.05 18-25 :72619
## 1st Qu.:-84.57 1st Qu.:39.12 31-40 :58768
## Median :-84.52 Median :39.14 UNKNOWN:57537
## Mean :-84.52 Mean :39.14 41-50 :43554
## 3rd Qu.:-84.49 3rd Qu.:39.16 26-30 :41978
## Max. :-84.25 Max. :39.36 51-60 :35961
## NA's :45864 NA's :45864 (Other):44962
## VICTIM_GENDER
## : 54361
## F - FEMALE : 25
## FEMALE :164757
## M - MALE : 18
## MALE :135901
## NON-PERSON (BUSINESS: 51
## UNKNOWN : 266
Removing missing values and blank fields from the dataset.
#cleaning to remove na
cd_cleandata <- na.omit(cd_cleandata)
cd_cleandata <- with(cd_cleandata, cd_cleandata[!(INSTANCEID == "" | is.na(INSTANCEID)), ])
cd_cleandata <- with(cd_cleandata, cd_cleandata[!(DAYOFWEEK == "" | is.na(DAYOFWEEK)), ])
cd_cleandata <- with(cd_cleandata, cd_cleandata[!(SNA_NEIGHBORHOOD == "" | is.na(SNA_NEIGHBORHOOD)), ])
cd_cleandata <- with(cd_cleandata, cd_cleandata[!(OFFENSE == "" | is.na(OFFENSE)), ])
cd_cleandata <- with(cd_cleandata, cd_cleandata[!(VICTIM_AGE == "" | is.na(VICTIM_AGE)), ])
cd_cleandata <- with(cd_cleandata, cd_cleandata[!(VICTIM_AGE == "UNKNOWN" | is.na(VICTIM_AGE)), ])
cd_cleandata <- with(cd_cleandata, cd_cleandata[!(VICTIM_GENDER == "" | is.na(VICTIM_GENDER)), ])
Combining the similar kinds of crimes to on and combinig the weapons used to find the unique weapons used for the crime. Also, grouped juvenile and under 18 years of age into one group called Under 18.
#cleaning offense column
cd_cleandata$OFFENSE <- gsub(".*HARRASS.*", "HARRASS", (cd_cleandata$OFFENSE), perl = FALSE)
cd_cleandata$OFFENSE <- gsub(".*MENACING.*", "MENACING", (cd_cleandata$OFFENSE), perl = FALSE)
cd_cleandata$OFFENSE <- gsub(".*FORGERY.*", "FORGERY", (cd_cleandata$OFFENSE), perl = FALSE)
cd_cleandata$OFFENSE <- gsub(".*ROBBERY.*", "ROBBERY", (cd_cleandata$OFFENSE), perl = FALSE)
cd_cleandata$OFFENSE <- gsub(".*MURDER.*", "MURDER", (cd_cleandata$OFFENSE), perl = FALSE)
cd_cleandata$OFFENSE <- gsub(".*ABDUCTION.*", "ABDUCTION", (cd_cleandata$OFFENSE), perl = FALSE)
cd_cleandata$OFFENSE <- gsub(".*SEX.*", "SEXUAL CRIME", (cd_cleandata$OFFENSE), perl = FALSE)
cd_cleandata$OFFENSE <- gsub(".*INTIMID.*", "INTIMIDATION", (cd_cleandata$OFFENSE), perl = FALSE)
cd_cleandata$OFFENSE <- gsub(".*KIDNAPPING.*", "KIDNAPPING", (cd_cleandata$OFFENSE), perl = FALSE)
cd_cleandata$OFFENSE <- gsub(".*PUBLIC INDECENCY.*", "PUBLIC INDECENCY", (cd_cleandata$OFFENSE), perl = FALSE)
cd_cleandata$OFFENSE <- gsub(".*ARSON.*", "ARSON", (cd_cleandata$OFFENSE), perl = FALSE)
cd_cleandata$OFFENSE <- gsub(".*UNAUTHORIZED USE.*", "UNAUTHORIZED USE", (cd_cleandata$OFFENSE), perl = FALSE)
cd_cleandata$OFFENSE <- gsub(".*PATIENT ABUSE.*", "PATIENT ABUSE", (cd_cleandata$OFFENSE), perl = FALSE)
cd_cleandata$OFFENSE <- gsub(".*TELEPHONE HARRASSMENT.*", "TELEPHONE HARRASSMENT", (cd_cleandata$OFFENSE), perl = FALSE)
cd_cleandata$OFFENSE <- gsub(".*VIOL.*", "VIOLATE PROTECTION ORDER", (cd_cleandata$OFFENSE), perl = FALSE)
cd_cleandata$OFFENSE <- gsub(".*CREDIT CARD.*", "CREDIT CARD FRAUD", (cd_cleandata$OFFENSE), perl = FALSE)
cd_cleandata$OFFENSE <- gsub(".*VANDALISM.*", "VANDALISM", (cd_cleandata$OFFENSE), perl = FALSE)
cd_cleandata$OFFENSE <- gsub(".*THEFT.*", "THEFT", (cd_cleandata$OFFENSE), perl = FALSE)
cd_cleandata$OFFENSE <- gsub(".*CRIMINAL.*", "CRIMINAL MISCHIEF", (cd_cleandata$OFFENSE), perl = FALSE)
cd_cleandata$OFFENSE <- gsub(".*DISORDERLY CONDUCT.*", "DISORDERLY CONDUCT", (cd_cleandata$OFFENSE), perl = FALSE)
cd_cleandata$OFFENSE <- gsub(".*ENDANGERING CHILDREN.*", "ENDANGERING CHILDREN", (cd_cleandata$OFFENSE), perl = FALSE)
cd_cleandata$OFFENSE <- gsub(".*ASSAULT.*", "ASSAULT", (cd_cleandata$OFFENSE), perl = FALSE)
cd_cleandata$OFFENSE <- gsub(".*BURGLARY.*", "BURGLARY", (cd_cleandata$OFFENSE), perl = FALSE)
cd_cleandata$OFFENSE <- gsub(".*RAPE.*", "RAPE", (cd_cleandata$OFFENSE), perl = FALSE)
cd_cleandata$VICTIM_AGE <- gsub(".*JUVENILE.*", "UNDER 18", (cd_cleandata$VICTIM_AGE), perl = FALSE)
#Cleaning weapons column
cd_cleandata$WEAPONS <- gsub(".*11.*", "FIREARM", (cd_cleandata$WEAPONS), perl = FALSE)
cd_cleandata$WEAPONS <- gsub(".*12.*", "HANDGUN", (cd_cleandata$WEAPONS), perl = FALSE)
cd_cleandata$WEAPONS <- gsub(".*13.*", "RIFLE", (cd_cleandata$WEAPONS), perl = FALSE)
cd_cleandata$WEAPONS <- gsub(".*14.*", "SHOTGUN", (cd_cleandata$WEAPONS), perl = FALSE)
cd_cleandata$WEAPONS <- gsub(".*15.*", "FIREARM", (cd_cleandata$WEAPONS), perl = FALSE)
cd_cleandata$WEAPONS <- gsub(".*16.*", "FIREARM", (cd_cleandata$WEAPONS), perl = FALSE)
cd_cleandata$WEAPONS <- gsub(".*17.*", "FIREARM", (cd_cleandata$WEAPONS), perl = FALSE)
cd_cleandata$WEAPONS <- gsub(".*18.*", "BB AND PELLET GUNS", (cd_cleandata$WEAPONS), perl = FALSE)
cd_cleandata$WEAPONS <- gsub(".*20.*", "KNIFE/CUTTING INSTRUMENT", (cd_cleandata$WEAPONS), perl = FALSE)
cd_cleandata$WEAPONS <- gsub(".*30.*", "BLUNT OBJECT", (cd_cleandata$WEAPONS), perl = FALSE)
cd_cleandata$WEAPONS <- gsub(".*35.*", "MOTOR VEHICLE", (cd_cleandata$WEAPONS), perl = FALSE)
cd_cleandata$WEAPONS <- gsub(".*40.*", "PERSONAL WEAPON", (cd_cleandata$WEAPONS), perl = FALSE)
cd_cleandata$WEAPONS <- gsub(".*60.*", "EXPLOSIVES", (cd_cleandata$WEAPONS), perl = FALSE)
cd_cleandata$WEAPONS <- gsub(".*70.*", "DRUGS", (cd_cleandata$WEAPONS), perl = FALSE)
cd_cleandata$WEAPONS <- gsub(".*80.*", "OTHER WEAPONS", (cd_cleandata$WEAPONS), perl = FALSE)
cd_cleandata$WEAPONS <- gsub(".*U.*", "UNKNOWN", (cd_cleandata$WEAPONS), perl = FALSE)
cd_cleandata$WEAPONS <- gsub(".*65.*", "FIRE/INCENDIARY DEVICE", (cd_cleandata$WEAPONS), perl = FALSE)
cd_cleandata$WEAPONS <- gsub(".*50.*", "POISON", (cd_cleandata$WEAPONS), perl = FALSE)
cd_cleandata$WEAPONS <- gsub(".*99.*", "NONE", (cd_cleandata$WEAPONS), perl = FALSE)
cd_cleandata$WEAPONS <- gsub(".*85.*", "ASPHYXIATION", (cd_cleandata$WEAPONS), perl = FALSE)
The data set I used to analyze the problem is as follows:
kable(cd_cleandata[1:10,], caption = "Preview - Clean Data")
| INSTANCEID | DATE_FROM | Time_of_rpt | OFFENSE | DAYOFWEEK | SNA_NEIGHBORHOOD | WEAPONS | LONGITUDE_X | LATITUDE_X | VICTIM_AGE | VICTIM_GENDER |
|---|---|---|---|---|---|---|---|---|---|---|
| D6D7D173-E416-4571-AF34-A767ACF810D9 | 03/16/2015 | 15.033333 hours | MENACING | MONDAY | WALNUT HILLS | NONE | -84.49080 | 39.11996 | 26-30 | FEMALE |
| 2A569674-9603-4ED9-82BB-783B843B94C9 | 06/23/2011 | 10.333333 hours | ASSAULT | THURSDAY | AVONDALE | PERSONAL WEAPON | -84.49155 | 39.14167 | 41-50 | MALE |
| 5B95E5DC-E35D-4F86-B77C-02F25C950329 | 06/25/2012 | 15.583333 hours | ASSAULT | MONDAY | COLLEGE HILL | FIREARM | -84.54596 | 39.20102 | 18-25 | MALE |
| B908DB49-C51D-442B-ADFD-B6E64696071B | 03/13/2018 | 16.750000 hours | ASSAULT | TUESDAY | WEST END | PERSONAL WEAPON | -84.52521 | 39.11082 | 41-50 | FEMALE |
| 66F569F5-4C8E-4848-BD42-D0F8DE4D42B0 | 05/19/2013 | 14.500000 hours | BURGLARY | SUNDAY | MADISONVILLE | FIREARM | -84.39035 | 39.15479 | 18-25 | FEMALE |
| 3B4450F2-F94B-4BFB-8B0A-3FF04DBD73C4 | 07/24/2015 | 0.000000 hours | ENDANGERING CHILDREN | FRIDAY | EVANSTON | UNKNOWN | -84.47835 | 39.13468 | UNDER 18 | MALE |
| B095D1B2-9089-4CA4-85B2-A2657FB82793 | 09/05/2012 | 1.366667 hours | ROBBERY | WEDNESDAY | OVER-THE-RHINE | NONE | -84.51276 | 39.11108 | 31-40 | MALE |
| FD5706CD-7E2F-4CED-8C73-2E3AB142EF8C | 09/03/2015 | 1.166667 hours | ASSAULT | THURSDAY | OVER-THE-RHINE | PERSONAL WEAPON | -84.51640 | 39.11505 | 41-50 | FEMALE |
| 417FB2EC-1405-4F31-802B-2D3D7C9A47C3 | 08/31/2014 | 18.000000 hours | THEFT | SUNDAY | MT. AUBURN | NONE | -84.51008 | 39.11733 | 18-25 | MALE |
| A6B8356E-5A44-425A-8076-80BA963D09AC | 09/25/2010 | 14.500000 hours | ASSAULT | SATURDAY | WALNUT HILLS | PERSONAL WEAPON | -84.49454 | 39.12494 | 31-40 | MALE |
The exploratory data analysis is used to find the following answers:
#total number of crimes
length(unique(cd_cleandata$INSTANCEID))
## [1] 201198
#Violence of offense
cd_cleandata %>% count(OFFENSE, sort = TRUE) %>% top_n(5)
## Selecting by n
## # A tibble: 5 x 2
## OFFENSE n
## <chr> <int>
## 1 THEFT 76142
## 2 ASSAULT 42003
## 3 CRIMINAL MISCHIEF 32744
## 4 BURGLARY 30155
## 5 ROBBERY 18469
#Weapons used
cd_cleandata %>% count(WEAPONS, sort = TRUE) %>% filter(WEAPONS != "NONE" & WEAPONS!= "UNKNOWN") %>% top_n(5)
## Selecting by n
## # A tibble: 5 x 2
## WEAPONS n
## <chr> <int>
## 1 PERSONAL WEAPON 46461
## 2 FIREARM 7918
## 3 OTHER WEAPONS 2474
## 4 PERSONAL WEAPONS (HANDS, FEET, TEETH, ETC.) 1094
## 5 MOTOR VEHICLE 829
#unsafe days of the week
cd_cleandata %>% count(DAYOFWEEK, sort = TRUE) %>% top_n(5)
## Selecting by n
## # A tibble: 5 x 2
## DAYOFWEEK n
## <fct> <int>
## 1 SATURDAY 37445
## 2 FRIDAY 37005
## 3 SUNDAY 37004
## 4 MONDAY 36286
## 5 WEDNESDAY 35881
#unsafe neighborhoods
cd_cleandata %>%
count(SNA_NEIGHBORHOOD, sort = TRUE) %>% top_n(5)
## Selecting by n
## # A tibble: 5 x 2
## SNA_NEIGHBORHOOD n
## <fct> <int>
## 1 WESTWOOD 23857
## 2 EAST PRICE HILL 19441
## 3 WEST PRICE HILL 16044
## 4 AVONDALE 15413
## 5 CUF 12816
#Which gender is targeted more
cd_cleandata %>% count(VICTIM_GENDER, sort = TRUE) %>% top_n(1)
## Selecting by n
## # A tibble: 1 x 2
## VICTIM_GENDER n
## <fct> <int>
## 1 FEMALE 141323
#age group
cd_cleandata %>% count(VICTIM_AGE, sort = TRUE) %>% top_n(5)
## Selecting by n
## # A tibble: 5 x 2
## VICTIM_AGE n
## <chr> <int>
## 1 18-25 62000
## 2 31-40 50983
## 3 41-50 37963
## 4 26-30 36088
## 5 51-60 31584
Creating a bar plot to see the distribution of the crime over the years. The plot shows that the crime rates were high in the years 2010 to 2013 and after that it decreased in the further years.
#adding year column to the dataset
year = (substr(cd_cleandata$DATE_FROM,7,11))
cd_cleandata <- cbind(cd_cleandata, year)
#plotting barplot
barplot(table(substr(cd_cleandata$DATE_FROM, 7, 11)),
main = "Crimes committed by Year",
xlab = "Year",
ylab = "Total Crimes",
col = "cyan")
Filtering the offenses by type.
#5 offense types
cd_cleandata %>%
filter(cd_cleandata$OFFENSE == "THEFT" |
cd_cleandata$OFFENSE == "MENACING"|
cd_cleandata$OFFENSE == "ASSAULT"|
cd_cleandata$OFFENSE == "BURGLARY"|
cd_cleandata$OFFENSE == "CRIMINAL MISCHIEF") %>%
ggplot(aes(factor(OFFENSE), fill = OFFENSE)) +
geom_bar(stat="count", position = "dodge")+
theme(axis.text.x = element_text(angle = 45, hjust = 1))+
labs( y = "Count of crimes",
x = "Offense type",
title = "Top 5 offenses in Cincinnati")+
guides(fill=guide_legend(title="Crime in Cincinnati"))
Here we can see that menacing has the lowest count and theft has the highest number of occurrence.
Using the leaflet package I am using the map functionality to find out the count of crimes per location.
#changing neighborhood to factor
cd_cleandata$SNA_NEIGHBORHOOD <- as.factor(cd_cleandata$SNA_NEIGHBORHOOD)
#creating function for visualization by location
mapforlocation <- function(location){
subdata <- filter(cd_cleandata, location == cd_cleandata$SNA_NEIGHBORHOOD)
occurence_map <- leaflet() %>%
addTiles()%>%
addMarkers(lng = subdata$LONGITUDE_X,
lat = subdata$LATITUDE_X,
clusterOptions = markerClusterOptions()
)
}
map <- mapforlocation(cd_cleandata$SNA_NEIGHBORHOOD)
map
Here we clearly see that the neighborhood Over the Rhine has the highest number of crimes.
Next I tried to find the trend to see the top 5 SNA neighborhoods with the highest number of crimes.
# finding most dangerous crimes by SNA neighborhood
cd_cleandata %>%
filter(cd_cleandata$SNA_NEIGHBORHOOD == "WESTWOOD" |
cd_cleandata$SNA_NEIGHBORHOOD == "WEST PRICE HILL"|
cd_cleandata$SNA_NEIGHBORHOOD == "EAST PRICE HILL"|
cd_cleandata$SNA_NEIGHBORHOOD == "OVER-THE-RHINE"|
cd_cleandata$SNA_NEIGHBORHOOD == "AVONDALE") %>%
ggplot(aes(factor(SNA_NEIGHBORHOOD), fill = SNA_NEIGHBORHOOD)) +
geom_bar(stat="count", position = "dodge",color ="blue")+
theme(axis.text.x = element_text(angle = 30, hjust = 1))+
labs( y = "Total number of crimes",
x = "Neighborhood",
title = "Unsafe Neighborhoods")
In the below plot we can see that the age group 31-40 is the most unsafe age group and the neighbohood has the highest number of crimes in that ge group.
#graph for age group and neighborhood
cd_cleandata %>% filter(cd_cleandata$SNA_NEIGHBORHOOD == "WESTWOOD"|
cd_cleandata$SNA_NEIGHBORHOOD == "WEST PRICE HILL"|
cd_cleandata$SNA_NEIGHBORHOOD == "EAST PRICE HILL"|
cd_cleandata$SNA_NEIGHBORHOOD == "OVER-THE-RHINE"|
cd_cleandata$SNA_NEIGHBORHOOD == "AVONDALE") %>%
group_by(VICTIM_AGE, SNA_NEIGHBORHOOD)%>%
tally()%>%
ggplot(aes(x = VICTIM_AGE, y = n, group = SNA_NEIGHBORHOOD, color = SNA_NEIGHBORHOOD)) +
geom_line()+
labs(y = "Number of Crimes",
x = "Age Group",
title = "Most Dangerous locations",
color = "Neighborhoods")
I worked to analyze the Cincinnati crime dataset to find trends in the data and determine the parameters which lead to the criminal activity in the city.
I addressed this problem by using the data columns like neighborhood and the number of crimes in that area. I also use the offense types to find the fators which lead to those crimes.
The methodology I used was that I created a barplot, map, histogram and trend lines to compare the various factors and thereby gaining the insights.
This analysis helped to find the most dangerous locations in the city which can help the police department to increase the security measures in those areas. We also found out the criminal activity against the different age groups which can help the police to focus people in that age group which seem to be the working class age. We also see that theft is the most comon type of crime and thereby security measure against thefts can be focused more. We also see that the crimes has been reducing over the years which is a good indicator for the police department.
Through my analysis the police department can enforce more security in the nieghborhood Avondale. The city can conduct more awareness for the people in the age group 31-40 about how to be safe and avoid such situations. To reduce the number of thefts various programs can be conducted to make common people aware about how to take care of their belongings.
This project can be further improved by working on the less utilized columns like gender of the victim, weapons used in the crime, bias of the criminal and times of the crime to get more deeper insights.