Safety and security is a primary concern for every citizen. The crime rate in a city is one of the most significant parameters to determine the quality of life offered in that city.
A strong and effective law enforcement is the biggest deterrent to crime. Through the analysis of Cincinnati’s crime data we can uncover new insights and facts of the criminal history of this city and assist law enforcement agencies to optimally use their resources to prevent the maximum possible incidents. This analysis can also help people searching for a locality to move into or a new property to purchase. It can be also used by surveillance equipment manufacturers to identify areas with frequent crimes.
For our solution to prevent crimes, our focus will be on the following :
We will be using dplyr select, filter and aggregation and grouping functions to gain answers to the questions above. We can also use machine learning models to predict if a person can be a homicide suspect given age, location, gender etc.
With the help of this analysis law enforcement agencies can know when and where to increase patrolling. If we can build an accurate predictive model to identify homicide suspects, we can reduce the effort required by law enforcements to identify suspects.
We would need to import the following R packages inorder to assist with the importing and cleaning of our dataset.
#1 Packages required
library(readr) # to import the csv file
library(dplyr) # to use functions like glimpse
library(lubridate) #for date formatting
library(DT)# to use the datatable function
library(tidyverse)# for transformations
library(ggplot2)# for EDA
Before we begin cleaning data, we need to import it as a dataframe in R. ### Loading Data The data can be found at the following link. Information about the data can be found here.
#2 Data Preparation
#Loading the data
crime_data_initial <-
read_csv("city_of_cincinnati_police_data_initiative_crime_incidents.csv",
col_names = TRUE)
#Check dimensions
dim(crime_data_initial)
## [1] 355379 40
#Check column names are correct
colnames(crime_data_initial)
## [1] "INSTANCEID" "INCIDENT_NO"
## [3] "DATE_REPORTED" "DATE_FROM"
## [5] "DATE_TO" "CLSD"
## [7] "UCR" "DST"
## [9] "BEAT" "OFFENSE"
## [11] "LOCATION" "THEFT_CODE"
## [13] "FLOOR" "SIDE"
## [15] "OPENING" "HATE_BIAS"
## [17] "DAYOFWEEK" "RPT_AREA"
## [19] "CPD_NEIGHBORHOOD" "SNA_NEIGHBORHOOD"
## [21] "WEAPONS" "DATE_OF_CLEARANCE"
## [23] "HOUR_FROM" "HOUR_TO"
## [25] "ADDRESS_X" "LONGITUDE_X"
## [27] "LATITUDE_X" "VICTIM_AGE"
## [29] "VICTIM_RACE" "VICTIM_ETHNICITY"
## [31] "VICTIM_GENDER" "SUSPECT_AGE"
## [33] "SUSPECT_RACE" "SUSPECT_ETHNICITY"
## [35] "SUSPECT_GENDER" "TOTALNUMBERVICTIMS"
## [37] "TOTALSUSPECTS" "UCR_GROUP"
## [39] "COMMUNITY_COUNCIL_NEIGHBORHOOD" "ZIP"
# Check the structure using dplyr function glimpse
glimpse(crime_data_initial)
## Observations: 355,379
## Variables: 40
## $ INSTANCEID <chr> "D6D7D173-E416-4571-AF34-A767AC...
## $ INCIDENT_NO <chr> "159006170", "41103934", "21104...
## $ DATE_REPORTED <chr> "03/16/2015 05:19:00 PM +0000",...
## $ DATE_FROM <chr> "03/16/2015 03:02:00 PM +0000",...
## $ DATE_TO <chr> "03/16/2015 03:05:00 PM +0000",...
## $ CLSD <chr> "J--CLOSED", "J--CLOSED", "D--V...
## $ UCR <int> 802, 810, 1493, 401, 810, 550, ...
## $ DST <chr> "4", "4", "2", "5", "1", "2", "...
## $ BEAT <int> 2, 3, 3, 4, 3, 3, 1, 2, 2, 1, 2...
## $ OFFENSE <chr> "AGGRAVATED MENACING", "ASSAULT...
## $ LOCATION <chr> "48-PARKING LOT", "02-MULTI FAM...
## $ THEFT_CODE <chr> NA, NA, NA, NA, NA, NA, NA, NA,...
## $ FLOOR <chr> NA, NA, NA, NA, NA, "2 - FIRST ...
## $ SIDE <chr> NA, NA, NA, NA, NA, "1 - FRONT"...
## $ OPENING <chr> NA, NA, NA, NA, NA, "1 - DOOR",...
## $ HATE_BIAS <chr> "N--NO BIAS/NOT APPLICABLE", "N...
## $ DAYOFWEEK <chr> "MONDAY", "THURSDAY", "SATURDAY...
## $ RPT_AREA <int> 45, 365, 138, 439, 203, 140, 68...
## $ CPD_NEIGHBORHOOD <chr> "WALNUT HILLS", "AVONDALE", "MA...
## $ SNA_NEIGHBORHOOD <chr> "WALNUT HILLS", "AVONDALE", "MA...
## $ WEAPONS <chr> "99 - NONE", "40--PERSONAL WEAP...
## $ DATE_OF_CLEARANCE <chr> "03/28/2015 12:00:00 AM +0000",...
## $ HOUR_FROM <chr> "152", "1020", "240", "1535", "...
## $ HOUR_TO <chr> "155", "1030", "244", "1540", "...
## $ ADDRESS_X <chr> "21XX FULTON AV", "34XX READING...
## $ LONGITUDE_X <dbl> -84.49080, -84.49155, -84.39439...
## $ LATITUDE_X <dbl> 39.11996, 39.14167, 39.15156, 3...
## $ VICTIM_AGE <chr> "26-30", "41-50", "UNKNOWN", "1...
## $ VICTIM_RACE <chr> "BLACK", "BLACK", NA, "BLACK", ...
## $ VICTIM_ETHNICITY <chr> "NOT OF HISPANIC ORIG", "NOT OF...
## $ VICTIM_GENDER <chr> "FEMALE", "MALE", NA, "MALE", "...
## $ SUSPECT_AGE <chr> "51-60", "UNKNOWN", "18-25", "1...
## $ SUSPECT_RACE <chr> "BLACK", "BLACK", "BLACK", "BLA...
## $ SUSPECT_ETHNICITY <chr> "NOT OF HISPANIC ORIG", "NOT OF...
## $ SUSPECT_GENDER <chr> "MALE", "MALE", "FEMALE", "MALE...
## $ TOTALNUMBERVICTIMS <int> 1, 1, 1, 2, 1, 2, 1, 2, 1, 2, 2...
## $ TOTALSUSPECTS <int> 1, 1, 2, 4, NA, 2, 1, 2, NA, NA...
## $ UCR_GROUP <chr> "PART 2 MINOR", "PART 2 MINOR",...
## $ COMMUNITY_COUNCIL_NEIGHBORHOOD <chr> "WALNUT HILLS", "AVONDALE", "MA...
## $ ZIP <int> 45206, 45219, 45227, 45224, 452...
#Check summary to identify NA's , outliers etc
summary(crime_data_initial)
## INSTANCEID INCIDENT_NO DATE_REPORTED
## Length:355379 Length:355379 Length:355379
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## DATE_FROM DATE_TO CLSD UCR
## Length:355379 Length:355379 Length:355379 Min. : 0.0
## Class :character Class :character Class :character 1st Qu.: 552.0
## Mode :character Mode :character Mode :character Median : 600.0
## Mean : 803.9
## 3rd Qu.: 810.0
## Max. :2761.0
## NA's :71
## DST BEAT OFFENSE LOCATION
## Length:355379 Min. :1.000 Length:355379 Length:355379
## Class :character 1st Qu.:2.000 Class :character Class :character
## Mode :character Median :3.000 Mode :character Mode :character
## Mean :2.698
## 3rd Qu.:3.000
## Max. :6.000
## NA's :1340
## THEFT_CODE FLOOR SIDE
## Length:355379 Length:355379 Length:355379
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## OPENING HATE_BIAS DAYOFWEEK
## Length:355379 Length:355379 Length:355379
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## RPT_AREA CPD_NEIGHBORHOOD SNA_NEIGHBORHOOD
## Min. : 0.0 Length:355379 Length:355379
## 1st Qu.: 147.0 Class :character Class :character
## Median : 277.0 Mode :character Mode :character
## Mean : 266.6
## 3rd Qu.: 373.0
## Max. :424379.0
## NA's :2471
## WEAPONS DATE_OF_CLEARANCE HOUR_FROM
## Length:355379 Length:355379 Length:355379
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## HOUR_TO ADDRESS_X LONGITUDE_X LATITUDE_X
## Length:355379 Length:355379 Min. :-84.82 Min. :39.05
## Class :character Class :character 1st Qu.:-84.57 1st Qu.:39.12
## Mode :character Mode :character Median :-84.52 Median :39.14
## Mean :-84.52 Mean :39.14
## 3rd Qu.:-84.49 3rd Qu.:39.16
## Max. :-84.25 Max. :39.36
## NA's :45864 NA's :45864
## VICTIM_AGE VICTIM_RACE VICTIM_ETHNICITY
## Length:355379 Length:355379 Length:355379
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## VICTIM_GENDER SUSPECT_AGE SUSPECT_RACE
## Length:355379 Length:355379 Length:355379
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## SUSPECT_ETHNICITY SUSPECT_GENDER TOTALNUMBERVICTIMS TOTALSUSPECTS
## Length:355379 Length:355379 Min. : 1.000 Min. : 1.00
## Class :character Class :character 1st Qu.: 1.000 1st Qu.: 1.00
## Mode :character Mode :character Median : 1.000 Median : 1.00
## Mean : 1.433 Mean : 1.64
## 3rd Qu.: 1.000 3rd Qu.: 2.00
## Max. :127.000 Max. :16.00
## NA's :146 NA's :194149
## UCR_GROUP COMMUNITY_COUNCIL_NEIGHBORHOOD ZIP
## Length:355379 Length:355379 Min. : 33
## Class :character Class :character 1st Qu.: 45206
## Mode :character Mode :character Median : 45216
## Mean : 46768
## 3rd Qu.: 45227
## Max. :531110230
## NA's :184
From the above summary, we see what columns need to undergo data cleaning. As seen in the glimpse and summary of the dataset, we change the following as seen below:
Replace all na with 0 for total victims and suspects as we do not want to end up omiting criminal records in which no suspects were found or no victims identified. Also, replace with 0 where lat/long is unknown, we will only be using neighbourhood names
crime_data_initial$TOTALNUMBERVICTIMS[is.na(crime_data_initial$TOTALNUMBERVICTIMS)] <- 0
crime_data_initial$TOTALSUSPECTS[is.na(crime_data_initial$TOTALSUSPECTS)] <- 0
crime_data_initial$LATITUDE_X[is.na(crime_data_initial$LATITUDE_X)] <- 0
crime_data_initial$LONGITUDE_X[is.na(crime_data_initial$LONGITUDE_X)] <- 0
Column - Theft code Now replace all NA in Theft code with NOT APPLICABLE as we do not want to end up omiting records which are not thefts when we use na.omit
unique(crime_data_initial$THEFT_CODE )
## [1] NA
## [2] "23F-THEFT FROM MOTOR VEHICLE"
## [3] "23H-ALL OTHER LARCENY"
## [4] "23D-THEFT FROM BUILDING"
## [5] "24O-MOTOR VEHICLE THEFT"
## [6] "23C-SHOPLIFTING"
## [7] "23G-THEFT OF MOTOR VEHICLE PARTS OR ACCESSORIES"
## [8] "24I-THEFT OF LICENSE PLATE"
## [9] "23A-POCKET-PICKING"
## [10] "23B-PURSE-SNATCHING"
## [11] "23E-THEFT FROM COIN-OPERATED MACHINE OR DEVICE"
crime_data_initial$THEFT_CODE[is.na(crime_data_initial$THEFT_CODE)] <- "NOT APPLICABLE"
Columns - floor,side and opening floor,side and opening are not applicable or all types of crimes
#1) FLOOR
#check unique values floor
unique(crime_data_initial$FLOOR)
## [1] NA "2 - FIRST FLOOR" "4 - OTHER"
## [4] "2-FIRST FLOOR" "3 - SECOND FLOOR" "1 - BASEMENT"
## [7] "5 - UNKNOWN" "3-SECOND FLOOR" "1-BASEMENT"
## [10] "4-OTHER" "5-UNKNOWN"
# multiple rows with same value but different number of spaces, remove unwanted spaces
crime_data_initial$FLOOR <- gsub(" ", "", crime_data_initial$FLOOR, fixed = TRUE)
#change NA to Not applicable
crime_data_initial$FLOOR[is.na(crime_data_initial$FLOOR)] <- "NOT APPLICABLE"
# 2) Side
#check unique values side
unique(crime_data_initial$SIDE)
## [1] NA "1 - FRONT" "3 - REAR" "1-FRONT" "2 - SIDE"
## [6] "3-REAR" "2-SIDE" "5-OTHER" "6-UNKNOWN" "4 - ROOF"
## [11] "6 - UNKNOWN" "5 - OTHER" "4-ROOF" "?"
# multiple rows with same value but different number of spaces, remove unwanted spaces
crime_data_initial$SIDE <- gsub(" ", "", crime_data_initial$SIDE, fixed = TRUE)
# Replace ? with "6-UNKNOWN"
crime_data_initial$SIDE[crime_data_initial$SIDE == "?"] <- "6-UNKNOWN"
crime_data_initial$SIDE[is.na(crime_data_initial$SIDE)] <- "NOT APPLICABLE"
# 3) Opening
#check unique values side
unique(crime_data_initial$OPENING)
## [1] NA "1 - DOOR" "3 - GARAGE" "1-DOOR"
## [5] "2 - WINDOW" "2-WINDOW" "5-OTHER" "6 - UNKNOWN"
## [9] "5 - OTHER" "6-UNKNOWN" "4 - SKYLIGHT" "3-GARAGE"
## [13] "4-SKYLIGHT" "?"
# multiple rows with same value but different number of spaces, remove unwanted spaces
crime_data_initial$OPENING <- gsub(" ", "", crime_data_initial$OPENING, fixed = TRUE)
# Replace ? with "6-UNKNOWN"
crime_data_initial$OPENING[crime_data_initial$OPENING == "?"] <- "6-UNKNOWN"
crime_data_initial$OPENING[is.na(crime_data_initial$OPENING)] <- "NOT APPLICABLE"
Columns of Gender check unique values of gender
unique(crime_data_initial$SUSPECT_GENDER)
## [1] "MALE" "FEMALE" NA
## [4] "UNKNOWN" "M - MALE" "NON-PERSON (BUSINESS"
## [7] "F - FEMALE"
unique(crime_data_initial$VICTIM_GENDER)
## [1] "FEMALE" "MALE" NA
## [4] "UNKNOWN" "M - MALE" "NON-PERSON (BUSINESS"
## [7] "F - FEMALE"
As seen above we have to clean certain values. First we replace na with “UNKNOWN. Then, M - MALE with MALE & F - FEMALE with FEMALE Then replace”NON-PERSON (BUSINESS" with “NON-PERSON/BUSINESS”.
crime_data_initial$SUSPECT_GENDER[is.na(crime_data_initial$SUSPECT_GENDER)] <-
"UNKNOWN"
crime_data_initial$VICTIM_GENDER[is.na(crime_data_initial$VICTIM_GENDER)] <-
"UNKNOWN"
#replace M - MALE with MALE & F - FEMALE with FEMALE
crime_data_initial$SUSPECT_GENDER[crime_data_initial$SUSPECT_GENDER == "M - MALE"] <-
"MALE"
crime_data_initial$VICTIM_GENDER[crime_data_initial$VICTIM_GENDER == "M - MALE"] <-
"MALE"
crime_data_initial$SUSPECT_GENDER[crime_data_initial$SUSPECT_GENDER == "F - FEMALE"] <-
"FEMALE"
crime_data_initial$VICTIM_GENDER[crime_data_initial$VICTIM_GENDER == "F - FEMALE"] <-
"FEMALE"
#replace"NON-PERSON (BUSINESS" with "NON-PERSON/BUSINESS"
crime_data_initial$SUSPECT_GENDER[crime_data_initial$SUSPECT_GENDER == "NON-PERSON (BUSINESS"] <-
"NON-PERSON/BUSINESS"
crime_data_initial$VICTIM_GENDER[crime_data_initial$VICTIM_GENDER == "NON-PERSON (BUSINESS"] <-
"NON-PERSON/BUSINESS"
Columns of Race check unique values of gender
#Columns of Race
#check unique values Race
unique(crime_data_initial$SUSPECT_RACE)
## [1] "BLACK" NA "WHITE"
## [4] "UNKNOWN" "ASIAN/PACIFIC ISLAND" "HISPANIC"
## [7] "ASIAN OR PACIFIC ISL" "AMERICAN INDIAN/ALAS" "AMERICAN IINDIAN/ALA"
unique(crime_data_initial$VICTIM_RACE)
## [1] "BLACK" NA "WHITE"
## [4] "UNKNOWN" "ASIAN/PACIFIC ISLAND" "ASIAN OR PACIFIC ISL"
## [7] "AMERICAN INDIAN/ALAS" "AMERICAN IINDIAN/ALA" "HISPANIC"
As seen above we have to clean certain values. First we replace na with “UNKNOWN”.
#replace na with unknown
crime_data_initial$SUSPECT_RACE[is.na(crime_data_initial$SUSPECT_RACE)] <-
"UNKNOWN"
crime_data_initial$VICTIM_RACE[is.na(crime_data_initial$VICTIM_RACE)] <-
"UNKNOWN"
Replace spelling errors/typos
crime_data_initial$SUSPECT_RACE[crime_data_initial$SUSPECT_RACE == "ASIAN OR PACIFIC ISL"] <-
"ASIAN/PACIFIC ISLAND"
crime_data_initial$VICTIM_RACE[crime_data_initial$VICTIM_RACE == "ASIAN OR PACIFIC ISL"] <-
"ASIAN/PACIFIC ISLAND"
crime_data_initial$SUSPECT_RACE[crime_data_initial$SUSPECT_RACE == "AMERICAN IINDIAN/ALA"] <-
"AMERICAN INDIAN/ALAS"
crime_data_initial$VICTIM_RACE[crime_data_initial$VICTIM_RACE == "AMERICAN IINDIAN/ALA"] <-
"AMERICAN INDIAN/ALAS"
Columns of Ethinicity check unique values Ethnicity
unique(crime_data_initial$SUSPECT_ETHNICITY)
## [1] "NOT OF HISPANIC ORIG" NA "UNKNOWN"
## [4] "AFRICAN AMERICAN" "WHITE AMERICAN" "HISPANIC ORIGIN"
## [7] "BLACK NOT OF HISPANI" "WHITE HISPANIC ORIGI" "WHITE NOT OF HISPANI"
## [10] "AMERICAN INDIAN" "BLACK HISPANIC ORIGI" "AFRICAN"
unique(crime_data_initial$VICTIM_ETHNICITY)
## [1] "NOT OF HISPANIC ORIG" NA "UNKNOWN"
## [4] "HISPANIC ORIGIN" "AFRICAN AMERICAN" "WHITE AMERICAN"
## [7] "BLACK NOT OF HISPANI" "WHITE NOT OF HISPANI" "ASIAN INDIAN"
## [10] "EUROPEAN" "INDONESIAN" "AMERICAN INDIAN"
## [13] "AFRICAN" "CHINESE" "BLACK HISPANIC ORIGI"
## [16] "WHITE HISPANIC ORIGI" "PAKASTANI" "ALASKA NATIVE"
As seen above we have only NAs to clean here. We replace NA with “UNKNOWN”
crime_data_initial$SUSPECT_ETHNICITY[is.na(crime_data_initial$SUSPECT_ETHNICITY)]<-
"UNKNOWN"
crime_data_initial$VICTIM_ETHNICITY[is.na(crime_data_initial$VICTIM_ETHNICITY)]<-
"UNKNOWN"
Drop 1st column INSTANCEID as it is only a row identifier, DROP day of week as it is not required for analysis
crime_data_initial <- crime_data_initial[,-1]
crime_data_initial<- subset(crime_data_initial, select= -DAYOFWEEK)
Drop unclean zipcodes.
crime_data_initial$ZIP[crime_data_initial$ZIP > 99999 ] <- NA
delete all na records
crime_data_no_na <- na.omit(crime_data_initial)
Convert data types found to be incorrect as per glimpse and summary output Delete any records having conversion error resulting in NA coercion this is because there were 326 alphanumeruc incident numbers
crime_data_no_na$INCIDENT_NO <- as.numeric(crime_data_no_na$INCIDENT_NO)
crime_data_no_na <- na.omit(crime_data_no_na)
Format all date and time columns
crime_data_no_na$DATE_REPORTED <- mdy_hms(crime_data_no_na$DATE_REPORTED)
crime_data_no_na$DATE_FROM <- mdy_hms(crime_data_no_na$DATE_FROM)
crime_data_no_na$DATE_TO <- mdy_hms(crime_data_no_na$DATE_TO)
crime_data_no_na$DATE_OF_CLEARANCE <-
mdy_hms(crime_data_no_na$DATE_OF_CLEARANCE)
Rename columns which are not named correctly.
#Rename CLSD to Clearances
names(crime_data_no_na)[5] <- "CLEARANCES"
#Rename VICTIM_AGE to VICTIM_AGE_RANGE
colnames(crime_data_no_na)[colnames(crime_data_no_na) == "VICTIM_AGE"] <-
"VICTIM_AGE_RANGE"
#Rename SUSPECT_AGE TO SUSPECT_AGE_RANGE
colnames(crime_data_no_na)[colnames(crime_data_no_na) == "SUSPECT_AGE"] <- "SUSPECT_AGE_RANGE"
crime_data_mutated <-
mutate(crime_data_no_na,
REPORTED_DATE = format(crime_data_no_na$DATE_REPORTED, "%m/%d/%Y") ,
REPORTED_TIME = format(crime_data_no_na$DATE_REPORTED, "%H:%M:%S"),
FROM_DATE = format(crime_data_no_na$DATE_FROM, "%m/%d/%Y") ,
FROM_TIME = format(crime_data_no_na$DATE_FROM, "%H:%M:%S"),
TO_DATE = format(crime_data_no_na$DATE_TO, "%m/%d/%Y") ,
TO_TIME = format(crime_data_no_na$DATE_TO, "%H:%M:%S"),
CLEARANCE_DATE = format(crime_data_no_na$DATE_OF_CLEARANCE , "%m/%d/%Y") ,
CLEARANCE_TIME = format(crime_data_no_na$DATE_OF_CLEARANCE , "%H:%M:%S"),
)
Clear old objects
rm(crime_data_initial,crime_data_no_na)
Drop unrequired date columns which we just mutated. Also drop HOUR_FROM and HOUR_TO as this info is already present in FROM_TIME and TO_TIME
crime_data_mutated <-
subset(crime_data_mutated, select=-c(DATE_REPORTED,DATE_FROM,DATE_TO,
DATE_OF_CLEARANCE,HOUR_FROM,HOUR_TO
))
Convert all column names to lower case
names(crime_data_mutated) <- tolower(names(crime_data_mutated))
sum(is.na(crime_data_mutated))
## [1] 0
Now we create our final clean dataset. We will view this dataset using the datatable function. Description of few fields which are abbreviated or not commonly known :
Now display our final clean dataset. Display only 1st 100 rows to prevent crashing of R studio
dim(crime_data_mutated)
## [1] 337485 40
datatable(head(crime_data_mutated,100))
Now that we have successfully cleant the data, we will start to answer our questions.
Using ggplot and geom_col we will create a chart to plot neighbourhoods with frequent occurrences of domestic violence.
crime_data_mutated %>%
filter(offense == "DOMESTIC VIOLENCE") %>%
count(cpd_neighborhood,sort = TRUE) %>%
filter(n > 250) %>%
mutate(cpd_neighborhood = reorder(cpd_neighborhood, n)) %>%
ggplot(aes(cpd_neighborhood, n)) +
geom_col(fill = "#FFA07A") +
xlab(NULL)+
ylab("No. of reports") +
theme_minimal()+
coord_flip()
As seen above, the neighbourhoods of Westwood, West & East price hill show high incidents of domestic violence. We need to keep a helpline number and also make people aware of the issues of domestic violence and how they can seek legal help or counselling.
Using ggplot and geom_col we will create another chart to plot neighbourhoods with frequent occurrences of theft and robbery.
crime_data_mutated %>%
filter(offense == "ROBBERY" | offense == "THEFT" | offense == "BURGLARY" ) %>%
count(cpd_neighborhood,sort = TRUE) %>%
filter(n > 5000) %>%
mutate(cpd_neighborhood = reorder(cpd_neighborhood, n)) %>%
ggplot(aes(cpd_neighborhood, n) ) +
geom_col(fill = "lightblue") +
xlab(NULL)+
ylab("No. of Theft incidents") +
theme_minimal()+
coord_flip()
As seen above, Westwood, CBD river front have most number of thefts. The high number of thefts in CBD river front may be attributed to it being a tourist area. Also, the interesting thing to note, all the neighbourhoods having a high occurances of domestic violence, are all present in this list.
We create a line graph using geom_line() to plot the number of occurances of crime by the hour.
crime_data_mutated %>%
mutate(hour = format(strptime(from_time,"%H:%M:%S"),'%H') ) %>%
count(hour) %>%
ggplot(aes(x = hour, y = n,group = 1) ) +
geom_line(colour = "red",size = 5)+
geom_point(size = 5)+
xlab("Hour of the day")+
ylab("No. of incidents") +
theme_classic()
As seen in the graph above, 5 - 12 AM has the most occurances of crime. It is better to take precautions while venturing out during these hours. Also, interesting to note is that there is a huge spike of crimes at 12 PM in the noon.
crime_data_mutated %>%
filter(ucr_group == "HOMICIDE" ) %>%
mutate(hasSuspects = ifelse(totalsuspects>0, "YES" ,"NO") )%>%
count(hasSuspects) %>%
ggplot( aes(hasSuspects, n, fill=hasSuspects)) +
geom_bar(stat = "identity",size = 3)+
ggtitle("Homicide cases without suspects idetified") +
xlab("Suspects identified ") +
ylab("No of cases")
This is an interesting plot, it shows us nearly half of the homicide cases in Cincinnati has no supects identified. There needs to be an improvement in criminal investigation methodology at the Cincinnati Police Department.
crime_data_mutated %>%
count(location,sort = TRUE) %>%
top_n(8) %>%
mutate(location = reorder(location, n)) %>%
ggplot(aes(location, n)) +
geom_col(fill = "#410f51") +
xlab(NULL)+
ylab("No. of crimes") +
theme_classic()+
coord_flip()
Even though it looks like most of the crime are occuring on the street. The next 6 bars, all denote different type of homes and areas around places of residence like a parking lot or yard. This shows, compared to any street the home is the most unsafest place to be !
crime_data_mutated %>%
mutate(hatecrime = ifelse(hate_bias == "N--NO BIAS/NOT APPLICABLE","No","Yes") ) %>%
count(hatecrime,sort = TRUE) %>%
ggplot( aes(hatecrime, n, fill=hatecrime)) +
geom_bar(stat = "identity",size = 3)+
ggtitle("Proportion of hate crimes") +
xlab("Hate Crime ? ") +
ylab("No of cases")
As seen above, most the crimes are not hate crimes. This may be due to the fact that most of the crimes are thefts and robberies which have mostly a financial motive.
There is a notion of males having a higher tendency to commit violent crimes like homicide and assaults. Let us see through our data whether this is true.
crime_data_mutated %>%
filter(ucr_group == "HOMICIDE" | ucr_group == "AGGRAVATED ASSAULTS") %>%
filter(suspect_gender == "MALE" | suspect_gender == "FEMALE") %>%
count(suspect_gender) %>%
ggplot( aes(suspect_gender, n, fill=suspect_gender)) +
geom_bar(stat = "identity",size = 3)+
ggtitle("Violent crime tendency by Gender") +
xlab("Gender") +
ylab("No of violent crimes")
As seen above, Males do commit the most violent crimes by a far margin.
There is a notion of males having a higher tendency to commit violent crimes like homicide and assaults. Let us see through our data whether this is true.
crime_data_mutated %>%
mutate(arrests = ifelse(grepl("CLEARED BY ARREST", crime_data_mutated$clearances),"Yes","No")) %>%
count(arrests) %>%
ggplot( aes(arrests, n, fill=arrests)) +
geom_bar(stat = "identity",size = 3)+
ggtitle("Proportion of arrests") +
xlab("Arrested ?") +
ylab("No of cases")+
theme_classic()
As seen above there are a large number of crimes where no arrests have been made.
What is the proportion of crimes commited by different age groups ?
crime_data_mutated %>%
count(suspect_age_range)%>%
filter(!suspect_age_range == "UNKNOWN") %>%
mutate(suspect_age_range = reorder(suspect_age_range, n)) %>%
ggplot(aes(x=suspect_age_range, y=n)) +
geom_col(fill = "#176770")+
ggtitle("Age range and Crimes") +
xlab(" Suspect Age bracket") +
ylab("No of criminal cases")+
coord_flip()
As seen above, 18-25 range shows the most criminal tendencies. The crime rate decrease as adults get older than 25. Old people above 70 are the least criminal, but young people below 18 are not !
The dataset had a lot of issues and many unclean values. With the help of powerful R packages we were able to clean more than 300,000 rows within seconds. This would not have been possible with other tools like excel that easily. The Exploratory data analysis helped us uncover many new insights. We have learnt where we need to tread cautiously and at what time of the day we should not venture outside. It also showed us that many homicide cases have no suspects yet. We hope justice comes one day to these victims. Overall, this analysis helped us get a glimpse of the criminal underbelly of Cincinnati !