Introduction

Introduction

Safety and security is a primary concern for every citizen. The crime rate in a city is one of the most significant parameters to determine the quality of life offered in that city.

A strong and effective law enforcement is the biggest deterrent to crime. Through the analysis of Cincinnati’s crime data we can uncover new insights and facts of the criminal history of this city and assist law enforcement agencies to optimally use their resources to prevent the maximum possible incidents. This analysis can also help people searching for a locality to move into or a new property to purchase. It can be also used by surveillance equipment manufacturers to identify areas with frequent crimes.

For our solution to prevent crimes, our focus will be on the following :

  • Identify neighbourhoods with frequent occurrences of domestic violence.
  • Identify neighbourhoods with frequent occurrences of theft and robbery.
  • Identify time of the day and day of the week with the most number of crimes on an average.
  • How many homicide cases do not have a suspect identified yet?
  • Where do most of the crimes take place( Ex: RESTAURANT,SCHOOL,HOSPITAL etc.)?
  • What proportion of crimes are hate crimes ?
  • Do males commit the most violent crimes?
  • How many crimes have resulted in arrests ?
  • What is the age range of most number of criminals ?

We will be using dplyr select, filter and aggregation and grouping functions to gain answers to the questions above. We can also use machine learning models to predict if a person can be a homicide suspect given age, location, gender etc.

With the help of this analysis law enforcement agencies can know when and where to increase patrolling. If we can build an accurate predictive model to identify homicide suspects, we can reduce the effort required by law enforcements to identify suspects.

Packages required

Packages required

We would need to import the following R packages inorder to assist with the importing and cleaning of our dataset.

#1 Packages required
library(readr) # to import the csv file
library(dplyr) # to use functions like glimpse
library(lubridate) #for date formatting
library(DT)# to use the datatable function
library(tidyverse)# for transformations
library(ggplot2)# for EDA

Data Preparation

Data Preparation

Before we begin cleaning data, we need to import it as a dataframe in R. ### Loading Data The data can be found at the following link. Information about the data can be found here.

#2 Data Preparation
#Loading the data
crime_data_initial <- 
  read_csv("city_of_cincinnati_police_data_initiative_crime_incidents.csv",
           col_names = TRUE)

View unclean dataset

#Check dimensions 
dim(crime_data_initial)
## [1] 355379     40
#Check column names are correct
colnames(crime_data_initial)
##  [1] "INSTANCEID"                     "INCIDENT_NO"                   
##  [3] "DATE_REPORTED"                  "DATE_FROM"                     
##  [5] "DATE_TO"                        "CLSD"                          
##  [7] "UCR"                            "DST"                           
##  [9] "BEAT"                           "OFFENSE"                       
## [11] "LOCATION"                       "THEFT_CODE"                    
## [13] "FLOOR"                          "SIDE"                          
## [15] "OPENING"                        "HATE_BIAS"                     
## [17] "DAYOFWEEK"                      "RPT_AREA"                      
## [19] "CPD_NEIGHBORHOOD"               "SNA_NEIGHBORHOOD"              
## [21] "WEAPONS"                        "DATE_OF_CLEARANCE"             
## [23] "HOUR_FROM"                      "HOUR_TO"                       
## [25] "ADDRESS_X"                      "LONGITUDE_X"                   
## [27] "LATITUDE_X"                     "VICTIM_AGE"                    
## [29] "VICTIM_RACE"                    "VICTIM_ETHNICITY"              
## [31] "VICTIM_GENDER"                  "SUSPECT_AGE"                   
## [33] "SUSPECT_RACE"                   "SUSPECT_ETHNICITY"             
## [35] "SUSPECT_GENDER"                 "TOTALNUMBERVICTIMS"            
## [37] "TOTALSUSPECTS"                  "UCR_GROUP"                     
## [39] "COMMUNITY_COUNCIL_NEIGHBORHOOD" "ZIP"
# Check the structure using dplyr function glimpse
glimpse(crime_data_initial)
## Observations: 355,379
## Variables: 40
## $ INSTANCEID                     <chr> "D6D7D173-E416-4571-AF34-A767AC...
## $ INCIDENT_NO                    <chr> "159006170", "41103934", "21104...
## $ DATE_REPORTED                  <chr> "03/16/2015 05:19:00 PM +0000",...
## $ DATE_FROM                      <chr> "03/16/2015 03:02:00 PM +0000",...
## $ DATE_TO                        <chr> "03/16/2015 03:05:00 PM +0000",...
## $ CLSD                           <chr> "J--CLOSED", "J--CLOSED", "D--V...
## $ UCR                            <int> 802, 810, 1493, 401, 810, 550, ...
## $ DST                            <chr> "4", "4", "2", "5", "1", "2", "...
## $ BEAT                           <int> 2, 3, 3, 4, 3, 3, 1, 2, 2, 1, 2...
## $ OFFENSE                        <chr> "AGGRAVATED MENACING", "ASSAULT...
## $ LOCATION                       <chr> "48-PARKING LOT", "02-MULTI FAM...
## $ THEFT_CODE                     <chr> NA, NA, NA, NA, NA, NA, NA, NA,...
## $ FLOOR                          <chr> NA, NA, NA, NA, NA, "2 - FIRST ...
## $ SIDE                           <chr> NA, NA, NA, NA, NA, "1 - FRONT"...
## $ OPENING                        <chr> NA, NA, NA, NA, NA, "1 - DOOR",...
## $ HATE_BIAS                      <chr> "N--NO BIAS/NOT APPLICABLE", "N...
## $ DAYOFWEEK                      <chr> "MONDAY", "THURSDAY", "SATURDAY...
## $ RPT_AREA                       <int> 45, 365, 138, 439, 203, 140, 68...
## $ CPD_NEIGHBORHOOD               <chr> "WALNUT HILLS", "AVONDALE", "MA...
## $ SNA_NEIGHBORHOOD               <chr> "WALNUT HILLS", "AVONDALE", "MA...
## $ WEAPONS                        <chr> "99 - NONE", "40--PERSONAL WEAP...
## $ DATE_OF_CLEARANCE              <chr> "03/28/2015 12:00:00 AM +0000",...
## $ HOUR_FROM                      <chr> "152", "1020", "240", "1535", "...
## $ HOUR_TO                        <chr> "155", "1030", "244", "1540", "...
## $ ADDRESS_X                      <chr> "21XX FULTON AV", "34XX READING...
## $ LONGITUDE_X                    <dbl> -84.49080, -84.49155, -84.39439...
## $ LATITUDE_X                     <dbl> 39.11996, 39.14167, 39.15156, 3...
## $ VICTIM_AGE                     <chr> "26-30", "41-50", "UNKNOWN", "1...
## $ VICTIM_RACE                    <chr> "BLACK", "BLACK", NA, "BLACK", ...
## $ VICTIM_ETHNICITY               <chr> "NOT OF HISPANIC ORIG", "NOT OF...
## $ VICTIM_GENDER                  <chr> "FEMALE", "MALE", NA, "MALE", "...
## $ SUSPECT_AGE                    <chr> "51-60", "UNKNOWN", "18-25", "1...
## $ SUSPECT_RACE                   <chr> "BLACK", "BLACK", "BLACK", "BLA...
## $ SUSPECT_ETHNICITY              <chr> "NOT OF HISPANIC ORIG", "NOT OF...
## $ SUSPECT_GENDER                 <chr> "MALE", "MALE", "FEMALE", "MALE...
## $ TOTALNUMBERVICTIMS             <int> 1, 1, 1, 2, 1, 2, 1, 2, 1, 2, 2...
## $ TOTALSUSPECTS                  <int> 1, 1, 2, 4, NA, 2, 1, 2, NA, NA...
## $ UCR_GROUP                      <chr> "PART 2 MINOR", "PART 2 MINOR",...
## $ COMMUNITY_COUNCIL_NEIGHBORHOOD <chr> "WALNUT HILLS", "AVONDALE", "MA...
## $ ZIP                            <int> 45206, 45219, 45227, 45224, 452...
#Check summary to identify NA's , outliers etc
summary(crime_data_initial)
##   INSTANCEID        INCIDENT_NO        DATE_REPORTED     
##  Length:355379      Length:355379      Length:355379     
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##   DATE_FROM           DATE_TO              CLSD                UCR        
##  Length:355379      Length:355379      Length:355379      Min.   :   0.0  
##  Class :character   Class :character   Class :character   1st Qu.: 552.0  
##  Mode  :character   Mode  :character   Mode  :character   Median : 600.0  
##                                                           Mean   : 803.9  
##                                                           3rd Qu.: 810.0  
##                                                           Max.   :2761.0  
##                                                           NA's   :71      
##      DST                 BEAT         OFFENSE            LOCATION        
##  Length:355379      Min.   :1.000   Length:355379      Length:355379     
##  Class :character   1st Qu.:2.000   Class :character   Class :character  
##  Mode  :character   Median :3.000   Mode  :character   Mode  :character  
##                     Mean   :2.698                                        
##                     3rd Qu.:3.000                                        
##                     Max.   :6.000                                        
##                     NA's   :1340                                         
##   THEFT_CODE           FLOOR               SIDE          
##  Length:355379      Length:355379      Length:355379     
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##    OPENING           HATE_BIAS          DAYOFWEEK        
##  Length:355379      Length:355379      Length:355379     
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##     RPT_AREA        CPD_NEIGHBORHOOD   SNA_NEIGHBORHOOD  
##  Min.   :     0.0   Length:355379      Length:355379     
##  1st Qu.:   147.0   Class :character   Class :character  
##  Median :   277.0   Mode  :character   Mode  :character  
##  Mean   :   266.6                                        
##  3rd Qu.:   373.0                                        
##  Max.   :424379.0                                        
##  NA's   :2471                                            
##    WEAPONS          DATE_OF_CLEARANCE   HOUR_FROM        
##  Length:355379      Length:355379      Length:355379     
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##    HOUR_TO           ADDRESS_X          LONGITUDE_X       LATITUDE_X   
##  Length:355379      Length:355379      Min.   :-84.82   Min.   :39.05  
##  Class :character   Class :character   1st Qu.:-84.57   1st Qu.:39.12  
##  Mode  :character   Mode  :character   Median :-84.52   Median :39.14  
##                                        Mean   :-84.52   Mean   :39.14  
##                                        3rd Qu.:-84.49   3rd Qu.:39.16  
##                                        Max.   :-84.25   Max.   :39.36  
##                                        NA's   :45864    NA's   :45864  
##   VICTIM_AGE        VICTIM_RACE        VICTIM_ETHNICITY  
##  Length:355379      Length:355379      Length:355379     
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##  VICTIM_GENDER      SUSPECT_AGE        SUSPECT_RACE      
##  Length:355379      Length:355379      Length:355379     
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##  SUSPECT_ETHNICITY  SUSPECT_GENDER     TOTALNUMBERVICTIMS TOTALSUSPECTS   
##  Length:355379      Length:355379      Min.   :  1.000    Min.   : 1.00   
##  Class :character   Class :character   1st Qu.:  1.000    1st Qu.: 1.00   
##  Mode  :character   Mode  :character   Median :  1.000    Median : 1.00   
##                                        Mean   :  1.433    Mean   : 1.64   
##                                        3rd Qu.:  1.000    3rd Qu.: 2.00   
##                                        Max.   :127.000    Max.   :16.00   
##                                        NA's   :146        NA's   :194149  
##   UCR_GROUP         COMMUNITY_COUNCIL_NEIGHBORHOOD      ZIP           
##  Length:355379      Length:355379                  Min.   :       33  
##  Class :character   Class :character               1st Qu.:    45206  
##  Mode  :character   Mode  :character               Median :    45216  
##                                                    Mean   :    46768  
##                                                    3rd Qu.:    45227  
##                                                    Max.   :531110230  
##                                                    NA's   :184

Data cleaning

From the above summary, we see what columns need to undergo data cleaning. As seen in the glimpse and summary of the dataset, we change the following as seen below:

Columns Total victims, total suspects, latitude, longitude

Replace all na with 0 for total victims and suspects as we do not want to end up omiting criminal records in which no suspects were found or no victims identified. Also, replace with 0 where lat/long is unknown, we will only be using neighbourhood names

crime_data_initial$TOTALNUMBERVICTIMS[is.na(crime_data_initial$TOTALNUMBERVICTIMS)] <- 0
crime_data_initial$TOTALSUSPECTS[is.na(crime_data_initial$TOTALSUSPECTS)] <- 0
crime_data_initial$LATITUDE_X[is.na(crime_data_initial$LATITUDE_X)] <- 0
crime_data_initial$LONGITUDE_X[is.na(crime_data_initial$LONGITUDE_X)] <- 0

Column - Theft code

Column - Theft code Now replace all NA in Theft code with NOT APPLICABLE as we do not want to end up omiting records which are not thefts when we use na.omit

unique(crime_data_initial$THEFT_CODE )
##  [1] NA                                               
##  [2] "23F-THEFT FROM MOTOR VEHICLE"                   
##  [3] "23H-ALL OTHER LARCENY"                          
##  [4] "23D-THEFT FROM BUILDING"                        
##  [5] "24O-MOTOR VEHICLE THEFT"                        
##  [6] "23C-SHOPLIFTING"                                
##  [7] "23G-THEFT OF MOTOR VEHICLE PARTS OR ACCESSORIES"
##  [8] "24I-THEFT OF LICENSE PLATE"                     
##  [9] "23A-POCKET-PICKING"                             
## [10] "23B-PURSE-SNATCHING"                            
## [11] "23E-THEFT FROM COIN-OPERATED MACHINE OR DEVICE"
crime_data_initial$THEFT_CODE[is.na(crime_data_initial$THEFT_CODE)] <- "NOT APPLICABLE"

Columns - Floor, Side and Opening

Columns - floor,side and opening floor,side and opening are not applicable or all types of crimes

#1) FLOOR
#check unique values floor

unique(crime_data_initial$FLOOR)
##  [1] NA                 "2 - FIRST FLOOR"  "4 - OTHER"       
##  [4] "2-FIRST FLOOR"    "3 - SECOND FLOOR" "1 - BASEMENT"    
##  [7] "5 - UNKNOWN"      "3-SECOND FLOOR"   "1-BASEMENT"      
## [10] "4-OTHER"          "5-UNKNOWN"
# multiple rows with same value but different number of spaces, remove unwanted spaces
crime_data_initial$FLOOR <- gsub(" ", "", crime_data_initial$FLOOR, fixed = TRUE)
#change NA to Not applicable
crime_data_initial$FLOOR[is.na(crime_data_initial$FLOOR)] <- "NOT APPLICABLE"


# 2) Side
#check unique values side
unique(crime_data_initial$SIDE)
##  [1] NA            "1 - FRONT"   "3 - REAR"    "1-FRONT"     "2 - SIDE"   
##  [6] "3-REAR"      "2-SIDE"      "5-OTHER"     "6-UNKNOWN"   "4 - ROOF"   
## [11] "6 - UNKNOWN" "5 - OTHER"   "4-ROOF"      "?"
# multiple rows with same value but different number of spaces, remove unwanted spaces
crime_data_initial$SIDE <- gsub(" ", "", crime_data_initial$SIDE, fixed = TRUE)

# Replace ? with "6-UNKNOWN"
crime_data_initial$SIDE[crime_data_initial$SIDE == "?"] <- "6-UNKNOWN"
crime_data_initial$SIDE[is.na(crime_data_initial$SIDE)] <- "NOT APPLICABLE"

# 3) Opening

#check unique values side
unique(crime_data_initial$OPENING)
##  [1] NA             "1 - DOOR"     "3 - GARAGE"   "1-DOOR"      
##  [5] "2 - WINDOW"   "2-WINDOW"     "5-OTHER"      "6 - UNKNOWN" 
##  [9] "5 - OTHER"    "6-UNKNOWN"    "4 - SKYLIGHT" "3-GARAGE"    
## [13] "4-SKYLIGHT"   "?"
# multiple rows with same value but different number of spaces, remove unwanted spaces
crime_data_initial$OPENING <- gsub(" ", "", crime_data_initial$OPENING, fixed = TRUE)
# Replace ? with "6-UNKNOWN"
crime_data_initial$OPENING[crime_data_initial$OPENING == "?"] <- "6-UNKNOWN"


crime_data_initial$OPENING[is.na(crime_data_initial$OPENING)] <- "NOT APPLICABLE"

Columns of Gender

Columns of Gender check unique values of gender

unique(crime_data_initial$SUSPECT_GENDER)
## [1] "MALE"                 "FEMALE"               NA                    
## [4] "UNKNOWN"              "M - MALE"             "NON-PERSON (BUSINESS"
## [7] "F - FEMALE"
unique(crime_data_initial$VICTIM_GENDER)
## [1] "FEMALE"               "MALE"                 NA                    
## [4] "UNKNOWN"              "M - MALE"             "NON-PERSON (BUSINESS"
## [7] "F - FEMALE"


As seen above we have to clean certain values. First we replace na with “UNKNOWN. Then, M - MALE with MALE & F - FEMALE with FEMALE Then replace”NON-PERSON (BUSINESS" with “NON-PERSON/BUSINESS”.

crime_data_initial$SUSPECT_GENDER[is.na(crime_data_initial$SUSPECT_GENDER)] <-
  "UNKNOWN"
crime_data_initial$VICTIM_GENDER[is.na(crime_data_initial$VICTIM_GENDER)] <-
  "UNKNOWN"

#replace M - MALE with MALE & F - FEMALE with FEMALE
crime_data_initial$SUSPECT_GENDER[crime_data_initial$SUSPECT_GENDER == "M - MALE"] <-
  "MALE"
crime_data_initial$VICTIM_GENDER[crime_data_initial$VICTIM_GENDER == "M - MALE"] <-
  "MALE"
crime_data_initial$SUSPECT_GENDER[crime_data_initial$SUSPECT_GENDER == "F - FEMALE"] <-
  "FEMALE"
crime_data_initial$VICTIM_GENDER[crime_data_initial$VICTIM_GENDER == "F - FEMALE"] <-
  "FEMALE"


#replace"NON-PERSON (BUSINESS" with "NON-PERSON/BUSINESS"
crime_data_initial$SUSPECT_GENDER[crime_data_initial$SUSPECT_GENDER == "NON-PERSON (BUSINESS"] <-
  "NON-PERSON/BUSINESS"
crime_data_initial$VICTIM_GENDER[crime_data_initial$VICTIM_GENDER == "NON-PERSON (BUSINESS"] <-
  "NON-PERSON/BUSINESS"

Columns of Race

Columns of Race check unique values of gender

#Columns of Race
#check unique values Race

unique(crime_data_initial$SUSPECT_RACE)
## [1] "BLACK"                NA                     "WHITE"               
## [4] "UNKNOWN"              "ASIAN/PACIFIC ISLAND" "HISPANIC"            
## [7] "ASIAN OR PACIFIC ISL" "AMERICAN INDIAN/ALAS" "AMERICAN IINDIAN/ALA"
unique(crime_data_initial$VICTIM_RACE)
## [1] "BLACK"                NA                     "WHITE"               
## [4] "UNKNOWN"              "ASIAN/PACIFIC ISLAND" "ASIAN OR PACIFIC ISL"
## [7] "AMERICAN INDIAN/ALAS" "AMERICAN IINDIAN/ALA" "HISPANIC"


As seen above we have to clean certain values. First we replace na with “UNKNOWN”.

#replace na with unknown
crime_data_initial$SUSPECT_RACE[is.na(crime_data_initial$SUSPECT_RACE)] <-
  "UNKNOWN"
crime_data_initial$VICTIM_RACE[is.na(crime_data_initial$VICTIM_RACE)] <-
  "UNKNOWN"


Replace spelling errors/typos

crime_data_initial$SUSPECT_RACE[crime_data_initial$SUSPECT_RACE == "ASIAN OR PACIFIC ISL"] <-
  "ASIAN/PACIFIC ISLAND"
crime_data_initial$VICTIM_RACE[crime_data_initial$VICTIM_RACE == "ASIAN OR PACIFIC ISL"] <-
  "ASIAN/PACIFIC ISLAND"

crime_data_initial$SUSPECT_RACE[crime_data_initial$SUSPECT_RACE == "AMERICAN IINDIAN/ALA"] <-
  "AMERICAN INDIAN/ALAS"
crime_data_initial$VICTIM_RACE[crime_data_initial$VICTIM_RACE == "AMERICAN IINDIAN/ALA"] <-
  "AMERICAN INDIAN/ALAS"

Columns of Ethinicity

Columns of Ethinicity check unique values Ethnicity

unique(crime_data_initial$SUSPECT_ETHNICITY)
##  [1] "NOT OF HISPANIC ORIG" NA                     "UNKNOWN"             
##  [4] "AFRICAN AMERICAN"     "WHITE AMERICAN"       "HISPANIC ORIGIN"     
##  [7] "BLACK NOT OF HISPANI" "WHITE HISPANIC ORIGI" "WHITE NOT OF HISPANI"
## [10] "AMERICAN INDIAN"      "BLACK HISPANIC ORIGI" "AFRICAN"
unique(crime_data_initial$VICTIM_ETHNICITY)
##  [1] "NOT OF HISPANIC ORIG" NA                     "UNKNOWN"             
##  [4] "HISPANIC ORIGIN"      "AFRICAN AMERICAN"     "WHITE AMERICAN"      
##  [7] "BLACK NOT OF HISPANI" "WHITE NOT OF HISPANI" "ASIAN INDIAN"        
## [10] "EUROPEAN"             "INDONESIAN"           "AMERICAN INDIAN"     
## [13] "AFRICAN"              "CHINESE"              "BLACK HISPANIC ORIGI"
## [16] "WHITE HISPANIC ORIGI" "PAKASTANI"            "ALASKA NATIVE"

As seen above we have only NAs to clean here. We replace NA with “UNKNOWN”

crime_data_initial$SUSPECT_ETHNICITY[is.na(crime_data_initial$SUSPECT_ETHNICITY)]<-
  "UNKNOWN"
crime_data_initial$VICTIM_ETHNICITY[is.na(crime_data_initial$VICTIM_ETHNICITY)]<-
  "UNKNOWN"

Drop Unrequired columns

Drop 1st column INSTANCEID as it is only a row identifier, DROP day of week as it is not required for analysis

crime_data_initial <- crime_data_initial[,-1]
crime_data_initial<-  subset(crime_data_initial, select= -DAYOFWEEK)

Drop unclean Zipcodes

Drop unclean zipcodes.

crime_data_initial$ZIP[crime_data_initial$ZIP > 99999 ] <- NA

Delete NAs

delete all na records

crime_data_no_na <- na.omit(crime_data_initial)

Convert datatypes

Convert data types found to be incorrect as per glimpse and summary output Delete any records having conversion error resulting in NA coercion this is because there were 326 alphanumeruc incident numbers

crime_data_no_na$INCIDENT_NO  <- as.numeric(crime_data_no_na$INCIDENT_NO) 
crime_data_no_na <- na.omit(crime_data_no_na)

Convert date and time columns

Format all date and time columns

crime_data_no_na$DATE_REPORTED <- mdy_hms(crime_data_no_na$DATE_REPORTED)
crime_data_no_na$DATE_FROM <- mdy_hms(crime_data_no_na$DATE_FROM)
crime_data_no_na$DATE_TO <- mdy_hms(crime_data_no_na$DATE_TO)
crime_data_no_na$DATE_OF_CLEARANCE <-
  mdy_hms(crime_data_no_na$DATE_OF_CLEARANCE)

Rename columns

Rename columns which are not named correctly.

#Rename CLSD to Clearances
names(crime_data_no_na)[5] <- "CLEARANCES"

#Rename VICTIM_AGE to VICTIM_AGE_RANGE
colnames(crime_data_no_na)[colnames(crime_data_no_na) == "VICTIM_AGE"] <-
  "VICTIM_AGE_RANGE"

#Rename SUSPECT_AGE TO SUSPECT_AGE_RANGE
colnames(crime_data_no_na)[colnames(crime_data_no_na) == "SUSPECT_AGE"] <- "SUSPECT_AGE_RANGE"

Mutate columns

crime_data_mutated <-
  mutate(crime_data_no_na,
         REPORTED_DATE  = format(crime_data_no_na$DATE_REPORTED, "%m/%d/%Y") ,
         REPORTED_TIME  = format(crime_data_no_na$DATE_REPORTED, "%H:%M:%S"),
         FROM_DATE = format(crime_data_no_na$DATE_FROM, "%m/%d/%Y") ,
         FROM_TIME = format(crime_data_no_na$DATE_FROM, "%H:%M:%S"),
         TO_DATE = format(crime_data_no_na$DATE_TO, "%m/%d/%Y") ,
         TO_TIME = format(crime_data_no_na$DATE_TO, "%H:%M:%S"),
         CLEARANCE_DATE  = format(crime_data_no_na$DATE_OF_CLEARANCE , "%m/%d/%Y") ,
         CLEARANCE_TIME  = format(crime_data_no_na$DATE_OF_CLEARANCE , "%H:%M:%S"),
         
         
  )

Clear old objects

Clear old objects

rm(crime_data_initial,crime_data_no_na)

Drop unrequired date and time columns

Drop unrequired date columns which we just mutated. Also drop HOUR_FROM and HOUR_TO as this info is already present in FROM_TIME and TO_TIME

crime_data_mutated <-
  subset(crime_data_mutated, select=-c(DATE_REPORTED,DATE_FROM,DATE_TO,
                                       DATE_OF_CLEARANCE,HOUR_FROM,HOUR_TO
  ))

Convert columns to lower case

Convert all column names to lower case

names(crime_data_mutated) <- tolower(names(crime_data_mutated)) 

Verify if any NA values remaining

sum(is.na(crime_data_mutated))
## [1] 0

Clean Dataset

Now we create our final clean dataset. We will view this dataset using the datatable function. Description of few fields which are abbreviated or not commonly known :

  1. FROM_DATE : Date at which incident began
  2. FROM_TIME : Time at which incident began
  3. TO_DATE : Date at which incident ended
  4. TO_TIME : Time at which incident ended
  5. DST : Designated Surveillance territory
  6. BEAT : territory for patrolling from a police officer
  7. UCR : uniform crime report

Now display our final clean dataset. Display only 1st 100 rows to prevent crashing of R studio

dim(crime_data_mutated)
## [1] 337485     40
datatable(head(crime_data_mutated,100))

Exploratory Data Analysis

Exploratory Data Analysis

Now that we have successfully cleant the data, we will start to answer our questions.

Identify neighbourhoods with frequent occurrences of domestic violence.

Using ggplot and geom_col we will create a chart to plot neighbourhoods with frequent occurrences of domestic violence.

crime_data_mutated  %>%
  filter(offense == "DOMESTIC VIOLENCE") %>%
  count(cpd_neighborhood,sort = TRUE) %>%
  filter(n > 250) %>%
  mutate(cpd_neighborhood = reorder(cpd_neighborhood, n)) %>%
  ggplot(aes(cpd_neighborhood, n)) +
  geom_col(fill = "#FFA07A") +
  xlab(NULL)+
  ylab("No. of reports") +
  theme_minimal()+
  coord_flip()

As seen above, the neighbourhoods of Westwood, West & East price hill show high incidents of domestic violence. We need to keep a helpline number and also make people aware of the issues of domestic violence and how they can seek legal help or counselling.

Identify neighbourhoods with frequent occurrences of theft and robbery.

Using ggplot and geom_col we will create another chart to plot neighbourhoods with frequent occurrences of theft and robbery.

crime_data_mutated  %>%
  filter(offense == "ROBBERY" | offense == "THEFT"  | offense ==  "BURGLARY" ) %>%
  count(cpd_neighborhood,sort = TRUE) %>%
  filter(n > 5000) %>%
  mutate(cpd_neighborhood = reorder(cpd_neighborhood, n)) %>%
  ggplot(aes(cpd_neighborhood, n) ) +
  geom_col(fill = "lightblue") +
  xlab(NULL)+
  ylab("No. of Theft incidents") +
  theme_minimal()+
  coord_flip()

As seen above, Westwood, CBD river front have most number of thefts. The high number of thefts in CBD river front may be attributed to it being a tourist area. Also, the interesting thing to note, all the neighbourhoods having a high occurances of domestic violence, are all present in this list.

Crimes as a function of time

We create a line graph using geom_line() to plot the number of occurances of crime by the hour.

crime_data_mutated  %>%
  mutate(hour = format(strptime(from_time,"%H:%M:%S"),'%H') ) %>%
  count(hour) %>%
  ggplot(aes(x = hour, y =  n,group = 1) ) +
  geom_line(colour = "red",size = 5)+
  geom_point(size = 5)+
  xlab("Hour of the day")+
  ylab("No. of  incidents") +
  theme_classic()

As seen in the graph above, 5 - 12 AM has the most occurances of crime. It is better to take precautions while venturing out during these hours. Also, interesting to note is that there is a huge spike of crimes at 12 PM in the noon.

Homicide cases without suspects identified

crime_data_mutated  %>%
  filter(ucr_group == "HOMICIDE" ) %>%
  mutate(hasSuspects = ifelse(totalsuspects>0, "YES" ,"NO") )%>%
  count(hasSuspects) %>%
  ggplot( aes(hasSuspects, n, fill=hasSuspects)) +
  geom_bar(stat = "identity",size = 3)+
  ggtitle("Homicide cases without suspects idetified") +
  xlab("Suspects identified ") +
  ylab("No of cases")

This is an interesting plot, it shows us nearly half of the homicide cases in Cincinnati has no supects identified. There needs to be an improvement in criminal investigation methodology at the Cincinnati Police Department.

Where do most of the crimes take place(RESTAURANT,SCHOOL,HOSPITAL etc.)?

crime_data_mutated  %>%
   count(location,sort = TRUE) %>%
  top_n(8) %>%
  mutate(location = reorder(location, n)) %>%
  ggplot(aes(location, n)) +
  geom_col(fill = "#410f51") +
  xlab(NULL)+
  ylab("No. of crimes") +
  theme_classic()+
  coord_flip()

Even though it looks like most of the crime are occuring on the street. The next 6 bars, all denote different type of homes and areas around places of residence like a parking lot or yard. This shows, compared to any street the home is the most unsafest place to be !

What proportion of crimes are hate crimes ?

crime_data_mutated  %>%
  mutate(hatecrime = ifelse(hate_bias == "N--NO BIAS/NOT APPLICABLE","No","Yes") ) %>%
  count(hatecrime,sort = TRUE) %>%
  ggplot( aes(hatecrime, n, fill=hatecrime)) +
  geom_bar(stat = "identity",size = 3)+
  ggtitle("Proportion of hate crimes") +
  xlab("Hate Crime ? ") +
  ylab("No of cases")

As seen above, most the crimes are not hate crimes. This may be due to the fact that most of the crimes are thefts and robberies which have mostly a financial motive.

Do males commit the most violent crimes?

There is a notion of males having a higher tendency to commit violent crimes like homicide and assaults. Let us see through our data whether this is true.

crime_data_mutated  %>%
  filter(ucr_group == "HOMICIDE" | ucr_group == "AGGRAVATED ASSAULTS") %>%
  filter(suspect_gender == "MALE" | suspect_gender == "FEMALE") %>%
  count(suspect_gender) %>%
  ggplot( aes(suspect_gender, n, fill=suspect_gender)) +
  geom_bar(stat = "identity",size = 3)+
  ggtitle("Violent crime tendency by Gender") +
  xlab("Gender") +
  ylab("No of violent crimes")

As seen above, Males do commit the most violent crimes by a far margin.

How many crimes have resulted in arrests ?

There is a notion of males having a higher tendency to commit violent crimes like homicide and assaults. Let us see through our data whether this is true.

crime_data_mutated  %>%
  mutate(arrests = ifelse(grepl("CLEARED BY ARREST", crime_data_mutated$clearances),"Yes","No")) %>%
  count(arrests) %>%
  ggplot( aes(arrests, n, fill=arrests)) +
  geom_bar(stat = "identity",size = 3)+
  ggtitle("Proportion of arrests") +
  xlab("Arrested ?") +
  ylab("No of cases")+
  theme_classic()

As seen above there are a large number of crimes where no arrests have been made.

Age range and crimes

What is the proportion of crimes commited by different age groups ?

crime_data_mutated  %>%
  count(suspect_age_range)%>%
  filter(!suspect_age_range == "UNKNOWN") %>%
  mutate(suspect_age_range = reorder(suspect_age_range, n)) %>%
  ggplot(aes(x=suspect_age_range, y=n)) + 
  geom_col(fill = "#176770")+
  ggtitle("Age range and Crimes") +
  xlab(" Suspect Age bracket") +
  ylab("No of criminal cases")+
  coord_flip()

As seen above, 18-25 range shows the most criminal tendencies. The crime rate decrease as adults get older than 25. Old people above 70 are the least criminal, but young people below 18 are not !

Summary

Summary

The dataset had a lot of issues and many unclean values. With the help of powerful R packages we were able to clean more than 300,000 rows within seconds. This would not have been possible with other tools like excel that easily. The Exploratory data analysis helped us uncover many new insights. We have learnt where we need to tread cautiously and at what time of the day we should not venture outside. It also showed us that many homicide cases have no suspects yet. We hope justice comes one day to these victims. Overall, this analysis helped us get a glimpse of the criminal underbelly of Cincinnati !