This project is primarily focused in anticipating crime as per gender, age and race in the city of Cincinnati in order to create social security awareness among citizens as well as to target education materials.
I will be using the City of Cincinnati Police Crime dataset to anticipate various crime as per gender, age and race. This will help in targeted educational materials across the city to create social security awareness as well as will help the citizens of Cincinnati to understand what kind of offense is more prevalent based on age and race.
I will be using City of Cincinnati Police Crime dataset which has approx. 358k rows and 40 columns. Out of these columns, my response variable will be “VICTIM_GENDER” and my predictor variables will be OFFENSE, VICTIM_AGE AND VICTIM_RACE. Since VICTIM_GENDER is a categorical variable, this will be the best target variable to analyze the data and get the outcome.
To analyze this data, we will use the following R packages:
library(dplyr)
library("rio")
mutate() adds new variables that are functions of existing variables select() picks variables based on their names. filter() picks cases based on their values. summarise() reduces multiple values down to a single summary. arrange() changes the ordering of the rows.
These all combine naturally with group_by() which allows you to perform any operation “by group”. dplyr is designed to abstract over how the data is stored.
The data set comes from the Cincinnati Police Crime Data.
Incidents are the records, of reported crimes, collated by an agency for management. Incidents are typically housed in a Records Management System (RMS) that stores agency-wide data about law enforcement operations.This dataset contains 40 columns and approx 358k rows. This dataset was created in November 15, 2017 but has been updated recently on April 4, 2019.
Below are the steps which have been followed for data importing and cleaning:
crime_data <- import("C:/Users/Ankeeta/Desktop/city_of_cincinnati_police_data_initiative_crime_incidents.csv")
glimpse(crime_data)
## Observations: 355,379
## Variables: 40
## $ INSTANCEID <chr> "D6D7D173-E416-4571-AF34-A767AC...
## $ INCIDENT_NO <chr> "159006170", "41103934", "21104...
## $ DATE_REPORTED <chr> "03/16/2015 05:19:00 PM +0000",...
## $ DATE_FROM <chr> "03/16/2015 03:02:00 PM +0000",...
## $ DATE_TO <chr> "03/16/2015 03:05:00 PM +0000",...
## $ CLSD <chr> "J--CLOSED", "J--CLOSED", "D--V...
## $ UCR <int> 802, 810, 1493, 401, 810, 550, ...
## $ DST <chr> "4", "4", "2", "5", "1", "2", "...
## $ BEAT <chr> "2", "3", "3", "4", "3", "3", "...
## $ OFFENSE <chr> "AGGRAVATED MENACING", "ASSAULT...
## $ LOCATION <chr> "48-PARKING LOT", "02-MULTI FAM...
## $ THEFT_CODE <chr> "", "", "", "", "", "", "", "",...
## $ FLOOR <chr> "", "", "", "", "", "2 - FIRST ...
## $ SIDE <chr> "", "", "", "", "", "1 - FRONT"...
## $ OPENING <chr> "", "", "", "", "", "1 - DOOR",...
## $ HATE_BIAS <chr> "N--NO BIAS/NOT APPLICABLE", "N...
## $ DAYOFWEEK <chr> "MONDAY", "THURSDAY", "SATURDAY...
## $ RPT_AREA <chr> "45", "365", "138", "439", "203...
## $ CPD_NEIGHBORHOOD <chr> "WALNUT HILLS", "AVONDALE", "MA...
## $ SNA_NEIGHBORHOOD <chr> "WALNUT HILLS", "AVONDALE", "MA...
## $ WEAPONS <chr> "99 - NONE", "40--PERSONAL WEAP...
## $ DATE_OF_CLEARANCE <chr> "03/28/2015 12:00:00 AM +0000",...
## $ HOUR_FROM <int> 152, 1020, 240, 1535, 1645, 143...
## $ HOUR_TO <int> 155, 1030, 244, 1540, 1712, 144...
## $ ADDRESS_X <chr> "21XX FULTON AV", "34XX READING...
## $ LONGITUDE_X <dbl> -84.49080, -84.49155, -84.39439...
## $ LATITUDE_X <dbl> 39.11996, 39.14167, 39.15156, 3...
## $ VICTIM_AGE <chr> "26-30", "41-50", "UNKNOWN", "1...
## $ VICTIM_RACE <chr> "BLACK", "BLACK", "", "BLACK", ...
## $ VICTIM_ETHNICITY <chr> "NOT OF HISPANIC ORIG", "NOT OF...
## $ VICTIM_GENDER <chr> "FEMALE", "MALE", "", "MALE", "...
## $ SUSPECT_AGE <chr> "51-60", "UNKNOWN", "18-25", "1...
## $ SUSPECT_RACE <chr> "BLACK", "BLACK", "BLACK", "BLA...
## $ SUSPECT_ETHNICITY <chr> "NOT OF HISPANIC ORIG", "NOT OF...
## $ SUSPECT_GENDER <chr> "MALE", "MALE", "FEMALE", "MALE...
## $ TOTALNUMBERVICTIMS <int> 1, 1, 1, 2, 1, 2, 1, 2, 1, 2, 2...
## $ TOTALSUSPECTS <int> 1, 1, 2, 4, NA, 2, 1, 2, NA, NA...
## $ UCR_GROUP <chr> "PART 2 MINOR", "PART 2 MINOR",...
## $ COMMUNITY_COUNCIL_NEIGHBORHOOD <chr> "WALNUT HILLS", "AVONDALE", "MA...
## $ ZIP <dbl> 2.233473e-319, 2.234115e-319, 2...
summary(crime_data)
## INSTANCEID INCIDENT_NO DATE_REPORTED
## Length:355379 Length:355379 Length:355379
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## DATE_FROM DATE_TO CLSD UCR
## Length:355379 Length:355379 Length:355379 Min. : 0.0
## Class :character Class :character Class :character 1st Qu.: 552.0
## Mode :character Mode :character Mode :character Median : 600.0
## Mean : 803.9
## 3rd Qu.: 810.0
## Max. :2761.0
## NA's :71
## DST BEAT OFFENSE
## Length:355379 Length:355379 Length:355379
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## LOCATION THEFT_CODE FLOOR
## Length:355379 Length:355379 Length:355379
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## SIDE OPENING HATE_BIAS
## Length:355379 Length:355379 Length:355379
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## DAYOFWEEK RPT_AREA CPD_NEIGHBORHOOD
## Length:355379 Length:355379 Length:355379
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## SNA_NEIGHBORHOOD WEAPONS DATE_OF_CLEARANCE HOUR_FROM
## Length:355379 Length:355379 Length:355379 Min. : 0.0
## Class :character Class :character Class :character 1st Qu.: 130.0
## Mode :character Mode :character Mode :character Median : 230.0
## Mean : 816.9
## 3rd Qu.:1623.0
## Max. :2359.0
## NA's :7
## HOUR_TO ADDRESS_X LONGITUDE_X LATITUDE_X
## Min. : 0.0 Length:355379 Min. :-84.82 Min. :39.05
## 1st Qu.: 145.0 Class :character 1st Qu.:-84.57 1st Qu.:39.12
## Median : 815.0 Mode :character Median :-84.52 Median :39.14
## Mean : 922.3 Mean :-84.52 Mean :39.14
## 3rd Qu.:1651.0 3rd Qu.:-84.49 3rd Qu.:39.16
## Max. :2359.0 Max. :-84.25 Max. :39.36
## NA's :1316 NA's :45864 NA's :45864
## VICTIM_AGE VICTIM_RACE VICTIM_ETHNICITY
## Length:355379 Length:355379 Length:355379
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## VICTIM_GENDER SUSPECT_AGE SUSPECT_RACE
## Length:355379 Length:355379 Length:355379
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## SUSPECT_ETHNICITY SUSPECT_GENDER TOTALNUMBERVICTIMS TOTALSUSPECTS
## Length:355379 Length:355379 Min. : 1.000 Min. : 1.00
## Class :character Class :character 1st Qu.: 1.000 1st Qu.: 1.00
## Mode :character Mode :character Median : 1.000 Median : 1.00
## Mean : 1.433 Mean : 1.64
## 3rd Qu.: 1.000 3rd Qu.: 2.00
## Max. :127.000 Max. :16.00
## NA's :146 NA's :194149
## UCR_GROUP COMMUNITY_COUNCIL_NEIGHBORHOOD ZIP
## Length:355379 Length:355379 Min. :0
## Class :character Class :character 1st Qu.:0
## Mode :character Mode :character Median :0
## Mean :0
## 3rd Qu.:0
## Max. :0
##
We can see that there are many NA’s in UCR, BEAT, RPT_AREA, LONGITUDE_X, LATITUDE_X, TOTALNUMBERVICTIMS, TOTALSUSPECTS and ZIP.
Clean data - It is always a best practice to clean the data before starting the analysis otherwise due to missing or NA values in the dataset, we might land up with biased and incorrect analysis.
Identify the NA values - Below R code provides the total count of NA values in this particular dataset.
sum(is.na(crime_data))
## [1] 287417
crime_data_cleaned <- na.omit(crime_data)
sum(is.na(crime_data_cleaned))
## [1] 0
summary(crime_data_cleaned)
## INSTANCEID INCIDENT_NO DATE_REPORTED
## Length:137065 Length:137065 Length:137065
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## DATE_FROM DATE_TO CLSD UCR
## Length:137065 Length:137065 Length:137065 Min. : 0.0
## Class :character Class :character Class :character 1st Qu.: 551.0
## Mode :character Mode :character Mode :character Median : 701.0
## Mean : 844.2
## 3rd Qu.: 862.0
## Max. :2761.0
## DST BEAT OFFENSE
## Length:137065 Length:137065 Length:137065
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## LOCATION THEFT_CODE FLOOR
## Length:137065 Length:137065 Length:137065
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## SIDE OPENING HATE_BIAS
## Length:137065 Length:137065 Length:137065
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## DAYOFWEEK RPT_AREA CPD_NEIGHBORHOOD
## Length:137065 Length:137065 Length:137065
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## SNA_NEIGHBORHOOD WEAPONS DATE_OF_CLEARANCE HOUR_FROM
## Length:137065 Length:137065 Length:137065 Min. : 0
## Class :character Class :character Class :character 1st Qu.: 140
## Mode :character Mode :character Mode :character Median : 645
## Mean : 913
## 3rd Qu.:1730
## Max. :2359
## HOUR_TO ADDRESS_X LONGITUDE_X LATITUDE_X
## Min. : 0 Length:137065 Min. :-84.82 Min. :39.05
## 1st Qu.: 170 Class :character 1st Qu.:-84.57 1st Qu.:39.12
## Median :1055 Mode :character Median :-84.52 Median :39.14
## Mean :1027 Mean :-84.52 Mean :39.14
## 3rd Qu.:1820 3rd Qu.:-84.49 3rd Qu.:39.16
## Max. :2359 Max. :-84.26 Max. :39.36
## VICTIM_AGE VICTIM_RACE VICTIM_ETHNICITY
## Length:137065 Length:137065 Length:137065
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## VICTIM_GENDER SUSPECT_AGE SUSPECT_RACE
## Length:137065 Length:137065 Length:137065
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## SUSPECT_ETHNICITY SUSPECT_GENDER TOTALNUMBERVICTIMS TOTALSUSPECTS
## Length:137065 Length:137065 Min. : 1.000 Min. : 1.000
## Class :character Class :character 1st Qu.: 1.000 1st Qu.: 1.000
## Mode :character Mode :character Median : 1.000 Median : 1.000
## Mean : 1.375 Mean : 1.611
## 3rd Qu.: 1.000 3rd Qu.: 2.000
## Max. :15.000 Max. :16.000
## UCR_GROUP COMMUNITY_COUNCIL_NEIGHBORHOOD ZIP
## Length:137065 Length:137065 Min. :0
## Class :character Class :character 1st Qu.:0
## Mode :character Mode :character Median :0
## Mean :0
## 3rd Qu.:0
## Max. :0
crime_data_final <- select(crime_data_cleaned, 10, 28, 29, 31)
summary(crime_data_final)
## OFFENSE VICTIM_AGE VICTIM_RACE
## Length:137065 Length:137065 Length:137065
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
## VICTIM_GENDER
## Length:137065
## Class :character
## Mode :character
attach(crime_data_final)
VICTIM_AGE <- as.factor(VICTIM_AGE)
VICTIM_RACE <- as.factor(VICTIM_RACE)
VICTIM_GENDER <- as.factor(VICTIM_GENDER)
OFFENSE <- as.factor(OFFENSE)
smp_size <- floor(0.7 * nrow(crime_data_final))
set.seed(25000)
train <- sample(seq_len(nrow(crime_data_final)), size = smp_size)
crime_train_data <- crime_data_final[train,]
crime_test_data <- crime_data_final[-train,]
summary(crime_train_data)
## OFFENSE VICTIM_AGE VICTIM_RACE
## Length:95945 Length:95945 Length:95945
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
## VICTIM_GENDER
## Length:95945
## Class :character
## Mode :character
summary(crime_test_data)
## OFFENSE VICTIM_AGE VICTIM_RACE
## Length:41120 Length:41120 Length:41120
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
## VICTIM_GENDER
## Length:41120
## Class :character
## Mode :character
From this above dataset, my target is to analyse the crime based upon gender, age, race and ethinicity. With Decision tree analysis and Random Forest analysis, I feel the outcome will be proper based on our categorical response variable. My target is to get an outcome where we can observe the crime, age, race and ethinicity which is more prevalent amongst men and women in Cincinnati. For example, expecting outcomes like below:
Male - between age 30-40, race - Black, ethinicity - Black American and most prevalent offense is robbery or
Female - between age 18-30, race - American, ethinicity - American, most prevalent crime is sexual harassment.