Going by the Mid Term expectations , all the questions / components are laid out in the sections below :
Problem Statement :
1, Which kind of locations are most prone to crimes ?
2. Is there a certain kind of crime that has more racial element to it than others? 3. Is there a tendency of a certain age group to indulge in a specific type of crime? 4. What age and gender are typical targets of each kind of crime? 5. What areas in cincinnati experience the most crime, the least crime and what areas are the safest? 6. What time of the day is most unsafe for a particular crime?
Why should I be interested in this?
Answers to above questions can tremendously help with prevention and increased awareness of crimes , the patterns and precautionary measures. It has the potential to directly impact the crime rate positively.
I plan to do an in-dept analysis of the cincinnati crime dataset. I will mainly be interested in presriptive analytics of data to understand the trends and patterns in the data that concern my particular problem statement , rather than any forecast or predictions.
I will primarily use summary statistics , grouping the data , joining and correlation analysis to arrive at my analysis.
The most relevant variables of interest for me are going to be :
OFFENSE
LOCATION_CODE
LOCATION_DESC
VICTIM_AGE
VICTIM_RACE
SUSPECT_AGE
SUSPECT_RACE
SUSPECT_GENDER
ZIP
Since my analysis is not a leraning problem , I will not require any supervised or unsupervised learning for my analysis. As per my current knowledge ,Summary analysis, data grouping and concatenation, joining , distribution plots, scatterplots, and boxplots should enable my analysis. Extended data visualization and analytical techniques will be required for succesful results.
Since the consumer of analysis can be public as well as the Police, it can significantly help them to breakdown the patterns and most frequent locations/ times / victim-perpretrator correlation tendencies , gender component of crimes.
The consumer/ audience can use the analysis to come up with preventive and precautionary measures to reduce the incidences of such crimes. Increased awareness and safety measures can make Cicinnati a better and safer place to live in. The analysis may also provide a direction to root causes of certain crimes.
2.1 All packages used are loaded upfront so the reader knows which are required to replicate the analysis.
2.2 Messages and warnings resulting from loading the package are suppressed.
#Since these packages are usually not installed , thus including the commented code to install these packages
#install.packages('bit64')
#install.packages("bindrcpp")
#Set Home Directory
setwd("/Users/deepak/University/DataWrangling")
#Load the libraries
library(data.table)
library(tidyr)
library(dplyr)
2.3 Explanation is provided regarding the purpose of each package
data.table : The ‘fread’ function from this package is faster than other methods to read csv files, plus it has stringsAsFactors=FALSE by default.
tidyr : This package is used in the ‘separate’ function used to split columns that had non-atomic level of information dplyr : Since tidyr works in conjunction with dplyr , so we will need dplyr for our analysis
Ohio Government - Cincinnati Police data Initiative
The above data is obtained from the Ohio Government - City of Cincinnati Police Data Initiative and is publically available.
Original Purpose of Data : The data has been recorded and made public by the ohio government and Cincinnati Police. The intention is to ensure transparency of crime data and incidents and enable general public to be more aware of each and every crime incident reported in the city.
The intention might also be to facilitate analysts to come up with insights and suggestions to make better decision related to crime.
Step 1 : Data import. The csv file has been imported using ‘fread’ function of data.table package since this one is the fastest.
df = fread('city_of_cincinnati_police_data_initiative_crime_incidents.csv')
##
Read 31.0% of 355379 rows
Read 53.5% of 355379 rows
Read 76.0% of 355379 rows
Read 90.0% of 355379 rows
Read 355379 rows and 40 (of 40) columns from 0.153 GB file in 00:00:06
Step 2 : Have a look at dimensions of data , and names. Rename the columns if there are any special characters or spaces
#Learning about dimensions -
dim(df) #355379 Rows and 40 Columns
## [1] 355379 40
names(df) #Get Column Names - All Names look OK
## [1] "INSTANCEID" "INCIDENT_NO"
## [3] "DATE_REPORTED" "DATE_FROM"
## [5] "DATE_TO" "CLSD"
## [7] "UCR" "DST"
## [9] "BEAT" "OFFENSE"
## [11] "LOCATION" "THEFT_CODE"
## [13] "FLOOR" "SIDE"
## [15] "OPENING" "HATE_BIAS"
## [17] "DAYOFWEEK" "RPT_AREA"
## [19] "CPD_NEIGHBORHOOD" "SNA_NEIGHBORHOOD"
## [21] "WEAPONS" "DATE_OF_CLEARANCE"
## [23] "HOUR_FROM" "HOUR_TO"
## [25] "ADDRESS_X" "LONGITUDE_X"
## [27] "LATITUDE_X" "VICTIM_AGE"
## [29] "VICTIM_RACE" "VICTIM_ETHNICITY"
## [31] "VICTIM_GENDER" "SUSPECT_AGE"
## [33] "SUSPECT_RACE" "SUSPECT_ETHNICITY"
## [35] "SUSPECT_GENDER" "TOTALNUMBERVICTIMS"
## [37] "TOTALSUSPECTS" "UCR_GROUP"
## [39] "COMMUNITY_COUNCIL_NEIGHBORHOOD" "ZIP"
Step 3: Have a peek at the datatypes of columns :
# Have a look the datatypes of string
str(df)
## Classes 'data.table' and 'data.frame': 355379 obs. of 40 variables:
## $ INSTANCEID : chr "D6D7D173-E416-4571-AF34-A767ACF810D9" "2A569674-9603-4ED9-82BB-783B843B94C9" "D588CE6C-A9FE-47A5-9C62-DADC7792B7E1" "5B95E5DC-E35D-4F86-B77C-02F25C950329" ...
## $ INCIDENT_NO : chr "159006170" "41103934" "21104327" "129019028" ...
## $ DATE_REPORTED : chr "03/16/2015 05:19:00 PM +0000" "06/23/2011 11:12:00 AM +0000" "10/08/2011 04:15:00 AM +0000" "06/25/2012 03:45:00 PM +0000" ...
## $ DATE_FROM : chr "03/16/2015 03:02:00 PM +0000" "06/23/2011 10:20:00 AM +0000" "10/08/2011 02:40:00 AM +0000" "06/25/2012 03:35:00 PM +0000" ...
## $ DATE_TO : chr "03/16/2015 03:05:00 PM +0000" "06/23/2011 10:30:00 AM +0000" "10/08/2011 02:44:00 AM +0000" "06/25/2012 03:40:00 PM +0000" ...
## $ CLSD : chr "J--CLOSED" "J--CLOSED" "D--VICTIM REFUSED TO COOPERATE" "B--PROSECUTION DECLINED" ...
## $ UCR : int 802 810 1493 401 810 550 2024 301 810 600 ...
## $ DST : chr "4" "4" "2" "5" ...
## $ BEAT : chr "2" "3" "3" "4" ...
## $ OFFENSE : chr "AGGRAVATED MENACING" "ASSAULT" "CRIMINAL DAMAGING/ENDANGERING" "FELONIOUS ASSAULT" ...
## $ LOCATION : chr "48-PARKING LOT" "02-MULTI FAMILY" "27-BUY/SELL/TRADE SHOP" "47-STREET" ...
## $ THEFT_CODE : chr "" "" "" "" ...
## $ FLOOR : chr "" "" "" "" ...
## $ SIDE : chr "" "" "" "" ...
## $ OPENING : chr "" "" "" "" ...
## $ HATE_BIAS : chr "N--NO BIAS/NOT APPLICABLE" "N--NO BIAS/NOT APPLICABLE" "N--NO BIAS/NOT APPLICABLE" "N--NO BIAS/NOT APPLICABLE" ...
## $ DAYOFWEEK : chr "MONDAY" "THURSDAY" "SATURDAY" "MONDAY" ...
## $ RPT_AREA : chr "45" "365" "138" "439" ...
## $ CPD_NEIGHBORHOOD : chr "WALNUT HILLS" "AVONDALE" "MADISONVILLE" "COLLEGE HILL" ...
## $ SNA_NEIGHBORHOOD : chr "WALNUT HILLS" "AVONDALE" "MADISONVILLE" "COLLEGE HILL" ...
## $ WEAPONS : chr "99 - NONE" "40--PERSONAL WEAPONS (HANDS, FEET, TEETH, ETC.)" "99--NONE" "11--FIREARM (TYPE NOT STATED)" ...
## $ DATE_OF_CLEARANCE : chr "03/28/2015 12:00:00 AM +0000" "07/16/2011 12:00:00 AM +0000" "11/08/2011 12:00:00 AM +0000" "07/12/2012 12:00:00 AM +0000" ...
## $ HOUR_FROM : int 152 1020 240 1535 1645 1430 0 122 110 180 ...
## $ HOUR_TO : int 155 1030 244 1540 1712 1445 2359 123 111 2030 ...
## $ ADDRESS_X : chr "21XX FULTON AV" "34XX READING RD" "58XX BRAMBLE AV" "15XX MARLOWE AV" ...
## $ LONGITUDE_X : num -84.5 -84.5 -84.4 -84.5 -84.5 ...
## $ LATITUDE_X : num 39.1 39.1 39.2 39.2 39.1 ...
## $ VICTIM_AGE : chr "26-30" "41-50" "UNKNOWN" "18-25" ...
## $ VICTIM_RACE : chr "BLACK" "BLACK" "" "BLACK" ...
## $ VICTIM_ETHNICITY : chr "NOT OF HISPANIC ORIG" "NOT OF HISPANIC ORIG" "" "NOT OF HISPANIC ORIG" ...
## $ VICTIM_GENDER : chr "FEMALE" "MALE" "" "MALE" ...
## $ SUSPECT_AGE : chr "51-60" "UNKNOWN" "18-25" "18-25" ...
## $ SUSPECT_RACE : chr "BLACK" "BLACK" "BLACK" "BLACK" ...
## $ SUSPECT_ETHNICITY : chr "NOT OF HISPANIC ORIG" "NOT OF HISPANIC ORIG" "NOT OF HISPANIC ORIG" "NOT OF HISPANIC ORIG" ...
## $ SUSPECT_GENDER : chr "MALE" "MALE" "FEMALE" "MALE" ...
## $ TOTALNUMBERVICTIMS : int 1 1 1 2 1 2 1 2 1 2 ...
## $ TOTALSUSPECTS : int 1 1 2 4 NA 2 1 2 NA NA ...
## $ UCR_GROUP : chr "PART 2 MINOR" "PART 2 MINOR" "PART 2 MINOR" "AGGRAVATED ASSAULTS" ...
## $ COMMUNITY_COUNCIL_NEIGHBORHOOD: chr "WALNUT HILLS" "AVONDALE" "MADISONVILLE" "COLLEGE HILL" ...
## $ ZIP :integer64 45206 45219 45227 45224 45214 45227 45207 45202 ...
## - attr(*, ".internal.selfref")=<externalptr>
Step 4: Learning about missing values in the dataframe
na_count <-sapply(df, function(y) sum(length(which(is.na(y)))))
na_count <- data.frame(na_count)
print(na_count)
## na_count
## INSTANCEID 0
## INCIDENT_NO 0
## DATE_REPORTED 0
## DATE_FROM 0
## DATE_TO 0
## CLSD 0
## UCR 71
## DST 0
## BEAT 0
## OFFENSE 0
## LOCATION 0
## THEFT_CODE 0
## FLOOR 0
## SIDE 0
## OPENING 0
## HATE_BIAS 0
## DAYOFWEEK 0
## RPT_AREA 0
## CPD_NEIGHBORHOOD 0
## SNA_NEIGHBORHOOD 0
## WEAPONS 0
## DATE_OF_CLEARANCE 0
## HOUR_FROM 7
## HOUR_TO 1316
## ADDRESS_X 0
## LONGITUDE_X 45864
## LATITUDE_X 45864
## VICTIM_AGE 0
## VICTIM_RACE 0
## VICTIM_ETHNICITY 0
## VICTIM_GENDER 0
## SUSPECT_AGE 0
## SUSPECT_RACE 0
## SUSPECT_ETHNICITY 0
## SUSPECT_GENDER 0
## TOTALNUMBERVICTIMS 146
## TOTALSUSPECTS 194149
## UCR_GROUP 0
## COMMUNITY_COUNCIL_NEIGHBORHOOD 0
## ZIP 176
Step 5: Look at summary of all the columns in the dataset. Results hideen to avoid clutter in markdown
summary(df)
Step 6: Convert Datatypes. As can be seen above , datatypes of some of the variables are not congruent with their true meaning. Some datatype converions below :
#Convert to character datatypes
df$LONGITUDE_X <- as.character(df$LONGITUDE_X)
df$LATITUDE_X <- as.character(df$LATITUDE_X)
df$ZIP <- as.character(df$ZIP)
#Convert to Factor datatypes :
df$CLSD<-as.factor(df$CLSD)
df$DAYOFWEEK<-as.factor(df$DAYOFWEEK)
df$VICTIM_AGE <- as.factor(df$VICTIM_AGE)
df$VICTIM_RACE <- as.factor((df$VICTIM_RACE))
df$SUSPECT_AGE <-as.factor(df$SUSPECT_AGE)
df$SUSPECT_RACE <- as.factor(df$SUSPECT_RACE)
df$UCR <- as.factor(df$UCR)
df$HATE_BIAS <- as.factor(df$HATE_BIAS)
df$OFFENSE <- as.factor(df$OFFENSE)
df$THEFT_CODE <- as.factor(df$THEFT_CODE)
Step 7 Since the ‘THEFT’ column is only applicable only where crime type is theft , thus other values can not be considered missing values. Adding a factor called ‘Not Applicable’ and replacing the blanl columnswith this value.
df$THEFT_CODE[df$THEFT_CODE == ''] <- NA
df$THEFT_CODE[is.na(df$THEFT_CODE)] <- "Not Applicable"
df$THEFT_CODE = factor(df$THEFT_CODE, levels=c(levels(df$THEFT_CODE), "Not Applicable"))
df$THEFT_CODE[is.na(df$THEFT_CODE)] <- "Not Applicable"
head(df$THEFT_CODE, n=10)
## [1] Not Applicable Not Applicable
## [3] Not Applicable Not Applicable
## [5] Not Applicable Not Applicable
## [7] Not Applicable Not Applicable
## [9] Not Applicable 23F-THEFT FROM MOTOR VEHICLE
## 12 Levels: 23A-POCKET-PICKING 23B-PURSE-SNATCHING ... Not Applicable
Step 8 The date related columns in the data include date, time and AM/PM information. Separating these columns into separate columns.
# 'Separate' the DATE_REPORTED to > Date , Time and AM/PM Columns
df<- df %>% separate(DATE_REPORTED, c('DATE_REPORTED', 'TIME_REPORTED' , "REPORTED_AM/PM"), sep=" ",extra="drop")
# Separate DATE_FROM to CRIME_DATE , CRIME START TIME AND AM/PM
df<- df %>% separate(DATE_FROM, c('CRIME_START_DATE', 'CRIME_START_TIME' , "CRIME_START_TIME_AM/PM"), sep=" ",extra="drop")
# Separate DATE_TO to CRIME_END_DATE , CRIME END TIME AND AM/PM
df<- df %>% separate(DATE_TO, c('CRIME_END_DATE', 'CRIME_END_TIME' , "CRIME_END_TIME_AM/PM"), sep=" ",extra="drop")
# 'Separate' the DATE_OF_CLEARANCE to > Date , Time and AM/PM Columns
df<- df %>% separate(DATE_OF_CLEARANCE, c('DATE_CLEARANCE', 'TIME_CLEARANCE' , "CLEARANCE_AM/PM"), sep=" ",extra="drop")
Step 9 Some columns in the data include Column Code as well as the column descriptions. Separating such columns to multiple columns
#Separate CLSD to CLSD_CODE and CLSD_STATUS
df <- df %>% separate(CLSD , c("CLSD_CODE","CLSD_STATUS") , sep="--" ,extra="drop")
#Separate Location code and location description
df <- df %>% separate(LOCATION , c("LOCATION_CODE","LOCATION_DESC") , sep="-" ,extra="drop")
#Separate Hate bias code to hate bias code and hate bias description
df <- df %>% separate(HATE_BIAS , c("HATE_BIAS_CODE","HATE_BIAS_DESC") , sep="--" ,extra="drop")
#Display the new columns
names(df)
## [1] "INSTANCEID" "INCIDENT_NO"
## [3] "DATE_REPORTED" "TIME_REPORTED"
## [5] "REPORTED_AM/PM" "CRIME_START_DATE"
## [7] "CRIME_START_TIME" "CRIME_START_TIME_AM/PM"
## [9] "CRIME_END_DATE" "CRIME_END_TIME"
## [11] "CRIME_END_TIME_AM/PM" "CLSD_CODE"
## [13] "CLSD_STATUS" "UCR"
## [15] "DST" "BEAT"
## [17] "OFFENSE" "LOCATION_CODE"
## [19] "LOCATION_DESC" "THEFT_CODE"
## [21] "FLOOR" "SIDE"
## [23] "OPENING" "HATE_BIAS_CODE"
## [25] "HATE_BIAS_DESC" "DAYOFWEEK"
## [27] "RPT_AREA" "CPD_NEIGHBORHOOD"
## [29] "SNA_NEIGHBORHOOD" "WEAPONS"
## [31] "DATE_CLEARANCE" "TIME_CLEARANCE"
## [33] "CLEARANCE_AM/PM" "HOUR_FROM"
## [35] "HOUR_TO" "ADDRESS_X"
## [37] "LONGITUDE_X" "LATITUDE_X"
## [39] "VICTIM_AGE" "VICTIM_RACE"
## [41] "VICTIM_ETHNICITY" "VICTIM_GENDER"
## [43] "SUSPECT_AGE" "SUSPECT_RACE"
## [45] "SUSPECT_ETHNICITY" "SUSPECT_GENDER"
## [47] "TOTALNUMBERVICTIMS" "TOTALSUSPECTS"
## [49] "UCR_GROUP" "COMMUNITY_COUNCIL_NEIGHBORHOOD"
## [51] "ZIP"
Step10 Getting rid of the columns that are not required and have too many NA Values
Here are the first 10 columns of the dataset , 50 rows to faciliate view in the markdown document
head(df[,1:10],50)
## INSTANCEID INCIDENT_NO DATE_REPORTED
## 1: D6D7D173-E416-4571-AF34-A767ACF810D9 159006170 03/16/2015
## 2: 2A569674-9603-4ED9-82BB-783B843B94C9 41103934 06/23/2011
## 3: D588CE6C-A9FE-47A5-9C62-DADC7792B7E1 21104327 10/08/2011
## 4: 5B95E5DC-E35D-4F86-B77C-02F25C950329 129019028 06/25/2012
## 5: B908DB49-C51D-442B-ADFD-B6E64696071B 189006615 03/13/2018
## 6: 66F569F5-4C8E-4848-BD42-D0F8DE4D42B0 139012258 05/19/2013
## 7: 3B4450F2-F94B-4BFB-8B0A-3FF04DBD73C4 159022242 07/31/2015
## 8: B095D1B2-9089-4CA4-85B2-A2657FB82793 129025276 09/05/2012
## 9: FD5706CD-7E2F-4CED-8C73-2E3AB142EF8C 159023879 09/03/2015
## 10: 417FB2EC-1405-4F31-802B-2D3D7C9A47C3 149021985 09/03/2014
## 11: A6B8356E-5A44-425A-8076-80BA963D09AC 41006181 09/28/2010
## 12: B73C49C6-C88A-4D1D-BC66-EC84CE699C91 129026855 09/18/2012
## 13: 52B174DA-7433-4C0A-A94C-0A26E7D3713B 129028346 09/30/2012
## 14: A0705DF3-CC5D-4040-A3B8-FA5B98B40BDC 159001664 01/20/2015
## 15: 30EC1F31-F9CD-405A-9EED-8A4CAB43A09A 139024562 09/08/2013
## 16: BAE384CC-3685-4C99-BE13-97E645144D35 149021094 08/25/2014
## 17: 076F6E07-A87C-40F6-BC61-4E1CD9863C8D 159026752 09/28/2015
## 18: 65F46FA4-B029-4E37-8BE8-FEF6C4564486 159021785 08/14/2015
## 19: 78540AB3-16DA-4283-A2ED-A198FEF38B05 139020540 08/03/2013
## 20: AC435DB2-A6AA-4E1C-97B7-3E42B91E4DED 31103812 05/11/2011
## 21: 7F13E55C-02C5-4613-838A-F4CC53F2C04A 129034947 11/30/2012
## 22: 44DEBB7D-24AF-4F5C-91E8-21F46ED1C0D4 31110948 12/18/2011
## 23: 021C8584-E569-4298-823F-C99849457FC3 189031592 12/02/2018
## 24: 16D7F925-3B4C-457B-B725-55BF77D72FA3 21002854 07/05/2010
## 25: 62AC4FA8-CEF4-4D03-9D0E-6C7F3349DBDB 11003402 08/02/2010
## 26: FDDC338A-0E55-46E5-9593-C30C6BD08186 169012098 05/21/2016
## 27: ABAAFF2F-DD21-4293-B222-5505D73D15F3 199000362 01/04/2019
## 28: 42D47D0C-6AA9-4201-ABC6-C0AA33B465F8 169026981 10/15/2016
## 29: 5C1A59BC-8F90-4A5D-8401-0DD9053F008F 159026818 09/29/2015
## 30: 56DD43BD-CC02-4798-B1BE-9184AD778806 139008920 04/17/2013
## 31: 3F6B052A-1704-4E79-AAB3-FC72967B800B 189027601 10/13/2018
## 32: A9970E50-9479-420D-BE98-7AC91451C7FB 139009812 04/25/2013
## 33: EE379C67-A6C1-4AD5-8393-79BE76CA4E5B 149023768 09/20/2014
## 34: C649D1EA-6E89-424F-9ACF-26EC22D2093E 169014266 06/14/2016
## 35: 53C77607-34D4-4636-9C29-2E82AAE46F3B 149003000 02/09/2014
## 36: 77E3E4A2-0283-4F7B-B6A7-02604A156923 129022700 08/13/2012
## 37: FF9EB2F2-B7AE-4FD7-9F9E-C5E0AAD87BB7 159014217 06/04/2015
## 38: B959473F-EFE3-454D-AC3E-1A22F459892A 199003980 02/25/2019
## 39: 7C233970-A7BF-4CA4-B0C1-D14290702BB8 139012372 05/17/2013
## 40: D056DFD6-A658-44B5-932B-4A2AD171F34B 31111196 12/29/2011
## 41: E908C057-67C3-46FB-8CD9-694093F850A5 179055172 12/28/2017
## 42: C459A1AA-F082-4BA0-8AD4-FD8986BC44A2 179035611 08/30/2017
## 43: CC700DC7-16EE-430A-8323-7B7DE442D8C5 159018729 07/18/2015
## 44: 19A7FE67-617C-4EEA-8086-4D4DF8339956 169014631 06/17/2016
## 45: 12ABD22E-F678-448F-8761-DAFD7564159F 169013151 06/01/2016
## 46: 0BBCA81E-A18D-4AB7-8B6D-D040CD8611F5 159015710 06/19/2015
## 47: 855251F0-5239-497A-A6C2-00DAE61CA55A 139013469 05/30/2013
## 48: 7FC4688D-28CD-4C3D-B97C-BC44D5F4BE9A 149010875 05/12/2014
## 49: 33E5B505-018B-4B6F-808B-336850B53A54 149001663 01/21/2014
## 50: 41AB9E45-D98A-4D78-ADE6-3A19DFF9C6EC 129034121 11/23/2012
## INSTANCEID INCIDENT_NO DATE_REPORTED
## TIME_REPORTED REPORTED_AM/PM CRIME_START_DATE CRIME_START_TIME
## 1: 05:19:00 PM 03/16/2015 03:02:00
## 2: 11:12:00 AM 06/23/2011 10:20:00
## 3: 04:15:00 AM 10/08/2011 02:40:00
## 4: 03:45:00 PM 06/25/2012 03:35:00
## 5: 05:12:00 PM 03/13/2018 04:45:00
## 6: 02:45:00 PM 05/19/2013 02:30:00
## 7: 12:00:00 AM 07/24/2015 12:00:00
## 8: 01:22:00 AM 09/05/2012 01:22:00
## 9: 01:22:00 AM 09/03/2015 01:10:00
## 10: 08:54:28 AM 08/31/2014 06:00:00
## 11: 03:28:00 PM 09/25/2010 02:30:00
## 12: 11:20:00 AM 09/16/2012 12:00:00
## 13: 08:10:00 PM 09/30/2012 07:55:00
## 14: 06:56:15 PM 01/20/2015 11:00:00
## 15: 12:30:00 PM 09/08/2013 12:15:00
## 16: 01:06:47 AM 08/24/2014 10:00:00
## 17: 01:30:00 PM 09/27/2015 10:23:00
## 18: 04:40:00 PM 08/13/2015 10:30:00
## 19: 05:45:00 AM 08/03/2013 03:44:00
## 20: 09:10:00 AM 05/11/2011 02:00:00
## 21: 11:34:05 PM 11/30/2012 04:00:00
## 22: 02:40:00 PM 12/18/2011 01:50:00
## 23: 10:30:45 AM 12/02/2018 09:07:36
## 24: 01:40:00 PM 07/05/2010 12:40:00
## 25: 03:00:00 PM 07/26/2010 10:00:00
## 26: 10:04:00 PM 05/21/2016 10:01:00
## 27: 07:11:00 PM 01/04/2019 06:50:00
## 28: 05:50:00 PM 10/15/2016 05:45:00
## 29: 12:17:00 AM 09/28/2015 04:00:00
## 30: 09:18:25 AM 04/16/2013 10:00:00
## 31: 10:44:00 PM 10/13/2018 10:00:00
## 32: 09:00:00 PM 04/25/2013 07:25:00
## 33: 01:00:00 AM 09/20/2014 12:55:00
## 34: 03:28:00 AM 06/14/2016 03:20:00
## 35: 10:47:17 PM 02/09/2014 04:00:13
## 36: 01:55:00 PM 08/13/2012 01:45:00
## 37: 02:20:00 PM 05/16/2015 05:56:00
## 38: 10:52:00 PM 02/19/2019 02:30:00
## 39: 02:10:00 PM 05/14/2013 12:05:00
## 40: 12:40:00 PM 12/29/2011 11:50:00
## 41: 05:24:00 PM 12/28/2017 04:00:00
## 42: 09:31:00 PM 08/30/2017 02:00:00
## 43: 01:22:00 PM 07/18/2015 01:15:00
## 44: 02:26:01 PM 05/27/2016 08:45:58
## 45: 09:30:00 AM 05/31/2016 10:00:00
## 46: 01:35:00 PM 06/03/2015 09:30:00
## 47: 08:35:00 PM 05/30/2013 08:31:00
## 48: 02:33:00 PM 05/11/2014 01:00:00
## 49: 11:11:00 AM 01/20/2014 09:30:00
## 50: 10:40:07 AM 11/22/2012 03:00:00
## TIME_REPORTED REPORTED_AM/PM CRIME_START_DATE CRIME_START_TIME
## CRIME_START_TIME_AM/PM CRIME_END_DATE CRIME_END_TIME
## 1: PM 03/16/2015 03:05:00
## 2: AM 06/23/2011 10:30:00
## 3: AM 10/08/2011 02:44:00
## 4: PM 06/25/2012 03:40:00
## 5: PM 03/13/2018 05:12:00
## 6: PM 05/19/2013 02:45:00
## 7: AM 07/24/2015 11:59:00
## 8: AM 09/05/2012 01:23:00
## 9: AM 09/03/2015 01:11:00
## 10: PM 08/31/2014 08:30:00
## 11: PM 09/25/2010 03:00:00
## 12: PM 09/18/2012 08:50:00
## 13: PM 09/30/2012 08:10:00
## 14: AM 01/20/2015 06:30:00
## 15: PM 09/08/2013 12:20:00
## 16: PM 08/24/2014 10:20:00
## 17: PM 09/28/2015 12:30:00
## 18: AM 08/14/2015 03:00:00
## 19: AM 08/03/2013 03:45:00
## 20: AM 05/11/2011 08:00:00
## 21: PM 11/30/2012 04:21:00
## 22: PM 12/18/2011 01:55:00
## 23: AM 12/02/2018 09:09:37
## 24: PM 07/05/2010 12:45:00
## 25: AM 08/02/2010 10:00:00
## 26: PM 05/21/2016 10:04:00
## 27: PM 01/04/2019 07:03:00
## 28: PM 10/15/2016 05:50:00
## 29: PM 09/29/2015 12:15:00
## 30: PM 04/17/2013 06:30:00
## 31: PM 10/13/2018 09:30:00
## 32: PM 04/25/2013 07:30:00
## 33: AM 09/20/2014 01:00:00
## 34: AM 06/14/2016 03:22:00
## 35: PM 02/09/2014 10:30:00
## 36: PM 08/13/2012 01:50:00
## 37: PM 05/23/2015 07:53:00
## 38: PM 02/19/2019 02:37:00
## 39: AM 05/15/2013 08:00:00
## 40: AM 12/29/2011 12:00:00
## 41: PM 12/28/2017 05:00:00
## 42: PM 08/30/2017 02:20:00
## 43: PM 07/18/2015 01:20:00
## 44: PM 06/15/2016 09:00:39
## 45: PM 06/01/2016 09:00:00
## 46: AM 06/03/2015 04:18:00
## 47: PM 05/30/2013 08:31:00
## 48: PM 05/11/2014 10:00:00
## 49: PM 01/20/2014 11:18:00
## 50: AM 11/22/2012 03:30:00
## CRIME_START_TIME_AM/PM CRIME_END_DATE CRIME_END_TIME
Following are the variables of concern for my analysis : the dataset has around 51 features with 355000 incidents.
str(df[,c("OFFENSE","LOCATION_DESC","VICTIM_AGE","VICTIM_RACE","SUSPECT_AGE","SUSPECT_RACE","ZIP")])
## Classes 'data.table' and 'data.frame': 355379 obs. of 7 variables:
## $ OFFENSE : Factor w/ 202 levels "","ABDUCTION",..: 15 24 40 82 24 13 69 150 24 175 ...
## $ LOCATION_DESC: chr "PARKING LOT" "MULTI FAMILY" "BUY/SELL/TRADE SHOP" "STREET" ...
## $ VICTIM_AGE : Factor w/ 13 levels "","00","18-25",..: 4 6 13 3 6 3 12 5 6 3 ...
## $ VICTIM_RACE : Factor w/ 9 levels "","AMERICAN IINDIAN/ALA",..: 6 6 1 6 6 9 6 9 6 6 ...
## $ SUSPECT_AGE : Factor w/ 9 levels "18-25","26-30",..: 5 9 1 1 9 9 2 9 9 9 ...
## $ SUSPECT_RACE : Factor w/ 9 levels "","AMERICAN IINDIAN/ALA",..: 6 6 6 6 1 6 6 6 1 1 ...
## $ ZIP : chr "45206" "45219" "45227" "45224" ...
## - attr(*, ".internal.selfref")=<externalptr>
Following are few key operations to answer the questions laid out in my problem statement : 1. Slice and group the data by crime type 2. Calculate converted columns from the ‘CRIME FROM’ and ‘CRIME TO’ columns to find out the durations of each crime 3. Converting the times to 24 hour formats as well to be able to work directly in visualizations 4. Create a new data frame named ‘THEFT CODE’ that contains Theft code and description and use it for a s separate analysis specifically for theft crimes 5. The key summary statistics related to my analysis will be caried out in the form of data tables , box plots and loads of information 6. Correlation tables across various numeric variables
BoxPlots : To detect outliers and understand the frequent and median patterns in my analysis of interest Histograms : To understand the distribution of various categorical variables in my data Line Charts : To understand the crime trends of various types over time Piecharts/ BarGraphs : To understand the statuses of various crime types of our data ** Scatter Plots :** To understand any dependencies and correlations in the variables
Visualizations, data analysis methods , statistical thinking and writing more complicated functions that can let me be more efficient in writing code are some things that can help me write good code and answer the questions I need
I plan to use classification to predict the age range of perpetrator based on a supervised learning algorithm. , then get its accuracy in case it can be of any help to get a starting point for investigation in case of totally clueless crime cases. I will include this only if there’s enough trainig examples and the model accuracy stats make sense