Data Wrangling : Mid Term Project

Going by the Mid Term expectations , all the questions / components are laid out in the sections below :

Introduction

1.1 Provide an introduction that explains the problem statement you are addressing. Why should I be interested in this?

Problem Statement :
1, Which kind of locations are most prone to crimes ?
2. Is there a certain kind of crime that has more racial element to it than others? 3. Is there a tendency of a certain age group to indulge in a specific type of crime? 4. What age and gender are typical targets of each kind of crime? 5. What areas in cincinnati experience the most crime, the least crime and what areas are the safest? 6. What time of the day is most unsafe for a particular crime?

Why should I be interested in this?
Answers to above questions can tremendously help with prevention and increased awareness of crimes , the patterns and precautionary measures. It has the potential to directly impact the crime rate positively.

1.2 Provide a short explanation of how you plan to address this problem statement (the data used and the methodology employed)

I plan to do an in-dept analysis of the cincinnati crime dataset. I will mainly be interested in presriptive analytics of data to understand the trends and patterns in the data that concern my particular problem statement , rather than any forecast or predictions.
I will primarily use summary statistics , grouping the data , joining and correlation analysis to arrive at my analysis.
The most relevant variables of interest for me are going to be :
OFFENSE
LOCATION_CODE
LOCATION_DESC
VICTIM_AGE
VICTIM_RACE
SUSPECT_AGE
SUSPECT_RACE
SUSPECT_GENDER
ZIP

1.3 Discuss your current proposed approach/analytic technique you think will address (fully or partially) this problem.

Since my analysis is not a leraning problem , I will not require any supervised or unsupervised learning for my analysis. As per my current knowledge ,Summary analysis, data grouping and concatenation, joining , distribution plots, scatterplots, and boxplots should enable my analysis. Extended data visualization and analytical techniques will be required for succesful results.

1.4 Explain how your analysis will help the consumer of your analysis.

Since the consumer of analysis can be public as well as the Police, it can significantly help them to breakdown the patterns and most frequent locations/ times / victim-perpretrator correlation tendencies , gender component of crimes.
The consumer/ audience can use the analysis to come up with preventive and precautionary measures to reduce the incidences of such crimes. Increased awareness and safety measures can make Cicinnati a better and safer place to live in. The analysis may also provide a direction to root causes of certain crimes.

2. Packages Required

2.1 All packages used are loaded upfront so the reader knows which are required to replicate the analysis.
2.2 Messages and warnings resulting from loading the package are suppressed.

#Since these packages are usually not installed , thus including the commented code to install these packages

#install.packages('bit64')
#install.packages("bindrcpp")

#Set Home Directory
setwd("/Users/deepak/University/DataWrangling")

#Load the libraries
library(data.table)
library(tidyr)
library(dplyr)

2.3 Explanation is provided regarding the purpose of each package
data.table : The ‘fread’ function from this package is faster than other methods to read csv files, plus it has stringsAsFactors=FALSE by default.
tidyr : This package is used in the ‘separate’ function used to split columns that had non-atomic level of information dplyr : Since tidyr works in conjunction with dplyr , so we will need dplyr for our analysis

3. Data Preparation

3.1 Original source where the data was obtained is cited and, if possible, hyperlinked.

Ohio Government - Cincinnati Police data Initiative

The above data is obtained from the Ohio Government - City of Cincinnati Police Data Initiative and is publically available.

3.2 Source data is thoroughly explained (i.e. what was the original purpose of the data, when was it collected, how many variables did the original have, explain any peculiarities of the source data such as how missing values are recorded, or how data was imputed, etc.).

Original Purpose of Data : The data has been recorded and made public by the ohio government and Cincinnati Police. The intention is to ensure transparency of crime data and incidents and enable general public to be more aware of each and every crime incident reported in the city.
The intention might also be to facilitate analysts to come up with insights and suggestions to make better decision related to crime.

3.3 Data importing and cleaning steps are explained in the text (tell me why you are doing the data cleaning activities that you perform) and follow a logical process.

Step 1 : Data import. The csv file has been imported using ‘fread’ function of data.table package since this one is the fastest.

df = fread('city_of_cincinnati_police_data_initiative_crime_incidents.csv')

## 
Read 31.0% of 355379 rows
Read 53.5% of 355379 rows
Read 76.0% of 355379 rows
Read 90.0% of 355379 rows
Read 355379 rows and 40 (of 40) columns from 0.153 GB file in 00:00:06

Step 2 : Have a look at dimensions of data , and names. Rename the columns if there are any special characters or spaces

#Learning about dimensions - 
dim(df)               #355379 Rows and 40 Columns

## [1] 355379     40

names(df)             #Get Column Names - All Names look OK

##  [1] "INSTANCEID"                     "INCIDENT_NO"                   
##  [3] "DATE_REPORTED"                  "DATE_FROM"                     
##  [5] "DATE_TO"                        "CLSD"                          
##  [7] "UCR"                            "DST"                           
##  [9] "BEAT"                           "OFFENSE"                       
## [11] "LOCATION"                       "THEFT_CODE"                    
## [13] "FLOOR"                          "SIDE"                          
## [15] "OPENING"                        "HATE_BIAS"                     
## [17] "DAYOFWEEK"                      "RPT_AREA"                      
## [19] "CPD_NEIGHBORHOOD"               "SNA_NEIGHBORHOOD"              
## [21] "WEAPONS"                        "DATE_OF_CLEARANCE"             
## [23] "HOUR_FROM"                      "HOUR_TO"                       
## [25] "ADDRESS_X"                      "LONGITUDE_X"                   
## [27] "LATITUDE_X"                     "VICTIM_AGE"                    
## [29] "VICTIM_RACE"                    "VICTIM_ETHNICITY"              
## [31] "VICTIM_GENDER"                  "SUSPECT_AGE"                   
## [33] "SUSPECT_RACE"                   "SUSPECT_ETHNICITY"             
## [35] "SUSPECT_GENDER"                 "TOTALNUMBERVICTIMS"            
## [37] "TOTALSUSPECTS"                  "UCR_GROUP"                     
## [39] "COMMUNITY_COUNCIL_NEIGHBORHOOD" "ZIP"

Step 3: Have a peek at the datatypes of columns :

# Have a look  the datatypes of string
str(df)

## Classes 'data.table' and 'data.frame':   355379 obs. of  40 variables:
##  $ INSTANCEID                    : chr  "D6D7D173-E416-4571-AF34-A767ACF810D9" "2A569674-9603-4ED9-82BB-783B843B94C9" "D588CE6C-A9FE-47A5-9C62-DADC7792B7E1" "5B95E5DC-E35D-4F86-B77C-02F25C950329" ...
##  $ INCIDENT_NO                   : chr  "159006170" "41103934" "21104327" "129019028" ...
##  $ DATE_REPORTED                 : chr  "03/16/2015 05:19:00 PM +0000" "06/23/2011 11:12:00 AM +0000" "10/08/2011 04:15:00 AM +0000" "06/25/2012 03:45:00 PM +0000" ...
##  $ DATE_FROM                     : chr  "03/16/2015 03:02:00 PM +0000" "06/23/2011 10:20:00 AM +0000" "10/08/2011 02:40:00 AM +0000" "06/25/2012 03:35:00 PM +0000" ...
##  $ DATE_TO                       : chr  "03/16/2015 03:05:00 PM +0000" "06/23/2011 10:30:00 AM +0000" "10/08/2011 02:44:00 AM +0000" "06/25/2012 03:40:00 PM +0000" ...
##  $ CLSD                          : chr  "J--CLOSED" "J--CLOSED" "D--VICTIM REFUSED TO COOPERATE" "B--PROSECUTION DECLINED" ...
##  $ UCR                           : int  802 810 1493 401 810 550 2024 301 810 600 ...
##  $ DST                           : chr  "4" "4" "2" "5" ...
##  $ BEAT                          : chr  "2" "3" "3" "4" ...
##  $ OFFENSE                       : chr  "AGGRAVATED MENACING" "ASSAULT" "CRIMINAL DAMAGING/ENDANGERING" "FELONIOUS ASSAULT" ...
##  $ LOCATION                      : chr  "48-PARKING LOT" "02-MULTI FAMILY" "27-BUY/SELL/TRADE SHOP" "47-STREET" ...
##  $ THEFT_CODE                    : chr  "" "" "" "" ...
##  $ FLOOR                         : chr  "" "" "" "" ...
##  $ SIDE                          : chr  "" "" "" "" ...
##  $ OPENING                       : chr  "" "" "" "" ...
##  $ HATE_BIAS                     : chr  "N--NO BIAS/NOT APPLICABLE" "N--NO BIAS/NOT APPLICABLE" "N--NO BIAS/NOT APPLICABLE" "N--NO BIAS/NOT APPLICABLE" ...
##  $ DAYOFWEEK                     : chr  "MONDAY" "THURSDAY" "SATURDAY" "MONDAY" ...
##  $ RPT_AREA                      : chr  "45" "365" "138" "439" ...
##  $ CPD_NEIGHBORHOOD              : chr  "WALNUT HILLS" "AVONDALE" "MADISONVILLE" "COLLEGE  HILL" ...
##  $ SNA_NEIGHBORHOOD              : chr  "WALNUT HILLS" "AVONDALE" "MADISONVILLE" "COLLEGE HILL" ...
##  $ WEAPONS                       : chr  "99 - NONE" "40--PERSONAL WEAPONS (HANDS, FEET, TEETH, ETC.)" "99--NONE" "11--FIREARM (TYPE NOT STATED)" ...
##  $ DATE_OF_CLEARANCE             : chr  "03/28/2015 12:00:00 AM +0000" "07/16/2011 12:00:00 AM +0000" "11/08/2011 12:00:00 AM +0000" "07/12/2012 12:00:00 AM +0000" ...
##  $ HOUR_FROM                     : int  152 1020 240 1535 1645 1430 0 122 110 180 ...
##  $ HOUR_TO                       : int  155 1030 244 1540 1712 1445 2359 123 111 2030 ...
##  $ ADDRESS_X                     : chr  "21XX FULTON AV" "34XX READING RD" "58XX BRAMBLE AV" "15XX MARLOWE AV" ...
##  $ LONGITUDE_X                   : num  -84.5 -84.5 -84.4 -84.5 -84.5 ...
##  $ LATITUDE_X                    : num  39.1 39.1 39.2 39.2 39.1 ...
##  $ VICTIM_AGE                    : chr  "26-30" "41-50" "UNKNOWN" "18-25" ...
##  $ VICTIM_RACE                   : chr  "BLACK" "BLACK" "" "BLACK" ...
##  $ VICTIM_ETHNICITY              : chr  "NOT OF HISPANIC ORIG" "NOT OF HISPANIC ORIG" "" "NOT OF HISPANIC ORIG" ...
##  $ VICTIM_GENDER                 : chr  "FEMALE" "MALE" "" "MALE" ...
##  $ SUSPECT_AGE                   : chr  "51-60" "UNKNOWN" "18-25" "18-25" ...
##  $ SUSPECT_RACE                  : chr  "BLACK" "BLACK" "BLACK" "BLACK" ...
##  $ SUSPECT_ETHNICITY             : chr  "NOT OF HISPANIC ORIG" "NOT OF HISPANIC ORIG" "NOT OF HISPANIC ORIG" "NOT OF HISPANIC ORIG" ...
##  $ SUSPECT_GENDER                : chr  "MALE" "MALE" "FEMALE" "MALE" ...
##  $ TOTALNUMBERVICTIMS            : int  1 1 1 2 1 2 1 2 1 2 ...
##  $ TOTALSUSPECTS                 : int  1 1 2 4 NA 2 1 2 NA NA ...
##  $ UCR_GROUP                     : chr  "PART 2 MINOR" "PART 2 MINOR" "PART 2 MINOR" "AGGRAVATED ASSAULTS" ...
##  $ COMMUNITY_COUNCIL_NEIGHBORHOOD: chr  "WALNUT HILLS" "AVONDALE" "MADISONVILLE" "COLLEGE HILL" ...
##  $ ZIP                           :integer64 45206 45219 45227 45224 45214 45227 45207 45202 ... 
##  - attr(*, ".internal.selfref")=<externalptr>

Step 4: Learning about missing values in the dataframe

na_count <-sapply(df, function(y) sum(length(which(is.na(y)))))
na_count <- data.frame(na_count)
print(na_count)

##                                na_count
## INSTANCEID                            0
## INCIDENT_NO                           0
## DATE_REPORTED                         0
## DATE_FROM                             0
## DATE_TO                               0
## CLSD                                  0
## UCR                                  71
## DST                                   0
## BEAT                                  0
## OFFENSE                               0
## LOCATION                              0
## THEFT_CODE                            0
## FLOOR                                 0
## SIDE                                  0
## OPENING                               0
## HATE_BIAS                             0
## DAYOFWEEK                             0
## RPT_AREA                              0
## CPD_NEIGHBORHOOD                      0
## SNA_NEIGHBORHOOD                      0
## WEAPONS                               0
## DATE_OF_CLEARANCE                     0
## HOUR_FROM                             7
## HOUR_TO                            1316
## ADDRESS_X                             0
## LONGITUDE_X                       45864
## LATITUDE_X                        45864
## VICTIM_AGE                            0
## VICTIM_RACE                           0
## VICTIM_ETHNICITY                      0
## VICTIM_GENDER                         0
## SUSPECT_AGE                           0
## SUSPECT_RACE                          0
## SUSPECT_ETHNICITY                     0
## SUSPECT_GENDER                        0
## TOTALNUMBERVICTIMS                  146
## TOTALSUSPECTS                    194149
## UCR_GROUP                             0
## COMMUNITY_COUNCIL_NEIGHBORHOOD        0
## ZIP                                 176

Step 5: Look at summary of all the columns in the dataset. Results hideen to avoid clutter in markdown

summary(df)

Step 6: Convert Datatypes. As can be seen above , datatypes of some of the variables are not congruent with their true meaning. Some datatype converions below :

#Convert to  character datatypes
df$LONGITUDE_X <- as.character(df$LONGITUDE_X)
df$LATITUDE_X <- as.character(df$LATITUDE_X)
df$ZIP <- as.character(df$ZIP)


#Convert to Factor datatypes :
df$CLSD<-as.factor(df$CLSD)
df$DAYOFWEEK<-as.factor(df$DAYOFWEEK)
df$VICTIM_AGE <- as.factor(df$VICTIM_AGE)
df$VICTIM_RACE <- as.factor((df$VICTIM_RACE))
df$SUSPECT_AGE <-as.factor(df$SUSPECT_AGE)
df$SUSPECT_RACE <- as.factor(df$SUSPECT_RACE)
df$UCR <- as.factor(df$UCR)
df$HATE_BIAS <- as.factor(df$HATE_BIAS)
df$OFFENSE <- as.factor(df$OFFENSE)
df$THEFT_CODE <- as.factor(df$THEFT_CODE)

Step 7 Since the ‘THEFT’ column is only applicable only where crime type is theft , thus other values can not be considered missing values. Adding a factor called ‘Not Applicable’ and replacing the blanl columnswith this value.

df$THEFT_CODE[df$THEFT_CODE == ''] <- NA
df$THEFT_CODE[is.na(df$THEFT_CODE)] <- "Not Applicable"
df$THEFT_CODE = factor(df$THEFT_CODE, levels=c(levels(df$THEFT_CODE), "Not Applicable"))
df$THEFT_CODE[is.na(df$THEFT_CODE)] <- "Not Applicable"
head(df$THEFT_CODE, n=10)

##  [1] Not Applicable               Not Applicable              
##  [3] Not Applicable               Not Applicable              
##  [5] Not Applicable               Not Applicable              
##  [7] Not Applicable               Not Applicable              
##  [9] Not Applicable               23F-THEFT FROM MOTOR VEHICLE
## 12 Levels:  23A-POCKET-PICKING 23B-PURSE-SNATCHING ... Not Applicable

Step 8 The date related columns in the data include date, time and AM/PM information. Separating these columns into separate columns.

# 'Separate' the DATE_REPORTED to > Date , Time and AM/PM Columns
df<- df %>% separate(DATE_REPORTED, c('DATE_REPORTED', 'TIME_REPORTED' , "REPORTED_AM/PM"), sep=" ",extra="drop")

# Separate DATE_FROM to CRIME_DATE , CRIME START TIME AND AM/PM
df<- df %>% separate(DATE_FROM, c('CRIME_START_DATE', 'CRIME_START_TIME' , "CRIME_START_TIME_AM/PM"), sep=" ",extra="drop")

# Separate DATE_TO to CRIME_END_DATE , CRIME END TIME AND AM/PM
df<-  df %>% separate(DATE_TO, c('CRIME_END_DATE', 'CRIME_END_TIME' , "CRIME_END_TIME_AM/PM"), sep=" ",extra="drop")

# 'Separate' the DATE_OF_CLEARANCE to > Date , Time and AM/PM Columns
df<- df %>% separate(DATE_OF_CLEARANCE, c('DATE_CLEARANCE', 'TIME_CLEARANCE' , "CLEARANCE_AM/PM"), sep=" ",extra="drop")

Step 9 Some columns in the data include Column Code as well as the column descriptions. Separating such columns to multiple columns

#Separate CLSD to CLSD_CODE and CLSD_STATUS
df <- df %>% separate(CLSD , c("CLSD_CODE","CLSD_STATUS") , sep="--" ,extra="drop")
#Separate Location code and location description
df <- df %>% separate(LOCATION , c("LOCATION_CODE","LOCATION_DESC") , sep="-" ,extra="drop")
#Separate Hate bias code to hate bias code and hate bias description
df <- df %>% separate(HATE_BIAS , c("HATE_BIAS_CODE","HATE_BIAS_DESC") , sep="--" ,extra="drop")

#Display the new columns
names(df)

##  [1] "INSTANCEID"                     "INCIDENT_NO"                   
##  [3] "DATE_REPORTED"                  "TIME_REPORTED"                 
##  [5] "REPORTED_AM/PM"                 "CRIME_START_DATE"              
##  [7] "CRIME_START_TIME"               "CRIME_START_TIME_AM/PM"        
##  [9] "CRIME_END_DATE"                 "CRIME_END_TIME"                
## [11] "CRIME_END_TIME_AM/PM"           "CLSD_CODE"                     
## [13] "CLSD_STATUS"                    "UCR"                           
## [15] "DST"                            "BEAT"                          
## [17] "OFFENSE"                        "LOCATION_CODE"                 
## [19] "LOCATION_DESC"                  "THEFT_CODE"                    
## [21] "FLOOR"                          "SIDE"                          
## [23] "OPENING"                        "HATE_BIAS_CODE"                
## [25] "HATE_BIAS_DESC"                 "DAYOFWEEK"                     
## [27] "RPT_AREA"                       "CPD_NEIGHBORHOOD"              
## [29] "SNA_NEIGHBORHOOD"               "WEAPONS"                       
## [31] "DATE_CLEARANCE"                 "TIME_CLEARANCE"                
## [33] "CLEARANCE_AM/PM"                "HOUR_FROM"                     
## [35] "HOUR_TO"                        "ADDRESS_X"                     
## [37] "LONGITUDE_X"                    "LATITUDE_X"                    
## [39] "VICTIM_AGE"                     "VICTIM_RACE"                   
## [41] "VICTIM_ETHNICITY"               "VICTIM_GENDER"                 
## [43] "SUSPECT_AGE"                    "SUSPECT_RACE"                  
## [45] "SUSPECT_ETHNICITY"              "SUSPECT_GENDER"                
## [47] "TOTALNUMBERVICTIMS"             "TOTALSUSPECTS"                 
## [49] "UCR_GROUP"                      "COMMUNITY_COUNCIL_NEIGHBORHOOD"
## [51] "ZIP"

Step10 Getting rid of the columns that are not required and have too many NA Values

3.4 Once your data is clean, show what the final data set looks like. However, do not print off a data frame with 200+ rows; show me the data in the most condensed form possible.

Here are the first 10 columns of the dataset , 50 rows to faciliate view in the markdown document

head(df[,1:10],50)

##                               INSTANCEID INCIDENT_NO DATE_REPORTED
##  1: D6D7D173-E416-4571-AF34-A767ACF810D9   159006170    03/16/2015
##  2: 2A569674-9603-4ED9-82BB-783B843B94C9    41103934    06/23/2011
##  3: D588CE6C-A9FE-47A5-9C62-DADC7792B7E1    21104327    10/08/2011
##  4: 5B95E5DC-E35D-4F86-B77C-02F25C950329   129019028    06/25/2012
##  5: B908DB49-C51D-442B-ADFD-B6E64696071B   189006615    03/13/2018
##  6: 66F569F5-4C8E-4848-BD42-D0F8DE4D42B0   139012258    05/19/2013
##  7: 3B4450F2-F94B-4BFB-8B0A-3FF04DBD73C4   159022242    07/31/2015
##  8: B095D1B2-9089-4CA4-85B2-A2657FB82793   129025276    09/05/2012
##  9: FD5706CD-7E2F-4CED-8C73-2E3AB142EF8C   159023879    09/03/2015
## 10: 417FB2EC-1405-4F31-802B-2D3D7C9A47C3   149021985    09/03/2014
## 11: A6B8356E-5A44-425A-8076-80BA963D09AC    41006181    09/28/2010
## 12: B73C49C6-C88A-4D1D-BC66-EC84CE699C91   129026855    09/18/2012
## 13: 52B174DA-7433-4C0A-A94C-0A26E7D3713B   129028346    09/30/2012
## 14: A0705DF3-CC5D-4040-A3B8-FA5B98B40BDC   159001664    01/20/2015
## 15: 30EC1F31-F9CD-405A-9EED-8A4CAB43A09A   139024562    09/08/2013
## 16: BAE384CC-3685-4C99-BE13-97E645144D35   149021094    08/25/2014
## 17: 076F6E07-A87C-40F6-BC61-4E1CD9863C8D   159026752    09/28/2015
## 18: 65F46FA4-B029-4E37-8BE8-FEF6C4564486   159021785    08/14/2015
## 19: 78540AB3-16DA-4283-A2ED-A198FEF38B05   139020540    08/03/2013
## 20: AC435DB2-A6AA-4E1C-97B7-3E42B91E4DED    31103812    05/11/2011
## 21: 7F13E55C-02C5-4613-838A-F4CC53F2C04A   129034947    11/30/2012
## 22: 44DEBB7D-24AF-4F5C-91E8-21F46ED1C0D4    31110948    12/18/2011
## 23: 021C8584-E569-4298-823F-C99849457FC3   189031592    12/02/2018
## 24: 16D7F925-3B4C-457B-B725-55BF77D72FA3    21002854    07/05/2010
## 25: 62AC4FA8-CEF4-4D03-9D0E-6C7F3349DBDB    11003402    08/02/2010
## 26: FDDC338A-0E55-46E5-9593-C30C6BD08186   169012098    05/21/2016
## 27: ABAAFF2F-DD21-4293-B222-5505D73D15F3   199000362    01/04/2019
## 28: 42D47D0C-6AA9-4201-ABC6-C0AA33B465F8   169026981    10/15/2016
## 29: 5C1A59BC-8F90-4A5D-8401-0DD9053F008F   159026818    09/29/2015
## 30: 56DD43BD-CC02-4798-B1BE-9184AD778806   139008920    04/17/2013
## 31: 3F6B052A-1704-4E79-AAB3-FC72967B800B   189027601    10/13/2018
## 32: A9970E50-9479-420D-BE98-7AC91451C7FB   139009812    04/25/2013
## 33: EE379C67-A6C1-4AD5-8393-79BE76CA4E5B   149023768    09/20/2014
## 34: C649D1EA-6E89-424F-9ACF-26EC22D2093E   169014266    06/14/2016
## 35: 53C77607-34D4-4636-9C29-2E82AAE46F3B   149003000    02/09/2014
## 36: 77E3E4A2-0283-4F7B-B6A7-02604A156923   129022700    08/13/2012
## 37: FF9EB2F2-B7AE-4FD7-9F9E-C5E0AAD87BB7   159014217    06/04/2015
## 38: B959473F-EFE3-454D-AC3E-1A22F459892A   199003980    02/25/2019
## 39: 7C233970-A7BF-4CA4-B0C1-D14290702BB8   139012372    05/17/2013
## 40: D056DFD6-A658-44B5-932B-4A2AD171F34B    31111196    12/29/2011
## 41: E908C057-67C3-46FB-8CD9-694093F850A5   179055172    12/28/2017
## 42: C459A1AA-F082-4BA0-8AD4-FD8986BC44A2   179035611    08/30/2017
## 43: CC700DC7-16EE-430A-8323-7B7DE442D8C5   159018729    07/18/2015
## 44: 19A7FE67-617C-4EEA-8086-4D4DF8339956   169014631    06/17/2016
## 45: 12ABD22E-F678-448F-8761-DAFD7564159F   169013151    06/01/2016
## 46: 0BBCA81E-A18D-4AB7-8B6D-D040CD8611F5   159015710    06/19/2015
## 47: 855251F0-5239-497A-A6C2-00DAE61CA55A   139013469    05/30/2013
## 48: 7FC4688D-28CD-4C3D-B97C-BC44D5F4BE9A   149010875    05/12/2014
## 49: 33E5B505-018B-4B6F-808B-336850B53A54   149001663    01/21/2014
## 50: 41AB9E45-D98A-4D78-ADE6-3A19DFF9C6EC   129034121    11/23/2012
##                               INSTANCEID INCIDENT_NO DATE_REPORTED
##     TIME_REPORTED REPORTED_AM/PM CRIME_START_DATE CRIME_START_TIME
##  1:      05:19:00             PM       03/16/2015         03:02:00
##  2:      11:12:00             AM       06/23/2011         10:20:00
##  3:      04:15:00             AM       10/08/2011         02:40:00
##  4:      03:45:00             PM       06/25/2012         03:35:00
##  5:      05:12:00             PM       03/13/2018         04:45:00
##  6:      02:45:00             PM       05/19/2013         02:30:00
##  7:      12:00:00             AM       07/24/2015         12:00:00
##  8:      01:22:00             AM       09/05/2012         01:22:00
##  9:      01:22:00             AM       09/03/2015         01:10:00
## 10:      08:54:28             AM       08/31/2014         06:00:00
## 11:      03:28:00             PM       09/25/2010         02:30:00
## 12:      11:20:00             AM       09/16/2012         12:00:00
## 13:      08:10:00             PM       09/30/2012         07:55:00
## 14:      06:56:15             PM       01/20/2015         11:00:00
## 15:      12:30:00             PM       09/08/2013         12:15:00
## 16:      01:06:47             AM       08/24/2014         10:00:00
## 17:      01:30:00             PM       09/27/2015         10:23:00
## 18:      04:40:00             PM       08/13/2015         10:30:00
## 19:      05:45:00             AM       08/03/2013         03:44:00
## 20:      09:10:00             AM       05/11/2011         02:00:00
## 21:      11:34:05             PM       11/30/2012         04:00:00
## 22:      02:40:00             PM       12/18/2011         01:50:00
## 23:      10:30:45             AM       12/02/2018         09:07:36
## 24:      01:40:00             PM       07/05/2010         12:40:00
## 25:      03:00:00             PM       07/26/2010         10:00:00
## 26:      10:04:00             PM       05/21/2016         10:01:00
## 27:      07:11:00             PM       01/04/2019         06:50:00
## 28:      05:50:00             PM       10/15/2016         05:45:00
## 29:      12:17:00             AM       09/28/2015         04:00:00
## 30:      09:18:25             AM       04/16/2013         10:00:00
## 31:      10:44:00             PM       10/13/2018         10:00:00
## 32:      09:00:00             PM       04/25/2013         07:25:00
## 33:      01:00:00             AM       09/20/2014         12:55:00
## 34:      03:28:00             AM       06/14/2016         03:20:00
## 35:      10:47:17             PM       02/09/2014         04:00:13
## 36:      01:55:00             PM       08/13/2012         01:45:00
## 37:      02:20:00             PM       05/16/2015         05:56:00
## 38:      10:52:00             PM       02/19/2019         02:30:00
## 39:      02:10:00             PM       05/14/2013         12:05:00
## 40:      12:40:00             PM       12/29/2011         11:50:00
## 41:      05:24:00             PM       12/28/2017         04:00:00
## 42:      09:31:00             PM       08/30/2017         02:00:00
## 43:      01:22:00             PM       07/18/2015         01:15:00
## 44:      02:26:01             PM       05/27/2016         08:45:58
## 45:      09:30:00             AM       05/31/2016         10:00:00
## 46:      01:35:00             PM       06/03/2015         09:30:00
## 47:      08:35:00             PM       05/30/2013         08:31:00
## 48:      02:33:00             PM       05/11/2014         01:00:00
## 49:      11:11:00             AM       01/20/2014         09:30:00
## 50:      10:40:07             AM       11/22/2012         03:00:00
##     TIME_REPORTED REPORTED_AM/PM CRIME_START_DATE CRIME_START_TIME
##     CRIME_START_TIME_AM/PM CRIME_END_DATE CRIME_END_TIME
##  1:                     PM     03/16/2015       03:05:00
##  2:                     AM     06/23/2011       10:30:00
##  3:                     AM     10/08/2011       02:44:00
##  4:                     PM     06/25/2012       03:40:00
##  5:                     PM     03/13/2018       05:12:00
##  6:                     PM     05/19/2013       02:45:00
##  7:                     AM     07/24/2015       11:59:00
##  8:                     AM     09/05/2012       01:23:00
##  9:                     AM     09/03/2015       01:11:00
## 10:                     PM     08/31/2014       08:30:00
## 11:                     PM     09/25/2010       03:00:00
## 12:                     PM     09/18/2012       08:50:00
## 13:                     PM     09/30/2012       08:10:00
## 14:                     AM     01/20/2015       06:30:00
## 15:                     PM     09/08/2013       12:20:00
## 16:                     PM     08/24/2014       10:20:00
## 17:                     PM     09/28/2015       12:30:00
## 18:                     AM     08/14/2015       03:00:00
## 19:                     AM     08/03/2013       03:45:00
## 20:                     AM     05/11/2011       08:00:00
## 21:                     PM     11/30/2012       04:21:00
## 22:                     PM     12/18/2011       01:55:00
## 23:                     AM     12/02/2018       09:09:37
## 24:                     PM     07/05/2010       12:45:00
## 25:                     AM     08/02/2010       10:00:00
## 26:                     PM     05/21/2016       10:04:00
## 27:                     PM     01/04/2019       07:03:00
## 28:                     PM     10/15/2016       05:50:00
## 29:                     PM     09/29/2015       12:15:00
## 30:                     PM     04/17/2013       06:30:00
## 31:                     PM     10/13/2018       09:30:00
## 32:                     PM     04/25/2013       07:30:00
## 33:                     AM     09/20/2014       01:00:00
## 34:                     AM     06/14/2016       03:22:00
## 35:                     PM     02/09/2014       10:30:00
## 36:                     PM     08/13/2012       01:50:00
## 37:                     PM     05/23/2015       07:53:00
## 38:                     PM     02/19/2019       02:37:00
## 39:                     AM     05/15/2013       08:00:00
## 40:                     AM     12/29/2011       12:00:00
## 41:                     PM     12/28/2017       05:00:00
## 42:                     PM     08/30/2017       02:20:00
## 43:                     PM     07/18/2015       01:20:00
## 44:                     PM     06/15/2016       09:00:39
## 45:                     PM     06/01/2016       09:00:00
## 46:                     AM     06/03/2015       04:18:00
## 47:                     PM     05/30/2013       08:31:00
## 48:                     PM     05/11/2014       10:00:00
## 49:                     PM     01/20/2014       11:18:00
## 50:                     AM     11/22/2012       03:30:00
##     CRIME_START_TIME_AM/PM CRIME_END_DATE CRIME_END_TIME

3.5 provide me with a consolidated explanation, either with a table that provides summary info for each variable or a nicely written summary paragraph with inline code.

Following are the variables of concern for my analysis : the dataset has around 51 features with 355000 incidents.

OFFENSE : The kind of offense conducted - a Factor vector 202 types of crimes
LOCATION_DESC : Where as the crime carried out? - a character vector
VICTIM_AGE : Age range of Victim of the crime. 13 possible values
VICTIM_RACE : Race of the victim. Factor Vector with 9 levels
SUSPECT_AGE : Age range of suspect.Factor Vector with 9 levels
SUSPECT_RACE : 9 possible races of the suspected criminal
SUSPECT_GENDER : the gender of the suspect,
ZIP : The Zip Code of the area where the crime was carried out

str(df[,c("OFFENSE","LOCATION_DESC","VICTIM_AGE","VICTIM_RACE","SUSPECT_AGE","SUSPECT_RACE","ZIP")])

## Classes 'data.table' and 'data.frame':   355379 obs. of  7 variables:
##  $ OFFENSE      : Factor w/ 202 levels "","ABDUCTION",..: 15 24 40 82 24 13 69 150 24 175 ...
##  $ LOCATION_DESC: chr  "PARKING LOT" "MULTI FAMILY" "BUY/SELL/TRADE SHOP" "STREET" ...
##  $ VICTIM_AGE   : Factor w/ 13 levels "","00","18-25",..: 4 6 13 3 6 3 12 5 6 3 ...
##  $ VICTIM_RACE  : Factor w/ 9 levels "","AMERICAN IINDIAN/ALA",..: 6 6 1 6 6 9 6 9 6 6 ...
##  $ SUSPECT_AGE  : Factor w/ 9 levels "18-25","26-30",..: 5 9 1 1 9 9 2 9 9 9 ...
##  $ SUSPECT_RACE : Factor w/ 9 levels "","AMERICAN IINDIAN/ALA",..: 6 6 6 6 1 6 6 6 1 1 ...
##  $ ZIP          : chr  "45206" "45219" "45227" "45224" ...
##  - attr(*, ".internal.selfref")=<externalptr>

Proposed Exploratory Data Analysis

4.1 Discuss how you plan to uncover new information in the data that is not self-evident. What are different ways you could look at this data to answer the questions you want to answer? Do you plan to slice and dice the data in different ways, create new variables, or join separate data frames to create new summary information? How could you summarize your data to answer key questions?

Following are few key operations to answer the questions laid out in my problem statement : 1. Slice and group the data by crime type 2. Calculate converted columns from the ‘CRIME FROM’ and ‘CRIME TO’ columns to find out the durations of each crime 3. Converting the times to 24 hour formats as well to be able to work directly in visualizations 4. Create a new data frame named ‘THEFT CODE’ that contains Theft code and description and use it for a s separate analysis specifically for theft crimes 5. The key summary statistics related to my analysis will be caried out in the form of data tables , box plots and loads of information 6. Correlation tables across various numeric variables

4.2 What types of plots and tables will help you to illustrate the findings to your questions?

BoxPlots : To detect outliers and understand the frequent and median patterns in my analysis of interest Histograms : To understand the distribution of various categorical variables in my data Line Charts : To understand the crime trends of various types over time Piecharts/ BarGraphs : To understand the statuses of various crime types of our data ** Scatter Plots :** To understand any dependencies and correlations in the variables

4.3 What do you not know how to do right now that you need to learn to answer your questions?

Visualizations, data analysis methods , statistical thinking and writing more complicated functions that can let me be more efficient in writing code are some things that can help me write good code and answer the questions I need

4.4 .4 Do you plan on incorporating any machine learning techniques (i.e. linear regression, discriminant analysis, cluster analysis) to answer your questions?

I plan to use classification to predict the age range of perpetrator based on a supervised learning algorithm. , then get its accuracy in case it can be of any help to get a starting point for investigation in case of totally clueless crime cases. I will include this only if there’s enough trainig examples and the model accuracy stats make sense