Disclaimer: This is a relatively rough analysis and should not be taken for any political or legislative agenda. To fully and completely understand the underlying effects of crime and gain a 100% accurate assessment of the dataset it will require further work.

import libraries

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.4.2

Load CSV

nypd_complaints_data = read.csv("NYPD_Complaint_Data_Historic.csv")

Remove unneccessary columns for analysis

nypd_complaints_df = nypd_complaints_data[,c(-1,-3,-4,-5,-7,-9,-16,-18,-19,-20)]

nypd_complaints_df = nypd_complaints_df[,-4]

Sort by Burrough

nypd_complaints_df = nypd_complaints_df[order(nypd_complaints_df$BORO_NM),]

Duration of days between incident and report for all available dates

duration = as.Date(nypd_complaints_df$RPT_DT,"%m/%d/%Y")-as.Date(nypd_complaints_df$CMPLNT_FR_DT,"%m/%d/%Y")

averagereportingduration = mean(as.numeric(as.character(duration[!is.na(duration)])))

print(paste("The average duration, for all available dates, between the date of a crime's occurence and it's reporting is",round(averagereportingduration,0),"days.",sep=" "))
## [1] "The average duration, for all available dates, between the date of a crime's occurence and it's reporting is 15 days."

Delay in reporting of each crime. This analysis extrapolates the missing duration values, and it does not create a fully unique list of crimes. This is a relatively rough high level overview of the data.

nypd_complaints_df$Duration = as.numeric(as.character(duration))
nypd_complaints_df$KYCD = nypd_complaints_data$KY_CD
nypd_complaints_df$PoliceDesc = nypd_complaints_data$PD_DESC
crimedurationdf = nypd_complaints_df[,c("OFNS_DESC","KYCD","PoliceDesc","Duration")]
crimedurationdf = crimedurationdf[order(crimedurationdf$OFNS_DESC),]
crimedurationdf = crimedurationdf %>% fill(Duration)
missingvals = which(crimedurationdf$OFNS_DESC=="")
crimedurationdf$OFNS_DESC = as.character(crimedurationdf$OFNS_DESC)
crimedurationdf[missingvals,"OFNS_DESC"] = list(as.character(crimedurationdf[missingvals,"PoliceDesc"]))
crimedurationdf[which(crimedurationdf$OFNS_DESC==""),"OFNS_DESC"] = "Other"
options(warn = -1)
averagedurationbycrime = aggregate(crimedurationdf,by = list(crimedurationdf$OFNS_DESC),FUN=mean)
averagedurationbycrime = averagedurationbycrime[,c(-2,-3,-4)]

Delays greater than 100

print(subset(averagedurationbycrime,Duration>100))
##                                         Group.1 Duration
## 94                                       FRAUDS 104.8310
## 120             KIDNAPPING AND RELATED OFFENSES 442.0000
## 124 LARCENY,GRAND BY ACQUIRING LOST CREDIT CARD 170.0769
## 204             MURDER & NON-NEGL. MANSLAUGHTER 203.3359
## 241                                        RAPE 251.3853
## 281                                  SEX CRIMES 147.5966
## 297                                 THEFT-FRAUD 174.8864
print(paste("The crime that has the longest delay between reporting and occurrence is",subset(averagedurationbycrime,Duration == max(Duration))[1],"with an average delay of",subset(averagedurationbycrime,Duration == max(Duration))[,2],"days"))
## [1] "The crime that has the longest delay between reporting and occurrence is KIDNAPPING AND RELATED OFFENSES with an average delay of 442 days"

Crime by Burrough: Broad overview. Using the tail function, I saw that most entries with their Jurisdiction containing “N.Y. Police Dept” belonged to Staten Island; thus I filled all NA Burrough entries with “Staten Island” as over 90% of NA fields had a coinciding jurisidiction of “N.Y. Police Dept”.

burroughdf = nypd_complaints_df[,c("BORO_NM","LAW_CAT_CD")]
missingburroughval = which(burroughdf$BORO_NM=="")
burroughdf[missingburroughval,"BORO_NM"] = "STATEN ISLAND"
frequencytablebroad = table(burroughdf$LAW_CAT_CD)

print(frequencytablebroad)
## 
##      FELONY MISDEMEANOR   VIOLATION 
##     1717265     3180644      682126
par("mar")
## [1] 5.1 4.1 4.1 2.1
par(mar=c(11,11,11,11))
getOption("scipen")
## [1] 0
opt <- options("scipen" = 20)
getOption("scipen")
## [1] 20

Crime by Burrough: Analysis of frequency of occurrence for Felonies, Misdemeanors, and Violations

freqtableburr = table(burroughdf)
freqtableburr = as.data.frame(freqtableburr)
freqtableburr = subset(freqtableburr,BORO_NM != "")

print(freqtableburr)
##          BORO_NM  LAW_CAT_CD   Freq
## 2          BRONX      FELONY 325145
## 3       BROOKLYN      FELONY 541841
## 4      MANHATTAN      FELONY 422189
## 5         QUEENS      FELONY 370347
## 6  STATEN ISLAND      FELONY  57743
## 8          BRONX MISDEMEANOR 737972
## 9       BROOKLYN MISDEMEANOR 919218
## 10     MANHATTAN MISDEMEANOR 771325
## 11        QUEENS MISDEMEANOR 593282
## 12 STATEN ISLAND MISDEMEANOR 158847
## 14         BRONX   VIOLATION 146530
## 15      BROOKLYN   VIOLATION 205844
## 16     MANHATTAN   VIOLATION 138246
## 17        QUEENS   VIOLATION 141992
## 18 STATEN ISLAND   VIOLATION  49514
par("mar")
## [1] 5.1 4.1 4.1 2.1
par(mar=c(7,7,7,7))

Percentage of Crimes Completed

completiondf = subset(nypd_complaints_data,CRM_ATPT_CPTD_CD =="COMPLETED")
completionrate = (length(completiondf$CRM_ATPT_CPTD_CD)/length(nypd_complaints_data$CRM_ATPT_CPTD_CD))*100

print(paste("Of the total number of attempted crimes,the completion rate is",round(completionrate,2),"%"))
## [1] "Of the total number of attempted crimes,the completion rate is 98.28 %"

Granular View: Crimes with the greatest delay between reporting and occurrence by Burrough

granburroughdf = burroughdf
granburroughdf$Offense = nypd_complaints_df$OFNS_DESC
granburroughdf$LAW_CAT_CD = nypd_complaints_df$LAW_CAT_CD
granmissingvals = which(granburroughdf$OFNS_DESC=="")
granburroughdf$Offense = as.character(granburroughdf$Offense)
granburroughdf[granmissingvals,"Offense"] = list(as.character(granburroughdf[granmissingvals,"LAW_CAT_CD"]))
granburroughdf = granburroughdf[,-2]

topitems = unlist(subset(averagedurationbycrime,Duration>100)[1],use.names = FALSE)
granburroughdftopitems = granburroughdf[which(granburroughdf$Offense %in% topitems),]
freqgranburrough = table(granburroughdftopitems)
freqgranburrough = as.data.frame(freqgranburrough)
freqgranburrough = subset(freqgranburrough, BORO_NM != "")
freqgranburrough = subset(freqgranburrough, Offense != "")
freqgranburrough = subset(freqgranburrough, Offense != "KIDNAPPING AND RELATED OFFENSES")

Conclusion: Of the 5 Burroughs, Brooklyn seems to experience the greatest frequency of crime. In addition, general sex crimes, theft, and fraud have occurred at an exceptionally high frequency for the period of 2006-2016. Moreover, of the 5 burroughs, Staten Island seems to experience the lowest frequency of crime, especially when considering the big movers.