NYPD Data Analysis

A Short Statistical Summary on New York Crime

Fist we load in the necessary R Packages for our research:

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Source

I will be reading in data from the Data.Gov historic shootings incident CSV file reported by the NYPD, here is the code:

url_in <- "https://data.cityofnewyork.us/api/views/833y-fsy8/rows.csv?accessType=DOWNLOAD"
NYPD_Data <- read_csv(url_in)

## Rows: 27312 Columns: 21
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (12): OCCUR_DATE, BORO, LOC_OF_OCCUR_DESC, LOC_CLASSFCTN_DESC, LOCATION...
## dbl   (7): INCIDENT_KEY, PRECINCT, JURISDICTION_CODE, X_COORD_CD, Y_COORD_CD...
## lgl   (1): STATISTICAL_MURDER_FLAG
## time  (1): OCCUR_TIME
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Visualization 1

Occurrence of crime over the course of the day by precint.

NYPD_TimePlace <- NYPD_Data %>% select(OCCUR_TIME, BORO, PRECINCT)
ggplot(NYPD_TimePlace, aes(x = PRECINCT, y = OCCUR_TIME, color = BORO)) + geom_point(size = 0.2)

(Bronx = Red, Brooklyn = Yellow, Manhattan = Green, Queens = Blue, Staten Island = Pink)

Analysis and Observation: We can observe the frequency of crime by precinct throughout the day between 2006 and 2022. One notable observation is that crime seems to be the sparsest from sunrise to noon. We can also see that Manhattan, especially in precincts 1 through 20 is much more safe than the Bronx.

Visualization 2

Frequency of Crime by Day from 2006 to 2022

NYPD_ByDate <- NYPD_Data %>% group_by(OCCUR_DATE) %>% summarise(frequency= n())
ggplot(NYPD_ByDate, aes(x = OCCUR_DATE, y = frequency)) + geom_point(size = 0.2)

Analysis and Observation: Here I have a less than pretty scatter plot of the frequency of crime every day between 2006 and 2022. I don’t have a very sophisticated grasp of R yet so I was unable to table the dates at the bottom. However it is still observable that in the middle of the time period, from around 2011 to 2017, there are noticeably more days with over 10 reportings.

Model

Projection of Frequency of crime by day over the normal distribution

 qqnorm(NYPD_ByDate$frequency)
 qqline(NYPD_ByDate$frequency, col = "red")

Analysis and Observation: I have projected the daily frequency of crime over the theoretical normal curve using the qqplot function. The graph is thick tailed in the beginning which means any low number of reportings on a given day, such as 1, is far from unexpected. However it is thin tailed at the end, so a large number of reportings such as 30 is far from expected based on the normal distribution. Such Occurrences suggest unusual circumstances.

Admission of Bias

It is important to know that my research and conclusion may contain many flaws. I am not a very practiced data scientist so my cleaning of data and understanding of it is not comprehensive. Further more the data itself I have collected from the source may have some flaws. I recommend cross examining my studies with any other published research available on the same subject.