Project_1_NYPD_STOPS

Author

Crewe Mellott

My Project

For my project I decided to use a dataset which has observations on police stops in NYC and I wanted to look at the duration of a stop versus the number of arrests per borough. I found the dataset in the google drive for the class and it was originally found on NYC.gov.

Loading Libraries and Dataset

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.2
✔ ggplot2   4.0.0     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(ggplot2)
library(readr)
setwd("C:/Users/crewe/Data_110")
nypd_stops <- read_csv("NYPD-Stop-Question-Frisk-2020.csv")
Rows: 9544 Columns: 83
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (70): STOP_FRISK_DATE, MONTH2, DAY2, STOP_WAS_INITIATED, RECORD_STATUS_...
dbl  (12): STOP_ID, YEAR2, ISSUING_OFFICER_COMMAND_CODE, SUPERVISING_OFFICER...
time  (1): STOP_FRISK_TIME

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Cleaning and Organizing Data

#Convert arrest flag to numeric for calculation 
nypd_data <- nypd_stops |>
mutate(
Arrests = ifelse(SUSPECT_ARRESTED_FLAG == "Y", 1, 0), # I had copilot help me with this because I couldn't find a website that worked 
STOP_LOCATION_BORO_NAME = as.factor(STOP_LOCATION_BORO_NAME)
)

# Calculate arrest percentage by exact stop duration and borough
arrest_by_duration_boro <- nypd_data |>
group_by(STOP_DURATION_MINUTES, STOP_LOCATION_BORO_NAME) |>
summarise(Arrest_Percentage = mean(Arrests, na.rm = TRUE) * 100) |>
ungroup()
`summarise()` has grouped output by 'STOP_DURATION_MINUTES'. You can override
using the `.groups` argument.

Visualizations

#creating graph
ggplot(arrest_by_duration_boro, aes(x = STOP_DURATION_MINUTES, y = Arrest_Percentage, fill = STOP_LOCATION_BORO_NAME)) +
  geom_col() +
  scale_fill_manual(values = c("#FF9999", "#99CCFF", "#99FF99", "#FFCC66", "#CC99FF")) +  # 5 different colors
  labs(
    title = "Percentage of Arrests by Stop Duration (Minutes) for Each Borough",
    x = "Stop Duration (Minutes)",
    y = "Arrest Percentage (%)",
    caption = "Source: NYPD Stop, Question, and Frisk Dataset (2020)"
  ) +
  theme_classic() +
  facet_wrap(~ STOP_LOCATION_BORO_NAME, scales = "free_y") +   #create seperate graphs
  theme(
    plot.title = element_text(face = "bold", size = 12),
    axis.title = element_text(size = 12),
    strip.text = element_text(face = "bold", size = 12),
    legend.position = "none"  #gets rid of the legend since each graph is already labeled
  )

Essay:

To clean the dataset, I first imported it using read_csv() and checked for missing or invalid values in the columns I was going to use. I converted SUSPECT_ARRESTED_FLAG from “Y” and “N” into numeric values (1 and 0) so I could calculate percentages. I also made sure STOP_DURATION_MINUTES was in numeric form and converted STOP_LOCATION_BORO_NAME into a factor so it could be used as a categorical variable in the graph.

The visualization shows the percentage of arrests at each exact stop duration, separated into different graphs for each borough. This allows us to compare how arrest rates change as stop time increases. In some boroughs, longer stops seem to result in higher arrest percentages, which may suggest more serious or escalated situations.

One challenge was that using every exact minute made the graph busy. If possible, I would have liked to group durations (e.g., in 5-minute ranges) to make patterns easier to see, but I wasn’t sure how to.

One thing that surprised me was that bronx looked like one of the lowest if not thee lowest percentage of arrests per stop. I feel like it might because of more overall stops leading to lower percentage but I thought it’d be at least similar to the other graphs with the reputation of the Bronx.