Data Ingestion via API

endpoint <- "https://data.cityofnewyork.us/resource/833y-fsy8.json"
resp <- httr::GET(endpoint, query = list("$limit" = 30000, "$order" = "occur_date DESC"))
shooting_data <- jsonlite::fromJSON(httr::content(resp, as = "text"), flatten = TRUE)

I pulled the NYPD shooting incident data directly from NYC Open Data using httr::GET() to the endpoint “https://data.cityofnewyork.us/resource/833y-fsy8.json”. This returned up to 30,000 records, sorted from newest to oldest by occur date. The dataset covers incidents from 2006-01-01T00:00:00.000 to 2024-12-31T00:00:00.000.

Cleaning Data

Removing NA rows in perp_race

colSums(is.na(shooting_data))
##                incident_key                  occur_date 
##                           0                           0 
##                  occur_time                        boro 
##                           0                           0 
##           loc_of_occur_desc                    precinct 
##                       25596                           0 
##           jurisdiction_code          loc_classfctn_desc 
##                           2                       25596 
##               location_desc     statistical_murder_flag 
##                       14977                           0 
##              perp_age_group                    perp_sex 
##                        9344                        9310 
##                   perp_race               vic_age_group 
##                        9310                           0 
##                     vic_sex                    vic_race 
##                           0                           0 
##                  x_coord_cd                  y_coord_cd 
##                           0                           0 
##                    latitude                   longitude 
##                          97                          97 
##        geocoded_column.type geocoded_column.coordinates 
##                          97                           0
sum(is.na(shooting_data$perp_race))
## [1] 9310
shooting_clean<-shooting_data %>% filter(
  !is.na(perp_race) &
  !(perp_race %in% c("(NULL)","UNKNOWN","(null)")))
sum(is.na(shooting_clean$perp_race))
## [1] 0

First, I checked the number of NA in each column. After considering which column I want to focus on, I picked perp_race. Then, I checked the number of NA in the perp_race column, which was 9310, and proceeded to remove any NA. Lastly, I checked the column again after removal to ensure it was successful.

Making perp_race values lowercase

shooting_clean<-shooting_clean %>% mutate(
  perp_race=stringr::str_to_lower(perp_race))

I cleaned the perp_race column by making every entry lowercase to prevent race duplication caused by capitalization differences. There are now 6 distinct race groups.

Creating time_of_daycolumn

shooting_clean<- shooting_data %>% separate(
  col = occur_time,
  into = c("Hour","Minute","Second"),
  sep = ":",
)
shooting_clean <- shooting_clean %>% mutate(
  time_of_day = case_when(
    Hour < 12 ~ "Morning",
    Hour < 18 ~ "Afternoon",
    Hour >= 18 ~ "Night"
  ))

To create the time_of_day column, i first split occur_time into separate Hour,Minute,and Secondcolumns. Then I grouped the Hour values into Morning, Afternoon, and Night. The counts for each group are ; Morning Afternoon Night ; 12222 5439 12083 , reflecting the times when shootings occurred in each group.

Insights

Time of Day

colnames(shooting_clean)
##  [1] "incident_key"                "occur_date"                 
##  [3] "Hour"                        "Minute"                     
##  [5] "Second"                      "boro"                       
##  [7] "loc_of_occur_desc"           "precinct"                   
##  [9] "jurisdiction_code"           "loc_classfctn_desc"         
## [11] "location_desc"               "statistical_murder_flag"    
## [13] "perp_age_group"              "perp_sex"                   
## [15] "perp_race"                   "vic_age_group"              
## [17] "vic_sex"                     "vic_race"                   
## [19] "x_coord_cd"                  "y_coord_cd"                 
## [21] "latitude"                    "longitude"                  
## [23] "geocoded_column.type"        "geocoded_column.coordinates"
## [25] "time_of_day"
shooting_clean %>% count(time_of_day)%>% arrange(desc(n))
##   time_of_day     n
## 1     Morning 12222
## 2       Night 12083
## 3   Afternoon  5439
shooting_clean %>% count(time_of_day,boro) %>% arrange(desc(n))
##    time_of_day          boro    n
## 1        Night      BROOKLYN 4793
## 2      Morning      BROOKLYN 4554
## 3        Night         BRONX 3761
## 4      Morning         BRONX 3517
## 5    Afternoon      BROOKLYN 2338
## 6      Morning        QUEENS 2072
## 7      Morning     MANHATTAN 1709
## 8        Night     MANHATTAN 1648
## 9        Night        QUEENS 1575
## 10   Afternoon         BRONX 1556
## 11   Afternoon        QUEENS  779
## 12   Afternoon     MANHATTAN  620
## 13     Morning STATEN ISLAND  370
## 14       Night STATEN ISLAND  306
## 15   Afternoon STATEN ISLAND  146
time_summary <- shooting_clean %>%
  filter(!is.na(time_of_day)) %>%
  count(time_of_day, name = "n") %>%
  mutate(pct = round(100 * n / sum(n), 1)) %>%
  arrange(desc(n))
time_summary
##   time_of_day     n  pct
## 1     Morning 12222 41.1
## 2       Night 12083 40.6
## 3   Afternoon  5439 18.3

To see when shootings most often occur, I counted incidents in the Morning, Afternoon, and Night and arranged them from highest to lowest. The highest rate happens during Morning (12222 cases; 41.1%).

Sex of Perpetrator

colnames(shooting_clean)
##  [1] "incident_key"                "occur_date"                 
##  [3] "Hour"                        "Minute"                     
##  [5] "Second"                      "boro"                       
##  [7] "loc_of_occur_desc"           "precinct"                   
##  [9] "jurisdiction_code"           "loc_classfctn_desc"         
## [11] "location_desc"               "statistical_murder_flag"    
## [13] "perp_age_group"              "perp_sex"                   
## [15] "perp_race"                   "vic_age_group"              
## [17] "vic_sex"                     "vic_race"                   
## [19] "x_coord_cd"                  "y_coord_cd"                 
## [21] "latitude"                    "longitude"                  
## [23] "geocoded_column.type"        "geocoded_column.coordinates"
## [25] "time_of_day"
shooting_clean_sex <- shooting_clean %>%
  filter(!is.na(perp_sex),
         !(perp_sex %in% c("U","(null)")))

shooting_clean_sex %>% count(perp_sex,boro)%>% arrange(desc(n))
##    perp_sex          boro    n
## 1         M      BROOKLYN 5971
## 2         M         BRONX 5279
## 3         M        QUEENS 2502
## 4         M     MANHATTAN 2484
## 5         M STATEN ISLAND  609
## 6         F      BROOKLYN  146
## 7         F         BRONX  134
## 8         F     MANHATTAN   87
## 9         F        QUEENS   79
## 10        F STATEN ISLAND   15
male_by_boro <- shooting_clean_sex %>%
  filter(perp_sex == "M", !is.na(boro)) %>%
  count(boro, name = "n") %>%
  arrange(desc(n)) %>%
  mutate(boro = str_to_title(boro))
male_by_boro
##            boro    n
## 1      Brooklyn 5971
## 2         Bronx 5279
## 3        Queens 2502
## 4     Manhattan 2484
## 5 Staten Island  609

I cleaned the data by removing rows where perpetrator sex was missing, then counted shootings by sex and borough. Then, I focused on male incidents by borough and sorted the counts. The most male-perp incidents occur in Brooklyn (5971 cases; 35.4%).

Tables & Graphs

Table (kable)

shooting_top <- shooting_clean %>% filter(!is.na(perp_sex), !(perp_sex %in% c("U","(null)"))) %>% mutate(occur_date = as.Date(str_remove(occur_date, "T.*")), perp_sex = case_when(
  perp_sex == "M" ~ "Male",
  perp_sex == "F" ~ "Female",
  TRUE ~ perp_sex)) %>% 
  select(occur_date, boro, time_of_day, perp_sex, perp_race) %>% head(10) 
shooting_top
##    occur_date      boro time_of_day perp_sex      perp_race
## 1  2024-12-31  BROOKLYN       Night     Male          BLACK
## 2  2024-12-31  BROOKLYN       Night     Male          BLACK
## 3  2024-12-30     BRONX   Afternoon     Male          BLACK
## 4  2024-12-30  BROOKLYN       Night     Male          BLACK
## 5  2024-12-30     BRONX       Night     Male          BLACK
## 6  2024-12-29     BRONX   Afternoon     Male          BLACK
## 7  2024-12-28 MANHATTAN       Night     Male          BLACK
## 8  2024-12-28 MANHATTAN       Night   Female          BLACK
## 9  2024-12-27     BRONX       Night     Male BLACK HISPANIC
## 10 2024-12-27     BRONX       Night     Male BLACK HISPANIC
top_sex <- shooting_top %>% count(perp_sex, sort = TRUE) %>% slice(1)
kable(shooting_top) 
occur_date boro time_of_day perp_sex perp_race
2024-12-31 BROOKLYN Night Male BLACK
2024-12-31 BROOKLYN Night Male BLACK
2024-12-30 BRONX Afternoon Male BLACK
2024-12-30 BROOKLYN Night Male BLACK
2024-12-30 BRONX Night Male BLACK
2024-12-29 BRONX Afternoon Male BLACK
2024-12-28 MANHATTAN Night Male BLACK
2024-12-28 MANHATTAN Night Female BLACK
2024-12-27 BRONX Night Male BLACK HISPANIC
2024-12-27 BRONX Night Male BLACK HISPANIC

I removed rows missing perpetrator sex, converted occur_date to dates without the time stamp, recoded sex labels to “Male” and “Female,” selected only the key columns, and displayed the first 10 rows. The most common perpetrator sex is Male

Graphs (ggplot2)

Time of Day Plot
shooting_time<- shooting_clean %>% 
  group_by(time_of_day,boro) %>% 
  summarize(total=n())
## `summarise()` has grouped output by 'time_of_day'. You can override using the
## `.groups` argument.
ggplot(shooting_time, aes(x = time_of_day, y = total, fill = time_of_day)) +
  geom_col() +
  labs(title = "Time of Shootings in NYC",
       x = "Time of Day", y = "Number of Shootings",fill="Time of Day") +
  theme_minimal(base_size = 12) +
  theme(plot.title = element_text(size = 17, family = "Georgia", face = "bold"),
        axis.title.x = element_text(size = 12, family = "Georgia"),
        axis.title.y = element_text(size = 12, family = "Georgia"))

I grouped shootings by time of day and borough, counted incidents, and created a bar chart of total shootings by time of day. The fewest shooting incidents occurs in the Afternoon.

Sex of Perpetrator Plot
shooting_clean_perp_sex<- shooting_clean_sex %>% 
  group_by(perp_sex,boro) %>% 
  summarize(total=n())
## `summarise()` has grouped output by 'perp_sex'. You can override using the
## `.groups` argument.
shooting_clean_perp_sex <- shooting_clean_perp_sex %>%
  mutate(
    perp_sex = factor(perp_sex, levels = c("F","M"),
                      labels = c("Female","Male")))

ggplot(shooting_clean_perp_sex, aes(x = perp_sex, y = total, fill = perp_sex)) +
  geom_col() +
  facet_wrap(~ boro) +
  labs(
    title = "Shootings by Sex of Perpetrator (Faceted by Borough)",
    x = "Perpetrator Sex", y = "Number of Shootings", fill = "Perpetrator Sex"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title   = element_text(size = 17, family = "sans", face = "bold"),
    axis.title.x = element_text(size = 12, family = "sans"),
    axis.title.y = element_text(size = 12, family = "sans")
  )

I grouped the data by perpetrator sex and borough, counted incidents, and recoded sex labels to “Female” and “Male.” Then I plotted a faceted bar chart showing shootings by perpetrator sex for each borough. The borough with the fewest shootings is STATEN ISLAND (624 incidents).

Reflection

I think learning how to create an R Markdown is gonna be really helpful when I start working with my thesis dataset. It keeps my code and notes in a clear, easy-to-follow flow, so I can see exactly how each step of the analysis was done. When I come back to it later, it’ll work like a built-in guide to remind me what I did and make it easy to pick up right where I left off or share the workflow with others. This setup will help me stay organized and keep everything reproducible.