endpoint <- "https://data.cityofnewyork.us/resource/833y-fsy8.json"
resp <- httr::GET(endpoint, query = list("$limit" = 30000, "$order" = "occur_date DESC"))
shooting_data <- jsonlite::fromJSON(httr::content(resp, as = "text"), flatten = TRUE)
I pulled the NYPD shooting incident data directly from NYC Open Data using httr::GET() to the endpoint “https://data.cityofnewyork.us/resource/833y-fsy8.json”. This returned up to 30,000 records, sorted from newest to oldest by occur date. The dataset covers incidents from 2006-01-01T00:00:00.000 to 2024-12-31T00:00:00.000.
perp_race
colSums(is.na(shooting_data))
## incident_key occur_date
## 0 0
## occur_time boro
## 0 0
## loc_of_occur_desc precinct
## 25596 0
## jurisdiction_code loc_classfctn_desc
## 2 25596
## location_desc statistical_murder_flag
## 14977 0
## perp_age_group perp_sex
## 9344 9310
## perp_race vic_age_group
## 9310 0
## vic_sex vic_race
## 0 0
## x_coord_cd y_coord_cd
## 0 0
## latitude longitude
## 97 97
## geocoded_column.type geocoded_column.coordinates
## 97 0
sum(is.na(shooting_data$perp_race))
## [1] 9310
shooting_clean<-shooting_data %>% filter(
!is.na(perp_race) &
!(perp_race %in% c("(NULL)","UNKNOWN","(null)")))
sum(is.na(shooting_clean$perp_race))
## [1] 0
First, I checked the number of NA in each column. After considering
which column I want to focus on, I picked perp_race
. Then,
I checked the number of NA in the perp_race
column, which
was 9310, and proceeded to remove any NA. Lastly, I checked the column
again after removal to ensure it was successful.
perp_race
values lowercaseshooting_clean<-shooting_clean %>% mutate(
perp_race=stringr::str_to_lower(perp_race))
I cleaned the perp_race column by making every entry lowercase to prevent race duplication caused by capitalization differences. There are now 6 distinct race groups.
time_of_day
columnshooting_clean<- shooting_data %>% separate(
col = occur_time,
into = c("Hour","Minute","Second"),
sep = ":",
)
shooting_clean <- shooting_clean %>% mutate(
time_of_day = case_when(
Hour < 12 ~ "Morning",
Hour < 18 ~ "Afternoon",
Hour >= 18 ~ "Night"
))
To create the time_of_day
column, i first split
occur_time
into separate
Hour
,Minute
,and Second
columns.
Then I grouped the Hour
values into Morning, Afternoon, and
Night. The counts for each group are ; Morning Afternoon Night ; 12222
5439 12083 , reflecting the times when shootings occurred in each
group.
colnames(shooting_clean)
## [1] "incident_key" "occur_date"
## [3] "Hour" "Minute"
## [5] "Second" "boro"
## [7] "loc_of_occur_desc" "precinct"
## [9] "jurisdiction_code" "loc_classfctn_desc"
## [11] "location_desc" "statistical_murder_flag"
## [13] "perp_age_group" "perp_sex"
## [15] "perp_race" "vic_age_group"
## [17] "vic_sex" "vic_race"
## [19] "x_coord_cd" "y_coord_cd"
## [21] "latitude" "longitude"
## [23] "geocoded_column.type" "geocoded_column.coordinates"
## [25] "time_of_day"
shooting_clean %>% count(time_of_day)%>% arrange(desc(n))
## time_of_day n
## 1 Morning 12222
## 2 Night 12083
## 3 Afternoon 5439
shooting_clean %>% count(time_of_day,boro) %>% arrange(desc(n))
## time_of_day boro n
## 1 Night BROOKLYN 4793
## 2 Morning BROOKLYN 4554
## 3 Night BRONX 3761
## 4 Morning BRONX 3517
## 5 Afternoon BROOKLYN 2338
## 6 Morning QUEENS 2072
## 7 Morning MANHATTAN 1709
## 8 Night MANHATTAN 1648
## 9 Night QUEENS 1575
## 10 Afternoon BRONX 1556
## 11 Afternoon QUEENS 779
## 12 Afternoon MANHATTAN 620
## 13 Morning STATEN ISLAND 370
## 14 Night STATEN ISLAND 306
## 15 Afternoon STATEN ISLAND 146
time_summary <- shooting_clean %>%
filter(!is.na(time_of_day)) %>%
count(time_of_day, name = "n") %>%
mutate(pct = round(100 * n / sum(n), 1)) %>%
arrange(desc(n))
time_summary
## time_of_day n pct
## 1 Morning 12222 41.1
## 2 Night 12083 40.6
## 3 Afternoon 5439 18.3
To see when shootings most often occur, I counted incidents in the Morning, Afternoon, and Night and arranged them from highest to lowest. The highest rate happens during Morning (12222 cases; 41.1%).
colnames(shooting_clean)
## [1] "incident_key" "occur_date"
## [3] "Hour" "Minute"
## [5] "Second" "boro"
## [7] "loc_of_occur_desc" "precinct"
## [9] "jurisdiction_code" "loc_classfctn_desc"
## [11] "location_desc" "statistical_murder_flag"
## [13] "perp_age_group" "perp_sex"
## [15] "perp_race" "vic_age_group"
## [17] "vic_sex" "vic_race"
## [19] "x_coord_cd" "y_coord_cd"
## [21] "latitude" "longitude"
## [23] "geocoded_column.type" "geocoded_column.coordinates"
## [25] "time_of_day"
shooting_clean_sex <- shooting_clean %>%
filter(!is.na(perp_sex),
!(perp_sex %in% c("U","(null)")))
shooting_clean_sex %>% count(perp_sex,boro)%>% arrange(desc(n))
## perp_sex boro n
## 1 M BROOKLYN 5971
## 2 M BRONX 5279
## 3 M QUEENS 2502
## 4 M MANHATTAN 2484
## 5 M STATEN ISLAND 609
## 6 F BROOKLYN 146
## 7 F BRONX 134
## 8 F MANHATTAN 87
## 9 F QUEENS 79
## 10 F STATEN ISLAND 15
male_by_boro <- shooting_clean_sex %>%
filter(perp_sex == "M", !is.na(boro)) %>%
count(boro, name = "n") %>%
arrange(desc(n)) %>%
mutate(boro = str_to_title(boro))
male_by_boro
## boro n
## 1 Brooklyn 5971
## 2 Bronx 5279
## 3 Queens 2502
## 4 Manhattan 2484
## 5 Staten Island 609
I cleaned the data by removing rows where perpetrator sex was missing, then counted shootings by sex and borough. Then, I focused on male incidents by borough and sorted the counts. The most male-perp incidents occur in Brooklyn (5971 cases; 35.4%).
shooting_top <- shooting_clean %>% filter(!is.na(perp_sex), !(perp_sex %in% c("U","(null)"))) %>% mutate(occur_date = as.Date(str_remove(occur_date, "T.*")), perp_sex = case_when(
perp_sex == "M" ~ "Male",
perp_sex == "F" ~ "Female",
TRUE ~ perp_sex)) %>%
select(occur_date, boro, time_of_day, perp_sex, perp_race) %>% head(10)
shooting_top
## occur_date boro time_of_day perp_sex perp_race
## 1 2024-12-31 BROOKLYN Night Male BLACK
## 2 2024-12-31 BROOKLYN Night Male BLACK
## 3 2024-12-30 BRONX Afternoon Male BLACK
## 4 2024-12-30 BROOKLYN Night Male BLACK
## 5 2024-12-30 BRONX Night Male BLACK
## 6 2024-12-29 BRONX Afternoon Male BLACK
## 7 2024-12-28 MANHATTAN Night Male BLACK
## 8 2024-12-28 MANHATTAN Night Female BLACK
## 9 2024-12-27 BRONX Night Male BLACK HISPANIC
## 10 2024-12-27 BRONX Night Male BLACK HISPANIC
top_sex <- shooting_top %>% count(perp_sex, sort = TRUE) %>% slice(1)
kable(shooting_top)
occur_date | boro | time_of_day | perp_sex | perp_race |
---|---|---|---|---|
2024-12-31 | BROOKLYN | Night | Male | BLACK |
2024-12-31 | BROOKLYN | Night | Male | BLACK |
2024-12-30 | BRONX | Afternoon | Male | BLACK |
2024-12-30 | BROOKLYN | Night | Male | BLACK |
2024-12-30 | BRONX | Night | Male | BLACK |
2024-12-29 | BRONX | Afternoon | Male | BLACK |
2024-12-28 | MANHATTAN | Night | Male | BLACK |
2024-12-28 | MANHATTAN | Night | Female | BLACK |
2024-12-27 | BRONX | Night | Male | BLACK HISPANIC |
2024-12-27 | BRONX | Night | Male | BLACK HISPANIC |
I removed rows missing perpetrator sex, converted occur_date to dates without the time stamp, recoded sex labels to “Male” and “Female,” selected only the key columns, and displayed the first 10 rows. The most common perpetrator sex is Male
shooting_time<- shooting_clean %>%
group_by(time_of_day,boro) %>%
summarize(total=n())
## `summarise()` has grouped output by 'time_of_day'. You can override using the
## `.groups` argument.
ggplot(shooting_time, aes(x = time_of_day, y = total, fill = time_of_day)) +
geom_col() +
labs(title = "Time of Shootings in NYC",
x = "Time of Day", y = "Number of Shootings",fill="Time of Day") +
theme_minimal(base_size = 12) +
theme(plot.title = element_text(size = 17, family = "Georgia", face = "bold"),
axis.title.x = element_text(size = 12, family = "Georgia"),
axis.title.y = element_text(size = 12, family = "Georgia"))
I grouped shootings by time of day and borough, counted incidents, and created a bar chart of total shootings by time of day. The fewest shooting incidents occurs in the Afternoon.
shooting_clean_perp_sex<- shooting_clean_sex %>%
group_by(perp_sex,boro) %>%
summarize(total=n())
## `summarise()` has grouped output by 'perp_sex'. You can override using the
## `.groups` argument.
shooting_clean_perp_sex <- shooting_clean_perp_sex %>%
mutate(
perp_sex = factor(perp_sex, levels = c("F","M"),
labels = c("Female","Male")))
ggplot(shooting_clean_perp_sex, aes(x = perp_sex, y = total, fill = perp_sex)) +
geom_col() +
facet_wrap(~ boro) +
labs(
title = "Shootings by Sex of Perpetrator (Faceted by Borough)",
x = "Perpetrator Sex", y = "Number of Shootings", fill = "Perpetrator Sex"
) +
theme_minimal(base_size = 12) +
theme(
plot.title = element_text(size = 17, family = "sans", face = "bold"),
axis.title.x = element_text(size = 12, family = "sans"),
axis.title.y = element_text(size = 12, family = "sans")
)
I grouped the data by perpetrator sex and borough, counted incidents, and recoded sex labels to “Female” and “Male.” Then I plotted a faceted bar chart showing shootings by perpetrator sex for each borough. The borough with the fewest shootings is STATEN ISLAND (624 incidents).
I think learning how to create an R Markdown is gonna be really helpful when I start working with my thesis dataset. It keeps my code and notes in a clear, easy-to-follow flow, so I can see exactly how each step of the analysis was done. When I come back to it later, it’ll work like a built-in guide to remind me what I did and make it easy to pick up right where I left off or share the workflow with others. This setup will help me stay organized and keep everything reproducible.