Loading the Dataset

endpoint <- "https://data.cityofnewyork.us/resource/833y-fsy8.json"

resp <- GET(endpoint, query = list(
  "$limit" = 30000,
  "$order" = "occur_date DESC"
  ))

shooting_data <- jsonlite::fromJSON(content(resp, as = "text"), flatten = TRUE)

This code allowed me to pull the public dataset from online and load it into R. Now, I can start working with it.

Data Cleaning

shooting_data_new <- shooting_data %>%
  mutate(
    perp_age_group = na_if(
      perp_age_group, "(null)"
      )
    )
shooting_data_new <- shooting_data_new %>%
  mutate(
    location_desc = na_if(
      location_desc, "(null)"
      )
    )
shooting_data_new <- shooting_data_new %>%
  mutate(
    perp_sex = na_if(
      perp_sex, "(null)"
      )
    )
shooting_data_new <- shooting_data_new %>%
  mutate(
    perp_race = na_if(
      perp_race, "(null)"
      )
    )

sum(is.na(
  shooting_data_new$perp_age_group)
  )
## [1] 10972
shooting_data_new <- shooting_data_new %>% select(1:16)
shooting_data_2 <- shooting_data_new %>% filter(!is.na(perp_age_group))

First, I changed all (null) values and made them NA, that way they can be recognized when removing all rows that had NA values. I removed all NA values from the column perp_age_group in order to filter out rows that we don’t have enough information on.

shooting_data_2 <- shooting_data_2 %>% separate(
  col = occur_time,
  into = c("Hour","Minute","Second"),
  sep = ":",
)

shooting_data_2 <- shooting_data_2 %>% mutate(Hour = as.numeric(Hour))
#and now...
shooting_data_2 <- shooting_data_2 %>%
  mutate(
    time_of_day = case_when(
      Hour >= 3 & Hour < 12  ~ "Morning",
      Hour >= 12 & Hour < 18 ~ "Afternoon",
      Hour >= 18 | Hour < 3 ~ "Night"
    )
  )

shooting_data_clean <- shooting_data_2 %>% select(1:19)

Next, I broke up the occur_time column into Hour, Minute, and Second columns. I used the Hour column to create a time_of_day column that specifies whether the shooting happened in the morning, afternoon, or night.

shooting_data_clean <- shooting_data_clean %>% separate(
  col = occur_date,
  into = c("Year","Month","Day"),
  sep = "-",
)
shooting_data_clean$Day <- sub("T.*", "", shooting_data_clean$Day)

Finally, I had to break up the occur_date column into Month, Day, and Year columns in order to run some of the insights and graphs that I plan on doing.

Insights

Insight 1

shooting_data_clean %>% count(perp_sex)
##   perp_sex     n
## 1        F   461
## 2        M 16845
## 3        U  1466

This insight shows us that almost four times the amount of shootings have been committed by men compared to women.

Insight 2

shooting_data_clean %>% count(perp_sex,vic_sex) %>% arrange(desc(n))
##   perp_sex vic_sex     n
## 1        M       M 15008
## 2        M       F  1830
## 3        U       M  1353
## 4        F       M   380
## 5        U       F   112
## 6        F       F    80
## 7        M       U     7
## 8        F       U     1
## 9        U       U     1

In the last 19 years, males have been the most common perpetrators as well as the most common victims. Female perpetrators have shot male victims more than they shot female victims.

Tables and Graphs

Graph 1

shooting_by_time_of_day <- shooting_data_clean %>% 
  group_by(time_of_day) %>% 
  summarize(total = n())

ggplot(shooting_by_time_of_day, aes(x = time_of_day, y = total)) +
  geom_bar(stat = 'identity', fill = 'steelblue') +
  labs(title = "Frequency of Shootings in NYC by Time of Day",
       x = "Time of Day",
       y =" Total" +
         theme(
           plot.title = element_text(size=15, family = "serif", face = "bold")
         )
  )

This graph shows us that most shootings occur at night.

Table 1

murders_per_year <- shooting_data_clean %>% 
  filter(statistical_murder_flag == TRUE) %>%
  group_by(Year)
murders_summary <- murders_per_year %>% 
  count(Year,statistical_murder_flag)

murders_summary <- murders_summary %>% rename(total = n)

murders_summary %>% select(1,3) %>% kable(caption = "Gun Murders per Year, NYC: 2006-2024")
Gun Murders per Year, NYC: 2006-2024
Year total
2006 378
2007 243
2008 240
2009 234
2010 216
2011 221
2012 169
2013 143
2014 159
2015 186
2016 147
2017 129
2018 147
2019 122
2020 214
2021 274
2022 243
2023 188
2024 175

This table gives us a run down of how many shootings resulted in murder each year from 2006-2024.

Graph 2

ggplot(murders_summary, aes(x = Year, y = total, group = 1))+
  geom_line(color = 'red', linewidth = 1) +
  labs(title = "Gun Murders in NYC per Year",
       x = "Year",
       y =" Murders by Gun") +
  theme(
    plot.title = element_text(size=20, family = "serif", face = "bold")
  )

This graph shows us the trend line of how many shootings resulted in murder each year from 2006-2024.

The average amount of shootings in NYC each year is 201.4736842 per year.

Reflection

I could see this workflow helping me in my thesis research because it seems to be a good tool for creating a coherent and comprehensive document to look back on and follow. Additionally, it seems easy to share my thought process/analyses with my mentor, and an accessible way for her to collaborate on my code if need be. While I still have to figure out the best way to utilize this kind of workflow, I can definitely see it having benefits to keep everything organized for myself and my data, and for reproducibility purposes.

NYC Shooting Data