Homework: R Script → R Markdown Report

Loading the Dataset

endpoint <- "https://data.cityofnewyork.us/resource/833y-fsy8.json"

resp <- GET(endpoint, query = list(
  "$limit" = 30000,
  "$order" = "occur_date DESC"
  ))

shooting_data <- jsonlite::fromJSON(content(resp, as = "text"), flatten = TRUE)

This code allowed me to pull the public dataset from online and load it into R. Now, I can start working with it.

Data Cleaning

shooting_data_new <- shooting_data %>%
  mutate(
    perp_age_group = na_if(
      perp_age_group, "(null)"
      )
    )
shooting_data_new <- shooting_data_new %>%
  mutate(
    location_desc = na_if(
      location_desc, "(null)"
      )
    )
shooting_data_new <- shooting_data_new %>%
  mutate(
    perp_sex = na_if(
      perp_sex, "(null)"
      )
    )
shooting_data_new <- shooting_data_new %>%
  mutate(
    perp_race = na_if(
      perp_race, "(null)"
      )
    )

sum(is.na(
  shooting_data_new$perp_age_group)
  )

## [1] 10972

shooting_data_new <- shooting_data_new %>% select(1:16)
shooting_data_2 <- shooting_data_new %>% filter(!is.na(perp_age_group))

First, I changed all (null) values and made them NA, that way they can be recognized when removing all rows that had NA values. I removed all NA values from the column perp_age_group in order to filter out rows that we don’t have enough information on.

shooting_data_2 <- shooting_data_2 %>% separate(
  col = occur_time,
  into = c("Hour","Minute","Second"),
  sep = ":",
)

shooting_data_2 <- shooting_data_2 %>% mutate(Hour = as.numeric(Hour))
#and now...
shooting_data_2 <- shooting_data_2 %>%
  mutate(
    time_of_day = case_when(
      Hour >= 3 & Hour < 12  ~ "Morning",
      Hour >= 12 & Hour < 18 ~ "Afternoon",
      Hour >= 18 | Hour < 3 ~ "Night"
    )
  )

shooting_data_clean <- shooting_data_2 %>% select(1:19)

Next, I broke up the occur_time column into Hour, Minute, and Second columns. I used the Hour column to create a time_of_day column that specifies whether the shooting happened in the morning, afternoon, or night.

shooting_data_clean <- shooting_data_clean %>% separate(
  col = occur_date,
  into = c("Year","Month","Day"),
  sep = "-",
)
shooting_data_clean$Day <- sub("T.*", "", shooting_data_clean$Day)

Finally, I had to break up the occur_date column into Month, Day, and Year columns in order to run some of the insights and graphs that I plan on doing.

Insights

Insight 1

shooting_data_clean %>% count(perp_sex)

##   perp_sex     n
## 1        F   461
## 2        M 16845
## 3        U  1466

This insight shows us that almost four times the amount of shootings have been committed by men compared to women.

Insight 2

shooting_data_clean %>% count(perp_sex,vic_sex) %>% arrange(desc(n))

##   perp_sex vic_sex     n
## 1        M       M 15008
## 2        M       F  1830
## 3        U       M  1353
## 4        F       M   380
## 5        U       F   112
## 6        F       F    80
## 7        M       U     7
## 8        F       U     1
## 9        U       U     1

In the last 19 years, males have been the most common perpetrators as well as the most common victims. Female perpetrators have shot male victims more than they shot female victims.

Tables and Graphs

Graph 1

shooting_by_time_of_day <- shooting_data_clean %>% 
  group_by(time_of_day) %>% 
  summarize(total = n())

ggplot(shooting_by_time_of_day, aes(x = time_of_day, y = total)) +
  geom_bar(stat = 'identity', fill = 'steelblue') +
  labs(title = "Frequency of Shootings in NYC by Time of Day",
       x = "Time of Day",
       y =" Total" +
         theme(
           plot.title = element_text(size=15, family = "serif", face = "bold")
         )
  )

This graph shows us that most shootings occur at night.

Table 1

murders_per_year <- shooting_data_clean %>% 
  filter(statistical_murder_flag == TRUE) %>%
  group_by(Year)
murders_summary <- murders_per_year %>% 
  count(Year,statistical_murder_flag)

murders_summary <- murders_summary %>% rename(total = n)

murders_summary %>% select(1,3) %>% kable(caption = "Gun Murders per Year, NYC: 2006-2024")

Gun Murders per Year, NYC: 2006-2024
Year	total
2006	378
2007	243
2008	240
2009	234
2010	216
2011	221
2012	169
2013	143
2014	159
2015	186
2016	147
2017	129
2018	147
2019	122
2020	214
2021	274
2022	243
2023	188
2024	175

This table gives us a run down of how many shootings resulted in murder each year from 2006-2024.

Graph 2

ggplot(murders_summary, aes(x = Year, y = total, group = 1))+
  geom_line(color = 'red', linewidth = 1) +
  labs(title = "Gun Murders in NYC per Year",
       x = "Year",
       y =" Murders by Gun") +
  theme(
    plot.title = element_text(size=20, family = "serif", face = "bold")
  )

This graph shows us the trend line of how many shootings resulted in murder each year from 2006-2024.

The average amount of shootings in NYC each year is 201.4736842 per year.

Reflection

I could see this workflow helping me in my thesis research because it seems to be a good tool for creating a coherent and comprehensive document to look back on and follow. Additionally, it seems easy to share my thought process/analyses with my mentor, and an accessible way for her to collaborate on my code if need be. While I still have to figure out the best way to utilize this kind of workflow, I can definitely see it having benefits to keep everything organized for myself and my data, and for reproducibility purposes.

NYC Shooting Data