Introduction

The NYC OpenData NYPD Shooting Incident (Historic) dataset provides comprehensive data on shooting incidents recorded by the New York Police Department from 2006 to the prior year. It details incidents where firearms were discharged and injuries occurred, including: date/time, location (borough, precinct, coordinates), victim/perpetrator demographics (age, sex, race), injury severity (fatal/non-fatal), crime classification, and contextual factors (e.g. gang-related, domestic violence). This dataset empowers analysts and policymakers to analyze gun violence trends and inform data-driven public safety strategies.

Research Questions

This report aims to answer the following questions:

  1. What is the overall trend in shooting incidents?
  2. Which boroughs and neighborhoods experience the highest number of shootings?
  3. During what times do most shootings occur?
  4. What insights can be gained about the victims in these incidents?
  5. Is the number of fatalities projected to increase or decrease over the next decade?
  6. What is the probability of a shooting resulting in a fatality?

Environment Setup

To begin, let’s import the various packages that we will need to conduct our analysis, and set a seed value so that our models are reproducible.

# Import required libraries
library(tidyverse)
library(lubridate)
library(ggplot2)
library(dplyr)
library(hms)
library(caret)
library(sf)
library(viridis)
library(knitr)
library(wordcloud)
library(forecast)
library(fastDummies)
library(randomForest)

# Set seed for reproducibility
set.seed(123)

Load Data

Next, download the latest version of the dataset from the NYC Open Data API and read it into R as a data frame.

# Load dataset into a dataframe
shootings <- read_csv("https://data.cityofnewyork.us/api/views/833y-fsy8/rows.csv?accessType=DOWNLOAD")

Inspect Data

Before performing any analysis, we need to understand the structure, data types, and sample values of the dataset. This is a critical early step in any data science project.

# Inspect dataframe
glimpse(shootings)
## Rows: 28,562
## Columns: 21
## $ INCIDENT_KEY            <dbl> 231974218, 177934247, 255028563, 25384540, 726…
## $ OCCUR_DATE              <chr> "08/09/2021", "04/07/2018", "12/02/2022", "11/…
## $ OCCUR_TIME              <time> 01:06:00, 19:48:00, 22:57:00, 01:50:00, 01:58…
## $ BORO                    <chr> "BRONX", "BROOKLYN", "BRONX", "BROOKLYN", "BRO…
## $ LOC_OF_OCCUR_DESC       <chr> NA, NA, "OUTSIDE", NA, NA, NA, NA, NA, NA, NA,…
## $ PRECINCT                <dbl> 40, 79, 47, 66, 46, 42, 71, 69, 75, 69, 40, 42…
## $ JURISDICTION_CODE       <dbl> 0, 0, 0, 0, 0, 2, 0, 2, 0, 0, 0, 2, 0, 0, 2, 0…
## $ LOC_CLASSFCTN_DESC      <chr> NA, NA, "STREET", NA, NA, NA, NA, NA, NA, NA, …
## $ LOCATION_DESC           <chr> NA, NA, "GROCERY/BODEGA", "PVT HOUSE", "MULTI …
## $ STATISTICAL_MURDER_FLAG <lgl> FALSE, TRUE, FALSE, TRUE, TRUE, FALSE, TRUE, F…
## $ PERP_AGE_GROUP          <chr> NA, "25-44", "(null)", "UNKNOWN", "25-44", "18…
## $ PERP_SEX                <chr> NA, "M", "(null)", "U", "M", "M", NA, NA, "M",…
## $ PERP_RACE               <chr> NA, "WHITE HISPANIC", "(null)", "UNKNOWN", "BL…
## $ VIC_AGE_GROUP           <chr> "18-24", "25-44", "25-44", "18-24", "<18", "18…
## $ VIC_SEX                 <chr> "M", "M", "M", "M", "F", "M", "M", "M", "M", "…
## $ VIC_RACE                <chr> "BLACK", "BLACK", "BLACK", "BLACK", "BLACK", "…
## $ X_COORD_CD              <dbl> 1006343.0, 1000082.9, 1020691.0, 985107.3, 100…
## $ Y_COORD_CD              <dbl> 234270.0, 189064.7, 257125.0, 173349.8, 247502…
## $ Latitude                <dbl> 40.80967, 40.68561, 40.87235, 40.64249, 40.845…
## $ Longitude               <dbl> -73.92019, -73.94291, -73.86823, -73.99691, -7…
## $ Lon_Lat                 <chr> "POINT (-73.92019278899994 40.80967347200004)"…

The dataset is comprised of 28,562 observations of 21 variables. A quick review of the glimpse() output reveals several issues that need to be addressed - there appear to be a large number of missing values, which could skew our analysis, and some of the data types need to be modified to make features easier to work with.

Data Cleaning & Transformation

Cleaning the dataset is critical for ensuring the integrity and reliability of our analysis. First, we should check to see if there are any duplicate records in the dataset that should be removed. Let’s filter out duplicates based on the INCIDENT_KEY values in this feature are all meant to be unique.

# Remove duplicate records
shootings_clean <- shootings %>%
  distinct(INCIDENT_KEY, .keep_all = TRUE)

# Calculate number of duplicate records removed
num_duplicates <- nrow(shootings) - nrow(shootings_clean)
cat("Number of duplicate records removed:", num_duplicates, "\n")
## Number of duplicate records removed: 6168

In addition to removing duplicates, we need to handle all of the missing values and other anomalies. For example, there are values in the PERP_AGE_GROUP and VIC_AGE_GROUP columns that appear to be data entry typos. Since we cannot assume the intended values we will treat these anomalies like all other pieces of missing information.

# Remove age group anomalies
shootings_clean <- shootings_clean %>%
  mutate(
    PERP_AGE_GROUP = case_when(
      PERP_AGE_GROUP %in% c("1020", "1028", "224", "940") ~ NA_character_,
      TRUE ~ PERP_AGE_GROUP
    ),
    VIC_AGE_GROUP = case_when(
      VIC_AGE_GROUP %in% c("1022") ~ NA_character_,
      TRUE ~ VIC_AGE_GROUP
    )
  )

# Standardize missing values as NA
shootings_clean <- shootings_clean %>%
  mutate(across(where(is.character), ~ na_if(., "(null)"))) %>%
  mutate(across(where(is.character), ~ na_if(., "UNKNOWN")))

# Replace "U" factors in VIC_SEX and PERP_SEX with NA
shootings_clean <- shootings_clean %>%
  mutate(
    PERP_SEX = na_if(PERP_SEX, "U"),
    VIC_SEX = na_if(VIC_SEX, "U")
  )

I also want to make a few simple string manipulations to optimize the readability of our visualizations.

# Rename Latitude and Longitude to uppercase for consistency
shootings_clean <- shootings_clean %>%
  rename(
    LATITUDE = Latitude,
    LONGITUDE = Longitude
  )

# Simplify racial group names for improved presentation
race_recode <- c(
  "WHITE HISPANIC" = "HISPANIC",
  "BLACK HISPANIC" = "HISPANIC",
  "ASIAN / PACIFIC ISLANDER" = "AAPI",
  "AMERICAN INDIAN/ALASKAN NATIVE" = "NATIVE AMERICAN"
)

# Apply changes to both PERP_RACE and VIC_RACE
shootings_clean <- shootings_clean %>%
  mutate(
    PERP_RACE = recode(PERP_RACE, !!!race_recode),
    VIC_RACE = recode(VIC_RACE, !!!race_recode)
  )

At this point, no further string manipulation is required so we can convert the chr features to more suitable data types.

# Convert character features
shootings_clean <- shootings_clean %>%
  mutate(OCCUR_DATE = mdy(OCCUR_DATE),
         BORO = as.factor(BORO),
         PERP_AGE_GROUP = as.factor(PERP_AGE_GROUP),
         PERP_SEX = as.factor(PERP_SEX),
         PERP_RACE = as.factor(PERP_RACE),
         VIC_AGE_GROUP = as.factor(VIC_AGE_GROUP),
         VIC_SEX = as.factor(VIC_SEX),
         VIC_RACE = as.factor(VIC_RACE)
  )

# Rename STATISTICAL_MURDER_FLAG to FATAL and convert to factor
shootings_clean <- shootings_clean %>%
  rename(FATAL = STATISTICAL_MURDER_FLAG) %>%
  mutate(FATAL = factor(FATAL, levels = c(FALSE, TRUE))
  )

I also want to create some new features.

# Use OCCUR_TIME to create a new feature called HOUR
shootings_clean <- shootings_clean %>%
  mutate(HOUR = (hour(OCCUR_TIME)))

# Use OCCUR_DATE to YEAR and MONTH features
shootings_clean <- shootings_clean %>%
  mutate(YEAR = factor(year(OCCUR_DATE)),
         MONTH = factor(month(OCCUR_DATE, label = TRUE)),
         DAY = factor(wday(OCCUR_DATE, label = TRUE, abbr = TRUE))
  )

Finally, we can remove the features that will not play a role in this report moving forward. Note, I am removing all features related to the perpetrators due to an overwhelming number of NA values in the dataset. Chances are, this data is missing because most of the perpetrators were never apprehended, but this is only an assumption and I have no evidence to back up this claim.

# Delete unnecessary features
shootings_clean <- shootings_clean %>%
  select(-c(LOC_OF_OCCUR_DESC, JURISDICTION_CODE, LOC_CLASSFCTN_DESC, LOCATION_DESC, 
            X_COORD_CD, Y_COORD_CD, Lon_Lat, PERP_AGE_GROUP, PERP_RACE, PERP_SEX)
  )

Now, if we call summary() we can see that the data has been successfully modified. There are a very small number of NA values remaining in features related to victims but they should not impede our analysis and we can always remove them later.

# Inspect modified dataframe
summary(shootings_clean)
##   INCIDENT_KEY         OCCUR_DATE          OCCUR_TIME      
##  Min.   :  9953245   Min.   :2006-01-01   Length:22394     
##  1st Qu.: 66170757   1st Qu.:2009-09-26   Class1:hms       
##  Median : 93164820   Median :2013-10-18   Class2:difftime  
##  Mean   :127667067   Mean   :2014-06-16   Mode  :numeric   
##  3rd Qu.:201852795   3rd Qu.:2019-08-31                    
##  Max.   :279758069   Max.   :2023-12-29                    
##                                                            
##             BORO         PRECINCT        FATAL       VIC_AGE_GROUP VIC_SEX     
##  BRONX        :6335   Min.   :  1.00   FALSE:18495   <18  : 2170   F   : 1756  
##  BROOKLYN     :9144   1st Qu.: 44.00   TRUE : 3899   18-24: 8145   M   :20630  
##  MANHATTAN    :2909   Median : 69.00                 25-44:10385   NA's:    8  
##  QUEENS       :3360   Mean   : 65.95                 45-64: 1497               
##  STATEN ISLAND: 646   3rd Qu.: 81.00                 65+  :  154               
##                       Max.   :123.00                 NA's :   43               
##                                                                                
##             VIC_RACE        LATITUDE       LONGITUDE           HOUR      
##  AAPI           :  318   Min.   :40.51   Min.   :-74.25   Min.   : 0.00  
##  BLACK          :16296   1st Qu.:40.67   1st Qu.:-73.94   1st Qu.: 3.00  
##  HISPANIC       : 5156   Median :40.70   Median :-73.92   Median :15.00  
##  NATIVE AMERICAN:    8   Mean   :40.74   Mean   :-73.91   Mean   :12.27  
##  WHITE          :  565   3rd Qu.:40.82   3rd Qu.:-73.88   3rd Qu.:20.00  
##  NA's           :   51   Max.   :40.91   Max.   :-73.70   Max.   :23.00  
##                          NA's   :43      NA's   :43                      
##       YEAR           MONTH       DAY      
##  2006   : 1566   Jul    :2612   Sun:4388  
##  2021   : 1562   Aug    :2548   Mon:3121  
##  2020   : 1531   Jun    :2314   Tue:2660  
##  2008   : 1520   Sep    :2092   Wed:2511  
##  2011   : 1509   May    :2055   Thu:2499  
##  2010   : 1473   Oct    :1867   Fri:3015  
##  (Other):13233   (Other):8906   Sat:4200

Visualize Data

Question 1: What is the overall trend of shooting incidents?

The total number of shootings shows a general decline from 2006 to 2017 followed by a significant increase in 2020 and 2021, before dropping again. The recent surge coincides with the impact of the COVID-19 pandemic, social unrest, and changes in policing strategies. In just three years, the lowest (754 in 2018) and highest (1,562 in 2021) amounts of shootings were recorded.

# Visualize number of shooting incidents by year
shootings_clean %>%
  ggplot(aes(x = YEAR, fill = FATAL)) +
  geom_bar(position = "stack") +
  scale_fill_manual(values = c("FALSE" = "lightblue", "TRUE" = "lightcoral")) +
  labs(title = "Shooting Incidents by Year",
       x = "Years",
       y = "Total Number of Shootings",
       fill = "Fatal") +
  geom_text(stat = 'count', aes(label = ..count..), vjust = 1.5, 
            col = 'white', size = 3) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Although the total number of shootings per year fluctuates, it is interesting to note that the proportion of fatal incidents is more or less consistent. We can visualize these trends using regression analysis:

# Summarize data by year for total shootings
annual_shootings <- shootings_clean %>%
  group_by(YEAR) %>%
  summarise(total_count = n())

# Summarize data by year for fatal shootings
annual_fatal_shootings <- shootings_clean %>%
  filter(FATAL == TRUE) %>%
  group_by(YEAR) %>%
  summarise(fatal_count = n())

# Summarize data by year for non-fatal shootings
annual_non_fatal_shootings <- shootings_clean %>%
  filter(FATAL == FALSE) %>%
  group_by(YEAR) %>%
  summarise(non_fatal_count = n())

# Convert YEAR to numeric for plotting
annual_shootings$YEAR <- as.numeric(as.character(annual_shootings$YEAR))
annual_fatal_shootings$YEAR <- as.numeric(as.character(annual_fatal_shootings$YEAR))
annual_non_fatal_shootings$YEAR <- as.numeric(as.character(annual_non_fatal_shootings$YEAR))

# Fit linear models
model_total <- lm(total_count ~ YEAR, data = annual_shootings)
model_fatal <- lm(fatal_count ~ YEAR, data = annual_fatal_shootings)
model_non_fatal <- lm(non_fatal_count ~ YEAR, data = annual_non_fatal_shootings)

# Combine datasets for plotting
combined_data <- bind_rows(
  annual_shootings %>% rename(count = total_count) %>% mutate(type = "Total Shootings"),
  annual_fatal_shootings %>% rename(count = fatal_count) %>% mutate(type = "Fatal Shootings"),
  annual_non_fatal_shootings %>% rename(count = non_fatal_count) %>% mutate(type = "Non-Fatal Shootings")
)

# Plot data with regression lines
ggplot(combined_data, aes(x = YEAR, y = count, color = type, fill = type)) +
  geom_line(size = 1, linetype = "dashed") +
  geom_smooth(method = "lm", se = TRUE, alpha = 0.3) +
  labs(title = "Trends in Total, Fatal, and Non-Fatal Shootings",
       x = "Year",
       y = "Number of Shootings",
       color = "Type of Shooting",
       fill = "Type of Shooting") +
  scale_color_manual(values = c("Total Shootings" = "darkred", "Fatal Shootings" = "darkblue", "Non-Fatal Shootings" = "darkgreen")) +
  scale_fill_manual(values = c("Total Shootings" = "lightcoral", "Fatal Shootings" = "lightblue", "Non-Fatal Shootings" = "lightgreen")) +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5))

# Print summary of total shootings model
summary(model_total)
## 
## Call:
## lm(formula = total_count ~ YEAR, data = annual_shootings)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -398.59 -158.51   16.92  109.90  487.85 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 53920.48   23772.26   2.268   0.0375 *
## YEAR          -26.15      11.80  -2.216   0.0415 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 259.7 on 16 degrees of freedom
## Multiple R-squared:  0.2348, Adjusted R-squared:  0.187 
## F-statistic:  4.91 on 1 and 16 DF,  p-value: 0.04155
# Print summary of fatal shootings model
summary(model_fatal)
## 
## Call:
## lm(formula = fatal_count ~ YEAR, data = annual_fatal_shootings)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -78.852 -48.111   7.144  38.249  99.162 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 11303.637   4830.449   2.340   0.0326 *
## YEAR           -5.504      2.398  -2.295   0.0356 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 52.78 on 16 degrees of freedom
## Multiple R-squared:  0.2477, Adjusted R-squared:  0.2007 
## F-statistic: 5.268 on 1 and 16 DF,  p-value: 0.03558
# Print summary of non-fatal shootings model
summary(model_non_fatal)
## 
## Call:
## lm(formula = non_fatal_count ~ YEAR, data = annual_non_fatal_shootings)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -341.24 -105.39    4.31   88.20  388.69 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 42616.842  19385.695   2.198   0.0430 *
## YEAR          -20.645      9.623  -2.145   0.0476 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 211.8 on 16 degrees of freedom
## Multiple R-squared:  0.2234, Adjusted R-squared:  0.1749 
## F-statistic: 4.603 on 1 and 16 DF,  p-value: 0.04761

The regression analysis confirms a statistically significant decrease in shootings, with total incidents reducing by approximately 26 per year. While fatal shootings also show a downward trend, their rate of decline is smaller, at roughly five incidents per year. This suggests that non-fatal shootings are the primary driver of the overall decreasing trend, as they are declining at a rate of around 21 per year.

Question 2: What boroughs and neighborhoods had the most shootings?

Brooklyn recorded the most shooting incidents overall (9,144), by some measure, followed by The Bronx (6,335). Combined, these two boroughs account for roughly two-thirds of all reported cases with Queens (3,360) and Manhattan (2,909) reporting relatively similar numbers. Staten Island (646) recorded the fewest incidents, with less than 1,000 - this is in stark contrast to the almost 10,000 shootings reported in Brooklyn.

# Visualize shootings by borough
shootings_clean %>%
  ggplot(aes(x = BORO, fill = BORO)) +
  geom_bar() +
  geom_text(stat = 'count', aes(label = ..count..), 
            vjust = 1.5, col = 'white', size = 3) +
  labs(title = "Shooting Incidents by Borough",
       fill = "Borough",
       x = "NYC Boroughs",
       y = "Total Number of Shootings")

Interestingly, if we look specifically at the number of fatal shootings in each borough, Staten Island actually has the highest proportion of murder cases, if only by a very small margin:

# Calculate number of fatal and non-fatal shootings in each borough
shooting_counts <- shootings_clean %>%
  group_by(BORO, FATAL) %>%
  summarise(count = n()) %>%
  spread(FATAL, count, fill = 0) %>%
  rename(fatal_count = `TRUE`, non_fatal_count = `FALSE`)

# Calculate total shootings and proportions
shooting_proportions <- shooting_counts %>%
  mutate(total = fatal_count + non_fatal_count,
         fatal_proportion = fatal_count / total,
         non_fatal_proportion = non_fatal_count / total)

# Select and display relevant columns
shooting_proportions %>%
  select(BORO, fatal_count, non_fatal_count, fatal_proportion, non_fatal_proportion)
## # A tibble: 5 × 5
## # Groups:   BORO [5]
##   BORO         fatal_count non_fatal_count fatal_proportion non_fatal_proportion
##   <fct>              <dbl>           <dbl>            <dbl>                <dbl>
## 1 BRONX               1064            5271            0.168                0.832
## 2 BROOKLYN            1634            7510            0.179                0.821
## 3 MANHATTAN            467            2442            0.161                0.839
## 4 QUEENS               605            2755            0.180                0.820
## 5 STATEN ISLA…         129             517            0.200                0.800

We can also inspect the PRECINCT feature to drill a little deeper and identify the specific neighborhoods that recorded the most shooting incidents. To do this we can pull in the latest census tracts from NYC OpenData and perform spatial analysis.

# Load census tract data and convert to spatial data frame
census_tract <- read_csv("https://data.cityofnewyork.us/api/views/63ge-mke6/rows.csv?accessType=DOWNLOAD")
census_tract_sf <- st_as_sf(census_tract, wkt = "the_geom", crs = 4326)

# Identify top ten precincts with most shootings
top_10_precincts <- shootings_clean %>%
  group_by(PRECINCT, BORO) %>%
  summarise(shooting_count = n(), .groups = 'drop') %>%
  arrange(desc(shooting_count)) %>%
  slice_head(n = 10)

# Print top ten precincts
print(top_10_precincts)
## # A tibble: 10 × 3
##    PRECINCT BORO     shooting_count
##       <dbl> <fct>             <int>
##  1       75 BROOKLYN           1285
##  2       73 BROOKLYN           1215
##  3       67 BROOKLYN           1037
##  4       79 BROOKLYN            828
##  5       44 BRONX               793
##  6       47 BRONX               786
##  7       40 BRONX               767
##  8       46 BRONX               708
##  9       77 BROOKLYN            695
## 10       81 BROOKLYN            671
# Filter shootings data to include only incidents in top ten precincts
top_10_shootings <- shootings_clean %>%
  filter(PRECINCT %in% top_10_precincts$PRECINCT) %>%
  filter(!is.na(LONGITUDE) & !is.na(LATITUDE))

# Convert filtered shootings data to spatial data frame
top_10_shootings_sf <- st_as_sf(top_10_shootings, coords = c("LONGITUDE", "LATITUDE"), 
                                crs = 4326, agr = "constant")

# Plot top ten precincts with most shootings on a map
ggplot() +
  geom_sf(data = census_tract_sf, fill = "lightgrey", color = "white") +
  geom_sf(data = top_10_shootings_sf, color = "steelblue", size = 1, alpha = 0.4) +
  labs(title = "Top Ten Precincts with Most Shootings in NYC") +
  theme_minimal()

By looking more closely at the data we can see that almost 40% of the reported shootings in Brooklyn happened in just four precincts, and this proportion jumps to almost 50% for The Bronx. This shows that incidents are highly concentrated within specific neighborhoods.

Question 3: When did most shootings take place?

Shooting incidents exhibit clear temporal patterns, peaking during summer months (June, July, and August) and declining in winter. This seasonal trend may be influenced by factors like weather, holidays, and social activities.

# Group dataset by month and summarize number of incidents
monthly_incidents <- shootings_clean %>%
  group_by(MONTH) %>%
  summarise(INCIDENTS = n()) %>%
  mutate(Month_Num = as.numeric(MONTH))

# Fit quadratic model to number of incidents by month
quadratic_model_month <- lm(INCIDENTS ~ poly(Month_Num, 2), data = monthly_incidents)
summary(quadratic_model_month)
## 
## Call:
## lm(formula = INCIDENTS ~ poly(Month_Num, 2), data = monthly_incidents)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -269.63 -235.06  -59.81  167.54  427.18 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          1866.17      81.77  22.822 2.83e-09 ***
## poly(Month_Num, 2)1   587.63     283.27   2.074  0.06787 .  
## poly(Month_Num, 2)2 -1155.92     283.27  -4.081  0.00276 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 283.3 on 9 degrees of freedom
## Multiple R-squared:  0.6996, Adjusted R-squared:  0.6328 
## F-statistic: 10.48 on 2 and 9 DF,  p-value: 0.004466
# Plot distribution of shootings by month
ggplot(monthly_incidents, aes(x = Month_Num, y = INCIDENTS)) +
  geom_bar(stat = "identity", fill = "steelblue", alpha = 0.7) +
  geom_smooth(method = "lm", formula = y ~ poly(x, 2), se = FALSE, color = "darkred") +
  scale_x_continuous(breaks = 1:12, labels = levels(monthly_incidents$MONTH)) +
  theme_minimal() +
  labs(
    title = "Distribution of Shootings by Month",
    x = "Month",
    y = "Number of Incidents")

Shootings are most frequent during late night and early morning hours, particularly around 11pm to 1am, with a decrease from 5:00 to 8:00, followed by a gradual increase throughout the day.

# Group dataset by hour day and summarize number of incidents
hourly_incidents <- shootings_clean %>%
  group_by(HOUR) %>%
  summarize(INCIDENTS = n())

# Plot number of shooting incidents by time of day
ggplot(hourly_incidents, aes(x = HOUR, y = INCIDENTS)) +
  geom_line(size = 1.2, color = "steelblue", alpha = 0.4) +
  geom_point(fill = "white", size = 2, stroke = 1.5, shape = 21) +
  labs(title = "Shooting Incidents by Time of Day",
       x = "Hour",
       y = "Total Number of Shootings") +
  scale_x_continuous(
    breaks = 0:23,
    labels = function(x) sprintf("%02d:00", x)
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Incidents typically peak on weekends, with a noticeable dip midweek, especially on Wednesday and Thursday.

# Group dataset by day of the week and summarize the number of incidents
daily_incidents <- shootings_clean %>%
  group_by(DAY) %>%
  summarise(Incidents = n()) %>%
  mutate(Day_Num = as.numeric(DAY))

# Fit a quadratic model to number of incidents by day
quadratic_model <- lm(Incidents ~ poly(Day_Num, 2), data = daily_incidents)

# Plot distribution of shootings by day of the week
ggplot(daily_incidents, aes(x = Day_Num, y = Incidents)) +
  geom_bar(stat = "identity", fill = "lightgreen", alpha = 0.7) +
  geom_smooth(method = "lm", formula = y ~ poly(x, 2), se = FALSE, color = "purple") +
  scale_x_continuous(breaks = 1:7, labels = levels(daily_incidents$DAY)) +
  theme_minimal() +
  labs(
    title = "Distribution of Shootings by Day",
    x = "Day of the Week",
    y = "Number of Incidents")

In summary, shootings are most likely in summer, late at night, and on weekends. These patterns suggest a link to increased outdoor activities, nightlife, and social dynamics. Understanding these trends can help law enforcement and community programs focus resources and preventive measures during peak times to effectively reduce shootings.

Question 4: What can we learn about the people killed in these incidents?

The dataset includes demographic features that allow us to explore the age, race, and sex dynamics of the shooting incidents. So with this in mind, let’s build up a profile of a “typical” murder victim.

# Filter dataset for murder cases
murder_cases <- shootings_clean %>%
  filter(FATAL == "TRUE")

# Create summary table of victim profiles
victim_profile_table <- murder_cases %>%
  group_by(VIC_AGE_GROUP, VIC_SEX, VIC_RACE) %>%
  summarise(Incident_Count = n(), .groups = 'drop') %>%  
  arrange(desc(Incident_Count))

# Print victim profile table to view summary
print(victim_profile_table)
## # A tibble: 40 × 4
##    VIC_AGE_GROUP VIC_SEX VIC_RACE Incident_Count
##    <fct>         <fct>   <fct>             <int>
##  1 25-44         M       BLACK              1454
##  2 18-24         M       BLACK               815
##  3 25-44         M       HISPANIC            390
##  4 18-24         M       HISPANIC            264
##  5 45-64         M       BLACK               196
##  6 <18           M       BLACK               155
##  7 25-44         F       BLACK                88
##  8 45-64         M       HISPANIC             57
##  9 25-44         M       WHITE                50
## 10 <18           M       HISPANIC             43
## # ℹ 30 more rows

The data clearly indicates that most murder victims are young Black men aged between 18 and 44. This can also be visualized as a simple word cloud.

# Combine victim characteristics into a single vector
victim_characteristics <- c(murder_cases$VIC_AGE_GROUP, murder_cases$VIC_SEX, 
                            murder_cases$VIC_RACE)

# Create frequency table of combined victim characteristics
characteristic_freq <- table(victim_characteristics)

# Generate word cloud to visualize the frequency of victim characteristics
wordcloud(names(characteristic_freq), 
          freq = characteristic_freq, 
          scale = c(3, 0.5), 
          min.freq = 1, 
          max.words = 100, 
          colors = brewer.pal(8, "Dark2"), 
          random.order = FALSE)

Question 5: Is the number of fatalities expected to increase or decrease over the next ten years?

We know exactly when each shooting took place over an 18 year period. With so much temporal data at our fingerprints, we can use time series forecasting to predict the expected trajectory of fatal shootings. This type of analysis is critical for public safety planning and law enforcement, and could help to shape intervention strategies and policies.

# Group data by month and summarize number of fatal shootings
murder_cases <- shootings_clean %>%
  group_by(YearMonth = floor_date(OCCUR_DATE, "month")) %>%
  summarize(Murders = sum(FATAL == "TRUE", na.rm = TRUE)) %>%
  ungroup()

# Plot monthly fatal shootings with a trend line
ggplot(murder_cases, aes(x = YearMonth, y = Murders)) +
  geom_line(color = "steelblue", size = 1) +
  geom_smooth(method = "lm", formula = y ~ poly(x, 2), color = "plum", 
              se = FALSE) +
  labs(title = "Monthly Fatal Shootings",
       x = "Year-Month",
       y = "Number of Fatal Shootings") +
  theme_minimal() +
  scale_x_date(date_labels = "%Y-%m", date_breaks = "1 year") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Convert summarized data into a time series object
murders_ts <- ts(murder_cases$Murders, 
                 start = c(year(min(murder_cases$YearMonth)), 
                           month(min(murder_cases$YearMonth))), 
                 frequency = 12)

# Decompose time series to analyze its components
ddata <- decompose(murders_ts, "multiplicative")
plot(ddata)

# Fit model to time series data
model <- auto.arima(murders_ts)
print(model)
## Series: murders_ts 
## ARIMA(5,1,2)(2,0,0)[12] 
## 
## Coefficients:
##          ar1      ar2      ar3     ar4      ar5      ma1     ma2    sar1
##       1.0558  -0.1573  -0.2178  -0.050  -0.0776  -1.6767  0.7848  0.0777
## s.e.  0.1140   0.1059   0.1013   0.101   0.0869   0.0937  0.0817  0.0799
##         sar2
##       0.0322
## s.e.  0.0857
## 
## sigma^2 = 31.65:  log likelihood = -672.91
## AIC=1365.81   AICc=1366.89   BIC=1399.52
# Plot residuals of model to check for patterns
plot.ts(model$residuals)

# Forecast next 10 years with 95% confidence intervals
forecasted <- forecast(model, level = c(95), h = 10*12)
plot(forecasted, main = "10-Year Forecast of Fatal Shootings in NYC")

# Perform Ljung-Box test on residuals to check for autocorrelation
Box.test(model$resid, lag = 5, type = "Ljung-Box")
## 
##  Box-Ljung test
## 
## data:  model$resid
## X-squared = 2.6073, df = 5, p-value = 0.7602
Box.test(model$resid, lag = 10, type = "Ljung-Box")
## 
##  Box-Ljung test
## 
## data:  model$resid
## X-squared = 6.6264, df = 10, p-value = 0.7602
Box.test(model$resid, lag = 15, type = "Ljung-Box")
## 
##  Box-Ljung test
## 
## data:  model$resid
## X-squared = 9.601, df = 15, p-value = 0.8441

The Autoregressive integrated moving average (ARIMA) model forecasts that the number of shooting-related fatalities in NYC is expected to remain stable or slightly decrease over the next decade. The model’s coefficients and fit, including a sigma^2 of 31.65 and a log likelihood of -672.91, indicate a well-fitted model, with AIC, AICc, and BIC values supporting its quality. Residual analysis through the Box-Ljung test shows p-values above 0.76, suggesting that the residuals are not significantly different from white noise, indicating effective pattern capture.

The forecasted trajectory, with stable central forecast lines and confidence intervals, suggests a low likelihood of a significant increase in fatalities. Although the confidence levels widen, reflecting long-term prediction uncertainty, the overall outlook remains positive. This stability provides valuable insights for strategic planning and resource allocation by the NYPD, emphasizing the importance of maintaining current efforts to ensure that the projected stability is realized. In summary, no significant increase in fatalities is anticipated, offering a positive outlook for future trends.

Question 6: What is the likelihood that a shooting is fatal?

To address this question we can employ statistical models to predict the fatality of shooting incidents in New York City. By considering various demographic and temporal features, such as borough, age group, sex, race, time of occurrence, and year, we can try to assess their impact on the likelihood of an incident being fatal. A quick baseline analysis reveals that approximately 21% of all shooting incidents resulted in a fatality.

# Display breakdown of fatal shootings
table(shootings_clean$FATAL)
## 
## FALSE  TRUE 
## 18495  3899

Initially, a Generalized Linear Model (GLM) was used to predict fatal shootings. The GLM achieved a high accuracy with perfect sensitivity, meaning it correctly identified all “FALSE” cases. However, this high sensitivity came at the cost of specificity, which was extremely low, indicating a significant bias towards predicting the majority class (“FALSE”). The low Kappa statistic suggests that the model’s predictions were only slightly better than random choice, highlighting its inability to effectively capture the minority class. This underscored the need for more advanced models that can better handle class imbalance and capture complex relationships.

# Filter out unnecessary features
dataset <- shootings_clean %>%
  select(-c(INCIDENT_KEY, LATITUDE, LONGITUDE, HOUR, HOUR, YEAR, MONTH, DAY)) %>%
  na.omit()

# Split data into training and testing sets
train_indices <- createDataPartition(y = dataset$FATAL, p = 0.8, list = FALSE)
train_data <- dataset[train_indices, ]
test_data <- dataset[-train_indices, ]

# Fit Generalized Linear Model
model <- train(FATAL ~., data = train_data, method = "glm", 
               trControl = trainControl(method = "cv", number = 3), 
               family = "binomial")

# Display model summary
summary(model)
## 
## Call:
## NULL
## 
## Coefficients:
##                             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               -1.476e+00  2.922e-01  -5.049 4.43e-07 ***
## OCCUR_DATE                -2.424e-05  1.027e-05  -2.360   0.0183 *  
## OCCUR_TIME                 7.911e-07  6.593e-07   1.200   0.2302    
## BOROBROOKLYN               1.625e-01  1.121e-01   1.449   0.1474    
## BOROMANHATTAN             -8.052e-02  9.541e-02  -0.844   0.3987    
## BOROQUEENS                 2.385e-01  2.209e-01   1.080   0.2801    
## `BOROSTATEN ISLAND`        4.658e-01  2.822e-01   1.651   0.0988 .  
## PRECINCT                  -2.483e-03  3.379e-03  -0.735   0.4624    
## `VIC_AGE_GROUP18-24`       3.919e-01  8.598e-02   4.558 5.18e-06 ***
## `VIC_AGE_GROUP25-44`       7.878e-01  8.346e-02   9.440  < 2e-16 ***
## `VIC_AGE_GROUP45-64`       9.361e-01  1.050e-01   8.912  < 2e-16 ***
## `VIC_AGE_GROUP65+`         1.560e+00  2.038e-01   7.656 1.92e-14 ***
## VIC_SEXM                   4.238e-02  7.492e-02   0.566   0.5716    
## VIC_RACEBLACK             -3.185e-01  1.509e-01  -2.111   0.0348 *  
## VIC_RACEHISPANIC          -3.551e-01  1.557e-01  -2.281   0.0226 *  
## `VIC_RACENATIVE AMERICAN` -1.121e+01  1.439e+02  -0.078   0.9379    
## VIC_RACEWHITE              2.763e-02  1.854e-01   0.149   0.8815    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 16519  on 17849  degrees of freedom
## Residual deviance: 16282  on 17833  degrees of freedom
## AIC: 16316
## 
## Number of Fisher Scoring iterations: 11
# Make predictions on test set
preds <- predict(model, newdata = test_data)

# Evaluate model
confusion_matrix <- confusionMatrix(preds, test_data$FATAL)
print(confusion_matrix)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction FALSE TRUE
##      FALSE  3684  778
##      TRUE      0    0
##                                           
##                Accuracy : 0.8256          
##                  95% CI : (0.8142, 0.8367)
##     No Information Rate : 0.8256          
##     P-Value [Acc > NIR] : 0.5096          
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.0000          
##          Pos Pred Value : 0.8256          
##          Neg Pred Value :    NaN          
##              Prevalence : 0.8256          
##          Detection Rate : 0.8256          
##    Detection Prevalence : 1.0000          
##       Balanced Accuracy : 0.5000          
##                                           
##        'Positive' Class : FALSE           
## 
# Calculate accuracy
accuracy <- sum(diag(confusion_matrix$table)) / sum(confusion_matrix$table)
print(paste("Accuracy:", round(accuracy, 2)))
## [1] "Accuracy: 0.83"

Subsequently, a Random Forest model was employed, offering a more balanced approach. The Random Forest model had a lower accuracy compared to the GLM, with a much lower sensitivity, indicating moderate effectiveness in identifying “FALSE” cases. Its specificity was considerably higher than the GLM, showing improved ability to identify “TRUE” cases, though still not optimal. The Kappa statistic was slightly higher, indicating better agreement between predicted and actual classifications. The Random Forest model provided a more balanced prediction between the two classes, though there was still room for improvement.

# Filter out unnecessary features
dataset <- shootings_clean %>%
  select(-c(INCIDENT_KEY, LATITUDE, LONGITUDE, OCCUR_DATE, OCCUR_TIME)) %>%
  na.omit()

# Convert FATAL to a factor with correct levels
dataset$FATAL <- factor(dataset$FATAL, levels = c("FALSE", "TRUE"))

# Check class distribution
class_distribution <- table(dataset$FATAL)

# One-hot encode categorical variables
dataset_numeric <- dummy_cols(dataset, 
                              select_columns = c("BORO", "PRECINCT", "VIC_AGE_GROUP", 
                                                 "VIC_SEX", "VIC_RACE", "HOUR",
                                                 "YEAR", "MONTH", "DAY"), 
                              remove_first_dummy = TRUE, 
                              remove_selected_columns = TRUE)

# Handle missing values by removing rows with missing values
dataset_numeric <- na.omit(dataset_numeric)

# Undersample majority class
majority_class <- dataset_numeric[dataset_numeric$FATAL == "FALSE", ]
minority_class <- dataset_numeric[dataset_numeric$FATAL == "TRUE", ]

# Combine undersampled majority class with minority class
if (nrow(minority_class) > 0) {
  majority_class_sample <- majority_class[sample(nrow(majority_class), 
                                                 nrow(minority_class)), ]
  dataset_balanced <- rbind(majority_class_sample, minority_class)
} else {
  stop("Minority class is empty or has very few records.")
}

# Ensure dataset is not empty
if (nrow(dataset_balanced) == 0) {
  stop("Balanced dataset is empty.")
}

# Split data into training and testing sets
train_indices <- createDataPartition(y = dataset_balanced$FATAL, 
                                     p = 0.8, list = FALSE)
train_data <- dataset_balanced[train_indices, ]
test_data <- dataset_balanced[-train_indices, ]

# Proceed with model training if no missing values
if (sum(is.na(train_data)) == 0 && sum(is.na(test_data)) == 0) {
  tune_grid <- expand.grid(.mtry = c(2, 3, 4))
  rf_tuned_model <- train(FATAL ~ ., data = train_data, method = "rf", 
                          trControl = trainControl(method = "cv", number = 3), 
                          tuneGrid = tune_grid)
  rf_tuned_preds <- predict(rf_tuned_model, newdata = test_data)
  rf_tuned_confusion_matrix <- confusionMatrix(rf_tuned_preds, test_data$FATAL)
  print(rf_tuned_confusion_matrix)
} else {
  stop("Missing values detected in train or test data.")
}
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction FALSE TRUE
##      FALSE   422  340
##      TRUE    356  438
##                                           
##                Accuracy : 0.5527          
##                  95% CI : (0.5276, 0.5776)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : 1.769e-05       
##                                           
##                   Kappa : 0.1054          
##                                           
##  Mcnemar's Test P-Value : 0.5696          
##                                           
##             Sensitivity : 0.5424          
##             Specificity : 0.5630          
##          Pos Pred Value : 0.5538          
##          Neg Pred Value : 0.5516          
##              Prevalence : 0.5000          
##          Detection Rate : 0.2712          
##    Detection Prevalence : 0.4897          
##       Balanced Accuracy : 0.5527          
##                                           
##        'Positive' Class : FALSE           
## 

In conclusion, while the Random Forest model offers a more balanced approach than the GLM, its overall performance suggests that further tuning, feature engineering, or exploration of different algorithms could enhance its effectiveness. At this point, we cannot confidently answer the question regarding the likelihood of a shooting being fatal due to the models’ sub-optimal performance. However, this analysis has certainly highlighted the importance of using advanced models that can handle class imbalance and capture complex interactions effectively, suggesting that further experimentation is advisable.

Conclusions

In this report, I have attempted to provide a comprehensive analysis of the NYPD Shooting Incident Data (Historic) dataset. Through my analysis, I have been able to draw the following conclusions:

  1. Geographic Disparities: The data reveals significant geographic disparities in shooting incidents across New York City. Brooklyn and The Bronx are the most affected boroughs, accounting for the majority of incidents. This suggests a need for targeted interventions in these areas to address the underlying causes of gun violence.
  2. Temporal Trends: There is a clear temporal pattern in shooting incidents, with a general decline from 2006 to 2017, followed by a sharp increase from 2020 to 2022. This recent surge may be linked to external factors such as the COVID-19 pandemic and social unrest, highlighting the importance of understanding and addressing these influences. The data shows that the number of shootings is trending downwards overall, and this is being driven primarily by a decrease in the number of non-fatal incidents.
  3. Seasonal & Daily Patterns: Shooting incidents peak during the summer months and late-night hours, indicating potential correlations with increased social activities and gatherings, as well as the “cover” of darkness. This information can be used to inform policing strategies and community outreach efforts during high-risk periods.
  4. Demographic Insights: Black males aged 18-44 are statistically the most likely people to be the victims of gun violence. These findings underscore the need for community-based interventions that address the specific needs and challenges faced by this demographic group.
  5. Data Limitations: A significant amount of missing data, particularly regarding perpetrator information, poses challenges to the analysis. Efforts to improve data collection and reporting practices are essential for more accurate and comprehensive insights.
  6. Statistical Modeling: This report employed both time series forecasting and predictive modeling to assess shooting incidents. The time series analysis forecasts that the number of shooting-related fatalities is expected to remain stable or slightly decrease over the next decade. However, I was unable to reliably predict if a shooting is likely to be fatal or not. It appears that far more advanced modeling is required to generate meaningful predictions.

Overall, this analysis highlights the complex interplay of geographic, temporal, and demographic factors in gun violence in New York City. It underscores the importance of data-driven approaches to inform policy decisions and intervention strategies aimed at reducing shooting incidents and improving public safety.

Addressing Potential Biases

There are many potential biases that could impact our analysis of this dataset:

Addressing these biases requires careful consideration of the dataset’s limitations, potential confounding factors, and the context in which the data was collected and recorded.

Session Info

# Display R session information
sessionInfo()
## R version 4.4.3 (2025-02-28 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26100)
## 
## Matrix products: default
## 
## 
## locale:
## [1] LC_COLLATE=English_United States.utf8 
## [2] LC_CTYPE=English_United States.utf8   
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.utf8    
## 
## time zone: America/New_York
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] randomForest_4.7-1.2 fastDummies_1.7.5    forecast_8.23.0     
##  [4] wordcloud_2.6        RColorBrewer_1.1-3   knitr_1.50          
##  [7] viridis_0.6.5        viridisLite_0.4.2    sf_1.0-20           
## [10] caret_7.0-1          lattice_0.22-6       hms_1.1.3           
## [13] lubridate_1.9.4      forcats_1.0.0        stringr_1.5.1       
## [16] dplyr_1.1.4          purrr_1.0.4          readr_2.1.5         
## [19] tidyr_1.3.1          tibble_3.2.1         ggplot2_3.5.1       
## [22] tidyverse_2.0.0     
## 
## loaded via a namespace (and not attached):
##  [1] DBI_1.2.3            pROC_1.18.5          gridExtra_2.3       
##  [4] rlang_1.1.5          magrittr_2.0.3       tseries_0.10-58     
##  [7] e1071_1.7-16         compiler_4.4.3       mgcv_1.9-1          
## [10] vctrs_0.6.5          reshape2_1.4.4       quadprog_1.5-8      
## [13] crayon_1.5.3         pkgconfig_2.0.3      fastmap_1.2.0       
## [16] labeling_0.4.3       utf8_1.2.4           rmarkdown_2.29      
## [19] prodlim_2024.06.25   tzdb_0.5.0           bit_4.6.0           
## [22] xfun_0.51            cachem_1.1.0         jsonlite_2.0.0      
## [25] recipes_1.2.1        parallel_4.4.3       R6_2.6.1            
## [28] bslib_0.9.0          stringi_1.8.7        parallelly_1.43.0   
## [31] rpart_4.1.24         lmtest_0.9-40        jquerylib_0.1.4     
## [34] Rcpp_1.0.14          iterators_1.0.14     future.apply_1.11.3 
## [37] zoo_1.8-13           Matrix_1.7-2         splines_4.4.3       
## [40] nnet_7.3-20          timechange_0.3.0     tidyselect_1.2.1    
## [43] rstudioapi_0.17.1    yaml_2.3.10          timeDate_4041.110   
## [46] codetools_0.2-20     curl_6.2.2           listenv_0.9.1       
## [49] plyr_1.8.9           quantmod_0.4.26      withr_3.0.2         
## [52] urca_1.3-4           evaluate_1.0.3       future_1.34.0       
## [55] survival_3.8-3       units_0.8-7          proxy_0.4-27        
## [58] xts_0.14.1           pillar_1.10.1        KernSmooth_2.23-26  
## [61] foreach_1.5.2        stats4_4.4.3         generics_0.1.3      
## [64] TTR_0.24.4           vroom_1.6.5          munsell_0.5.1       
## [67] scales_1.3.0         globals_0.16.3       class_7.3-23        
## [70] glue_1.8.0           tools_4.4.3          data.table_1.17.0   
## [73] ModelMetrics_1.2.2.2 gower_1.0.2          grid_4.4.3          
## [76] ipred_0.9-15         colorspace_2.1-1     nlme_3.1-167        
## [79] fracdiff_1.5-3       cli_3.6.4            lava_1.8.1          
## [82] gtable_0.3.6         sass_0.4.9           digest_0.6.37       
## [85] classInt_0.4-11      farver_2.1.2         htmltools_0.5.8.1   
## [88] lifecycle_1.0.4      hardhat_1.4.1        bit64_4.6.0-1       
## [91] MASS_7.3-64