The NYC OpenData NYPD Shooting Incident (Historic) dataset provides comprehensive data on shooting incidents recorded by the New York Police Department from 2006 to the prior year. It details incidents where firearms were discharged and injuries occurred, including: date/time, location (borough, precinct, coordinates), victim/perpetrator demographics (age, sex, race), injury severity (fatal/non-fatal), crime classification, and contextual factors (e.g. gang-related, domestic violence). This dataset empowers analysts and policymakers to analyze gun violence trends and inform data-driven public safety strategies.
This report aims to answer the following questions:
To begin, let’s import the various packages that we will need to conduct our analysis, and set a seed value so that our models are reproducible.
# Import required libraries
library(tidyverse)
library(lubridate)
library(ggplot2)
library(dplyr)
library(hms)
library(caret)
library(sf)
library(viridis)
library(knitr)
library(wordcloud)
library(forecast)
library(fastDummies)
library(randomForest)
# Set seed for reproducibility
set.seed(123)
Next, download the latest version of the dataset from the NYC Open Data API and read it into R as a data frame.
# Load dataset into a dataframe
shootings <- read_csv("https://data.cityofnewyork.us/api/views/833y-fsy8/rows.csv?accessType=DOWNLOAD")
Before performing any analysis, we need to understand the structure, data types, and sample values of the dataset. This is a critical early step in any data science project.
# Inspect dataframe
glimpse(shootings)
## Rows: 28,562
## Columns: 21
## $ INCIDENT_KEY <dbl> 231974218, 177934247, 255028563, 25384540, 726…
## $ OCCUR_DATE <chr> "08/09/2021", "04/07/2018", "12/02/2022", "11/…
## $ OCCUR_TIME <time> 01:06:00, 19:48:00, 22:57:00, 01:50:00, 01:58…
## $ BORO <chr> "BRONX", "BROOKLYN", "BRONX", "BROOKLYN", "BRO…
## $ LOC_OF_OCCUR_DESC <chr> NA, NA, "OUTSIDE", NA, NA, NA, NA, NA, NA, NA,…
## $ PRECINCT <dbl> 40, 79, 47, 66, 46, 42, 71, 69, 75, 69, 40, 42…
## $ JURISDICTION_CODE <dbl> 0, 0, 0, 0, 0, 2, 0, 2, 0, 0, 0, 2, 0, 0, 2, 0…
## $ LOC_CLASSFCTN_DESC <chr> NA, NA, "STREET", NA, NA, NA, NA, NA, NA, NA, …
## $ LOCATION_DESC <chr> NA, NA, "GROCERY/BODEGA", "PVT HOUSE", "MULTI …
## $ STATISTICAL_MURDER_FLAG <lgl> FALSE, TRUE, FALSE, TRUE, TRUE, FALSE, TRUE, F…
## $ PERP_AGE_GROUP <chr> NA, "25-44", "(null)", "UNKNOWN", "25-44", "18…
## $ PERP_SEX <chr> NA, "M", "(null)", "U", "M", "M", NA, NA, "M",…
## $ PERP_RACE <chr> NA, "WHITE HISPANIC", "(null)", "UNKNOWN", "BL…
## $ VIC_AGE_GROUP <chr> "18-24", "25-44", "25-44", "18-24", "<18", "18…
## $ VIC_SEX <chr> "M", "M", "M", "M", "F", "M", "M", "M", "M", "…
## $ VIC_RACE <chr> "BLACK", "BLACK", "BLACK", "BLACK", "BLACK", "…
## $ X_COORD_CD <dbl> 1006343.0, 1000082.9, 1020691.0, 985107.3, 100…
## $ Y_COORD_CD <dbl> 234270.0, 189064.7, 257125.0, 173349.8, 247502…
## $ Latitude <dbl> 40.80967, 40.68561, 40.87235, 40.64249, 40.845…
## $ Longitude <dbl> -73.92019, -73.94291, -73.86823, -73.99691, -7…
## $ Lon_Lat <chr> "POINT (-73.92019278899994 40.80967347200004)"…
The dataset is comprised of 28,562 observations of
21 variables. A quick review of the
glimpse()
output reveals several issues that need to be
addressed - there appear to be a large number of missing values, which
could skew our analysis, and some of the data types need to be modified
to make features easier to work with.
Cleaning the dataset is critical for ensuring the integrity and
reliability of our analysis. First, we should check to see if there are
any duplicate records in the dataset that should be removed. Let’s
filter out duplicates based on the INCIDENT_KEY
values in
this feature are all meant to be unique.
# Remove duplicate records
shootings_clean <- shootings %>%
distinct(INCIDENT_KEY, .keep_all = TRUE)
# Calculate number of duplicate records removed
num_duplicates <- nrow(shootings) - nrow(shootings_clean)
cat("Number of duplicate records removed:", num_duplicates, "\n")
## Number of duplicate records removed: 6168
In addition to removing duplicates, we need to handle all of the
missing values and other anomalies. For example, there are values in the
PERP_AGE_GROUP
and VIC_AGE_GROUP
columns that
appear to be data entry typos. Since we cannot assume the intended
values we will treat these anomalies like all other pieces of missing
information.
# Remove age group anomalies
shootings_clean <- shootings_clean %>%
mutate(
PERP_AGE_GROUP = case_when(
PERP_AGE_GROUP %in% c("1020", "1028", "224", "940") ~ NA_character_,
TRUE ~ PERP_AGE_GROUP
),
VIC_AGE_GROUP = case_when(
VIC_AGE_GROUP %in% c("1022") ~ NA_character_,
TRUE ~ VIC_AGE_GROUP
)
)
# Standardize missing values as NA
shootings_clean <- shootings_clean %>%
mutate(across(where(is.character), ~ na_if(., "(null)"))) %>%
mutate(across(where(is.character), ~ na_if(., "UNKNOWN")))
# Replace "U" factors in VIC_SEX and PERP_SEX with NA
shootings_clean <- shootings_clean %>%
mutate(
PERP_SEX = na_if(PERP_SEX, "U"),
VIC_SEX = na_if(VIC_SEX, "U")
)
I also want to make a few simple string manipulations to optimize the readability of our visualizations.
# Rename Latitude and Longitude to uppercase for consistency
shootings_clean <- shootings_clean %>%
rename(
LATITUDE = Latitude,
LONGITUDE = Longitude
)
# Simplify racial group names for improved presentation
race_recode <- c(
"WHITE HISPANIC" = "HISPANIC",
"BLACK HISPANIC" = "HISPANIC",
"ASIAN / PACIFIC ISLANDER" = "AAPI",
"AMERICAN INDIAN/ALASKAN NATIVE" = "NATIVE AMERICAN"
)
# Apply changes to both PERP_RACE and VIC_RACE
shootings_clean <- shootings_clean %>%
mutate(
PERP_RACE = recode(PERP_RACE, !!!race_recode),
VIC_RACE = recode(VIC_RACE, !!!race_recode)
)
At this point, no further string manipulation is required so we can
convert the chr
features to more suitable data types.
# Convert character features
shootings_clean <- shootings_clean %>%
mutate(OCCUR_DATE = mdy(OCCUR_DATE),
BORO = as.factor(BORO),
PERP_AGE_GROUP = as.factor(PERP_AGE_GROUP),
PERP_SEX = as.factor(PERP_SEX),
PERP_RACE = as.factor(PERP_RACE),
VIC_AGE_GROUP = as.factor(VIC_AGE_GROUP),
VIC_SEX = as.factor(VIC_SEX),
VIC_RACE = as.factor(VIC_RACE)
)
# Rename STATISTICAL_MURDER_FLAG to FATAL and convert to factor
shootings_clean <- shootings_clean %>%
rename(FATAL = STATISTICAL_MURDER_FLAG) %>%
mutate(FATAL = factor(FATAL, levels = c(FALSE, TRUE))
)
I also want to create some new features.
# Use OCCUR_TIME to create a new feature called HOUR
shootings_clean <- shootings_clean %>%
mutate(HOUR = (hour(OCCUR_TIME)))
# Use OCCUR_DATE to YEAR and MONTH features
shootings_clean <- shootings_clean %>%
mutate(YEAR = factor(year(OCCUR_DATE)),
MONTH = factor(month(OCCUR_DATE, label = TRUE)),
DAY = factor(wday(OCCUR_DATE, label = TRUE, abbr = TRUE))
)
Finally, we can remove the features that will not play a role in this report moving forward. Note, I am removing all features related to the perpetrators due to an overwhelming number of NA values in the dataset. Chances are, this data is missing because most of the perpetrators were never apprehended, but this is only an assumption and I have no evidence to back up this claim.
# Delete unnecessary features
shootings_clean <- shootings_clean %>%
select(-c(LOC_OF_OCCUR_DESC, JURISDICTION_CODE, LOC_CLASSFCTN_DESC, LOCATION_DESC,
X_COORD_CD, Y_COORD_CD, Lon_Lat, PERP_AGE_GROUP, PERP_RACE, PERP_SEX)
)
Now, if we call summary()
we can see that the data has
been successfully modified. There are a very small number of NA values
remaining in features related to victims but they should not impede our
analysis and we can always remove them later.
# Inspect modified dataframe
summary(shootings_clean)
## INCIDENT_KEY OCCUR_DATE OCCUR_TIME
## Min. : 9953245 Min. :2006-01-01 Length:22394
## 1st Qu.: 66170757 1st Qu.:2009-09-26 Class1:hms
## Median : 93164820 Median :2013-10-18 Class2:difftime
## Mean :127667067 Mean :2014-06-16 Mode :numeric
## 3rd Qu.:201852795 3rd Qu.:2019-08-31
## Max. :279758069 Max. :2023-12-29
##
## BORO PRECINCT FATAL VIC_AGE_GROUP VIC_SEX
## BRONX :6335 Min. : 1.00 FALSE:18495 <18 : 2170 F : 1756
## BROOKLYN :9144 1st Qu.: 44.00 TRUE : 3899 18-24: 8145 M :20630
## MANHATTAN :2909 Median : 69.00 25-44:10385 NA's: 8
## QUEENS :3360 Mean : 65.95 45-64: 1497
## STATEN ISLAND: 646 3rd Qu.: 81.00 65+ : 154
## Max. :123.00 NA's : 43
##
## VIC_RACE LATITUDE LONGITUDE HOUR
## AAPI : 318 Min. :40.51 Min. :-74.25 Min. : 0.00
## BLACK :16296 1st Qu.:40.67 1st Qu.:-73.94 1st Qu.: 3.00
## HISPANIC : 5156 Median :40.70 Median :-73.92 Median :15.00
## NATIVE AMERICAN: 8 Mean :40.74 Mean :-73.91 Mean :12.27
## WHITE : 565 3rd Qu.:40.82 3rd Qu.:-73.88 3rd Qu.:20.00
## NA's : 51 Max. :40.91 Max. :-73.70 Max. :23.00
## NA's :43 NA's :43
## YEAR MONTH DAY
## 2006 : 1566 Jul :2612 Sun:4388
## 2021 : 1562 Aug :2548 Mon:3121
## 2020 : 1531 Jun :2314 Tue:2660
## 2008 : 1520 Sep :2092 Wed:2511
## 2011 : 1509 May :2055 Thu:2499
## 2010 : 1473 Oct :1867 Fri:3015
## (Other):13233 (Other):8906 Sat:4200
The total number of shootings shows a general decline from 2006 to 2017 followed by a significant increase in 2020 and 2021, before dropping again. The recent surge coincides with the impact of the COVID-19 pandemic, social unrest, and changes in policing strategies. In just three years, the lowest (754 in 2018) and highest (1,562 in 2021) amounts of shootings were recorded.
# Visualize number of shooting incidents by year
shootings_clean %>%
ggplot(aes(x = YEAR, fill = FATAL)) +
geom_bar(position = "stack") +
scale_fill_manual(values = c("FALSE" = "lightblue", "TRUE" = "lightcoral")) +
labs(title = "Shooting Incidents by Year",
x = "Years",
y = "Total Number of Shootings",
fill = "Fatal") +
geom_text(stat = 'count', aes(label = ..count..), vjust = 1.5,
col = 'white', size = 3) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Although the total number of shootings per year fluctuates, it is interesting to note that the proportion of fatal incidents is more or less consistent. We can visualize these trends using regression analysis:
# Summarize data by year for total shootings
annual_shootings <- shootings_clean %>%
group_by(YEAR) %>%
summarise(total_count = n())
# Summarize data by year for fatal shootings
annual_fatal_shootings <- shootings_clean %>%
filter(FATAL == TRUE) %>%
group_by(YEAR) %>%
summarise(fatal_count = n())
# Summarize data by year for non-fatal shootings
annual_non_fatal_shootings <- shootings_clean %>%
filter(FATAL == FALSE) %>%
group_by(YEAR) %>%
summarise(non_fatal_count = n())
# Convert YEAR to numeric for plotting
annual_shootings$YEAR <- as.numeric(as.character(annual_shootings$YEAR))
annual_fatal_shootings$YEAR <- as.numeric(as.character(annual_fatal_shootings$YEAR))
annual_non_fatal_shootings$YEAR <- as.numeric(as.character(annual_non_fatal_shootings$YEAR))
# Fit linear models
model_total <- lm(total_count ~ YEAR, data = annual_shootings)
model_fatal <- lm(fatal_count ~ YEAR, data = annual_fatal_shootings)
model_non_fatal <- lm(non_fatal_count ~ YEAR, data = annual_non_fatal_shootings)
# Combine datasets for plotting
combined_data <- bind_rows(
annual_shootings %>% rename(count = total_count) %>% mutate(type = "Total Shootings"),
annual_fatal_shootings %>% rename(count = fatal_count) %>% mutate(type = "Fatal Shootings"),
annual_non_fatal_shootings %>% rename(count = non_fatal_count) %>% mutate(type = "Non-Fatal Shootings")
)
# Plot data with regression lines
ggplot(combined_data, aes(x = YEAR, y = count, color = type, fill = type)) +
geom_line(size = 1, linetype = "dashed") +
geom_smooth(method = "lm", se = TRUE, alpha = 0.3) +
labs(title = "Trends in Total, Fatal, and Non-Fatal Shootings",
x = "Year",
y = "Number of Shootings",
color = "Type of Shooting",
fill = "Type of Shooting") +
scale_color_manual(values = c("Total Shootings" = "darkred", "Fatal Shootings" = "darkblue", "Non-Fatal Shootings" = "darkgreen")) +
scale_fill_manual(values = c("Total Shootings" = "lightcoral", "Fatal Shootings" = "lightblue", "Non-Fatal Shootings" = "lightgreen")) +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5))
# Print summary of total shootings model
summary(model_total)
##
## Call:
## lm(formula = total_count ~ YEAR, data = annual_shootings)
##
## Residuals:
## Min 1Q Median 3Q Max
## -398.59 -158.51 16.92 109.90 487.85
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 53920.48 23772.26 2.268 0.0375 *
## YEAR -26.15 11.80 -2.216 0.0415 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 259.7 on 16 degrees of freedom
## Multiple R-squared: 0.2348, Adjusted R-squared: 0.187
## F-statistic: 4.91 on 1 and 16 DF, p-value: 0.04155
# Print summary of fatal shootings model
summary(model_fatal)
##
## Call:
## lm(formula = fatal_count ~ YEAR, data = annual_fatal_shootings)
##
## Residuals:
## Min 1Q Median 3Q Max
## -78.852 -48.111 7.144 38.249 99.162
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11303.637 4830.449 2.340 0.0326 *
## YEAR -5.504 2.398 -2.295 0.0356 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 52.78 on 16 degrees of freedom
## Multiple R-squared: 0.2477, Adjusted R-squared: 0.2007
## F-statistic: 5.268 on 1 and 16 DF, p-value: 0.03558
# Print summary of non-fatal shootings model
summary(model_non_fatal)
##
## Call:
## lm(formula = non_fatal_count ~ YEAR, data = annual_non_fatal_shootings)
##
## Residuals:
## Min 1Q Median 3Q Max
## -341.24 -105.39 4.31 88.20 388.69
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 42616.842 19385.695 2.198 0.0430 *
## YEAR -20.645 9.623 -2.145 0.0476 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 211.8 on 16 degrees of freedom
## Multiple R-squared: 0.2234, Adjusted R-squared: 0.1749
## F-statistic: 4.603 on 1 and 16 DF, p-value: 0.04761
The regression analysis confirms a statistically significant decrease in shootings, with total incidents reducing by approximately 26 per year. While fatal shootings also show a downward trend, their rate of decline is smaller, at roughly five incidents per year. This suggests that non-fatal shootings are the primary driver of the overall decreasing trend, as they are declining at a rate of around 21 per year.
Brooklyn recorded the most shooting incidents overall (9,144), by some measure, followed by The Bronx (6,335). Combined, these two boroughs account for roughly two-thirds of all reported cases with Queens (3,360) and Manhattan (2,909) reporting relatively similar numbers. Staten Island (646) recorded the fewest incidents, with less than 1,000 - this is in stark contrast to the almost 10,000 shootings reported in Brooklyn.
# Visualize shootings by borough
shootings_clean %>%
ggplot(aes(x = BORO, fill = BORO)) +
geom_bar() +
geom_text(stat = 'count', aes(label = ..count..),
vjust = 1.5, col = 'white', size = 3) +
labs(title = "Shooting Incidents by Borough",
fill = "Borough",
x = "NYC Boroughs",
y = "Total Number of Shootings")
Interestingly, if we look specifically at the number of fatal shootings in each borough, Staten Island actually has the highest proportion of murder cases, if only by a very small margin:
# Calculate number of fatal and non-fatal shootings in each borough
shooting_counts <- shootings_clean %>%
group_by(BORO, FATAL) %>%
summarise(count = n()) %>%
spread(FATAL, count, fill = 0) %>%
rename(fatal_count = `TRUE`, non_fatal_count = `FALSE`)
# Calculate total shootings and proportions
shooting_proportions <- shooting_counts %>%
mutate(total = fatal_count + non_fatal_count,
fatal_proportion = fatal_count / total,
non_fatal_proportion = non_fatal_count / total)
# Select and display relevant columns
shooting_proportions %>%
select(BORO, fatal_count, non_fatal_count, fatal_proportion, non_fatal_proportion)
## # A tibble: 5 × 5
## # Groups: BORO [5]
## BORO fatal_count non_fatal_count fatal_proportion non_fatal_proportion
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 BRONX 1064 5271 0.168 0.832
## 2 BROOKLYN 1634 7510 0.179 0.821
## 3 MANHATTAN 467 2442 0.161 0.839
## 4 QUEENS 605 2755 0.180 0.820
## 5 STATEN ISLA… 129 517 0.200 0.800
We can also inspect the PRECINCT
feature to drill a
little deeper and identify the specific neighborhoods that recorded the
most shooting incidents. To do this we can pull in the latest census
tracts from NYC
OpenData and perform spatial analysis.
# Load census tract data and convert to spatial data frame
census_tract <- read_csv("https://data.cityofnewyork.us/api/views/63ge-mke6/rows.csv?accessType=DOWNLOAD")
census_tract_sf <- st_as_sf(census_tract, wkt = "the_geom", crs = 4326)
# Identify top ten precincts with most shootings
top_10_precincts <- shootings_clean %>%
group_by(PRECINCT, BORO) %>%
summarise(shooting_count = n(), .groups = 'drop') %>%
arrange(desc(shooting_count)) %>%
slice_head(n = 10)
# Print top ten precincts
print(top_10_precincts)
## # A tibble: 10 × 3
## PRECINCT BORO shooting_count
## <dbl> <fct> <int>
## 1 75 BROOKLYN 1285
## 2 73 BROOKLYN 1215
## 3 67 BROOKLYN 1037
## 4 79 BROOKLYN 828
## 5 44 BRONX 793
## 6 47 BRONX 786
## 7 40 BRONX 767
## 8 46 BRONX 708
## 9 77 BROOKLYN 695
## 10 81 BROOKLYN 671
# Filter shootings data to include only incidents in top ten precincts
top_10_shootings <- shootings_clean %>%
filter(PRECINCT %in% top_10_precincts$PRECINCT) %>%
filter(!is.na(LONGITUDE) & !is.na(LATITUDE))
# Convert filtered shootings data to spatial data frame
top_10_shootings_sf <- st_as_sf(top_10_shootings, coords = c("LONGITUDE", "LATITUDE"),
crs = 4326, agr = "constant")
# Plot top ten precincts with most shootings on a map
ggplot() +
geom_sf(data = census_tract_sf, fill = "lightgrey", color = "white") +
geom_sf(data = top_10_shootings_sf, color = "steelblue", size = 1, alpha = 0.4) +
labs(title = "Top Ten Precincts with Most Shootings in NYC") +
theme_minimal()
By looking more closely at the data we can see that almost 40% of the reported shootings in Brooklyn happened in just four precincts, and this proportion jumps to almost 50% for The Bronx. This shows that incidents are highly concentrated within specific neighborhoods.
Shooting incidents exhibit clear temporal patterns, peaking during summer months (June, July, and August) and declining in winter. This seasonal trend may be influenced by factors like weather, holidays, and social activities.
# Group dataset by month and summarize number of incidents
monthly_incidents <- shootings_clean %>%
group_by(MONTH) %>%
summarise(INCIDENTS = n()) %>%
mutate(Month_Num = as.numeric(MONTH))
# Fit quadratic model to number of incidents by month
quadratic_model_month <- lm(INCIDENTS ~ poly(Month_Num, 2), data = monthly_incidents)
summary(quadratic_model_month)
##
## Call:
## lm(formula = INCIDENTS ~ poly(Month_Num, 2), data = monthly_incidents)
##
## Residuals:
## Min 1Q Median 3Q Max
## -269.63 -235.06 -59.81 167.54 427.18
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1866.17 81.77 22.822 2.83e-09 ***
## poly(Month_Num, 2)1 587.63 283.27 2.074 0.06787 .
## poly(Month_Num, 2)2 -1155.92 283.27 -4.081 0.00276 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 283.3 on 9 degrees of freedom
## Multiple R-squared: 0.6996, Adjusted R-squared: 0.6328
## F-statistic: 10.48 on 2 and 9 DF, p-value: 0.004466
# Plot distribution of shootings by month
ggplot(monthly_incidents, aes(x = Month_Num, y = INCIDENTS)) +
geom_bar(stat = "identity", fill = "steelblue", alpha = 0.7) +
geom_smooth(method = "lm", formula = y ~ poly(x, 2), se = FALSE, color = "darkred") +
scale_x_continuous(breaks = 1:12, labels = levels(monthly_incidents$MONTH)) +
theme_minimal() +
labs(
title = "Distribution of Shootings by Month",
x = "Month",
y = "Number of Incidents")
Shootings are most frequent during late night and early morning hours, particularly around 11pm to 1am, with a decrease from 5:00 to 8:00, followed by a gradual increase throughout the day.
# Group dataset by hour day and summarize number of incidents
hourly_incidents <- shootings_clean %>%
group_by(HOUR) %>%
summarize(INCIDENTS = n())
# Plot number of shooting incidents by time of day
ggplot(hourly_incidents, aes(x = HOUR, y = INCIDENTS)) +
geom_line(size = 1.2, color = "steelblue", alpha = 0.4) +
geom_point(fill = "white", size = 2, stroke = 1.5, shape = 21) +
labs(title = "Shooting Incidents by Time of Day",
x = "Hour",
y = "Total Number of Shootings") +
scale_x_continuous(
breaks = 0:23,
labels = function(x) sprintf("%02d:00", x)
) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Incidents typically peak on weekends, with a noticeable dip midweek, especially on Wednesday and Thursday.
# Group dataset by day of the week and summarize the number of incidents
daily_incidents <- shootings_clean %>%
group_by(DAY) %>%
summarise(Incidents = n()) %>%
mutate(Day_Num = as.numeric(DAY))
# Fit a quadratic model to number of incidents by day
quadratic_model <- lm(Incidents ~ poly(Day_Num, 2), data = daily_incidents)
# Plot distribution of shootings by day of the week
ggplot(daily_incidents, aes(x = Day_Num, y = Incidents)) +
geom_bar(stat = "identity", fill = "lightgreen", alpha = 0.7) +
geom_smooth(method = "lm", formula = y ~ poly(x, 2), se = FALSE, color = "purple") +
scale_x_continuous(breaks = 1:7, labels = levels(daily_incidents$DAY)) +
theme_minimal() +
labs(
title = "Distribution of Shootings by Day",
x = "Day of the Week",
y = "Number of Incidents")
In summary, shootings are most likely in summer, late at night, and on weekends. These patterns suggest a link to increased outdoor activities, nightlife, and social dynamics. Understanding these trends can help law enforcement and community programs focus resources and preventive measures during peak times to effectively reduce shootings.
The dataset includes demographic features that allow us to explore the age, race, and sex dynamics of the shooting incidents. So with this in mind, let’s build up a profile of a “typical” murder victim.
# Filter dataset for murder cases
murder_cases <- shootings_clean %>%
filter(FATAL == "TRUE")
# Create summary table of victim profiles
victim_profile_table <- murder_cases %>%
group_by(VIC_AGE_GROUP, VIC_SEX, VIC_RACE) %>%
summarise(Incident_Count = n(), .groups = 'drop') %>%
arrange(desc(Incident_Count))
# Print victim profile table to view summary
print(victim_profile_table)
## # A tibble: 40 × 4
## VIC_AGE_GROUP VIC_SEX VIC_RACE Incident_Count
## <fct> <fct> <fct> <int>
## 1 25-44 M BLACK 1454
## 2 18-24 M BLACK 815
## 3 25-44 M HISPANIC 390
## 4 18-24 M HISPANIC 264
## 5 45-64 M BLACK 196
## 6 <18 M BLACK 155
## 7 25-44 F BLACK 88
## 8 45-64 M HISPANIC 57
## 9 25-44 M WHITE 50
## 10 <18 M HISPANIC 43
## # ℹ 30 more rows
The data clearly indicates that most murder victims are young Black men aged between 18 and 44. This can also be visualized as a simple word cloud.
# Combine victim characteristics into a single vector
victim_characteristics <- c(murder_cases$VIC_AGE_GROUP, murder_cases$VIC_SEX,
murder_cases$VIC_RACE)
# Create frequency table of combined victim characteristics
characteristic_freq <- table(victim_characteristics)
# Generate word cloud to visualize the frequency of victim characteristics
wordcloud(names(characteristic_freq),
freq = characteristic_freq,
scale = c(3, 0.5),
min.freq = 1,
max.words = 100,
colors = brewer.pal(8, "Dark2"),
random.order = FALSE)
We know exactly when each shooting took place over an 18 year period. With so much temporal data at our fingerprints, we can use time series forecasting to predict the expected trajectory of fatal shootings. This type of analysis is critical for public safety planning and law enforcement, and could help to shape intervention strategies and policies.
# Group data by month and summarize number of fatal shootings
murder_cases <- shootings_clean %>%
group_by(YearMonth = floor_date(OCCUR_DATE, "month")) %>%
summarize(Murders = sum(FATAL == "TRUE", na.rm = TRUE)) %>%
ungroup()
# Plot monthly fatal shootings with a trend line
ggplot(murder_cases, aes(x = YearMonth, y = Murders)) +
geom_line(color = "steelblue", size = 1) +
geom_smooth(method = "lm", formula = y ~ poly(x, 2), color = "plum",
se = FALSE) +
labs(title = "Monthly Fatal Shootings",
x = "Year-Month",
y = "Number of Fatal Shootings") +
theme_minimal() +
scale_x_date(date_labels = "%Y-%m", date_breaks = "1 year") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# Convert summarized data into a time series object
murders_ts <- ts(murder_cases$Murders,
start = c(year(min(murder_cases$YearMonth)),
month(min(murder_cases$YearMonth))),
frequency = 12)
# Decompose time series to analyze its components
ddata <- decompose(murders_ts, "multiplicative")
plot(ddata)
# Fit model to time series data
model <- auto.arima(murders_ts)
print(model)
## Series: murders_ts
## ARIMA(5,1,2)(2,0,0)[12]
##
## Coefficients:
## ar1 ar2 ar3 ar4 ar5 ma1 ma2 sar1
## 1.0558 -0.1573 -0.2178 -0.050 -0.0776 -1.6767 0.7848 0.0777
## s.e. 0.1140 0.1059 0.1013 0.101 0.0869 0.0937 0.0817 0.0799
## sar2
## 0.0322
## s.e. 0.0857
##
## sigma^2 = 31.65: log likelihood = -672.91
## AIC=1365.81 AICc=1366.89 BIC=1399.52
# Plot residuals of model to check for patterns
plot.ts(model$residuals)
# Forecast next 10 years with 95% confidence intervals
forecasted <- forecast(model, level = c(95), h = 10*12)
plot(forecasted, main = "10-Year Forecast of Fatal Shootings in NYC")
# Perform Ljung-Box test on residuals to check for autocorrelation
Box.test(model$resid, lag = 5, type = "Ljung-Box")
##
## Box-Ljung test
##
## data: model$resid
## X-squared = 2.6073, df = 5, p-value = 0.7602
Box.test(model$resid, lag = 10, type = "Ljung-Box")
##
## Box-Ljung test
##
## data: model$resid
## X-squared = 6.6264, df = 10, p-value = 0.7602
Box.test(model$resid, lag = 15, type = "Ljung-Box")
##
## Box-Ljung test
##
## data: model$resid
## X-squared = 9.601, df = 15, p-value = 0.8441
The Autoregressive integrated moving average (ARIMA) model forecasts that the number of shooting-related fatalities in NYC is expected to remain stable or slightly decrease over the next decade. The model’s coefficients and fit, including a sigma^2 of 31.65 and a log likelihood of -672.91, indicate a well-fitted model, with AIC, AICc, and BIC values supporting its quality. Residual analysis through the Box-Ljung test shows p-values above 0.76, suggesting that the residuals are not significantly different from white noise, indicating effective pattern capture.
The forecasted trajectory, with stable central forecast lines and confidence intervals, suggests a low likelihood of a significant increase in fatalities. Although the confidence levels widen, reflecting long-term prediction uncertainty, the overall outlook remains positive. This stability provides valuable insights for strategic planning and resource allocation by the NYPD, emphasizing the importance of maintaining current efforts to ensure that the projected stability is realized. In summary, no significant increase in fatalities is anticipated, offering a positive outlook for future trends.
To address this question we can employ statistical models to predict the fatality of shooting incidents in New York City. By considering various demographic and temporal features, such as borough, age group, sex, race, time of occurrence, and year, we can try to assess their impact on the likelihood of an incident being fatal. A quick baseline analysis reveals that approximately 21% of all shooting incidents resulted in a fatality.
# Display breakdown of fatal shootings
table(shootings_clean$FATAL)
##
## FALSE TRUE
## 18495 3899
Initially, a Generalized Linear Model (GLM) was used to predict fatal shootings. The GLM achieved a high accuracy with perfect sensitivity, meaning it correctly identified all “FALSE” cases. However, this high sensitivity came at the cost of specificity, which was extremely low, indicating a significant bias towards predicting the majority class (“FALSE”). The low Kappa statistic suggests that the model’s predictions were only slightly better than random choice, highlighting its inability to effectively capture the minority class. This underscored the need for more advanced models that can better handle class imbalance and capture complex relationships.
# Filter out unnecessary features
dataset <- shootings_clean %>%
select(-c(INCIDENT_KEY, LATITUDE, LONGITUDE, HOUR, HOUR, YEAR, MONTH, DAY)) %>%
na.omit()
# Split data into training and testing sets
train_indices <- createDataPartition(y = dataset$FATAL, p = 0.8, list = FALSE)
train_data <- dataset[train_indices, ]
test_data <- dataset[-train_indices, ]
# Fit Generalized Linear Model
model <- train(FATAL ~., data = train_data, method = "glm",
trControl = trainControl(method = "cv", number = 3),
family = "binomial")
# Display model summary
summary(model)
##
## Call:
## NULL
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.476e+00 2.922e-01 -5.049 4.43e-07 ***
## OCCUR_DATE -2.424e-05 1.027e-05 -2.360 0.0183 *
## OCCUR_TIME 7.911e-07 6.593e-07 1.200 0.2302
## BOROBROOKLYN 1.625e-01 1.121e-01 1.449 0.1474
## BOROMANHATTAN -8.052e-02 9.541e-02 -0.844 0.3987
## BOROQUEENS 2.385e-01 2.209e-01 1.080 0.2801
## `BOROSTATEN ISLAND` 4.658e-01 2.822e-01 1.651 0.0988 .
## PRECINCT -2.483e-03 3.379e-03 -0.735 0.4624
## `VIC_AGE_GROUP18-24` 3.919e-01 8.598e-02 4.558 5.18e-06 ***
## `VIC_AGE_GROUP25-44` 7.878e-01 8.346e-02 9.440 < 2e-16 ***
## `VIC_AGE_GROUP45-64` 9.361e-01 1.050e-01 8.912 < 2e-16 ***
## `VIC_AGE_GROUP65+` 1.560e+00 2.038e-01 7.656 1.92e-14 ***
## VIC_SEXM 4.238e-02 7.492e-02 0.566 0.5716
## VIC_RACEBLACK -3.185e-01 1.509e-01 -2.111 0.0348 *
## VIC_RACEHISPANIC -3.551e-01 1.557e-01 -2.281 0.0226 *
## `VIC_RACENATIVE AMERICAN` -1.121e+01 1.439e+02 -0.078 0.9379
## VIC_RACEWHITE 2.763e-02 1.854e-01 0.149 0.8815
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 16519 on 17849 degrees of freedom
## Residual deviance: 16282 on 17833 degrees of freedom
## AIC: 16316
##
## Number of Fisher Scoring iterations: 11
# Make predictions on test set
preds <- predict(model, newdata = test_data)
# Evaluate model
confusion_matrix <- confusionMatrix(preds, test_data$FATAL)
print(confusion_matrix)
## Confusion Matrix and Statistics
##
## Reference
## Prediction FALSE TRUE
## FALSE 3684 778
## TRUE 0 0
##
## Accuracy : 0.8256
## 95% CI : (0.8142, 0.8367)
## No Information Rate : 0.8256
## P-Value [Acc > NIR] : 0.5096
##
## Kappa : 0
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 1.0000
## Specificity : 0.0000
## Pos Pred Value : 0.8256
## Neg Pred Value : NaN
## Prevalence : 0.8256
## Detection Rate : 0.8256
## Detection Prevalence : 1.0000
## Balanced Accuracy : 0.5000
##
## 'Positive' Class : FALSE
##
# Calculate accuracy
accuracy <- sum(diag(confusion_matrix$table)) / sum(confusion_matrix$table)
print(paste("Accuracy:", round(accuracy, 2)))
## [1] "Accuracy: 0.83"
Subsequently, a Random Forest model was employed, offering a more balanced approach. The Random Forest model had a lower accuracy compared to the GLM, with a much lower sensitivity, indicating moderate effectiveness in identifying “FALSE” cases. Its specificity was considerably higher than the GLM, showing improved ability to identify “TRUE” cases, though still not optimal. The Kappa statistic was slightly higher, indicating better agreement between predicted and actual classifications. The Random Forest model provided a more balanced prediction between the two classes, though there was still room for improvement.
# Filter out unnecessary features
dataset <- shootings_clean %>%
select(-c(INCIDENT_KEY, LATITUDE, LONGITUDE, OCCUR_DATE, OCCUR_TIME)) %>%
na.omit()
# Convert FATAL to a factor with correct levels
dataset$FATAL <- factor(dataset$FATAL, levels = c("FALSE", "TRUE"))
# Check class distribution
class_distribution <- table(dataset$FATAL)
# One-hot encode categorical variables
dataset_numeric <- dummy_cols(dataset,
select_columns = c("BORO", "PRECINCT", "VIC_AGE_GROUP",
"VIC_SEX", "VIC_RACE", "HOUR",
"YEAR", "MONTH", "DAY"),
remove_first_dummy = TRUE,
remove_selected_columns = TRUE)
# Handle missing values by removing rows with missing values
dataset_numeric <- na.omit(dataset_numeric)
# Undersample majority class
majority_class <- dataset_numeric[dataset_numeric$FATAL == "FALSE", ]
minority_class <- dataset_numeric[dataset_numeric$FATAL == "TRUE", ]
# Combine undersampled majority class with minority class
if (nrow(minority_class) > 0) {
majority_class_sample <- majority_class[sample(nrow(majority_class),
nrow(minority_class)), ]
dataset_balanced <- rbind(majority_class_sample, minority_class)
} else {
stop("Minority class is empty or has very few records.")
}
# Ensure dataset is not empty
if (nrow(dataset_balanced) == 0) {
stop("Balanced dataset is empty.")
}
# Split data into training and testing sets
train_indices <- createDataPartition(y = dataset_balanced$FATAL,
p = 0.8, list = FALSE)
train_data <- dataset_balanced[train_indices, ]
test_data <- dataset_balanced[-train_indices, ]
# Proceed with model training if no missing values
if (sum(is.na(train_data)) == 0 && sum(is.na(test_data)) == 0) {
tune_grid <- expand.grid(.mtry = c(2, 3, 4))
rf_tuned_model <- train(FATAL ~ ., data = train_data, method = "rf",
trControl = trainControl(method = "cv", number = 3),
tuneGrid = tune_grid)
rf_tuned_preds <- predict(rf_tuned_model, newdata = test_data)
rf_tuned_confusion_matrix <- confusionMatrix(rf_tuned_preds, test_data$FATAL)
print(rf_tuned_confusion_matrix)
} else {
stop("Missing values detected in train or test data.")
}
## Confusion Matrix and Statistics
##
## Reference
## Prediction FALSE TRUE
## FALSE 422 340
## TRUE 356 438
##
## Accuracy : 0.5527
## 95% CI : (0.5276, 0.5776)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : 1.769e-05
##
## Kappa : 0.1054
##
## Mcnemar's Test P-Value : 0.5696
##
## Sensitivity : 0.5424
## Specificity : 0.5630
## Pos Pred Value : 0.5538
## Neg Pred Value : 0.5516
## Prevalence : 0.5000
## Detection Rate : 0.2712
## Detection Prevalence : 0.4897
## Balanced Accuracy : 0.5527
##
## 'Positive' Class : FALSE
##
In conclusion, while the Random Forest model offers a more balanced approach than the GLM, its overall performance suggests that further tuning, feature engineering, or exploration of different algorithms could enhance its effectiveness. At this point, we cannot confidently answer the question regarding the likelihood of a shooting being fatal due to the models’ sub-optimal performance. However, this analysis has certainly highlighted the importance of using advanced models that can handle class imbalance and capture complex interactions effectively, suggesting that further experimentation is advisable.
In this report, I have attempted to provide a comprehensive analysis of the NYPD Shooting Incident Data (Historic) dataset. Through my analysis, I have been able to draw the following conclusions:
Overall, this analysis highlights the complex interplay of geographic, temporal, and demographic factors in gun violence in New York City. It underscores the importance of data-driven approaches to inform policy decisions and intervention strategies aimed at reducing shooting incidents and improving public safety.
There are many potential biases that could impact our analysis of this dataset:
Addressing these biases requires careful consideration of the dataset’s limitations, potential confounding factors, and the context in which the data was collected and recorded.
# Display R session information
sessionInfo()
## R version 4.4.3 (2025-02-28 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26100)
##
## Matrix products: default
##
##
## locale:
## [1] LC_COLLATE=English_United States.utf8
## [2] LC_CTYPE=English_United States.utf8
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.utf8
##
## time zone: America/New_York
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] randomForest_4.7-1.2 fastDummies_1.7.5 forecast_8.23.0
## [4] wordcloud_2.6 RColorBrewer_1.1-3 knitr_1.50
## [7] viridis_0.6.5 viridisLite_0.4.2 sf_1.0-20
## [10] caret_7.0-1 lattice_0.22-6 hms_1.1.3
## [13] lubridate_1.9.4 forcats_1.0.0 stringr_1.5.1
## [16] dplyr_1.1.4 purrr_1.0.4 readr_2.1.5
## [19] tidyr_1.3.1 tibble_3.2.1 ggplot2_3.5.1
## [22] tidyverse_2.0.0
##
## loaded via a namespace (and not attached):
## [1] DBI_1.2.3 pROC_1.18.5 gridExtra_2.3
## [4] rlang_1.1.5 magrittr_2.0.3 tseries_0.10-58
## [7] e1071_1.7-16 compiler_4.4.3 mgcv_1.9-1
## [10] vctrs_0.6.5 reshape2_1.4.4 quadprog_1.5-8
## [13] crayon_1.5.3 pkgconfig_2.0.3 fastmap_1.2.0
## [16] labeling_0.4.3 utf8_1.2.4 rmarkdown_2.29
## [19] prodlim_2024.06.25 tzdb_0.5.0 bit_4.6.0
## [22] xfun_0.51 cachem_1.1.0 jsonlite_2.0.0
## [25] recipes_1.2.1 parallel_4.4.3 R6_2.6.1
## [28] bslib_0.9.0 stringi_1.8.7 parallelly_1.43.0
## [31] rpart_4.1.24 lmtest_0.9-40 jquerylib_0.1.4
## [34] Rcpp_1.0.14 iterators_1.0.14 future.apply_1.11.3
## [37] zoo_1.8-13 Matrix_1.7-2 splines_4.4.3
## [40] nnet_7.3-20 timechange_0.3.0 tidyselect_1.2.1
## [43] rstudioapi_0.17.1 yaml_2.3.10 timeDate_4041.110
## [46] codetools_0.2-20 curl_6.2.2 listenv_0.9.1
## [49] plyr_1.8.9 quantmod_0.4.26 withr_3.0.2
## [52] urca_1.3-4 evaluate_1.0.3 future_1.34.0
## [55] survival_3.8-3 units_0.8-7 proxy_0.4-27
## [58] xts_0.14.1 pillar_1.10.1 KernSmooth_2.23-26
## [61] foreach_1.5.2 stats4_4.4.3 generics_0.1.3
## [64] TTR_0.24.4 vroom_1.6.5 munsell_0.5.1
## [67] scales_1.3.0 globals_0.16.3 class_7.3-23
## [70] glue_1.8.0 tools_4.4.3 data.table_1.17.0
## [73] ModelMetrics_1.2.2.2 gower_1.0.2 grid_4.4.3
## [76] ipred_0.9-15 colorspace_2.1-1 nlme_3.1-167
## [79] fracdiff_1.5-3 cli_3.6.4 lava_1.8.1
## [82] gtable_0.3.6 sass_0.4.9 digest_0.6.37
## [85] classInt_0.4-11 farver_2.1.2 htmltools_0.5.8.1
## [88] lifecycle_1.0.4 hardhat_1.4.1 bit64_4.6.0-1
## [91] MASS_7.3-64