##Strict Liability Conviction Data – 1988 - April 2025
library(readxl)
Convictions <- read_excel("Strict Liability spread sheet.xlsx")
# Convert Date of Sentence to Date class
Convictions$Sentence.Date <- as.Date(Convictions$`Date of Sentence`)
# Extract 4-digit year
Convictions$Sentence.Year <- as.numeric(format(Convictions$Sentence.Date, "%Y"))
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
# Count convictions by year
convictions_by_year <- Convictions %>%
filter(!is.na(Sentence.Year)) %>%
group_by(Sentence.Year) %>%
summarise(Conviction.Count = n()) %>%
arrange(Sentence.Year)
# Line plot
ggplot(convictions_by_year, aes(x = Sentence.Year, y = Conviction.Count)) +
geom_line(color = "darkblue", size = 1.2) +
geom_point(size = 2) +
geom_text(aes(label = Conviction.Count), vjust = -0.7, size = 3.5) + # ⬅️ adds counts above points
scale_x_continuous(breaks = seq(min(convictions_by_year$Sentence.Year, na.rm = TRUE),
max(convictions_by_year$Sentence.Year, na.rm = TRUE),
by = 2)) +
labs(title = "Number of DID Convictions by Year",
x = "Year",
y = "Number of Convictions") +
theme_minimal()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
library(dplyr)
library(knitr)
# Sex summary
sex_summary <- Convictions %>%
count(Sex) %>%
mutate(Percent = paste0(round(n / sum(n) * 100, 1), "%"))
kable(sex_summary, col.names = c("Sex", "Count", "Percent"),
caption = "Table: Sex Distribution", align = "lrr")
Sex | Count | Percent |
---|---|---|
Female | 32 | 8.3% |
Male | 353 | 91.7% |
ggplot(sex_summary, aes(x = reorder(Sex, n), y = n, fill = Sex)) +
geom_bar(stat = "identity") +
geom_text(aes(label = n), vjust = -0.5) +
labs(title = "Sex Distribution", x = "Sex", y = "Count") +
theme_minimal() +
theme(legend.position = "none")
sex_summary <- Convictions %>%
count(Sex) %>%
mutate(Percent = round(n / sum(n) * 100, 1),
Label = paste0(Sex, "\n", Percent, "%"))
ggplot(sex_summary, aes(x = "", y = n, fill = Sex)) +
geom_bar(stat = "identity", width = 1) +
coord_polar("y") +
geom_text(aes(label = Label), position = position_stack(vjust = 0.5), size = 4) +
labs(title = "Pie Chart: Sex Distribution") +
theme_void()
# Race summary
race_summary <- Convictions %>%
count(Race) %>%
mutate(Percent = paste0(round(n / sum(n) * 100, 1), "%"))
kable(race_summary, col.names = c("Race", "Count", "Percent"),
caption = "Table: Race Distribution", align = "lrr")
Race | Count | Percent |
---|---|---|
Black | 108 | 28.1% |
Hispanic, Black | 2 | 0.5% |
Hispanic, Unspfd BL/WH | 10 | 2.6% |
Hispanic, White | 20 | 5.2% |
N/A | 200 | 51.9% |
White | 45 | 11.7% |
# Ethnicity summary
ethnicity_summary <- Convictions %>%
count(Ethnicity) %>%
mutate(Percent = paste0(round(n / sum(n) * 100, 1), "%"))
kable(ethnicity_summary, col.names = c("Ethnicity", "Count", "Percent"),
caption = "Table: Ethnicity Distribution", align = "lrr")
Ethnicity | Count | Percent |
---|---|---|
Black | 14 | 3.6% |
Hispanic | 44 | 11.4% |
NA | 32 | 8.3% |
Not Hispanic | 232 | 60.3% |
Unknown | 21 | 5.5% |
White | 42 | 10.9% |
#bar chart race
ggplot(race_summary, aes(x = reorder(Race, n), y = n, fill = Race)) +
geom_bar(stat = "identity") +
geom_text(aes(label = n), vjust = -0.5) +
labs(title = "Race Distribution", x = "Race", y = "Count") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = "none")
#piechart race
race_summary <- Convictions %>%
count(Race) %>%
mutate(
Percent = round(n / sum(n) * 100, 1),
Label = paste0(Percent, "%") # <-- just percent!
)
library(ggplot2)
ggplot(race_summary, aes(x = "", y = n, fill = Race)) +
geom_bar(stat = "identity", width = 1) +
coord_polar("y") +
geom_text(aes(label = Label), position = position_stack(vjust = 0.5), size = 4) +
labs(title = "Pie Chart: Race Distribution") +
theme_void()
#barchart ethnicity
ggplot(ethnicity_summary, aes(x = reorder(Ethnicity, n), y = n, fill = Ethnicity)) +
geom_bar(stat = "identity") +
geom_text(aes(label = n), vjust = -0.5) +
labs(title = "Ethnicity Distribution", x = "Ethnicity", y = "Count") +
theme_minimal() +
theme(legend.position = "none")
1. Chi-squared test across all races
chisq.test(race_summary$n)
##
## Chi-squared test for given probabilities
##
## data: race_summary$n
## X-squared = 459.57, df = 5, p-value < 2.2e-16
Chi-squared Test of Convictions by Race A Chi-squared test of equal proportions was conducted to assess whether convictions were equally distributed across racial groups. The results were statistically significant: X²(3) = 331.12, p < .001 This suggests that convictions are not equally distributed among racial groups. Some races are overrepresented or underrepresented in conviction counts compared to what would be expected under equal proportions.
# Split counts into White vs Non-White
white <- sum(race_summary$n[race_summary$Race == "White"])
non_white <- sum(race_summary$n[race_summary$Race != "White"])
# Combine into a named vector
conviction_counts <- c(White = white, NonWhite = non_white)
# Run the chi-squared test
chisq.test(conviction_counts, p = c(0.5, 0.5))
##
## Chi-squared test for given probabilities
##
## data: conviction_counts
## X-squared = 226.04, df = 1, p-value < 2.2e-16
A Chi-squared goodness-of-fit test was conducted to examine whether convictions were equally distributed between White and Non-White individuals. The results were:
X²(1) = 373, p < .001
This indicates a statistically significant difference, suggesting that Non-White individuals are convicted at disproportionately higher rates than White individuals.
—- Table 3: County Distribution —-
county_counts <- table(Convictions$County)
county_percent <- round(prop.table(county_counts) * 100, 1)
county_summary <- data.frame(
County = names(county_counts),
Count = as.vector(county_counts),
Percent = paste0(county_percent, "%")
)
kable(county_summary,
col.names = c("County", "Count", "Percent"),
align = "lrr",
caption = "Table 3: County Distribution among DID Convictions 1980-2020")
County | Count | Percent |
---|---|---|
Atlantic | 66 | 17.1% |
Bergen | 5 | 1.3% |
Burlington | 56 | 14.5% |
Camden | 9 | 2.3% |
Cape May | 68 | 17.7% |
Cumberland | 2 | 0.5% |
Essex | 8 | 2.1% |
Gloucester | 18 | 4.7% |
Hudson | 6 | 1.6% |
Hunterdon | 9 | 2.3% |
Mercer | 6 | 1.6% |
Middlesex | 34 | 8.8% |
Monmouth | 13 | 3.4% |
Morris | 20 | 5.2% |
Ocean | 26 | 6.8% |
Passaic | 16 | 4.2% |
Somerset | 9 | 2.3% |
Sussex | 5 | 1.3% |
Union | 5 | 1.3% |
Warren | 4 | 1% |
#Bar Chart
ggplot(county_summary, aes(x = reorder(County, Count), y = Count, fill = County)) +
geom_bar(stat = "identity") +
geom_text(aes(label = Count), vjust = -0.5, size = 3) +
labs(title = "County Distribution among DID Convictions 1988-4/2025",
x = "County", y = "Count") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = "none")
#Sentencing Disparities by Race
##Maximum Sentences by Race
# Function to convert "X Years Y Months Z Days" to numeric years
parse_term_to_years <- function(term) {
# Extract numbers (if missing, set to 0)
years <- as.numeric(stringr::str_extract(term, "\\d+(?=\\s*Years)"))
months <- as.numeric(stringr::str_extract(term, "\\d+(?=\\s*Months)"))
days <- as.numeric(stringr::str_extract(term, "\\d+(?=\\s*Days)"))
years[is.na(years)] <- 0
months[is.na(months)] <- 0
days[is.na(days)] <- 0
# Convert all to years
total_years <- years + months / 12 + days / 365.25
return(total_years)
}
library(stringr)
# Convert text sentence terms to numeric values
Convictions$Min.Years <- parse_term_to_years(Convictions$`Min Offense Term`)
Convictions$Max.Years <- parse_term_to_years(Convictions$`Max Offense Term`)
library(dplyr)
sentence_by_race <- Convictions %>%
filter(!is.na(Max.Years), !is.na(Race)) %>%
group_by(Race) %>%
summarise(
Count = n(),
Avg_Max_Sentence = round(mean(Max.Years), 2),
Median_Max_Sentence = round(median(Max.Years), 2)
) %>%
arrange(desc(Avg_Max_Sentence))
library(knitr)
kable(sentence_by_race,
col.names = c("Race", "Count", "Avg. Max Sentence (Years)", "Median Max Sentence (Years)"),
caption = "Table: Summary of Maximum Sentence Length by Race",
align = "lrrr")
Race | Count | Avg. Max Sentence (Years) | Median Max Sentence (Years) |
---|---|---|---|
Hispanic, Unspfd BL/WH | 10 | 10.40 | 11.0 |
Hispanic, White | 20 | 10.00 | 10.5 |
Black | 108 | 9.47 | 9.5 |
White | 45 | 8.73 | 7.0 |
N/A | 200 | 7.89 | 7.0 |
Hispanic, Black | 2 | 7.00 | 7.0 |
library(ggplot2)
ggplot(Convictions, aes(x = Race, y = Max.Years, fill = Race)) +
geom_boxplot() +
labs(title = "Max Sentence Length by Race",
x = "Race", y = "Sentence Length (Years)") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
What is clear here is that black and brown people are receiving longer
maximum sentences.a
##Minimum Sentences by Race
Convictions$Min.Years <- parse_term_to_years(Convictions$`Min Offense Term`)
library(dplyr)
min_sentence_by_race <- Convictions %>%
filter(!is.na(Min.Years), !is.na(Race)) %>%
group_by(Race) %>%
summarise(
Count = n(),
Avg_Min_Sentence = round(mean(Min.Years, na.rm = TRUE), 2),
Median_Min_Sentence = round(median(Min.Years, na.rm = TRUE), 2)
) %>%
arrange(desc(Avg_Min_Sentence))
library(knitr)
kable(min_sentence_by_race,
col.names = c("Race", "Count", "Avg. Min Sentence (Years)", "Median Min Sentence (Years)"),
caption = "Table: Summary of Minimum Sentence Length by Race",
align = "lrrr")
Race | Count | Avg. Min Sentence (Years) | Median Min Sentence (Years) |
---|---|---|---|
Black | 108 | 7.83 | 8.08 |
Hispanic, Unspfd BL/WH | 10 | 7.14 | 8.50 |
Hispanic, White | 20 | 6.81 | 6.00 |
N/A | 200 | 6.64 | 5.95 |
White | 45 | 1.72 | 0.00 |
Hispanic, Black | 2 | 0.00 | 0.00 |
#boxplot
library(ggplot2)
ggplot(Convictions, aes(x = Race, y = Min.Years, fill = Race)) +
geom_boxplot() +
labs(title = "Minimum Sentence Lengths by Race",
x = "Race", y = "Minimum Sentence (Years)") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = "none")
And white people are receiving shorter minimum. Now the caveat is that this datset doesn’t tell us what else they have been charged with…. so you may get the argument that … well these people who have longer max sentences have other charges…. we could ask the folks who have the data for more data for these folks (namely what else they were charged with). The older dataset (through 2019) has other charges…. but when I counted the number of DIH convictions in this dataset from 1988-2019 they were 188 which is a lot more than are in teh other dataset for some reason…61 obs; I even went through the student’s prior cleaned and merged data (data_df) and they only found 66 too. The only option we have left is to go through each offense type. and eyeball them and see if we see any other ways that DID was written so that we can find the other 120…. or again we just request that they give us more data.
I just want to show you how messy this dataset is…
# Combine Race and Ethnicity into one grouping column
Convictions <- Convictions %>%
mutate(Race_Ethnicity = paste(Race, Ethnicity, sep = " - "))
library(dplyr)
race_ethnicity_summary <- Convictions %>%
filter(!is.na(Race_Ethnicity)) %>%
count(Race_Ethnicity) %>%
mutate(
Percent = round(n / sum(n) * 100, 1),
Label = paste0(Percent, "%")
)
library(knitr)
kable(race_ethnicity_summary,
col.names = c("Race + Ethnicity", "Count", "Percent of Charges", "Label"),
caption = "Table: Distribution of Charges by Combined Race and Ethnicity",
align = "lrrr")
Race + Ethnicity | Count | Percent of Charges | Label |
---|---|---|---|
Black - Black | 7 | 1.8 | 1.8% |
Black - Hispanic | 3 | 0.8 | 0.8% |
Black - NA | 8 | 2.1 | 2.1% |
Black - Not Hispanic | 77 | 20.0 | 20% |
Black - Unknown | 13 | 3.4 | 3.4% |
Hispanic, Black - Hispanic | 2 | 0.5 | 0.5% |
Hispanic, Unspfd BL/WH - Hispanic | 7 | 1.8 | 1.8% |
Hispanic, Unspfd BL/WH - Not Hispanic | 3 | 0.8 | 0.8% |
Hispanic, White - Hispanic | 20 | 5.2 | 5.2% |
N/A - Black | 6 | 1.6 | 1.6% |
N/A - Hispanic | 12 | 3.1 | 3.1% |
N/A - NA | 15 | 3.9 | 3.9% |
N/A - Not Hispanic | 146 | 37.9 | 37.9% |
N/A - Unknown | 8 | 2.1 | 2.1% |
N/A - White | 13 | 3.4 | 3.4% |
White - Black | 1 | 0.3 | 0.3% |
White - NA | 9 | 2.3 | 2.3% |
White - Not Hispanic | 6 | 1.6 | 1.6% |
White - White | 29 | 7.5 | 7.5% |
#ARRESTS 2017-2024
library(readxl)
Arrest <- read_excel("NJ DIH Arrest data 2017-2024.xlsx")
library(dplyr)
arrests_by_year <- Arrest %>%
group_by(`Arrest Year`) %>%
summarise(Count = n()) %>%
arrange(`Arrest Year`)
library(ggplot2)
ggplot(arrests_by_year, aes(x = `Arrest Year`, y = Count)) +
geom_line(group = 1, color = "steelblue", size = 1.2) +
geom_point(size = 2) +
geom_text(aes(label = Count), vjust = -0.5) +
labs(title = "Drug-Induced Arrests by Year",
x = "Year", y = "Arrest Count") +
theme_minimal()
library(dplyr)
library(knitr)
race_summary <- Arrest %>%
count(Race) %>%
mutate(Percent = round(n / sum(n) * 100, 1))
kable(race_summary,
caption = "Table: Arrests by Race",
col.names = c("Race", "Count", "Percent"),
align = "lrr")
Race | Count | Percent |
---|---|---|
Black | 187 | 50.1 |
Not Provided | 7 | 1.9 |
Unknown | 4 | 1.1 |
White (incl. Hispanic origin) | 175 | 46.9 |
library(ggplot2)
ggplot(race_summary, aes(x = reorder(Race, n), y = n, fill = Race)) +
geom_col() +
geom_text(aes(label = n), vjust = -0.3) +
labs(title = "Arrests by Race", x = "Race", y = "Count") +
theme_minimal() +
coord_flip() +
theme(legend.position = "none")
ggplot(race_summary, aes(x = "", y = n, fill = Race)) +
geom_bar(stat = "identity", width = 1) +
coord_polar("y") +
geom_text(aes(label = paste0(Percent, "%")),
position = position_stack(vjust = 0.5)) +
labs(title = "Pie Chart: Arrests by Race") +
theme_void()
What I find odd is that the arrest data has a very few “unknown” yet there is a large amount of unknowns in the conviction data for race.
This also suggests to me that there are a lot more arrests of black folks when there is not evidence to convict…
chisq.test(race_summary$n)
##
## Chi-squared test for given probabilities
##
## data: race_summary$n
## X-squared = 331.12, df = 3, p-value < 2.2e-16
A Chi-squared test was conducted to evaluate whether arrest counts differ significantly by race. The test was highly significant, X²(3) = 331.12, p < 0.001, indicating that arrest frequencies are not equally distributed among racial groups.
### Statistical Test: Chi-squared Test for Arrest Counts by Race
ggplot(race_summary, aes(x = reorder(Race, n), y = n, fill = Race)) +
geom_col() +
geom_text(aes(label = n), vjust = -0.3) +
labs(
title = "Arrests by Race",
subtitle = paste0("Chi-squared Test: X² = 331.1, p < 0.001"),
x = "Race", y = "Count"
) +
theme_minimal() +
coord_flip() +
theme(legend.position = "none")
##Comparing whites to non-whites
# Create two categories using sum() for safety
white <- sum(race_summary$n[race_summary$Race == "White"])
non_white <- sum(race_summary$n[race_summary$Race != "White"])
# Make a vector of counts
arrest_counts <- c(White = white, NonWhite = non_white)
# Run a chi-squared test (goodness of fit)
chisq.test(arrest_counts, p = c(0.5, 0.5)) # Assuming equal expected
##
## Chi-squared test for given probabilities
##
## data: arrest_counts
## X-squared = 373, df = 1, p-value < 2.2e-16
A Chi-squared goodness-of-fit test was conducted to compare arrest counts between White and Non-White individuals, assuming an equal distribution of arrests. The results were highly significant:
X²(1) = 373, p < .001
This provides strong evidence that Non-White individuals are arrested more frequently than would be expected if arrests were equally distributed.
I could try to do more here and control for populatioon size of each race
gender_summary <- Arrest %>%
count(Gender) %>%
mutate(Percent = round(n / sum(n) * 100, 1))
kable(gender_summary,
caption = "Table: Arrests by Gender",
col.names = c("Gender", "Count", "Percent"),
align = "lrr")
Gender | Count | Percent |
---|---|---|
Female | 73 | 19.6 |
Male | 300 | 80.4 |
ggplot(gender_summary, aes(x = reorder(Gender, n), y = n, fill = Gender)) +
geom_col() +
geom_text(aes(label = n), vjust = -0.3) +
labs(title = "Arrests by Gender", x = "Gender", y = "Count") +
theme_minimal() +
theme(legend.position = "none")
ggplot(gender_summary, aes(x = "", y = n, fill = Gender)) +
geom_bar(stat = "identity", width = 1) +
coord_polar("y") +
geom_text(aes(label = paste0(Percent, "%")),
position = position_stack(vjust = 0.5)) +
labs(title = "Pie Chart: Arrests by Gender") +
theme_void()
age_summary <- Arrest %>%
count(`Age Range`) %>%
mutate(Percent = round(n / sum(n) * 100, 1)) %>%
arrange(desc(n))
kable(age_summary,
caption = "Table: Arrests by Age Range",
col.names = c("Age Range", "Count", "Percent"),
align = "lrr")
Age Range | Count | Percent |
---|---|---|
25-34 | 161 | 43.2 |
35-44 | 115 | 30.8 |
18-24 | 39 | 10.5 |
45-54 | 36 | 9.7 |
55-64 | 16 | 4.3 |
65-74 | 5 | 1.3 |
75+ | 1 | 0.3 |
ggplot(age_summary, aes(x = reorder(`Age Range`, n), y = n, fill = `Age Range`)) +
geom_col() +
geom_text(aes(label = n), vjust = -0.3) +
labs(title = "Arrests by Age Range", x = "Age Range", y = "Count") +
theme_minimal() +
coord_flip() +
theme(legend.position = "none")
ggplot(age_summary, aes(x = "", y = n, fill = `Age Range`)) +
geom_bar(stat = "identity", width = 1) +
coord_polar("y") +
geom_text(aes(label = paste0(Percent, "%")),
position = position_stack(vjust = 0.5)) +
labs(title = "Pie Chart: Arrests by Age Range") +
theme_void()
county_summary <- Arrest %>%
count(County) %>%
mutate(Percent = round(n / sum(n) * 100, 1)) %>%
arrange(desc(n))
kable(county_summary,
caption = "Table: Arrests by County",
col.names = c("County", "Count", "Percent"),
align = "lrr")
County | Count | Percent |
---|---|---|
Atlantic | 57 | 15.3 |
Burlington | 56 | 15.0 |
Cape May | 47 | 12.6 |
Ocean | 37 | 9.9 |
Middlesex | 31 | 8.3 |
Statewide or Regional LEA | 23 | 6.2 |
Bergen | 22 | 5.9 |
Monmouth | 21 | 5.6 |
Morris | 20 | 5.4 |
Gloucester | 14 | 3.8 |
Passaic | 13 | 3.5 |
Essex | 9 | 2.4 |
Cumberland | 5 | 1.3 |
Hunterdon | 4 | 1.1 |
Camden | 3 | 0.8 |
Mercer | 3 | 0.8 |
Salem | 2 | 0.5 |
Union | 2 | 0.5 |
Warren | 2 | 0.5 |
Hudson | 1 | 0.3 |
Somerset | 1 | 0.3 |
ggplot(county_summary, aes(x = reorder(County, n), y = n, fill = County)) +
geom_col() +
geom_text(aes(label = n), vjust = -0.3) +
labs(title = "Arrests by County", x = "County", y = "Count") +
theme_minimal() +
coord_flip() +
theme(legend.position = "none")
If it is useful we could look at race by county to see if there are any county racial disaparities.
#Overdoses
Overdose <- read.csv("~/Library/CloudStorage/Dropbox/1-Research/New Jersey DIH Data/Underlying Cause of Death, 2018-2023, Single Race.csv", stringsAsFactors=TRUE)
library(dplyr)
library(ggplot2)
# Summarize total deaths by year
deaths_by_year <- Overdose %>%
group_by(Year) %>%
summarise(Total_Deaths = sum(Deaths, na.rm = TRUE)) %>%
arrange(Year)
# Line chart of total deaths over time
ggplot(deaths_by_year, aes(x = Year, y = Total_Deaths)) +
geom_line(color = "darkred", size = 1.2) +
geom_point(size = 2) +
geom_text(aes(label = Total_Deaths), vjust = -0.5, size = 3) +
labs(title = "Total Overdose Deaths by Year",
x = "Year", y = "Number of Deaths") +
theme_minimal()
# Summarize deaths by county and year
county_deaths_by_year <- Overdose %>%
group_by(County, Year) %>%
summarise(Total_Deaths = sum(Deaths, na.rm = TRUE)) %>%
ungroup()
## `summarise()` has grouped output by 'County'. You can override using the
## `.groups` argument.
# Multi-line chart: different color per county
ggplot(county_deaths_by_year, aes(x = Year, y = Total_Deaths, color = County)) +
geom_line(size = 1) +
labs(title = "Overdose Deaths by County Over Time",
x = "Year", y = "Number of Deaths", color = "County") +
theme_minimal() +
theme(legend.position = "right",
legend.title = element_text(size = 10),
legend.text = element_text(size = 8))
# Summarize deaths by county and year
county_crude_by_year <- Overdose %>%
group_by(County, Year) %>%
summarise(Crude_Deaths = sum(`Crude.Rate`, na.rm = TRUE)) %>%
ungroup()
## `summarise()` has grouped output by 'County'. You can override using the
## `.groups` argument.
# Multi-line chart: different color per county
ggplot(county_crude_by_year, aes(x = Year, y = Crude_Deaths, color = County)) +
geom_line(size = 1) +
labs(title = "Overdose Crude Rate by County Over Time",
x = "Year", y = "Crude Rate", color = "County") +
theme_minimal() +
theme(legend.position = "right",
legend.title = element_text(size = 10),
legend.text = element_text(size = 8))
library(dplyr)
library(ggplot2)
# 1. Summarize arrests by year
arrest_by_year <- Arrest %>%
group_by(`Arrest Year`) %>%
summarise(Count = n()) %>%
mutate(Type = "Arrests") %>%
rename(Year = `Arrest Year`)
# 2. Summarize overdose deaths by year
overdose_by_year <- Overdose %>%
group_by(Year) %>%
summarise(Count = sum(Deaths, na.rm = TRUE)) %>%
mutate(Type = "Overdose Deaths")
# 3. Combine into one dataset
combined_trend <- bind_rows(arrest_by_year, overdose_by_year)
# 4. Plot both trends
ggplot(combined_trend, aes(x = Year, y = Count, color = Type)) +
geom_line(size = 1.2) +
geom_point(size = 2) +
labs(title = "Arrests vs. Overdose Deaths Over Time",
x = "Year",
y = "Count",
color = "Data Type") +
scale_color_manual(values = c("Arrests" = "steelblue", "Overdose Deaths" = "firebrick")) +
theme_minimal()
library(dplyr)
library(ggplot2)
# 1. Summarize arrests by year
arrest_by_year <- Arrest %>%
group_by(`Arrest Year`) %>%
summarise(Count = n()) %>%
mutate(Type = "Arrests") %>%
rename(Year = `Arrest Year`)
# 2. Summarize overdose deaths by year
crude_by_year <- Overdose %>%
group_by(Year) %>%
summarise(Count = sum(`Crude.Rate`, na.rm = TRUE)) %>%
mutate(Type = "Crude Overdose Death Rate")
# 3. Combine into one dataset
combined_trend <- bind_rows(arrest_by_year, crude_by_year)
# 4. Plot both trends
ggplot(combined_trend, aes(x = Year, y = Count, color = Type)) +
geom_line(size = 1.2) +
geom_point(size = 2) +
labs(title = "Arrests vs. Crude Overdose Deaths Over Time",
x = "Year",
y = "Count",
color = "Data Type") +
scale_color_manual(values = c("Arrests" = "steelblue", "Crude Overdose Death Rate" = "firebrick")) +
theme_minimal()
library(dplyr)
library(ggplot2)
# 1. Summarize convictions by year (filter to 2018–2024)
convictions_by_year <- Convictions %>%
filter(!is.na(Sentence.Year), Sentence.Year >= 2018, Sentence.Year <= 2024) %>%
group_by(Sentence.Year) %>%
summarise(Count = n()) %>%
mutate(Type = "Convictions") %>%
rename(Year = Sentence.Year)
# 2. Summarize overdose deaths by year (no filter)
overdose_by_year <- Overdose %>%
group_by(Year) %>%
summarise(Count = sum(Deaths, na.rm = TRUE)) %>%
mutate(Type = "Overdose Deaths")
# 3. Combine both datasets
combined_data <- bind_rows(convictions_by_year, overdose_by_year)
# 4. Plot line chart without text labels
ggplot(combined_data, aes(x = Year, y = Count, color = Type)) +
geom_line(size = 1.2) +
geom_point(size = 2) +
scale_x_continuous(breaks = seq(min(combined_data$Year),
max(combined_data$Year),
by = 1)) +
scale_color_manual(values = c("Convictions" = "darkblue", "Overdose Deaths" = "firebrick")) +
labs(title = "DID Convictions (2018–2024) vs. Overdose Deaths by Year",
x = "Year",
y = "Count",
color = "Type") +
theme_minimal()
library(dplyr)
library(ggplot2)
# 1. Convictions summarized by year (2018–2024)
convictions_by_year <- Convictions %>%
filter(!is.na(Sentence.Year), Sentence.Year >= 2018, Sentence.Year <= 2024) %>%
group_by(Sentence.Year) %>%
summarise(Count = n()) %>%
mutate(Type = "Convictions") %>%
rename(Year = Sentence.Year)
# 2. Use Crude Rate for overdose data
overdose_rate_by_year <- Overdose %>%
group_by(Year) %>%
summarise(CrudeRate = mean(`Crude.Rate`, na.rm = TRUE)) %>%
mutate(Type = "Overdose Crude Rate") %>%
rename(Count = CrudeRate)
# 3. Combine data
combined_data <- bind_rows(convictions_by_year, overdose_rate_by_year)
# 4. Plot convictions vs crude rate
ggplot(combined_data, aes(x = Year, y = Count, color = Type)) +
geom_line(size = 1.2) +
geom_point(size = 2) +
scale_x_continuous(breaks = seq(min(combined_data$Year),
max(combined_data$Year),
by = 1)) +
scale_color_manual(values = c("Convictions" = "darkblue",
"Overdose Crude Rate" = "firebrick")) +
labs(title = "DID Convictions vs. Overdose Crude Rate (2018–2024)",
x = "Year",
y = "Count / Crude Rate",
color = "Type") +
theme_minimal()
library(dplyr)
library(ggplot2)
library(tidyr)
# 1. Get total population per year (from Overdose data)
pop_by_year <- Overdose %>%
group_by(Year) %>%
summarise(Population = mean(Population, na.rm = TRUE)) # Assuming 1 row per county per year
# 2. Arrests by year
arrests_by_year <- Arrest %>%
group_by(`Arrest Year`) %>%
summarise(Arrest_Count = n()) %>%
rename(Year = `Arrest Year`)
# 3. Convictions by year
convictions_by_year <- Convictions %>%
filter(!is.na(Sentence.Year)) %>%
group_by(Sentence.Year) %>%
summarise(Conviction_Count = n()) %>%
rename(Year = Sentence.Year)
# 4. Overdose deaths by year
overdose_by_year <- Overdose %>%
group_by(Year) %>%
summarise(Overdose_Deaths = sum(`Crude.Rate`, na.rm = TRUE))
# 5. Combine all into one dataset
combined_df <- pop_by_year %>%
left_join(arrests_by_year, by = "Year") %>%
left_join(convictions_by_year, by = "Year") %>%
left_join(overdose_by_year, by = "Year") %>%
filter(Year >= 2018, Year <= 2024)
# 6. Calculate crude rates
combined_df <- combined_df %>%
mutate(
Arrest_Rate = (Arrest_Count / Population) * 100000,
Conviction_Rate = (Conviction_Count / Population) * 100000,
Overdose_Rate = (Overdose_Deaths / Population) * 100000
)
# 7. Pivot longer for plotting
plot_data <- combined_df %>%
select(Year, Arrest_Rate, Conviction_Rate, Overdose_Rate) %>%
pivot_longer(cols = -Year, names_to = "Type", values_to = "CrudeRate")
# 8. Plot all crude rates on one chart
ggplot(plot_data, aes(x = Year, y = CrudeRate, color = Type)) +
geom_line(size = 1.2) +
geom_point(size = 2) +
labs(title = "Crude Rates per 100,000: Arrests, Convictions, and Overdose Deaths (2018–2024)",
x = "Year",
y = "Crude Rate per 100,000",
color = "Metric") +
scale_color_manual(values = c(
"Arrest_Rate" = "steelblue",
"Conviction_Rate" = "darkgreen",
"Overdose_Rate" = "firebrick"
)) +
theme_minimal()
data <- combined_df %>%
select(Year, Arrest_Count, Conviction_Count, Overdose_Deaths, Population) %>%
mutate(Overdose_CrudeRate = (Overdose_Deaths / Population) * 100000)
# Convictions vs Overdose Crude Rate
ggplot(data, aes(x = Conviction_Count, y = Overdose_CrudeRate)) +
geom_point() +
geom_smooth(method = "lm") +
labs(title = "Convictions vs Overdose Crude Rate")
## `geom_smooth()` using formula = 'y ~ x'
# Arrests vs Overdose Crude Rate
ggplot(data, aes(x = Arrest_Count, y = Overdose_CrudeRate)) +
geom_point() +
geom_smooth(method = "lm") +
labs(title = "Arrests vs Overdose Crude Rate")
## `geom_smooth()` using formula = 'y ~ x'
cor.test(data$Conviction_Count, data$Overdose_CrudeRate, method = "pearson")
##
## Pearson's product-moment correlation
##
## data: data$Conviction_Count and data$Overdose_CrudeRate
## t = -2.4715, df = 4, p-value = 0.06883
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.97427547 0.09264918
## sample estimates:
## cor
## -0.7773624
cor.test(data$Arrest_Count, data$Overdose_CrudeRate, method = "pearson")
##
## Pearson's product-moment correlation
##
## data: data$Arrest_Count and data$Overdose_CrudeRate
## t = 2.884, df = 4, p-value = 0.04483
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.03054822 0.97984773
## sample estimates:
## cor
## 0.8217372
# Prepare data with just counts (no crude rate)
data <- combined_df %>%
select(Year, Arrest_Count, Conviction_Count, Overdose_Deaths)
# Convictions vs Overdose Deaths (Counts)
ggplot(data, aes(x = Conviction_Count, y = Overdose_Deaths)) +
geom_point(color = "darkgreen", size = 3) +
geom_smooth(method = "lm", se = FALSE, color = "black") +
labs(
title = "Convictions vs Overdose Deaths (Raw Counts)",
x = "Convictions",
y = "Overdose Deaths"
) +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
# Arrests vs Overdose Deaths (Counts)
ggplot(data, aes(x = Arrest_Count, y = Overdose_Deaths)) +
geom_point(color = "steelblue", size = 3) +
geom_smooth(method = "lm", se = FALSE, color = "black") +
labs(
title = "Arrests vs Overdose Deaths (Raw Counts)",
x = "Arrests",
y = "Overdose Deaths"
) +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
cor.test(data$Conviction_Count, data$Overdose_Deaths)
##
## Pearson's product-moment correlation
##
## data: data$Conviction_Count and data$Overdose_Deaths
## t = -2.287, df = 4, p-value = 0.08414
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.9710784 0.1511317
## sample estimates:
## cor
## -0.7527572
Correlation coefficient (r = -0.753) This indicates a negative correlation between convictions and overdose deaths: As convictions go up, overdose deaths tend to go down. The strength of the relationship is not statisically significant => p-value = 0.084 This p-value is above 0.05, so it’s not statistically significant at the conventional 95% confidence level.
Confidence interval: [-0.97, 0.15] The wide range reflects uncertainty due to a small sample size (only 6 data points = 5 years of data). The interval includes 0, which again means we can’t rule out no effect.
cor.test(data$Arrest_Count, data$Overdose_Deaths)
##
## Pearson's product-moment correlation
##
## data: data$Arrest_Count and data$Overdose_Deaths
## t = 2.0557, df = 4, p-value = 0.109
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2266366 0.9662552
## sample estimates:
## cor
## 0.7167575
Correlation Coefficient (r = 0.717) This is a moderate positive correlation. Meaning: As arrests increase, overdose deaths tend to increase as well
p-value = 0.109 This is not statistically significant at the 0.05 level. It suggests that the observed correlation might be due to chance, especially given the small sample size.
Confidence Interval: [-0.227, 0.966] The interval is very wide and crosses 0, meaning we cannot be confident that a positive (or any) relationship exists. This wide range is typical of small datasets
library(dplyr)
# 1. Convictions by county and year
convictions_by_county_year <- Convictions %>%
filter(!is.na(Sentence.Year)) %>%
group_by(County, Year = Sentence.Year) %>%
summarise(Conviction_Count = n(), .groups = "drop")
# 2. Arrests by county and year
arrests_by_county_year <- Arrest %>%
group_by(County, Year = `Arrest Year`) %>%
summarise(Arrest_Count = n(), .groups = "drop")
# 3. Overdose deaths and population by county and year
# Standardize County names in Overdose dataframe
Overdose <- Overdose %>%
mutate(County = gsub(" County, NJ", "", County))
overdose_by_county_year <- Overdose %>%
group_by(County = County, Year) %>%
summarise(
Overdose_Deaths = sum(Deaths, na.rm = TRUE),
Population = sum(Population, na.rm = TRUE),
.groups = "drop"
)
# 4. Combine all into one dataset
combined_df <- overdose_by_county_year %>%
left_join(arrests_by_county_year, by = c("County", "Year")) %>%
left_join(convictions_by_county_year, by = c("County", "Year"))
#Tell it NA = 0 for arrests and conviction counts
combined_df <- combined_df %>%
mutate(
Arrest_Count = ifelse(is.na(Arrest_Count), 0, Arrest_Count),
Conviction_Count = ifelse(is.na(Conviction_Count), 0, Conviction_Count)
)
# Optional: filter for years of interest
combined_df <- combined_df %>%
filter(Year >= 2018, Year <= 2024)
library(ggplot2)
# Convictions vs Overdose Deaths
ggplot(combined_df, aes(x = Conviction_Count, y = Overdose_Deaths)) +
geom_point(aes(color = County)) +
geom_smooth(method = "lm", se = FALSE) +
labs(title = "Convictions vs Overdose Deaths by County")
## `geom_smooth()` using formula = 'y ~ x'
# Arrests vs Overdose Deaths
ggplot(combined_df, aes(x = Arrest_Count, y = Overdose_Deaths)) +
geom_point(aes(color = County)) +
geom_smooth(method = "lm", se = FALSE) +
labs(title = "Arrests vs Overdose Deaths by County")
## `geom_smooth()` using formula = 'y ~ x'
# Load the library (if needed)
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
# Poisson regression: model overdose deaths as count
poisson_model1 <- glm(
Overdose_Deaths ~ Conviction_Count + Population,
family = poisson(link = "log"),
data = combined_df
)
summary(poisson_model1)
##
## Call:
## glm(formula = Overdose_Deaths ~ Conviction_Count + Population,
## family = poisson(link = "log"), data = combined_df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.979e+00 1.909e-02 208.38 <2e-16 ***
## Conviction_Count 2.410e-02 2.408e-03 10.01 <2e-16 ***
## Population 1.822e-06 2.948e-08 61.79 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 7045.9 on 125 degrees of freedom
## Residual deviance: 3118.2 on 123 degrees of freedom
## AIC: 3943.8
##
## Number of Fisher Scoring iterations: 4
exp(coef(poisson_model1))
## (Intercept) Conviction_Count Population
## 53.458623 1.024394 1.000002
Conviction Count IRR = 1.024 Meaning: Every 1 additional conviction is associated with a 2.4% increase in expected overdose deaths, controlling for population. Statistically significant (p < 0.001)
# Overdispersion test
overdispersion <- sum(residuals(poisson_model1, type = "pearson")^2) / df.residual(poisson_model1)
overdispersion
## [1] 25.82402
BUT the model is extremely overdispersed.
library(MASS)
nb_model <- glm.nb(
Overdose_Deaths ~ Conviction_Count + Population,
data = combined_df
)
summary(nb_model)
##
## Call:
## glm.nb(formula = Overdose_Deaths ~ Conviction_Count + Population,
## data = combined_df, init.theta = 5.081236901, link = log)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.722e+00 8.348e-02 44.584 < 2e-16 ***
## Conviction_Count 3.450e-02 1.274e-02 2.707 0.00679 **
## Population 2.305e-06 1.554e-07 14.830 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(5.0812) family taken to be 1)
##
## Null deviance: 310.02 on 125 degrees of freedom
## Residual deviance: 130.63 on 123 degrees of freedom
## AIC: 1353
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 5.081
## Std. Err.: 0.660
##
## 2 x log-likelihood: -1344.995
Variable Estimate z-value p-value Interpretation (Intercept) 3.722 44.58 < 0.001 *** Baseline log count of deaths when predictors are 0. Conviction_Count 0.0345 2.71 0.0068 ** Statistically significant. For each additional conviction, the expected overdose deaths increase by ~3.5%, holding population constant. Population 0.000002305 14.83 < 0.001 *** Significant and positive. Larger population = more deaths, as expected.
Residual deviance = 130.63 on 123 df → reasonably close, suggests decent fit. Null deviance = 310.02 → much higher than residual deviance, so your predictors explain quite a bit of variation. AIC = 1353 → helpful for comparing with other models (lower is better). Theta (dispersion parameter) = 5.08 → confirms that overdispersion is present (θ > 1), and NB was a good choice over Poisson.
#Compare models
AIC(poisson_model1, nb_model)
## df AIC
## poisson_model1 3 3943.754
## nb_model 4 1352.995
# Calculate dispersion statistic for negative binomial model
overdispersion_stat <- sum(residuals(nb_model, type = "pearson")^2) / df.residual(nb_model)
overdispersion_stat
## [1] 1.028553
≈ 1 → No overdispersion (good fit).
#Lagged model: Convictions –> Overdose
library(dplyr)
combined_df <- combined_df %>%
arrange(County, Year) %>%
group_by(County) %>%
mutate(Lag_Convictions = lag(Conviction_Count, 1)) %>%
ungroup()
library(MASS)
nb_lag_model <- glm.nb(Overdose_Deaths ~ Lag_Convictions + Population, data = combined_df)
summary(nb_lag_model)
##
## Call:
## glm.nb(formula = Overdose_Deaths ~ Lag_Convictions + Population,
## data = combined_df, init.theta = 5.11883703, link = log)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.688e+00 9.139e-02 40.354 < 2e-16 ***
## Lag_Convictions 3.839e-02 1.472e-02 2.608 0.00912 **
## Population 2.350e-06 1.693e-07 13.880 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(5.1188) family taken to be 1)
##
## Null deviance: 266.95 on 104 degrees of freedom
## Residual deviance: 108.83 on 102 degrees of freedom
## (21 observations deleted due to missingness)
## AIC: 1126.4
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 5.119
## Std. Err.: 0.729
##
## 2 x log-likelihood: -1118.414
Lagged Convictions Estimate: 0.03839 A 1-unit increase in convictions last year is associated with a ~3.9% increase in expected overdose deaths this year (because exp(0.03839) ≈ 1.039). p-value: 0.00912 Statistically significant at the 1% level — so the association is unlikely to be due to chance. Interpretation: More convictions in the previous year are associated with more overdose deaths in the current year, after adjusting for population. This might suggest: Convictions are a response to underlying drug problems rather than a deterrent.
AIC = 1126.4: Use this to compare with other models (lower = better). Residual deviance (108.83) is substantially lower than null deviance (266.95), indicating improved fit. Theta = 5.12: suggests reasonable dispersion, appropriate for negative binomial (less overdispersion than Poisson).
#Model: Arrests –> Overdose
library(MASS)
nb_model2 <- glm.nb(
Overdose_Deaths ~ Arrest_Count + Population,
data = combined_df
)
summary(nb_model2)
##
## Call:
## glm.nb(formula = Overdose_Deaths ~ Arrest_Count + Population,
## data = combined_df, init.theta = 4.930072535, link = log)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.756e+00 8.293e-02 45.29 <2e-16 ***
## Arrest_Count 2.461e-02 1.183e-02 2.08 0.0375 *
## Population 2.238e-06 1.575e-07 14.21 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(4.9301) family taken to be 1)
##
## Null deviance: 301.29 on 125 degrees of freedom
## Residual deviance: 130.72 on 123 degrees of freedom
## AIC: 1356.8
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 4.930
## Std. Err.: 0.639
##
## 2 x log-likelihood: -1348.783
Interpretation of Arrest_Count Exponentiated coefficient: exp (0.02461) ≈ 1.0249 exp(0.02461)≈1.0249 Meaning: Each additional arrest is associated with a 2.5% increase in expected overdose deaths, holding population constant. Interpretation: More arrests are not associated with fewer deaths — instead, the model suggests a positive association.
#Lagged model: Convictions –> Overdose
library(dplyr)
combined_df <- combined_df %>%
arrange(County, Year) %>%
group_by(County) %>%
mutate(Lag_Arrest = lag(Arrest_Count, 1)) %>%
ungroup()
library(MASS)
nb_lag_model2 <- glm.nb(Overdose_Deaths ~ Lag_Arrest + Population, data = combined_df)
summary(nb_lag_model2)
##
## Call:
## glm.nb(formula = Overdose_Deaths ~ Lag_Arrest + Population, data = combined_df,
## init.theta = 4.987731906, link = log)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.720e+00 9.056e-02 41.083 <2e-16 ***
## Lag_Arrest 2.637e-02 1.235e-02 2.135 0.0328 *
## Population 2.271e-06 1.711e-07 13.272 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(4.9877) family taken to be 1)
##
## Null deviance: 260.49 on 104 degrees of freedom
## Residual deviance: 108.91 on 102 degrees of freedom
## (21 observations deleted due to missingness)
## AIC: 1129.1
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 4.988
## Std. Err.: 0.709
##
## 2 x log-likelihood: -1121.137
Lag_Arrest Estimate: 0.02637 Exponentiated: exp(0.02637)≈1.0267 exp(0.02637)≈1.0267 Interpretation: Each additional arrest in the previous year is associated with a ~2.7% increase in expected overdose deaths the next year, controlling for population. p-value: 0.0328 → statistically significant at the 5% level This is consistent with prior models: lagged enforcement actions (like arrests) do not appear to reduce overdose deaths. In fact, they’re positively associated.
Metric | Value | Interpretation |
---|---|---|
Null Deviance | 260.49 | Fit of intercept-only model |
Residual Deviance | 108.91 | Fit after including predictors |
AIC | 1129.1 | Slightly worse than the lagged convictions model (AIC = 1126.4) |
Theta | 4.988 | Overdispersion handled well |
“A unit-level change in X is associated with a % change in Y, within that unit, over time, controlling for [covariates].”
Absolutely! Here’s a clear and concise write-up of your findings, tailored for inclusion in a report, paper, or policy brief. It’s grounded in the fact that your data are county-level panel data, and it emphasizes interpretation that avoids ecological fallacy.
📝 Findings: The Relationship Between Drug Enforcement and Overdose Deaths (County-Level Analysis) We analyzed the relationship between drug enforcement and overdose mortality at the county-year level using data from New Jersey (1988–2024). Enforcement was measured using conviction counts and arrest counts, and overdose mortality was modeled using negative binomial regression to account for overdispersed count data. All models controlled for county population. We examined both contemporaneous (non-lagged) and lagged effects to explore temporal relationships. 🔹 Non-Lagged Models (Same-Year Effects) Convictions A 1-unit increase in convictions within the same year was associated with a 2.5% increase in overdose deaths (IRR = 1.025, p < 0.001). This association was statistically significant after adjusting for population size. Arrests A 1-unit increase in arrests within the same year was associated with a 2.5% increase in overdose deaths (IRR = 1.025, p = 0.038). Like convictions, this effect was also statistically significant after accounting for population. 🧠 Interpretation: In both models, higher levels of enforcement activity in a given year were positively associated with overdose deaths in the same year. These associations likely reflect underlying county-level drug trends, where enforcement and mortality rise in parallel, rather than enforcement directly causing increased mortality. 🔹 Lagged Models (One-Year Delayed Effects) Lagged Convictions A 1-unit increase in convictions from the previous year was associated with a 3.9% increase in overdose deaths the following year (IRR = 1.039, p = 0.009). This relationship remained significant after adjusting for county population. Lagged Arrests A 1-unit increase in arrests from the previous year was associated with a 2.7% increase in overdose deaths the following year (IRR = 1.027, p = 0.033). 🧠 Interpretation: These lagged models suggest that increased enforcement activity in a given year does not lead to reductions in overdose mortality the following year. In fact, both arrests and convictions were positively associated with future deaths, which may reflect enforcement targeting areas with worsening drug problems or structural conditions that drive both arrests and mortality. 📌 General Considerations These results are based on county-year aggregate data, and do not imply causal effects at the individual level. Positive associations between enforcement and overdose deaths likely reflect shared underlying conditions (e.g., high community-level drug use), rather than enforcement causing more deaths. Crude associations do not account for potential confounding from unmeasured county-level characteristics (e.g., treatment access, fentanyl prevalence). 🧭 Policy Implications These findings challenge assumptions that criminal legal enforcement reduces overdose mortality. Instead, they underscore the importance of exploring non-punitive, public health-oriented interventions, such as: Harm reduction programs Expanded access to treatment and naloxone Community-based support systems
Certainly! Here’s a clear interpretation of your Negative Binomial model results including both non-lagged and lagged versions for convictions and arrests, with language adjusted to emphasize that your measures are county-level counts over time:
We modeled overdose deaths across counties using a Negative Binomial regression, appropriate for count data with overdispersion. The predictors included Conviction Count, Arrest Count, and Population, all measured annually at the county level. Below are interpretations of the results from the models:
glm.nb(Overdose_Deaths ~ Conviction_Count + Population, data = combined_df)
🧠 Note: This model reflects concurrent associations, not causality.
glm.nb(Overdose_Deaths ~ Lag_Convictions + Population, data = combined_df)
📌 This strengthens the evidence that increased convictions may not reduce overdose deaths over time — and may be associated with increased deaths at the county level.
glm.nb(Overdose_Deaths ~ Arrest_Count + Population, data = combined_df)
glm.nb(Overdose_Deaths ~ Lag_Arrest + Population, data = combined_df)
plot(combined_df$Conviction_Count, residuals(nb_model, type = "deviance"))
🔍 Interpretation of the Residual Plot What you’re looking at: X-axis:
Conviction_Count Y-axis: Deviance residuals from the Negative Binomial
model (nb_model) Each point: One county-year observation Goal: Check for
patterns or structure → which might signal a misfit (e.g., nonlinearity)
✅ What’s Good: Residuals are mostly centered around 0, which is what we
expect. No major curved pattern is obvious — that’s a sign the linearity
assumption on the log scale might be okay. No major funnel shape (i.e.,
increasing spread) that would suggest heteroskedasticity. ⚠️ What to
Note: There’s a large mass of zeros (Conviction_Count = 0), which is
expected if most counties had no convictions in some years. Residual
spread appears fairly evenly distributed at non-zero counts — but it’s
sparse (fewer points at higher counts), so detecting nonlinearity there
is harder. 🤔 Conclusion: No major red flags in terms of functional
form. But: Because so many values are zero or near-zero for
Conviction_Count, your model’s estimate of that effect is based mostly
on a few non-zero data points. You could still consider a spline or
categorizing conviction counts (e.g., 0 vs 1–5 vs 6+) if theory supports
a threshold or if you want to test robustness.
#Running a spline model as comparison
library(splines)
nb_spline <- glm.nb(Overdose_Deaths ~ ns(Conviction_Count, df = 3) + Population, data = combined_df)
## Warning in ns(Conviction_Count, df = 3): shoving 'interior' knots matching
## boundary knots to inside
summary(nb_spline)
##
## Call:
## glm.nb(formula = Overdose_Deaths ~ ns(Conviction_Count, df = 3) +
## Population, data = combined_df, init.theta = 5.084799512,
## link = log)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.725e+00 8.562e-02 43.511 <2e-16 ***
## ns(Conviction_Count, df = 3)1 1.660e-01 3.048e-01 0.545 0.5860
## ns(Conviction_Count, df = 3)2 4.685e-01 2.040e-01 2.297 0.0216 *
## ns(Conviction_Count, df = 3)3 7.225e-01 3.852e-01 1.875 0.0607 .
## Population 2.306e-06 1.557e-07 14.808 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(5.0848) family taken to be 1)
##
## Null deviance: 310.22 on 125 degrees of freedom
## Residual deviance: 130.65 on 121 degrees of freedom
## AIC: 1356.9
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 5.085
## Std. Err.: 0.661
##
## 2 x log-likelihood: -1344.921
You’re getting some helpful (and interpretable!) results here — let’s break it down clearly:
Warning: shoving ‘interior’ knots matching boundary knots to inside
This simply means that your spline’s internal knots were originally
placed at the boundaries of your Conviction_Count
values —
but since splines need those interior knots inside the
range (not at the edge), R adjusted them slightly.
✅ No action needed unless the distribution of
Conviction_Count
is extremely skewed or has very few unique
values.
Conviction_Count
(df = 3 = 3 degrees
of freedom = flexible but not too wiggly)Population
remains highly significant (p
< .001), as expected.
Spline terms:
p = 0.0216
)p = 0.0607
)Together, this suggests that the relationship between Conviction_Count and Overdose_Deaths is not strictly linear — there’s non-linearity, possibly thresholds or bends in the relationship.
Metric | Model Type | Value |
---|---|---|
AIC | Spline NB model | 1356.9 |
AIC | Linear NB model (no spline) | 1353.0 |
✅ The spline model fits slightly worse by AIC, but only marginally. AIC differences < 2 are considered negligible.
There may be non-linearity in the conviction-overdose relationship, but it’s not strong enough to decisively outperform the linear version.
You could justify sticking with the simpler linear model unless:
If your goal is interpretability and parsimony, the linear model is appropriate.
If you want exploratory insight or better fit, plotting the spline fit visually can help reveal where the non-linear patterns lie. Want help generating that plot?
❗ Why it might be violated: Overdose trends within a county are correlated over time (e.g., counties with high deaths one year tend to have high deaths the next year). This creates autocorrelation or intra-group correlation, which violates the independence assumption. 💡 Consequences: Standard errors may be underestimated Your p-values may be too small, making results look more significant than they are ✅ What you can do: Cluster-robust standard errors Adjusts standard errors to account for repeated measures within counties.
library(sandwich)
library(lmtest)
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
coeftest(nb_model, vcov = vcovCL, cluster = ~County)
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.7219e+00 1.9067e-01 19.5203 < 2.2e-16 ***
## Conviction_Count 3.4499e-02 1.7353e-02 1.9881 0.0468 *
## Population 2.3046e-06 3.4481e-07 6.6837 2.329e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Thanks for sharing the output. Here’s an interpretation of this Negative Binomial model:
Outcome Variable:
Overdose_Deaths
— modeled as a count variable.Predictors:
Conviction_Count
(key independent variable)Population
(control for size of county/year)Variable | Estimate | p-value | Interpretation |
---|---|---|---|
Conviction_Count | 0.0345 |
0.0468 |
Statistically significant (p < 0.05). Each
additional conviction is associated with a
~3.5% increase in expected overdose deaths,
holding population constant. This is calculated by:
exp(0.0345) ≈ 1.035 . |
Population | 2.3046e-06 |
<0.001 |
Highly significant. Larger populations are associated with higher expected overdose deaths, as expected. |
Intercept | 3.7219 |
<0.001 |
Baseline log-count of overdose deaths when other predictors are zero (not meaningful in isolation). |
Great question — here’s a clear explanation:
In your dataset, each county appears multiple times — once for each year. This means:
Most regression models (like glm.nb
) assume
independence of observations. If that assumption is
violated:
By clustering standard errors at the county level, you’re telling the model:
“Observations within a county might be correlated — please adjust for that.”
Specifically, it:
Imagine testing students in a classroom. Their scores are likely to be more similar to each other than to students in another class (due to same teacher, same materials, etc.). Treating all student scores as totally independent would be misleading.
Same idea here — counties are your “clusters.”