NJ DID Data v2

##Strict Liability Conviction Data – 1988 - April 2025

library(readxl)
Convictions <- read_excel("Strict Liability spread sheet.xlsx")

# Convert Date of Sentence to Date class
Convictions$Sentence.Date <- as.Date(Convictions$`Date of Sentence`)

# Extract 4-digit year
Convictions$Sentence.Year <- as.numeric(format(Convictions$Sentence.Date, "%Y"))

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)

# Count convictions by year
convictions_by_year <- Convictions %>%
  filter(!is.na(Sentence.Year)) %>%
  group_by(Sentence.Year) %>%
  summarise(Conviction.Count = n()) %>%
  arrange(Sentence.Year)

# Line plot
ggplot(convictions_by_year, aes(x = Sentence.Year, y = Conviction.Count)) +
  geom_line(color = "darkblue", size = 1.2) +
  geom_point(size = 2) +
  geom_text(aes(label = Conviction.Count), vjust = -0.7, size = 3.5) +  # ⬅️ adds counts above points
  scale_x_continuous(breaks = seq(min(convictions_by_year$Sentence.Year, na.rm = TRUE),
                                  max(convictions_by_year$Sentence.Year, na.rm = TRUE),
                                  by = 2)) +
  labs(title = "Number of DID Convictions by Year",
       x = "Year",
       y = "Number of Convictions") +
  theme_minimal()

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

library(dplyr)
library(knitr)

# Sex summary
sex_summary <- Convictions %>%
  count(Sex) %>%
  mutate(Percent = paste0(round(n / sum(n) * 100, 1), "%"))

kable(sex_summary, col.names = c("Sex", "Count", "Percent"),
      caption = "Table: Sex Distribution", align = "lrr")

Table: Sex Distribution
Sex	Count	Percent
Female	32	8.3%
Male	353	91.7%

ggplot(sex_summary, aes(x = reorder(Sex, n), y = n, fill = Sex)) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = n), vjust = -0.5) +
  labs(title = "Sex Distribution", x = "Sex", y = "Count") +
  theme_minimal() +
  theme(legend.position = "none")

sex_summary <- Convictions %>%
  count(Sex) %>%
  mutate(Percent = round(n / sum(n) * 100, 1),
         Label = paste0(Sex, "\n", Percent, "%"))

ggplot(sex_summary, aes(x = "", y = n, fill = Sex)) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar("y") +
  geom_text(aes(label = Label), position = position_stack(vjust = 0.5), size = 4) +
  labs(title = "Pie Chart: Sex Distribution") +
  theme_void()

# Race summary
race_summary <- Convictions %>%
  count(Race) %>%
  mutate(Percent = paste0(round(n / sum(n) * 100, 1), "%"))

kable(race_summary, col.names = c("Race", "Count", "Percent"),
      caption = "Table: Race Distribution", align = "lrr")

Table: Race Distribution
Race	Count	Percent
Black	108	28.1%
Hispanic, Black	2	0.5%
Hispanic, Unspfd BL/WH	10	2.6%
Hispanic, White	20	5.2%
N/A	200	51.9%
White	45	11.7%

# Ethnicity summary
ethnicity_summary <- Convictions %>%
  count(Ethnicity) %>%
  mutate(Percent = paste0(round(n / sum(n) * 100, 1), "%"))

kable(ethnicity_summary, col.names = c("Ethnicity", "Count", "Percent"),
      caption = "Table: Ethnicity Distribution", align = "lrr")

Table: Ethnicity Distribution
Ethnicity	Count	Percent
Black	14	3.6%
Hispanic	44	11.4%
NA	32	8.3%
Not Hispanic	232	60.3%
Unknown	21	5.5%
White	42	10.9%

#bar chart race
ggplot(race_summary, aes(x = reorder(Race, n), y = n, fill = Race)) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = n), vjust = -0.5) +
  labs(title = "Race Distribution", x = "Race", y = "Count") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        legend.position = "none")

#piechart race
race_summary <- Convictions %>%
  count(Race) %>%
  mutate(
    Percent = round(n / sum(n) * 100, 1),
    Label = paste0(Percent, "%")  # <-- just percent!
  )

library(ggplot2)

ggplot(race_summary, aes(x = "", y = n, fill = Race)) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar("y") +
  geom_text(aes(label = Label), position = position_stack(vjust = 0.5), size = 4) +
  labs(title = "Pie Chart: Race Distribution") +
  theme_void()

#barchart ethnicity
ggplot(ethnicity_summary, aes(x = reorder(Ethnicity, n), y = n, fill = Ethnicity)) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = n), vjust = -0.5) +
  labs(title = "Ethnicity Distribution", x = "Ethnicity", y = "Count") +
  theme_minimal() +
  theme(legend.position = "none")

1. Chi-squared test across all races

chisq.test(race_summary$n)

## 
##  Chi-squared test for given probabilities
## 
## data:  race_summary$n
## X-squared = 459.57, df = 5, p-value < 2.2e-16

Chi-squared Test of Convictions by Race A Chi-squared test of equal proportions was conducted to assess whether convictions were equally distributed across racial groups. The results were statistically significant: X²(3) = 331.12, p < .001 This suggests that convictions are not equally distributed among racial groups. Some races are overrepresented or underrepresented in conviction counts compared to what would be expected under equal proportions.

Compare White vs Non-White conviction counts

# Split counts into White vs Non-White
white <- sum(race_summary$n[race_summary$Race == "White"])
non_white <- sum(race_summary$n[race_summary$Race != "White"])

# Combine into a named vector
conviction_counts <- c(White = white, NonWhite = non_white)

# Run the chi-squared test
chisq.test(conviction_counts, p = c(0.5, 0.5))

## 
##  Chi-squared test for given probabilities
## 
## data:  conviction_counts
## X-squared = 226.04, df = 1, p-value < 2.2e-16

Chi-squared Test: White vs Non-White Convictions

A Chi-squared goodness-of-fit test was conducted to examine whether convictions were equally distributed between White and Non-White individuals. The results were:

X²(1) = 373, p < .001

This indicates a statistically significant difference, suggesting that Non-White individuals are convicted at disproportionately higher rates than White individuals.

—- Table 3: County Distribution —-

county_counts <- table(Convictions$County)
county_percent <- round(prop.table(county_counts) * 100, 1)

county_summary <- data.frame(
  County = names(county_counts),
  Count = as.vector(county_counts),
  Percent = paste0(county_percent, "%")
)

kable(county_summary,
      col.names = c("County", "Count", "Percent"),
      align = "lrr",
      caption = "Table 3: County Distribution among DID Convictions 1980-2020")

Table 3: County Distribution among DID Convictions 1980-2020
County	Count	Percent
Atlantic	66	17.1%
Bergen	5	1.3%
Burlington	56	14.5%
Camden	9	2.3%
Cape May	68	17.7%
Cumberland	2	0.5%
Essex	8	2.1%
Gloucester	18	4.7%
Hudson	6	1.6%
Hunterdon	9	2.3%
Mercer	6	1.6%
Middlesex	34	8.8%
Monmouth	13	3.4%
Morris	20	5.2%
Ocean	26	6.8%
Passaic	16	4.2%
Somerset	9	2.3%
Sussex	5	1.3%
Union	5	1.3%
Warren	4	1%

#Bar Chart
ggplot(county_summary, aes(x = reorder(County, Count), y = Count, fill = County)) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = Count), vjust = -0.5, size = 3) +
  labs(title = "County Distribution among DID Convictions 1988-4/2025",
       x = "County", y = "Count") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        legend.position = "none")

#Sentencing Disparities by Race

##Maximum Sentences by Race

# Function to convert "X Years Y Months Z Days" to numeric years
parse_term_to_years <- function(term) {
  # Extract numbers (if missing, set to 0)
  years  <- as.numeric(stringr::str_extract(term, "\\d+(?=\\s*Years)"))
  months <- as.numeric(stringr::str_extract(term, "\\d+(?=\\s*Months)"))
  days   <- as.numeric(stringr::str_extract(term, "\\d+(?=\\s*Days)"))
  
  years[is.na(years)] <- 0
  months[is.na(months)] <- 0
  days[is.na(days)] <- 0
  
  # Convert all to years
  total_years <- years + months / 12 + days / 365.25
  return(total_years)
}

library(stringr)

# Convert text sentence terms to numeric values
Convictions$Min.Years <- parse_term_to_years(Convictions$`Min Offense Term`)
Convictions$Max.Years <- parse_term_to_years(Convictions$`Max Offense Term`)

library(dplyr)

sentence_by_race <- Convictions %>%
  filter(!is.na(Max.Years), !is.na(Race)) %>%
  group_by(Race) %>%
  summarise(
    Count = n(),
    Avg_Max_Sentence = round(mean(Max.Years), 2),
    Median_Max_Sentence = round(median(Max.Years), 2)
  ) %>%
  arrange(desc(Avg_Max_Sentence))

library(knitr)

kable(sentence_by_race,
      col.names = c("Race", "Count", "Avg. Max Sentence (Years)", "Median Max Sentence (Years)"),
      caption = "Table: Summary of Maximum Sentence Length by Race",
      align = "lrrr")

Table: Summary of Maximum Sentence Length by Race
Race	Count	Avg. Max Sentence (Years)	Median Max Sentence (Years)
Hispanic, Unspfd BL/WH	10	10.40	11.0
Hispanic, White	20	10.00	10.5
Black	108	9.47	9.5
White	45	8.73	7.0
N/A	200	7.89	7.0
Hispanic, Black	2	7.00	7.0

library(ggplot2)

ggplot(Convictions, aes(x = Race, y = Max.Years, fill = Race)) +
  geom_boxplot() +
  labs(title = "Max Sentence Length by Race",
       x = "Race", y = "Sentence Length (Years)") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

What is clear here is that black and brown people are receiving longer maximum sentences.a

##Minimum Sentences by Race

Convictions$Min.Years <- parse_term_to_years(Convictions$`Min Offense Term`)
library(dplyr)

min_sentence_by_race <- Convictions %>%
  filter(!is.na(Min.Years), !is.na(Race)) %>%
  group_by(Race) %>%
  summarise(
    Count = n(),
    Avg_Min_Sentence = round(mean(Min.Years, na.rm = TRUE), 2),
    Median_Min_Sentence = round(median(Min.Years, na.rm = TRUE), 2)
  ) %>%
  arrange(desc(Avg_Min_Sentence))

library(knitr)

kable(min_sentence_by_race,
      col.names = c("Race", "Count", "Avg. Min Sentence (Years)", "Median Min Sentence (Years)"),
      caption = "Table: Summary of Minimum Sentence Length by Race",
      align = "lrrr")

Table: Summary of Minimum Sentence Length by Race
Race	Count	Avg. Min Sentence (Years)	Median Min Sentence (Years)
Black	108	7.83	8.08
Hispanic, Unspfd BL/WH	10	7.14	8.50
Hispanic, White	20	6.81	6.00
N/A	200	6.64	5.95
White	45	1.72	0.00
Hispanic, Black	2	0.00	0.00

#boxplot
library(ggplot2)

ggplot(Convictions, aes(x = Race, y = Min.Years, fill = Race)) +
  geom_boxplot() +
  labs(title = "Minimum Sentence Lengths by Race",
       x = "Race", y = "Minimum Sentence (Years)") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        legend.position = "none")

And white people are receiving shorter minimum. Now the caveat is that this datset doesn’t tell us what else they have been charged with…. so you may get the argument that … well these people who have longer max sentences have other charges…. we could ask the folks who have the data for more data for these folks (namely what else they were charged with). The older dataset (through 2019) has other charges…. but when I counted the number of DIH convictions in this dataset from 1988-2019 they were 188 which is a lot more than are in teh other dataset for some reason…61 obs; I even went through the student’s prior cleaned and merged data (data_df) and they only found 66 too. The only option we have left is to go through each offense type. and eyeball them and see if we see any other ways that DID was written so that we can find the other 120…. or again we just request that they give us more data.

Combining Race & Ethnicity

I just want to show you how messy this dataset is…

# Combine Race and Ethnicity into one grouping column
Convictions <- Convictions %>%
  mutate(Race_Ethnicity = paste(Race, Ethnicity, sep = " - "))

library(dplyr)

race_ethnicity_summary <- Convictions %>%
  filter(!is.na(Race_Ethnicity)) %>%
  count(Race_Ethnicity) %>%
  mutate(
    Percent = round(n / sum(n) * 100, 1),
    Label = paste0(Percent, "%")
  )

library(knitr)

kable(race_ethnicity_summary,
      col.names = c("Race + Ethnicity", "Count", "Percent of Charges", "Label"),
      caption = "Table: Distribution of Charges by Combined Race and Ethnicity",
      align = "lrrr")

Table: Distribution of Charges by Combined Race and Ethnicity
Race + Ethnicity	Count	Percent of Charges	Label
Black - Black	7	1.8	1.8%
Black - Hispanic	3	0.8	0.8%
Black - NA	8	2.1	2.1%
Black - Not Hispanic	77	20.0	20%
Black - Unknown	13	3.4	3.4%
Hispanic, Black - Hispanic	2	0.5	0.5%
Hispanic, Unspfd BL/WH - Hispanic	7	1.8	1.8%
Hispanic, Unspfd BL/WH - Not Hispanic	3	0.8	0.8%
Hispanic, White - Hispanic	20	5.2	5.2%
N/A - Black	6	1.6	1.6%
N/A - Hispanic	12	3.1	3.1%
N/A - NA	15	3.9	3.9%
N/A - Not Hispanic	146	37.9	37.9%
N/A - Unknown	8	2.1	2.1%
N/A - White	13	3.4	3.4%
White - Black	1	0.3	0.3%
White - NA	9	2.3	2.3%
White - Not Hispanic	6	1.6	1.6%
White - White	29	7.5	7.5%

#ARRESTS 2017-2024

library(readxl)
Arrest <- read_excel("NJ DIH Arrest data 2017-2024.xlsx")

library(dplyr)

arrests_by_year <- Arrest %>%
  group_by(`Arrest Year`) %>%
  summarise(Count = n()) %>%
  arrange(`Arrest Year`)

library(ggplot2)

ggplot(arrests_by_year, aes(x = `Arrest Year`, y = Count)) +
  geom_line(group = 1, color = "steelblue", size = 1.2) +
  geom_point(size = 2) +
  geom_text(aes(label = Count), vjust = -0.5) +
  labs(title = "Drug-Induced Arrests by Year",
       x = "Year", y = "Arrest Count") +
  theme_minimal()

library(dplyr)
library(knitr)

race_summary <- Arrest %>%
  count(Race) %>%
  mutate(Percent = round(n / sum(n) * 100, 1))

kable(race_summary,
      caption = "Table: Arrests by Race",
      col.names = c("Race", "Count", "Percent"),
      align = "lrr")

Table: Arrests by Race
Race	Count	Percent
Black	187	50.1
Not Provided	7	1.9
Unknown	4	1.1
White (incl. Hispanic origin)	175	46.9

library(ggplot2)

ggplot(race_summary, aes(x = reorder(Race, n), y = n, fill = Race)) +
  geom_col() +
  geom_text(aes(label = n), vjust = -0.3) +
  labs(title = "Arrests by Race", x = "Race", y = "Count") +
  theme_minimal() +
  coord_flip() +
  theme(legend.position = "none")

ggplot(race_summary, aes(x = "", y = n, fill = Race)) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar("y") +
  geom_text(aes(label = paste0(Percent, "%")),
            position = position_stack(vjust = 0.5)) +
  labs(title = "Pie Chart: Arrests by Race") +
  theme_void()

What I find odd is that the arrest data has a very few “unknown” yet there is a large amount of unknowns in the conviction data for race.

This also suggests to me that there are a lot more arrests of black folks when there is not evidence to convict…

chisq.test(race_summary$n)

## 
##  Chi-squared test for given probabilities
## 
## data:  race_summary$n
## X-squared = 331.12, df = 3, p-value < 2.2e-16

A Chi-squared test was conducted to evaluate whether arrest counts differ significantly by race. The test was highly significant, X²(3) = 331.12, p < 0.001, indicating that arrest frequencies are not equally distributed among racial groups.

### Statistical Test: Chi-squared Test for Arrest Counts by Race

ggplot(race_summary, aes(x = reorder(Race, n), y = n, fill = Race)) +
  geom_col() +
  geom_text(aes(label = n), vjust = -0.3) +
  labs(
    title = "Arrests by Race",
    subtitle = paste0("Chi-squared Test: X² = 331.1, p < 0.001"),
    x = "Race", y = "Count"
  ) +
  theme_minimal() +
  coord_flip() +
  theme(legend.position = "none")

##Comparing whites to non-whites

# Create two categories using sum() for safety
white <- sum(race_summary$n[race_summary$Race == "White"])
non_white <- sum(race_summary$n[race_summary$Race != "White"])

# Make a vector of counts
arrest_counts <- c(White = white, NonWhite = non_white)

# Run a chi-squared test (goodness of fit)
chisq.test(arrest_counts, p = c(0.5, 0.5))  # Assuming equal expected

## 
##  Chi-squared test for given probabilities
## 
## data:  arrest_counts
## X-squared = 373, df = 1, p-value < 2.2e-16

Are Non-White Individuals Arrested More Often Than White Individuals?

A Chi-squared goodness-of-fit test was conducted to compare arrest counts between White and Non-White individuals, assuming an equal distribution of arrests. The results were highly significant:

X²(1) = 373, p < .001

This provides strong evidence that Non-White individuals are arrested more frequently than would be expected if arrests were equally distributed.

I could try to do more here and control for populatioon size of each race

gender_summary <- Arrest %>%
  count(Gender) %>%
  mutate(Percent = round(n / sum(n) * 100, 1))

kable(gender_summary,
      caption = "Table: Arrests by Gender",
      col.names = c("Gender", "Count", "Percent"),
      align = "lrr")

Table: Arrests by Gender
Gender	Count	Percent
Female	73	19.6
Male	300	80.4

ggplot(gender_summary, aes(x = reorder(Gender, n), y = n, fill = Gender)) +
  geom_col() +
  geom_text(aes(label = n), vjust = -0.3) +
  labs(title = "Arrests by Gender", x = "Gender", y = "Count") +
  theme_minimal() +
  theme(legend.position = "none")

ggplot(gender_summary, aes(x = "", y = n, fill = Gender)) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar("y") +
  geom_text(aes(label = paste0(Percent, "%")),
            position = position_stack(vjust = 0.5)) +
  labs(title = "Pie Chart: Arrests by Gender") +
  theme_void()

age_summary <- Arrest %>%
  count(`Age Range`) %>%
  mutate(Percent = round(n / sum(n) * 100, 1)) %>%
  arrange(desc(n))

kable(age_summary,
      caption = "Table: Arrests by Age Range",
      col.names = c("Age Range", "Count", "Percent"),
      align = "lrr")

Table: Arrests by Age Range
Age Range	Count	Percent
25-34	161	43.2
35-44	115	30.8
18-24	39	10.5
45-54	36	9.7
55-64	16	4.3
65-74	5	1.3
75+	1	0.3

ggplot(age_summary, aes(x = reorder(`Age Range`, n), y = n, fill = `Age Range`)) +
  geom_col() +
  geom_text(aes(label = n), vjust = -0.3) +
  labs(title = "Arrests by Age Range", x = "Age Range", y = "Count") +
  theme_minimal() +
  coord_flip() +
  theme(legend.position = "none")

ggplot(age_summary, aes(x = "", y = n, fill = `Age Range`)) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar("y") +
  geom_text(aes(label = paste0(Percent, "%")),
            position = position_stack(vjust = 0.5)) +
  labs(title = "Pie Chart: Arrests by Age Range") +
  theme_void()

county_summary <- Arrest %>%
  count(County) %>%
  mutate(Percent = round(n / sum(n) * 100, 1)) %>%
  arrange(desc(n))

kable(county_summary,
      caption = "Table: Arrests by County",
      col.names = c("County", "Count", "Percent"),
      align = "lrr")

Table: Arrests by County
County	Count	Percent
Atlantic	57	15.3
Burlington	56	15.0
Cape May	47	12.6
Ocean	37	9.9
Middlesex	31	8.3
Statewide or Regional LEA	23	6.2
Bergen	22	5.9
Monmouth	21	5.6
Morris	20	5.4
Gloucester	14	3.8
Passaic	13	3.5
Essex	9	2.4
Cumberland	5	1.3
Hunterdon	4	1.1
Camden	3	0.8
Mercer	3	0.8
Salem	2	0.5
Union	2	0.5
Warren	2	0.5
Hudson	1	0.3
Somerset	1	0.3

ggplot(county_summary, aes(x = reorder(County, n), y = n, fill = County)) +
  geom_col() +
  geom_text(aes(label = n), vjust = -0.3) +
  labs(title = "Arrests by County", x = "County", y = "Count") +
  theme_minimal() +
  coord_flip() +
  theme(legend.position = "none")

If it is useful we could look at race by county to see if there are any county racial disaparities.

#Overdoses

Overdose <- read.csv("~/Library/CloudStorage/Dropbox/1-Research/New Jersey DIH Data/Underlying Cause of Death, 2018-2023, Single Race.csv", stringsAsFactors=TRUE)

library(dplyr)
library(ggplot2)

# Summarize total deaths by year
deaths_by_year <- Overdose %>%
  group_by(Year) %>%
  summarise(Total_Deaths = sum(Deaths, na.rm = TRUE)) %>%
  arrange(Year)

# Line chart of total deaths over time
ggplot(deaths_by_year, aes(x = Year, y = Total_Deaths)) +
  geom_line(color = "darkred", size = 1.2) +
  geom_point(size = 2) +
  geom_text(aes(label = Total_Deaths), vjust = -0.5, size = 3) +
  labs(title = "Total Overdose Deaths by Year",
       x = "Year", y = "Number of Deaths") +
  theme_minimal()

# Summarize deaths by county and year
county_deaths_by_year <- Overdose %>%
  group_by(County, Year) %>%
  summarise(Total_Deaths = sum(Deaths, na.rm = TRUE)) %>%
  ungroup()

## `summarise()` has grouped output by 'County'. You can override using the
## `.groups` argument.

# Multi-line chart: different color per county
ggplot(county_deaths_by_year, aes(x = Year, y = Total_Deaths, color = County)) +
  geom_line(size = 1) +
  labs(title = "Overdose Deaths by County Over Time",
       x = "Year", y = "Number of Deaths", color = "County") +
  theme_minimal() +
  theme(legend.position = "right",
        legend.title = element_text(size = 10),
        legend.text = element_text(size = 8))

# Summarize deaths by county and year
county_crude_by_year <- Overdose %>%
  group_by(County, Year) %>%
  summarise(Crude_Deaths = sum(`Crude.Rate`, na.rm = TRUE)) %>%
  ungroup()

## `summarise()` has grouped output by 'County'. You can override using the
## `.groups` argument.

# Multi-line chart: different color per county
ggplot(county_crude_by_year, aes(x = Year, y = Crude_Deaths, color = County)) +
  geom_line(size = 1) +
  labs(title = "Overdose Crude Rate by County Over Time",
       x = "Year", y = "Crude Rate", color = "County") +
  theme_minimal() +
  theme(legend.position = "right",
        legend.title = element_text(size = 10),
        legend.text = element_text(size = 8))

library(dplyr)
library(ggplot2)

# 1. Summarize arrests by year
arrest_by_year <- Arrest %>%
  group_by(`Arrest Year`) %>%
  summarise(Count = n()) %>%
  mutate(Type = "Arrests") %>%
  rename(Year = `Arrest Year`)

# 2. Summarize overdose deaths by year
overdose_by_year <- Overdose %>%
  group_by(Year) %>%
  summarise(Count = sum(Deaths, na.rm = TRUE)) %>%
  mutate(Type = "Overdose Deaths")

# 3. Combine into one dataset
combined_trend <- bind_rows(arrest_by_year, overdose_by_year)

# 4. Plot both trends
ggplot(combined_trend, aes(x = Year, y = Count, color = Type)) +
  geom_line(size = 1.2) +
  geom_point(size = 2) +
  labs(title = "Arrests vs. Overdose Deaths Over Time",
       x = "Year",
       y = "Count",
       color = "Data Type") +
  scale_color_manual(values = c("Arrests" = "steelblue", "Overdose Deaths" = "firebrick")) +
  theme_minimal()

library(dplyr)
library(ggplot2)

# 1. Summarize arrests by year
arrest_by_year <- Arrest %>%
  group_by(`Arrest Year`) %>%
  summarise(Count = n()) %>%
  mutate(Type = "Arrests") %>%
  rename(Year = `Arrest Year`)

# 2. Summarize overdose deaths by year
crude_by_year <- Overdose %>%
  group_by(Year) %>%
  summarise(Count = sum(`Crude.Rate`, na.rm = TRUE)) %>%
  mutate(Type = "Crude Overdose Death Rate")

# 3. Combine into one dataset
combined_trend <- bind_rows(arrest_by_year, crude_by_year)

# 4. Plot both trends
ggplot(combined_trend, aes(x = Year, y = Count, color = Type)) +
  geom_line(size = 1.2) +
  geom_point(size = 2) +
  labs(title = "Arrests vs. Crude Overdose Deaths Over Time",
       x = "Year",
       y = "Count",
       color = "Data Type") +
  scale_color_manual(values = c("Arrests" = "steelblue", "Crude Overdose Death Rate" = "firebrick")) +
  theme_minimal()

library(dplyr)
library(ggplot2)

# 1. Summarize convictions by year (filter to 2018–2024)
convictions_by_year <- Convictions %>%
  filter(!is.na(Sentence.Year), Sentence.Year >= 2018, Sentence.Year <= 2024) %>%
  group_by(Sentence.Year) %>%
  summarise(Count = n()) %>%
  mutate(Type = "Convictions") %>%
  rename(Year = Sentence.Year)

# 2. Summarize overdose deaths by year (no filter)
overdose_by_year <- Overdose %>%
  group_by(Year) %>%
  summarise(Count = sum(Deaths, na.rm = TRUE)) %>%
  mutate(Type = "Overdose Deaths")

# 3. Combine both datasets
combined_data <- bind_rows(convictions_by_year, overdose_by_year)

# 4. Plot line chart without text labels
ggplot(combined_data, aes(x = Year, y = Count, color = Type)) +
  geom_line(size = 1.2) +
  geom_point(size = 2) +
  scale_x_continuous(breaks = seq(min(combined_data$Year),
                                  max(combined_data$Year),
                                  by = 1)) +
  scale_color_manual(values = c("Convictions" = "darkblue", "Overdose Deaths" = "firebrick")) +
  labs(title = "DID Convictions (2018–2024) vs. Overdose Deaths by Year",
       x = "Year",
       y = "Count",
       color = "Type") +
  theme_minimal()

library(dplyr)
library(ggplot2)

# 1. Convictions summarized by year (2018–2024)
convictions_by_year <- Convictions %>%
  filter(!is.na(Sentence.Year), Sentence.Year >= 2018, Sentence.Year <= 2024) %>%
  group_by(Sentence.Year) %>%
  summarise(Count = n()) %>%
  mutate(Type = "Convictions") %>%
  rename(Year = Sentence.Year)

# 2. Use Crude Rate for overdose data
overdose_rate_by_year <- Overdose %>%
  group_by(Year) %>%
  summarise(CrudeRate = mean(`Crude.Rate`, na.rm = TRUE)) %>%
  mutate(Type = "Overdose Crude Rate") %>%
  rename(Count = CrudeRate)

# 3. Combine data
combined_data <- bind_rows(convictions_by_year, overdose_rate_by_year)

# 4. Plot convictions vs crude rate
ggplot(combined_data, aes(x = Year, y = Count, color = Type)) +
  geom_line(size = 1.2) +
  geom_point(size = 2) +
  scale_x_continuous(breaks = seq(min(combined_data$Year),
                                  max(combined_data$Year),
                                  by = 1)) +
  scale_color_manual(values = c("Convictions" = "darkblue",
                                "Overdose Crude Rate" = "firebrick")) +
  labs(title = "DID Convictions vs. Overdose Crude Rate (2018–2024)",
       x = "Year",
       y = "Count / Crude Rate",
       color = "Type") +
  theme_minimal()

library(dplyr)
library(ggplot2)
library(tidyr)

# 1. Get total population per year (from Overdose data)
pop_by_year <- Overdose %>%
  group_by(Year) %>%
  summarise(Population = mean(Population, na.rm = TRUE))  # Assuming 1 row per county per year

# 2. Arrests by year
arrests_by_year <- Arrest %>%
  group_by(`Arrest Year`) %>%
  summarise(Arrest_Count = n()) %>%
  rename(Year = `Arrest Year`)

# 3. Convictions by year
convictions_by_year <- Convictions %>%
  filter(!is.na(Sentence.Year)) %>%
  group_by(Sentence.Year) %>%
  summarise(Conviction_Count = n()) %>%
  rename(Year = Sentence.Year)

# 4. Overdose deaths by year
overdose_by_year <- Overdose %>%
  group_by(Year) %>%
  summarise(Overdose_Deaths = sum(`Crude.Rate`, na.rm = TRUE))

# 5. Combine all into one dataset
combined_df <- pop_by_year %>%
  left_join(arrests_by_year, by = "Year") %>%
  left_join(convictions_by_year, by = "Year") %>%
  left_join(overdose_by_year, by = "Year") %>%
  filter(Year >= 2018, Year <= 2024)

# 6. Calculate crude rates
combined_df <- combined_df %>%
  mutate(
    Arrest_Rate = (Arrest_Count / Population) * 100000,
    Conviction_Rate = (Conviction_Count / Population) * 100000,
    Overdose_Rate = (Overdose_Deaths / Population) * 100000
  )

# 7. Pivot longer for plotting
plot_data <- combined_df %>%
  select(Year, Arrest_Rate, Conviction_Rate, Overdose_Rate) %>%
  pivot_longer(cols = -Year, names_to = "Type", values_to = "CrudeRate")

# 8. Plot all crude rates on one chart
ggplot(plot_data, aes(x = Year, y = CrudeRate, color = Type)) +
  geom_line(size = 1.2) +
  geom_point(size = 2) +
  labs(title = "Crude Rates per 100,000: Arrests, Convictions, and Overdose Deaths (2018–2024)",
       x = "Year",
       y = "Crude Rate per 100,000",
       color = "Metric") +
  scale_color_manual(values = c(
    "Arrest_Rate" = "steelblue",
    "Conviction_Rate" = "darkgreen",
    "Overdose_Rate" = "firebrick"
  )) +
  theme_minimal()

data <- combined_df %>%
  select(Year, Arrest_Count, Conviction_Count, Overdose_Deaths, Population) %>%
  mutate(Overdose_CrudeRate = (Overdose_Deaths / Population) * 100000)

# Convictions vs Overdose Crude Rate
ggplot(data, aes(x = Conviction_Count, y = Overdose_CrudeRate)) +
  geom_point() +
  geom_smooth(method = "lm") +
  labs(title = "Convictions vs Overdose Crude Rate")

## `geom_smooth()` using formula = 'y ~ x'

# Arrests vs Overdose Crude Rate
ggplot(data, aes(x = Arrest_Count, y = Overdose_CrudeRate)) +
  geom_point() +
  geom_smooth(method = "lm") +
  labs(title = "Arrests vs Overdose Crude Rate")

## `geom_smooth()` using formula = 'y ~ x'

cor.test(data$Conviction_Count, data$Overdose_CrudeRate, method = "pearson")

## 
##  Pearson's product-moment correlation
## 
## data:  data$Conviction_Count and data$Overdose_CrudeRate
## t = -2.4715, df = 4, p-value = 0.06883
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.97427547  0.09264918
## sample estimates:
##        cor 
## -0.7773624

cor.test(data$Arrest_Count, data$Overdose_CrudeRate, method = "pearson")

## 
##  Pearson's product-moment correlation
## 
## data:  data$Arrest_Count and data$Overdose_CrudeRate
## t = 2.884, df = 4, p-value = 0.04483
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.03054822 0.97984773
## sample estimates:
##       cor 
## 0.8217372

# Prepare data with just counts (no crude rate)
data <- combined_df %>%
  select(Year, Arrest_Count, Conviction_Count, Overdose_Deaths)

# Convictions vs Overdose Deaths (Counts)
ggplot(data, aes(x = Conviction_Count, y = Overdose_Deaths)) +
  geom_point(color = "darkgreen", size = 3) +
  geom_smooth(method = "lm", se = FALSE, color = "black") +
  labs(
    title = "Convictions vs Overdose Deaths (Raw Counts)",
    x = "Convictions",
    y = "Overdose Deaths"
  ) +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

# Arrests vs Overdose Deaths (Counts)
ggplot(data, aes(x = Arrest_Count, y = Overdose_Deaths)) +
  geom_point(color = "steelblue", size = 3) +
  geom_smooth(method = "lm", se = FALSE, color = "black") +
  labs(
    title = "Arrests vs Overdose Deaths (Raw Counts)",
    x = "Arrests",
    y = "Overdose Deaths"
  ) +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

cor.test(data$Conviction_Count, data$Overdose_Deaths)

## 
##  Pearson's product-moment correlation
## 
## data:  data$Conviction_Count and data$Overdose_Deaths
## t = -2.287, df = 4, p-value = 0.08414
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.9710784  0.1511317
## sample estimates:
##        cor 
## -0.7527572

Correlation coefficient (r = -0.753) This indicates a negative correlation between convictions and overdose deaths: As convictions go up, overdose deaths tend to go down. The strength of the relationship is not statisically significant => p-value = 0.084 This p-value is above 0.05, so it’s not statistically significant at the conventional 95% confidence level.
Confidence interval: [-0.97, 0.15] The wide range reflects uncertainty due to a small sample size (only 6 data points = 5 years of data). The interval includes 0, which again means we can’t rule out no effect.

cor.test(data$Arrest_Count, data$Overdose_Deaths)

## 
##  Pearson's product-moment correlation
## 
## data:  data$Arrest_Count and data$Overdose_Deaths
## t = 2.0557, df = 4, p-value = 0.109
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2266366  0.9662552
## sample estimates:
##       cor 
## 0.7167575

Correlation Coefficient (r = 0.717) This is a moderate positive correlation. Meaning: As arrests increase, overdose deaths tend to increase as well
p-value = 0.109 This is not statistically significant at the 0.05 level. It suggests that the observed correlation might be due to chance, especially given the small sample size.
Confidence Interval: [-0.227, 0.966] The interval is very wide and crosses 0, meaning we cannot be confident that a positive (or any) relationship exists. This wide range is typical of small datasets

library(dplyr)

# 1. Convictions by county and year
convictions_by_county_year <- Convictions %>%
  filter(!is.na(Sentence.Year)) %>%
  group_by(County, Year = Sentence.Year) %>%
  summarise(Conviction_Count = n(), .groups = "drop")

# 2. Arrests by county and year
arrests_by_county_year <- Arrest %>%
  group_by(County, Year = `Arrest Year`) %>%
  summarise(Arrest_Count = n(), .groups = "drop")

# 3. Overdose deaths and population by county and year

# Standardize County names in Overdose dataframe
Overdose <- Overdose %>%
  mutate(County = gsub(" County, NJ", "", County))

overdose_by_county_year <- Overdose %>%
  group_by(County = County, Year) %>%
  summarise(
    Overdose_Deaths = sum(Deaths, na.rm = TRUE),
    Population = sum(Population, na.rm = TRUE),
    .groups = "drop"
  )



# 4. Combine all into one dataset
combined_df <- overdose_by_county_year %>%
  left_join(arrests_by_county_year, by = c("County", "Year")) %>%
  left_join(convictions_by_county_year, by = c("County", "Year"))

#Tell it NA = 0 for arrests and conviction counts
combined_df <- combined_df %>%
  mutate(
    Arrest_Count = ifelse(is.na(Arrest_Count), 0, Arrest_Count),
    Conviction_Count = ifelse(is.na(Conviction_Count), 0, Conviction_Count)
  )


# Optional: filter for years of interest
combined_df <- combined_df %>%
  filter(Year >= 2018, Year <= 2024)

library(ggplot2)

# Convictions vs Overdose Deaths
ggplot(combined_df, aes(x = Conviction_Count, y = Overdose_Deaths)) +
  geom_point(aes(color = County)) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(title = "Convictions vs Overdose Deaths by County")

## `geom_smooth()` using formula = 'y ~ x'

# Arrests vs Overdose Deaths
ggplot(combined_df, aes(x = Arrest_Count, y = Overdose_Deaths)) +
  geom_point(aes(color = County)) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(title = "Arrests vs Overdose Deaths by County")

## `geom_smooth()` using formula = 'y ~ x'

# Load the library (if needed)
library(MASS)

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

# Poisson regression: model overdose deaths as count
poisson_model1 <- glm(
  Overdose_Deaths ~ Conviction_Count + Population,
  family = poisson(link = "log"),
  data = combined_df
)

summary(poisson_model1)

## 
## Call:
## glm(formula = Overdose_Deaths ~ Conviction_Count + Population, 
##     family = poisson(link = "log"), data = combined_df)
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      3.979e+00  1.909e-02  208.38   <2e-16 ***
## Conviction_Count 2.410e-02  2.408e-03   10.01   <2e-16 ***
## Population       1.822e-06  2.948e-08   61.79   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 7045.9  on 125  degrees of freedom
## Residual deviance: 3118.2  on 123  degrees of freedom
## AIC: 3943.8
## 
## Number of Fisher Scoring iterations: 4

exp(coef(poisson_model1))

##      (Intercept) Conviction_Count       Population 
##        53.458623         1.024394         1.000002

Conviction Count IRR = 1.024 Meaning: Every 1 additional conviction is associated with a 2.4% increase in expected overdose deaths, controlling for population. Statistically significant (p < 0.001)

# Overdispersion test
overdispersion <- sum(residuals(poisson_model1, type = "pearson")^2) / df.residual(poisson_model1)
overdispersion

## [1] 25.82402

Deviance Reduction Null deviance: How well a model with no predictors (just an intercept) fits the data. Residual deviance: How well your model with predictors fits the data. ➕ Interpretation: Deviance reduction = 7045.9 − 3118.2 = 3927.7 Deviance reduction=7045.9−3118.2=3927.7 That’s a ~56% reduction in deviance, which is substantial. predictors (Conviction_Count + Population) explain a large portion of the variation in overdose deaths.

BUT the model is extremely overdispersed.

library(MASS)

nb_model <- glm.nb(
  Overdose_Deaths ~ Conviction_Count + Population,
  data = combined_df
)

summary(nb_model)

## 
## Call:
## glm.nb(formula = Overdose_Deaths ~ Conviction_Count + Population, 
##     data = combined_df, init.theta = 5.081236901, link = log)
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      3.722e+00  8.348e-02  44.584  < 2e-16 ***
## Conviction_Count 3.450e-02  1.274e-02   2.707  0.00679 ** 
## Population       2.305e-06  1.554e-07  14.830  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(5.0812) family taken to be 1)
## 
##     Null deviance: 310.02  on 125  degrees of freedom
## Residual deviance: 130.63  on 123  degrees of freedom
## AIC: 1353
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  5.081 
##           Std. Err.:  0.660 
## 
##  2 x log-likelihood:  -1344.995

Variable Estimate z-value p-value Interpretation (Intercept) 3.722 44.58 < 0.001 *** Baseline log count of deaths when predictors are 0. Conviction_Count 0.0345 2.71 0.0068 ** Statistically significant. For each additional conviction, the expected overdose deaths increase by ~3.5%, holding population constant. Population 0.000002305 14.83 < 0.001 *** Significant and positive. Larger population = more deaths, as expected.

Residual deviance = 130.63 on 123 df → reasonably close, suggests decent fit. Null deviance = 310.02 → much higher than residual deviance, so your predictors explain quite a bit of variation. AIC = 1353 → helpful for comparing with other models (lower is better). Theta (dispersion parameter) = 5.08 → confirms that overdispersion is present (θ > 1), and NB was a good choice over Poisson.

#Compare models
AIC(poisson_model1, nb_model)

##                df      AIC
## poisson_model1  3 3943.754
## nb_model        4 1352.995

# Calculate dispersion statistic for negative binomial model
overdispersion_stat <- sum(residuals(nb_model, type = "pearson")^2) / df.residual(nb_model)
overdispersion_stat

## [1] 1.028553

≈ 1 → No overdispersion (good fit).

#Lagged model: Convictions –> Overdose

library(dplyr)

combined_df <- combined_df %>%
  arrange(County, Year) %>%
  group_by(County) %>%
  mutate(Lag_Convictions = lag(Conviction_Count, 1)) %>%
  ungroup()

library(MASS)

nb_lag_model <- glm.nb(Overdose_Deaths ~ Lag_Convictions + Population, data = combined_df)

summary(nb_lag_model)

## 
## Call:
## glm.nb(formula = Overdose_Deaths ~ Lag_Convictions + Population, 
##     data = combined_df, init.theta = 5.11883703, link = log)
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     3.688e+00  9.139e-02  40.354  < 2e-16 ***
## Lag_Convictions 3.839e-02  1.472e-02   2.608  0.00912 ** 
## Population      2.350e-06  1.693e-07  13.880  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(5.1188) family taken to be 1)
## 
##     Null deviance: 266.95  on 104  degrees of freedom
## Residual deviance: 108.83  on 102  degrees of freedom
##   (21 observations deleted due to missingness)
## AIC: 1126.4
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  5.119 
##           Std. Err.:  0.729 
## 
##  2 x log-likelihood:  -1118.414

Lagged Convictions Estimate: 0.03839 A 1-unit increase in convictions last year is associated with a ~3.9% increase in expected overdose deaths this year (because exp(0.03839) ≈ 1.039). p-value: 0.00912 Statistically significant at the 1% level — so the association is unlikely to be due to chance. Interpretation: More convictions in the previous year are associated with more overdose deaths in the current year, after adjusting for population. This might suggest: Convictions are a response to underlying drug problems rather than a deterrent.

AIC = 1126.4: Use this to compare with other models (lower = better). Residual deviance (108.83) is substantially lower than null deviance (266.95), indicating improved fit. Theta = 5.12: suggests reasonable dispersion, appropriate for negative binomial (less overdispersion than Poisson).

#Model: Arrests –> Overdose

library(MASS)

nb_model2 <- glm.nb(
  Overdose_Deaths ~ Arrest_Count + Population,
  data = combined_df
)

summary(nb_model2)

## 
## Call:
## glm.nb(formula = Overdose_Deaths ~ Arrest_Count + Population, 
##     data = combined_df, init.theta = 4.930072535, link = log)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  3.756e+00  8.293e-02   45.29   <2e-16 ***
## Arrest_Count 2.461e-02  1.183e-02    2.08   0.0375 *  
## Population   2.238e-06  1.575e-07   14.21   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(4.9301) family taken to be 1)
## 
##     Null deviance: 301.29  on 125  degrees of freedom
## Residual deviance: 130.72  on 123  degrees of freedom
## AIC: 1356.8
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  4.930 
##           Std. Err.:  0.639 
## 
##  2 x log-likelihood:  -1348.783

Interpretation of Arrest_Count Exponentiated coefficient: exp ⁡ (0.02461) ≈ 1.0249 exp(0.02461)≈1.0249 Meaning: Each additional arrest is associated with a 2.5% increase in expected overdose deaths, holding population constant. Interpretation: More arrests are not associated with fewer deaths — instead, the model suggests a positive association.

#Lagged model: Convictions –> Overdose

library(dplyr)

combined_df <- combined_df %>%
  arrange(County, Year) %>%
  group_by(County) %>%
  mutate(Lag_Arrest = lag(Arrest_Count, 1)) %>%
  ungroup()

library(MASS)

nb_lag_model2 <- glm.nb(Overdose_Deaths ~ Lag_Arrest + Population, data = combined_df)

summary(nb_lag_model2)

## 
## Call:
## glm.nb(formula = Overdose_Deaths ~ Lag_Arrest + Population, data = combined_df, 
##     init.theta = 4.987731906, link = log)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) 3.720e+00  9.056e-02  41.083   <2e-16 ***
## Lag_Arrest  2.637e-02  1.235e-02   2.135   0.0328 *  
## Population  2.271e-06  1.711e-07  13.272   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(4.9877) family taken to be 1)
## 
##     Null deviance: 260.49  on 104  degrees of freedom
## Residual deviance: 108.91  on 102  degrees of freedom
##   (21 observations deleted due to missingness)
## AIC: 1129.1
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  4.988 
##           Std. Err.:  0.709 
## 
##  2 x log-likelihood:  -1121.137

Lag_Arrest Estimate: 0.02637 Exponentiated: exp⁡(0.02637)≈1.0267 exp(0.02637)≈1.0267 Interpretation: Each additional arrest in the previous year is associated with a ~2.7% increase in expected overdose deaths the next year, controlling for population. p-value: 0.0328 → statistically significant at the 5% level This is consistent with prior models: lagged enforcement actions (like arrests) do not appear to reduce overdose deaths. In fact, they’re positively associated.

Metric	Value	Interpretation
Null Deviance	260.49	Fit of intercept-only model
Residual Deviance	108.91	Fit after including predictors
AIC	1129.1	Slightly worse than the lagged convictions model (AIC = 1126.4)
Theta	4.988	Overdispersion handled well

“A unit-level change in X is associated with a % change in Y, within that unit, over time, controlling for [covariates].”

Absolutely! Here’s a clear and concise write-up of your findings, tailored for inclusion in a report, paper, or policy brief. It’s grounded in the fact that your data are county-level panel data, and it emphasizes interpretation that avoids ecological fallacy.

📝 Findings: The Relationship Between Drug Enforcement and Overdose Deaths (County-Level Analysis) We analyzed the relationship between drug enforcement and overdose mortality at the county-year level using data from New Jersey (1988–2024). Enforcement was measured using conviction counts and arrest counts, and overdose mortality was modeled using negative binomial regression to account for overdispersed count data. All models controlled for county population. We examined both contemporaneous (non-lagged) and lagged effects to explore temporal relationships. 🔹 Non-Lagged Models (Same-Year Effects) Convictions A 1-unit increase in convictions within the same year was associated with a 2.5% increase in overdose deaths (IRR = 1.025, p < 0.001). This association was statistically significant after adjusting for population size. Arrests A 1-unit increase in arrests within the same year was associated with a 2.5% increase in overdose deaths (IRR = 1.025, p = 0.038). Like convictions, this effect was also statistically significant after accounting for population. 🧠 Interpretation: In both models, higher levels of enforcement activity in a given year were positively associated with overdose deaths in the same year. These associations likely reflect underlying county-level drug trends, where enforcement and mortality rise in parallel, rather than enforcement directly causing increased mortality. 🔹 Lagged Models (One-Year Delayed Effects) Lagged Convictions A 1-unit increase in convictions from the previous year was associated with a 3.9% increase in overdose deaths the following year (IRR = 1.039, p = 0.009). This relationship remained significant after adjusting for county population. Lagged Arrests A 1-unit increase in arrests from the previous year was associated with a 2.7% increase in overdose deaths the following year (IRR = 1.027, p = 0.033). 🧠 Interpretation: These lagged models suggest that increased enforcement activity in a given year does not lead to reductions in overdose mortality the following year. In fact, both arrests and convictions were positively associated with future deaths, which may reflect enforcement targeting areas with worsening drug problems or structural conditions that drive both arrests and mortality. 📌 General Considerations These results are based on county-year aggregate data, and do not imply causal effects at the individual level. Positive associations between enforcement and overdose deaths likely reflect shared underlying conditions (e.g., high community-level drug use), rather than enforcement causing more deaths. Crude associations do not account for potential confounding from unmeasured county-level characteristics (e.g., treatment access, fentanyl prevalence). 🧭 Policy Implications These findings challenge assumptions that criminal legal enforcement reduces overdose mortality. Instead, they underscore the importance of exploring non-punitive, public health-oriented interventions, such as: Harm reduction programs Expanded access to treatment and naloxone Community-based support systems

Certainly! Here’s a clear interpretation of your Negative Binomial model results including both non-lagged and lagged versions for convictions and arrests, with language adjusted to emphasize that your measures are county-level counts over time:

Statistical Model Interpretation (County-Level Analysis)

We modeled overdose deaths across counties using a Negative Binomial regression, appropriate for count data with overdispersion. The predictors included Conviction Count, Arrest Count, and Population, all measured annually at the county level. Below are interpretations of the results from the models:

1. Convictions and Overdose Deaths (Non-Lagged Model)

glm.nb(Overdose_Deaths ~ Conviction_Count + Population, data = combined_df)

Conviction Count was significantly positively associated with overdose deaths (β = 0.0345, p = 0.0068).
Population was also a strong positive predictor (p < 0.001).
This suggests that counties with more convictions in a given year also tended to have more overdose deaths, after controlling for population size.

🧠 Note: This model reflects concurrent associations, not causality.

2. Convictions and Overdose Deaths (Lagged Model)

glm.nb(Overdose_Deaths ~ Lag_Convictions + Population, data = combined_df)

Lagged Conviction Count (previous year) remained significantly associated with overdose deaths (β = 0.0384, p = 0.0091).
This indicates that higher conviction counts in the previous year are associated with more overdose deaths in the current year.

📌 This strengthens the evidence that increased convictions may not reduce overdose deaths over time — and may be associated with increased deaths at the county level.

3. Arrests and Overdose Deaths (Non-Lagged Model)

glm.nb(Overdose_Deaths ~ Arrest_Count + Population, data = combined_df)

Arrest Count was positively associated with overdose deaths (β = 0.0246, p = 0.0375).
This again suggests that counties with more arrests also experienced more overdose deaths, controlling for population.

4. Arrests and Overdose Deaths (Lagged Model)

glm.nb(Overdose_Deaths ~ Lag_Arrest + Population, data = combined_df)

Lagged Arrest Count was also positively associated with overdose deaths one year later (β = 0.0264, p = 0.0328).
This suggests a possible delayed relationship, where higher arrest activity may not prevent future overdoses, and may even correlate with higher overdose risk the following year.

✅ Model Fit Notes

Negative Binomial was appropriate due to overdispersion (confirmed via residual deviance / df).
All models adjusted for county population, making the estimates relative to county size.
These results describe associations, not causality. It may reflect broader systemic or contextual factors (e.g., criminal justice response intensity and public health outcomes).

plot(combined_df$Conviction_Count, residuals(nb_model, type = "deviance"))

🔍 Interpretation of the Residual Plot What you’re looking at: X-axis: Conviction_Count Y-axis: Deviance residuals from the Negative Binomial model (nb_model) Each point: One county-year observation Goal: Check for patterns or structure → which might signal a misfit (e.g., nonlinearity) ✅ What’s Good: Residuals are mostly centered around 0, which is what we expect. No major curved pattern is obvious — that’s a sign the linearity assumption on the log scale might be okay. No major funnel shape (i.e., increasing spread) that would suggest heteroskedasticity. ⚠️ What to Note: There’s a large mass of zeros (Conviction_Count = 0), which is expected if most counties had no convictions in some years. Residual spread appears fairly evenly distributed at non-zero counts — but it’s sparse (fewer points at higher counts), so detecting nonlinearity there is harder. 🤔 Conclusion: No major red flags in terms of functional form. But: Because so many values are zero or near-zero for Conviction_Count, your model’s estimate of that effect is based mostly on a few non-zero data points. You could still consider a spline or categorizing conviction counts (e.g., 0 vs 1–5 vs 6+) if theory supports a threshold or if you want to test robustness.

#Running a spline model as comparison

library(splines)

nb_spline <- glm.nb(Overdose_Deaths ~ ns(Conviction_Count, df = 3) + Population, data = combined_df)

## Warning in ns(Conviction_Count, df = 3): shoving 'interior' knots matching
## boundary knots to inside

summary(nb_spline)

## 
## Call:
## glm.nb(formula = Overdose_Deaths ~ ns(Conviction_Count, df = 3) + 
##     Population, data = combined_df, init.theta = 5.084799512, 
##     link = log)
## 
## Coefficients:
##                                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                   3.725e+00  8.562e-02  43.511   <2e-16 ***
## ns(Conviction_Count, df = 3)1 1.660e-01  3.048e-01   0.545   0.5860    
## ns(Conviction_Count, df = 3)2 4.685e-01  2.040e-01   2.297   0.0216 *  
## ns(Conviction_Count, df = 3)3 7.225e-01  3.852e-01   1.875   0.0607 .  
## Population                    2.306e-06  1.557e-07  14.808   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(5.0848) family taken to be 1)
## 
##     Null deviance: 310.22  on 125  degrees of freedom
## Residual deviance: 130.65  on 121  degrees of freedom
## AIC: 1356.9
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  5.085 
##           Std. Err.:  0.661 
## 
##  2 x log-likelihood:  -1344.921

You’re getting some helpful (and interpretable!) results here — let’s break it down clearly:

🔍 What the warning means

Warning: shoving ‘interior’ knots matching boundary knots to inside

This simply means that your spline’s internal knots were originally placed at the boundaries of your Conviction_Count values — but since splines need those interior knots inside the range (not at the edge), R adjusted them slightly.

✅ No action needed unless the distribution of Conviction_Count is extremely skewed or has very few unique values.

📊 Model interpretation

✅ Model type:

Negative binomial model (which handles overdispersion well)
Spline applied to Conviction_Count (df = 3 = 3 degrees of freedom = flexible but not too wiggly)

🔢 Coefficients:

Population remains highly significant (p < .001), as expected.
Spline terms:
- The 2nd spline basis is statistically significant (p = 0.0216)
- The 3rd term is marginally significant (p = 0.0607)
- The 1st term is not significant.

Together, this suggests that the relationship between Conviction_Count and Overdose_Deaths is not strictly linear — there’s non-linearity, possibly thresholds or bends in the relationship.

📈 Fit / Model comparison

Metric	Model Type	Value
AIC	Spline NB model	1356.9
AIC	Linear NB model (no spline)	1353.0

✅ The spline model fits slightly worse by AIC, but only marginally. AIC differences < 2 are considered negligible.

🧠 What this tells us

There may be non-linearity in the conviction-overdose relationship, but it’s not strong enough to decisively outperform the linear version.
You could justify sticking with the simpler linear model unless:
- You want to capture local effects visually,
- Or plan to explore higher df splines or more flexible models like GAMs.

👉 Recommendation

If your goal is interpretability and parsimony, the linear model is appropriate.

If you want exploratory insight or better fit, plotting the spline fit visually can help reveal where the non-linear patterns lie. Want help generating that plot?

Independence of Observations 🔍 What it means: Most regression models—including Negative Binomial—assume that each observation is independent of the others. That means the model expects that the overdose count in Camden in 2012 doesn’t “know” anything about Camden in 2013. But in panel data (i.e., data across counties over multiple years), that assumption is probably not true.

❗ Why it might be violated: Overdose trends within a county are correlated over time (e.g., counties with high deaths one year tend to have high deaths the next year). This creates autocorrelation or intra-group correlation, which violates the independence assumption. 💡 Consequences: Standard errors may be underestimated Your p-values may be too small, making results look more significant than they are ✅ What you can do: Cluster-robust standard errors Adjusts standard errors to account for repeated measures within counties.

library(sandwich)
library(lmtest)

## Loading required package: zoo

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

coeftest(nb_model, vcov = vcovCL, cluster = ~County)

## 
## z test of coefficients:
## 
##                    Estimate Std. Error z value  Pr(>|z|)    
## (Intercept)      3.7219e+00 1.9067e-01 19.5203 < 2.2e-16 ***
## Conviction_Count 3.4499e-02 1.7353e-02  1.9881    0.0468 *  
## Population       2.3046e-06 3.4481e-07  6.6837 2.329e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Thanks for sharing the output. Here’s an interpretation of this Negative Binomial model:

🔍 Model Summary (Negative Binomial Regression)

Outcome Variable:

Overdose_Deaths — modeled as a count variable.

Predictors:

Conviction_Count (key independent variable)
Population (control for size of county/year)

✅ Key Coefficients and Interpretation

Variable	Estimate	p-value	Interpretation
Conviction_Count	`0.0345`	`0.0468`	Statistically significant (p < 0.05). Each additional conviction is associated with a ~3.5% increase in expected overdose deaths, holding population constant. This is calculated by: `exp(0.0345) ≈ 1.035`.
Population	`2.3046e-06`	`<0.001`	Highly significant. Larger populations are associated with higher expected overdose deaths, as expected.
Intercept	`3.7219`	`<0.001`	Baseline log-count of overdose deaths when other predictors are zero (not meaningful in isolation).

📌 Takeaways

The relationship between Conviction_Count and Overdose_Deaths remains positive and statistically significant, even after accounting for population size.
The effect size is moderate — every 10 convictions is associated with about a 35% increase in expected overdose deaths.
This might suggest that convictions are either not serving as a deterrent or are correlated with underlying conditions (e.g., higher drug activity) that also lead to higher overdose deaths.

Great question — here’s a clear explanation:

🔁 “Adjusts standard errors to account for repeated measures within counties” — What it Means

In your dataset, each county appears multiple times — once for each year. This means:

The same county contributes multiple observations (e.g., Essex County in 2018, 2019, 2020, etc.).
These observations aren’t truly independent — they share unobserved characteristics (e.g., socioeconomic conditions, health systems, law enforcement practices, etc.).

✅ Why It Matters

Most regression models (like glm.nb) assume independence of observations. If that assumption is violated:

Standard errors can be underestimated.
This increases the chance of false positives (Type I errors).

🛠️ What the Adjustment Does

By clustering standard errors at the county level, you’re telling the model:

“Observations within a county might be correlated — please adjust for that.”

Specifically, it:

Allows within-county correlation.
Makes the standard errors more robust and conservative.
Helps ensure valid inference (e.g., p-values and confidence intervals are more accurate).

📊 Analogy

Imagine testing students in a classroom. Their scores are likely to be more similar to each other than to students in another class (due to same teacher, same materials, etc.). Treating all student scores as totally independent would be misleading.

Same idea here — counties are your “clusters.”