Project 3 Data 101

Analyzing the Relationship Between Opioid Settlement Funding and Treatment Commitments

A. Introduction

Does the total amount of opioid settlement funds received by a state (in millions) significantly predict the total dollar amount committed to Treatment and Recovery services? This project utilizes the Opioid Settlement Expenditures data set (updated December 2024), which tracks how billions of dollars from legal settlements are being allocated across the United States. The data set contains 51 cases, representing all 50 states and the District of Columbia. While the original data includes dozens of variables regarding prevention, harm reduction, and administrative costs, this analysis focuses on two primary quantitative variables: funds_millions (the total settlement funds received by a state, scaled to millions) and treatment_numeric (the total dollar amount committed to Treatment and Recovery). The data was sourced from the KFF Health News Opioid Settlement Tracker.

knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.2.1     ✔ readr     2.2.0
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.2     ✔ tibble    3.3.1
## ✔ lubridate 1.9.5     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(readxl)
library(janitor)

## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

library(broom)

# Step 4: Importing 
file_path <- "opioid-settlement-expenditures-downloadable-data-121224-3.xlsx"
raw_data <- read_excel(file_path, sheet = "Summary Data", col_names = FALSE)

## New names:
## • `` -> `...1`
## • `` -> `...2`
## • `` -> `...3`
## • `` -> `...4`
## • `` -> `...5`
## • `` -> `...6`
## • `` -> `...7`
## • `` -> `...8`
## • `` -> `...9`
## • `` -> `...10`
## • `` -> `...11`
## • `` -> `...12`
## • `` -> `...13`
## • `` -> `...14`
## • `` -> `...15`
## • `` -> `...16`
## • `` -> `...17`
## • `` -> `...18`
## • `` -> `...19`
## • `` -> `...20`
## • `` -> `...21`
## • `` -> `...22`
## • `` -> `...23`
## • `` -> `...24`
## • `` -> `...25`
## • `` -> `...26`
## • `` -> `...27`
## • `` -> `...28`
## • `` -> `...29`
## • `` -> `...30`
## • `` -> `...31`
## • `` -> `...32`
## • `` -> `...33`
## • `` -> `...34`
## • `` -> `...35`
## • `` -> `...36`
## • `` -> `...37`
## • `` -> `...38`
## • `` -> `...39`
## • `` -> `...40`
## • `` -> `...41`

# Manipulation (Using 3 dplyr functions: select, mutate, filter)

# Manipulation (Using 3+ dplyr functions)
analysis_df <- raw_data %>%
  select(state_name = 1, funds_raw = 3, treatment_raw = 5) %>% 
  filter(row_number() > 3) %>%                              
  mutate(                                                    
# CLEANING: Replaced everything except numbers and decimals with nothing
# This prevents the "NAs introduced by coercion" error
    funds_numeric = as.numeric(gsub("[^0-9.]", "", funds_raw)),
    treatment_numeric = as.numeric(gsub("[^0-9.]", "", treatment_raw)),
    funds_millions = funds_numeric / 1000000
  ) %>%
  # dplyr: filtering (Removes any rows that turned into NAs or are empty)
  filter(!is.na(treatment_numeric), !is.na(funds_millions))

EDA Visualization

Exploratory Data Analysis (EDA) via the scatterplot reveals a positive linear relationship between the variables. As states receive larger total settlements, their specific dollar commitment to Treatment and Recovery increases. However, the plot also highlights several outliers States receiving massive settlements (over $300 million) that significantly pull the regression line upward.

# EDA Visualization: Scatterplot for Regression prep
ggplot(analysis_df, aes(x = funds_millions, y = treatment_numeric)) +
  geom_point(color = "darkgreen", alpha = 0.6) +
  geom_smooth(method = "lm", color = "blue", se = TRUE) +
  labs(title = "Settlement Funding vs. Treatment Allocation",
       x = "Total Funds Received (Millions \ $)", 
       y = "Percent Committed to Treatment & Recovery") +
  theme_minimal()

C. Regression Analysis

To address the research question, I performed a Simple Linear Regression analysis. Before modeling, the data was cleaned using positional selection to extract the State, Total Funds, and Treatment allocations. String manipulation was used to strip currency symbols and commas, converting the data into numeric format.

# Regression Model
model1 <- lm(treatment_numeric ~ funds_millions, data = analysis_df)

# Model Summary with coefficients and p-values
summary(model1)

## 
## Call:
## lm(formula = treatment_numeric ~ funds_millions, data = analysis_df)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -83574024 -23033344  -9267948  20427450 182836222 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    21592557    7153960   3.018  0.00406 ** 
## funds_millions   417595      90141   4.633 2.78e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 42300000 on 48 degrees of freedom
## Multiple R-squared:  0.309,  Adjusted R-squared:  0.2946 
## F-statistic: 21.46 on 1 and 48 DF,  p-value: 2.782e-05

# Confidence Intervals
confint(model1)

##                    2.5 %     97.5 %
## (Intercept)    7208556.5 35976557.1
## funds_millions  236354.7   598836.2

Interpretation:

** Regression Analysis

The final model was constructed using the lm() function

Treatment = 21,592,557 + 417,595 (Funds in Millions)

Model Summary and Interpretation Coefficients:

The Intercept is $21,592,557. This represents the predicted treatment commitment for a state receiving zero settlement funds (a theoretical baseline). The Slope for funds_millions is 417,595.

Interpretation: For every additional one million dollars a state receives in settlement funds, the model predicts an average increase of $417,595 committed specifically to Treatment and Recovery services.

Statistical Significance: The p-value for the slope is 2.78x10^-5, which is far below the alpha = 0.05 threshold. This indicates that total settlement funding is a highly significant predictor of treatment commitment.

Confidence Intervals: We are 95% confident that for every 1 million increase in total funds, the increase in treatment commitment falls between $236,354 and 598,836.

Model Fit: The Multiple R-squared is 0.309, meaning approximately 30.9 of the variation in state treatment spending can be explained by the total amount of settlement money received.

D. Model Assumptions and Diagnostics

# Diagnostic Plots
par(mfrow = c(2, 2))
plot(model1)

D. Model Assumptions and Diagnostics

Based on the diagnostic plots, I evaluated the five core assumptions of linear regression:

Linearity: The Residuals vs Fitted plot shows a red line that is relatively flat and stays near zero, suggesting that a linear model is appropriate for this data.

Independence: I assumed independence as each state manages its budget and remediation strategy separately.

Homoscedasticity: In the Scale-Location plot, we see a “fan” shape where the spread of residuals increases as fitted values increase. This indicates some heteroscedasticity, likely driven by large-population states (outliers 1, 9, and 46).

Normality of Residuals: The Normal Q-Q plot shows that most residuals follow the dashed line, but there is significant “heavy-tailing” at the upper end. This suggests the residuals are not perfectly normal, impacted by the extreme values of high-funding states.

Multicollinearity: Not applicable as this model uses only one predictor.

E. Conclusion and Future Directions

The regression analysis confirms a statistically significant positive relationship between settlement size and treatment allocation. With a p-value of 2.78x10^-5, we reject the null hypothesis. However, the R-squared value of 0.309 suggests that while funding size matters, it only explains about a third of the story. Other factors such as a state’s existing infrastructure, political priorities, or overdose rates likely dictate the remaining 69% of spending variance.

Limitations and Future Research: A major limitation is the influence of high leverage outliers (points 1 and 46), which represent states with the largest settlements. For future research i plan to use a logarithmic transformation of the funds to handle this skewness or include “State Population” as a second predictor to see if the relationship holds when adjusting for state size.

F. References

KFF Health News. (2024). Opioid Settlement Expenditures Data.

Project 3 Data 101

Arinze Ugbah

2026-04-30

# Manipulation (Using 3 dplyr functions: select, mutate, filter)

EDA Visualization

C. Regression Analysis

D. Model Assumptions and Diagnostics

D. Model Assumptions and Diagnostics

E. Conclusion and Future Directions

F. References