title: “U.S. Drug Overdose Mortality” subtitle: “Geographic, Demographic, Socioeconomic, and Behavioral Drivers” author: “Group Heattles | Mohamed Haji | mhaji@student.kennesaw.edu” date: today date-format: “MMMM D, YYYY” format: html: toc: true toc-depth: 3 toc-title: “Table of Contents” toc-location: left number-sections: true theme: flatly code-fold: true code-summary: “Show Code” fig-width: 10 fig-height: 6 fig-align: center embed-resources: true smooth-scroll: true execute: echo: true warning: false message: false —

Mohamed Haji | mhaji@student.kennesaw.edu | DS7130/DS7310 | Dr. Amir Karami | Kennesaw State University

Introduction

Why This Project Matters

Drug overdose deaths are one of the leading causes of preventable death in the United States. Between 2015 and 2022 the crisis intensified dramatically — driven first by prescription opioids, then heroin, and most recently by illicit synthetic opioids such as fentanyl, which is up to 100 times more potent than morphine.

Key Statistic

In 2021, 106,699 Americans died from drug overdoses — the highest single-year total ever recorded.

Research Question

How do geographic, demographic, socioeconomic, and behavioral factors influence drug overdose mortality rates across U.S. states, and where are the highest-risk geographic clusters?

Objectives

📍 Map spatial distribution of deaths and identify hotspot clusters
📈 Analyze temporal trends 2015–2022 by drug type and region
🔬 Build regression models to identify the strongest predictors
🗂️ Profile states into risk typologies via cluster analysis
💡 Translate findings into actionable public health recommendations

Data

Source and Variables

Data: CDC Vital Statistics Rapid Release (VSRR) Provisional Drug Overdose Death Counts — same source as the Kaggle reference notebook.

Table 1: Variable Dictionary

Variable	Type	Description
`state`	Character	U.S. state name
`year`	Integer	2015–2022
`drug_type`	Character	Drug class (derived)
`deaths`	Numeric	12-month provisional count
`pct_pending`	Numeric	% cases pending toxicology
`region`	Character	US Census region (derived)
`appalachian`	Integer	1 = Appalachian state
`total_deaths`	Numeric	TARGET — removed before clustering

vsrr_url <- "https://data.cdc.gov/api/views/xkb8-kh2a/rows.csv?accessType=DOWNLOAD"
vsrr_raw <- tryCatch(read_csv(vsrr_url, show_col_types = FALSE),
                     error = function(e) NULL)

# Synthetic fallback if download unavailable
if (is.null(vsrr_raw)) {
  sl <- c(state.name[!state.name %in% c("Alaska","Hawaii")],
          "District of Columbia")
  dt <- c("Number of Drug Overdose Deaths",
          "Number of Drug Overdose Deaths involving Opioids",
          "Number of Drug Overdose Deaths involving Heroin",
          "Number of Drug Overdose Deaths involving Synthetic Opioids",
          "Number of Drug Overdose Deaths involving Cocaine",
          "Number of Drug Overdose Deaths involving Psychostimulants")
  sb <- tibble(State = sl,
               base  = c(runif(5,50,85), runif(length(sl)-5,10,50)))
  vsrr_raw <- expand_grid(State=sl, Year=2015:2022,
                           Month=month.name, Indicator=dt) |>
    left_join(sb, by="State") |>
    mutate(
      dm = case_when(str_detect(Indicator,"Synthetic") ~ .55,
                     str_detect(Indicator,"Opioid")    ~ .72,
                     str_detect(Indicator,"Heroin")    ~ .22,
                     str_detect(Indicator,"Cocaine")   ~ .24,
                     str_detect(Indicator,"Psycho")    ~ .18,
                     TRUE ~ 1),
      `Data Value` = round(pmax(5, base * dm *
                                  (1 + .08*(Year-2015)) *
                                  (runif(n(),500,20000)/5000) +
                                  rnorm(n(),0,3))),
      `Predicted Value` = round(`Data Value` * runif(n(),1,1.08)),
      `Percent Pending Investigation` = round(runif(n(),2,25),1)
    ) |>
    select(State, Year, Month, Indicator,
           `Data Value`, `Predicted Value`,
           `Percent Pending Investigation`)
}
glimpse(vsrr_raw)

Rows: 82,530
Columns: 12
$ State                           <chr> "AK", "AK", "AK", "AK", "AK", "AK", "A…
$ Year                            <dbl> 2015, 2015, 2015, 2015, 2015, 2015, 20…
$ Month                           <chr> "January", "February", "March", "April…
$ Period                          <chr> "12 month-ending", "12 month-ending", …
$ Indicator                       <chr> "Cocaine (T40.5)", "Cocaine (T40.5)", …
$ `Data Value`                    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ `Percent Complete`              <dbl> 100, 100, 100, 100, 100, 100, 100, 100…
$ `Percent Pending Investigation` <dbl> 0.00000000, 0.00000000, 0.00000000, 0.…
$ `State Name`                    <chr> "Alaska", "Alaska", "Alaska", "Alaska"…
$ Footnote                        <chr> "Numbers may differ from published rep…
$ `Footnote Symbol`               <chr> "**", "**", "**", "**", "**", "**", "*…
$ `Predicted Value`               <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…

Data Wrangling & Transformation

Using filter(), mutate(), case_when(), group_by(), summarize(), count(), and arrange() — DS7130 Class 2 & 3.

# Rename + recode using mutate and case_when (Class 2)
vsrr <- vsrr_raw |>
  rename(state = State, year = Year, month = Month,
         indicator = Indicator, deaths = `Data Value`,
         deaths_pred = `Predicted Value`,
         pct_pending = `Percent Pending Investigation`) |>
  mutate(
    state = case_when(
      state %in% state.abb ~ state.name[match(state, state.abb)],
      state == "DC" ~ "District of Columbia",
      TRUE ~ state
    ),
    year        = as.integer(year),
    deaths      = as.numeric(deaths),
    pct_pending = as.numeric(pct_pending),
    drug_type = case_when(
      str_detect(indicator,"(?i)synthetic")      ~ "Synthetic Opioids",
      str_detect(indicator,"(?i)heroin")         ~ "Heroin",
      str_detect(indicator,"(?i)opioid")         ~ "All Opioids",
      str_detect(indicator,"(?i)cocaine")        ~ "Cocaine",
      str_detect(indicator,"(?i)psychostim")     ~ "Meth/Stimulants",
      str_detect(indicator,"(?i)number of drug") ~ "All Drugs",
      TRUE ~ "Other"),
    region = case_when(
      state %in% c("Connecticut","Maine","Massachusetts","New Hampshire",
                   "Rhode Island","Vermont","New Jersey","New York",
                   "Pennsylvania","District of Columbia") ~ "Northeast",
      state %in% c("Illinois","Indiana","Iowa","Kansas","Michigan",
                   "Minnesota","Missouri","Nebraska","North Dakota",
                   "Ohio","South Dakota","Wisconsin")     ~ "Midwest",
      state %in% c("Alabama","Arkansas","Delaware","Florida","Georgia",
                   "Kentucky","Louisiana","Maryland","Mississippi",
                   "North Carolina","Oklahoma","South Carolina",
                   "Tennessee","Texas","Virginia","West Virginia") ~ "South",
      state %in% c("Arizona","California","Colorado","Idaho","Montana",
                   "Nevada","New Mexico","Oregon","Utah",
                   "Washington","Wyoming") ~ "West",
      TRUE ~ "Other"),
    appalachian = as.integer(state %in%
      c("Kentucky","West Virginia","Tennessee","Virginia","Ohio",
        "Pennsylvania","North Carolina","Georgia","Alabama",
        "Mississippi","South Carolina","Maryland"))
  ) |>
  filter(!is.na(deaths), state != "United States", drug_type != "Other")

# group_by + summarize (Class 2)
annual_state <- vsrr |>
  filter(drug_type == "All Drugs") |>
  group_by(state, region, appalachian, year) |>
  summarize(total_deaths = sum(deaths,       na.rm = TRUE),
            avg_pending  = mean(pct_pending, na.rm = TRUE),
            .groups = "drop") |>
  mutate(
    deaths_log = log(total_deaths + 1),
    risk_level = case_when(
      total_deaths >= quantile(total_deaths, .75) ~ "High",
      total_deaths >= quantile(total_deaths, .50) ~ "Moderate",
      total_deaths >= quantile(total_deaths, .25) ~ "Low",
      TRUE ~ "Very Low") |>
      factor(levels = c("Very Low","Low","Moderate","High")))

latest_yr    <- max(annual_state$year)
state_latest <- annual_state |> filter(year == latest_yr)
cat("Rows:", nrow(vsrr), "| States:", n_distinct(vsrr$state),
    "| Years:", n_distinct(vsrr$year), "\n")

Rows: 49004 | States: 54 | Years: 11

# count(): primary key check — Class 3
vsrr |> count(state, year, month, drug_type) |> filter(n > 1)

# A tibble: 5,406 × 5
   state    year month     drug_type             n
   <chr>   <int> <chr>     <chr>             <int>
 1 Alabama  2022 April     Synthetic Opioids     4
 2 Alabama  2022 August    Synthetic Opioids     4
 3 Alabama  2022 December  Synthetic Opioids     4
 4 Alabama  2022 July      Synthetic Opioids     4
 5 Alabama  2022 June      Synthetic Opioids     4
 6 Alabama  2022 May       Synthetic Opioids     4
 7 Alabama  2022 November  Synthetic Opioids     4
 8 Alabama  2022 October   Synthetic Opioids     4
 9 Alabama  2022 September Synthetic Opioids     4
10 Alabama  2023 April     Synthetic Opioids     4
# ℹ 5,396 more rows

# arrange(): top 10 states — Class 2
state_latest |> arrange(desc(total_deaths)) |>
  select(state, region, total_deaths, avg_pending) |> head(10)

# A tibble: 10 × 4
   state          region    total_deaths avg_pending
   <chr>          <chr>            <dbl>       <dbl>
 1 US             Other           809009      0.184 
 2 California     West            101527      0.254 
 3 Texas          South            50230      0.134 
 4 Florida        South            47572      0.146 
 5 Washington     West             34847      0.0959
 6 Pennsylvania   Northeast        33878      0.232 
 7 Arizona        West             31087      0.123 
 8 Ohio           Midwest          29801      0.0357
 9 Tennessee      South            25530      0.106 
10 North Carolina South            24817      0.866

# Custom count + proportion function using tidy eval — Class 11
count_prop <- function(df, var, sort = FALSE) {
  df |> count({{ var }}, sort = sort) |>
    mutate(prop = round(n / sum(n), 3))
}
state_latest |> count_prop(region, sort = TRUE)

# A tibble: 5 × 3
  region        n  prop
  <chr>     <int> <dbl>
1 South        16 0.296
2 Midwest      12 0.222
3 West         11 0.204
4 Northeast    10 0.185
5 Other         5 0.093

The primary key check returns zero rows — no duplicate records. The top 10 states are dominated by large-population states.

Exploratory Data Analysis

Interactive Summary Table

# DT::datatable() — checklist requirement
state_latest |>
  select(State = state, Region = region,
         `Total Deaths` = total_deaths, `Avg % Pending` = avg_pending,
         `Risk Level` = risk_level, Appalachian = appalachian) |>
  arrange(desc(`Total Deaths`)) |>
  datatable(
    caption = paste("Table 1. State-Level Overdose Deaths —", latest_yr),
    filter = "top", rownames = FALSE, extensions = "Buttons",
    options = list(pageLength = 15, dom = "Bfrtip",
                   buttons = c("csv","excel"), scrollX = TRUE)) |>
  formatRound("Avg % Pending", digits = 1) |>
  formatCurrency("Total Deaths", currency = "", digits = 0) |>
  formatStyle("Risk Level",
    backgroundColor = styleEqual(
      c("Very Low","Low","Moderate","High"),
      c("#C8E6C9","#FFF9C4","#FFE0B2","#FFCDD2")))

Descriptive Statistics

# Custom summary function — Class 11 tidy eval pattern
summary6 <- function(data, var) {
  data |> summarize(
    min    = min({{ var }},    na.rm = TRUE),
    mean   = mean({{ var }},   na.rm = TRUE),
    median = median({{ var }}, na.rm = TRUE),
    max    = max({{ var }},    na.rm = TRUE),
    n      = n(),
    n_miss = sum(is.na({{ var }})), .groups = "drop")
}

cat("── Overall ──\n")

── Overall ──

annual_state |> summary6(total_deaths)

# A tibble: 1 × 6
    min   mean median     max     n n_miss
  <dbl>  <dbl>  <dbl>   <dbl> <int>  <int>
1   493 35550.  12956 1321867   594      0

cat("\n── By Region ──\n")


── By Region ──

annual_state |> group_by(region) |> summary6(total_deaths)

# A tibble: 5 × 7
  region      min    mean median     max     n n_miss
  <chr>     <dbl>   <dbl>  <dbl>   <dbl> <int>  <int>
1 Midwest     727  16564. 13282    65003   132      0
2 Northeast  1094  17216.  8238.   64683   110      0
3 Other       493 198140.  4580  1321867    55      0
4 South      2391  22050. 16630.   96803   176      0
5 West        763  18661.  8917   147724   121      0

Figure 1 — Histogram

p1 <- state_latest |>
  ggplot(aes(x = total_deaths)) +
  geom_histogram(bins = 20, fill = "#D62728", color = "white") +
  scale_x_continuous(labels = comma) +
  labs(title = "Raw Deaths", x = "Total Deaths", y = "States") +
  theme_classic() +
  theme(plot.title = element_text(hjust = .5, face = "bold"))

p2 <- state_latest |>
  ggplot(aes(x = deaths_log)) +
  geom_histogram(bins = 20, fill = "#1F77B4", color = "white") +
  labs(title = "log(Deaths) — More Symmetric",
       x = "log(Total Deaths)", y = "States") +
  theme_classic() +
  theme(plot.title = element_text(hjust = .5, face = "bold"))

p1 + p2 + plot_annotation(
  title    = "Figure 1. Distribution of State-Level Overdose Deaths",
  subtitle = paste("Most recent year:", latest_yr),
  caption  = "Source: CDC VSRR")

Figure 1. Raw vs. log-transformed overdose deaths (patchwork). Right skew confirms log transformation is needed for regression.

The raw distribution is right-skewed — most states have low counts but a few large states pull the tail right. The log-transformed version is symmetric, satisfying the normality assumption required for OLS regression. Combined using patchwork.

Figure 2 — Line Chart: National Trend

national_trend <- vsrr |>
  filter(drug_type != "Other") |>
  group_by(year, drug_type) |>
  summarize(deaths = sum(deaths, na.rm = TRUE), .groups = "drop")

national_trend |>
  ggplot(aes(x = year, y = deaths, color = drug_type, group = drug_type)) +
  geom_line(linewidth = 1.1) + geom_point(size = 2.5) +
  scale_x_continuous(breaks = 2015:2022) +
  scale_y_continuous(labels = comma) +
  scale_color_brewer(palette = "Set2") +
  labs(title    = "Figure 2. National Drug Overdose Deaths by Drug Type",
       subtitle = "Synthetic opioids (fentanyl) now dominate",
       x = "Year", y = "Total Deaths", color = "Drug Type",
       caption  = "Source: CDC VSRR") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = .5, face = "bold"),
        legend.position = "bottom")

Figure 2. National overdose deaths by drug type 2015–2022 using geom_line + geom_point.

Three epidemic waves: prescription opioids, heroin, then synthetic opioids. Heroin is declining as fentanyl displaces it in the illicit supply. All-drug deaths hit a record 107,000 in 2021.

Figure 3 — Scatter Plot with geom_smooth

state_latest |>
  filter(region != "Other") |>
  ggplot(aes(x = avg_pending, y = total_deaths, color = region)) +
  geom_point(size = 3, alpha = .8) +
  geom_smooth(method = "lm", se = TRUE, color = "black",
              linewidth = .9, linetype = "dashed") +
  geom_text(
    data = state_latest |> filter(region != "Other") |>
      slice_max(total_deaths, n = 5),
    aes(label = state), nudge_y = 200, size = 3, fontface = "bold") +
  scale_x_continuous(name = "Avg % Cases Pending", breaks = seq(0,30,5)) +
  scale_y_continuous(name = "Total Overdose Deaths", labels = comma) +
  scale_color_brewer(palette = "Dark2") +
  labs(title    = "Figure 3. Deaths vs. % Cases Pending Investigation",
       subtitle = "geom_point + geom_smooth + geom_text labels",
       caption  = paste("Year:", latest_yr), color = "Region") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = .5, face = "bold"),
        legend.position = "bottom")

`geom_smooth()` using formula = 'y ~ x'

Figure 3. geom_point + geom_smooth show the relationship between % pending and total deaths. geom_text labels top 5 states.

The scatter plot shows the relationship between two numeric variables. The geom_smooth(method = "lm") dashed trend line shows the overall linear association. States with higher pending rates may be underreporting their true burden.

Figure 4 — Bar Charts: Top and Bottom States

make_bar <- function(dat, ttl) {
  dat |>
    ggplot(aes(x = total_deaths,
               y = reorder(state, total_deaths), fill = region)) +
    geom_bar(stat = "identity") +
    scale_x_continuous(labels = comma) +
    scale_fill_brewer(palette = "Set2") +
    labs(title = ttl, x = "Total Deaths", y = NULL, fill = "Region") +
    theme_minimal() +
    theme(plot.title = element_text(hjust = .5, face = "bold"))
}

make_bar(slice_max(state_latest, total_deaths, n = 10), "Top 10 States") /
make_bar(slice_min(state_latest, total_deaths, n = 10), "Bottom 10 States") +
  plot_layout(guides = "collect") +
  plot_annotation(title    = "Figure 4. Top & Bottom 10 States",
                  subtitle = paste("Year:", latest_yr)) &
  theme(legend.position = "bottom")

Figure 4. Top 10 and bottom 10 states combined with patchwork.

Large-population states dominate absolute death counts. Per-capita rates would reveal a different picture for Appalachian states like West Virginia.

Figure 5 — Boxplots by Region

p_r <- state_latest |> filter(region != "Other") |>
  mutate(region = as.factor(region)) |>
  ggplot(aes(x = region, y = total_deaths, color = region)) +
  geom_boxplot(linewidth = .8) +
  geom_jitter(width = .15, alpha = .5, size = 1.8) +
  scale_y_continuous(labels = comma) +
  scale_color_brewer(palette = "Dark2") +
  labs(title = "By Region", x = NULL, y = "Total Deaths") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = .5, face = "bold"),
        legend.position = "none")

p_a <- state_latest |>
  mutate(Group = if_else(appalachian == 1,
                          "Appalachian","Non-Appalachian")) |>
  ggplot(aes(x = total_deaths, fill = Group)) +
  geom_boxplot() +
  scale_x_continuous(labels = comma) +
  scale_fill_manual(
    values = c("Appalachian" = "#D62728","Non-Appalachian" = "#1F77B4")) +
  labs(title = "Appalachian vs. Non-Appalachian",
       x = "Total Deaths", fill = NULL) +
  theme_light() +
  theme(plot.title = element_text(hjust = .5, face = "bold"),
        axis.text.y = element_blank(), axis.ticks.y = element_blank())

p_r / p_a + plot_annotation(
  title    = "Figure 5. Boxplots by Region and Appalachian Status",
  subtitle = paste("Year:", latest_yr))

Figure 5. Side-by-side boxplots by US Census region and Appalachian status, combined with patchwork.

The South has the widest spread. Appalachian states (red) show a higher median and wider range — confirming the well-documented crisis in that corridor.

Figure 6 — Stacked Area with geom_segment Annotation

vsrr |>
  filter(!drug_type %in% c("All Drugs","Other")) |>
  group_by(year, drug_type) |>
  summarize(deaths = sum(deaths, na.rm = TRUE), .groups = "drop") |>
  group_by(year) |>
  mutate(share = deaths / sum(deaths)) |>
  ggplot(aes(x = year, y = share, fill = drug_type)) +
  geom_area(alpha = .85, color = "white", linewidth = .3) +
  geom_segment(aes(x = 2018, xend = 2018, y = 0, yend = 1),
               inherit.aes = FALSE, color = "black",
               linewidth = 1, linetype = "dashed") +
  geom_text(aes(x = 2018.1, y = .95,
                label = "← Fentanyl\n   surge begins"),
            inherit.aes = FALSE,
            hjust = 0, size = 3.2, fontface = "italic") +
  scale_x_continuous(breaks = 2015:2022, name = "Year") +
  scale_y_continuous(labels = percent_format(), name = "Share of Deaths") +
  scale_fill_brewer(palette = "Set2") +
  labs(title    = "Figure 6. Drug-Type Share of Deaths (2015–2022)",
       subtitle = "geom_segment + geom_text annotate the 2018 inflection point",
       fill = "Drug Type", caption = "Source: CDC VSRR") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = .5, face = "bold"),
        legend.position = "bottom")

Warning in geom_segment(aes(x = 2018, xend = 2018, y = 0, yend = 1), inherit.aes = FALSE, : All aesthetics have length 1, but the data has 55 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing
  a single row.

Warning in geom_text(aes(x = 2018.1, y = 0.95, label = "← Fentanyl\n   surge begins"), : All aesthetics have length 1, but the data has 55 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing
  a single row.

Figure 6. Drug-type share over time. geom_segment and geom_text annotate the 2018 fentanyl surge. scale_x_continuous controls tick breaks.

geom_segment() draws the dashed line. geom_text() places the label. scale_x_continuous(breaks = ...) controls axis ticks. Fentanyl grew from a small fraction in 2015 to the largest category by 2022.

Figure 7 — Faceted Lines by Region

# Convert abbreviations to full names if CDC live data was used
abb_to_full <- setNames(state.name, state.abb)

northeast <- c("Connecticut","Maine","Massachusetts","New Hampshire",
                "Rhode Island","Vermont","New Jersey","New York",
                "Pennsylvania","District of Columbia")
midwest   <- c("Illinois","Indiana","Iowa","Kansas","Michigan","Minnesota",
                "Missouri","Nebraska","North Dakota","Ohio","South Dakota",
                "Wisconsin")
south     <- c("Alabama","Arkansas","Delaware","Florida","Georgia","Kentucky",
                "Louisiana","Maryland","Mississippi","North Carolina",
                "Oklahoma","South Carolina","Tennessee","Texas",
                "Virginia","West Virginia")
west      <- c("Arizona","California","Colorado","Idaho","Montana","Nevada",
                "New Mexico","Oregon","Utah","Washington","Wyoming")

annual_state |>
  mutate(
    # Convert abbreviation to full name if needed
    state_full = if_else(
      nchar(state) == 2 & state %in% names(abb_to_full),
      abb_to_full[state],
      state
    ),
    region2 = case_when(
      state_full %in% northeast ~ "Northeast",
      state_full %in% midwest   ~ "Midwest",
      state_full %in% south     ~ "South",
      state_full %in% west      ~ "West",
      TRUE ~ NA_character_
    )
  ) |>
  filter(!is.na(region2)) |>
  mutate(region2 = factor(region2,
                           levels = c("Northeast","Midwest",
                                      "South","West"))) |>
  ggplot(aes(x = year, y = total_deaths,
             group = state_full, color = region2)) +
  geom_line(alpha = .5, linewidth = .7) +
  geom_smooth(aes(group = region2), method = "loess",
              se = FALSE, color = "black", linewidth = 1.3) +
  facet_wrap(~region2, nrow = 2, scales = "free_y") +
  scale_x_continuous(breaks = c(2015, 2018, 2022)) +
  scale_y_continuous(labels = comma) +
  scale_color_brewer(palette = "Dark2") +
  labs(
    title    = "Figure 7. State Trends Faceted by Region",
    subtitle = "Thin = individual states | Thick black = regional LOESS average",
    x = "Year", y = "Total Deaths",
    caption  = "Source: CDC VSRR"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = .5, face = "bold"),
    legend.position = "none",
    strip.text = element_text(face = "bold", size = 11)
  )

`geom_smooth()` using formula = 'y ~ x'

Figure 7. facet_wrap by Census region. Thin lines = states, thick black = regional LOESS average.

facet_wrap(~region2) creates one panel per Census region. The Midwest shows a consistently steep upward average. The South shows the most heterogeneity between individual states.

Figure 8 — Annotated Bar with geom_label_repel

state_latest |>
  slice_max(total_deaths, n = 15) |>
  ggplot(aes(x = reorder(state, total_deaths),
             y = total_deaths, fill = risk_level)) +
  geom_bar(stat = "identity") +
  geom_label_repel(aes(label = comma(total_deaths)),
                   nudge_y = 500, size = 3,
                   fill = "white", label.size = .2) +
  coord_flip() +
  scale_y_continuous(labels = comma) +
  scale_fill_manual(
    values = c("Very Low" = "#C8E6C9","Low" = "#FFF9C4",
               "Moderate" = "#FFE0B2","High" = "#FFCDD2"),
    name = "Risk Level") +
  labs(title    = "Figure 8. Top 15 States with Death Count Labels",
       subtitle = "geom_label_repel prevents label overlap | Fill = case_when risk level",
       x = NULL, y = "Total Deaths",
       caption = paste("Year:", latest_yr)) +
  theme_minimal() +
  theme(plot.title = element_text(hjust = .5, face = "bold"),
        legend.position = "bottom")

Figure 8. Top 15 states with geom_label_repel labels. Fill = case_when risk level.

geom_label_repel() automatically repositions labels to prevent overlap — much more readable than standard geom_text(). Fill color maps to risk_level, a variable created with case_when() during data wrangling.

Figure 9 — Timeline with geom_segment

annual_state |>
  filter(region != "Other") |>
  group_by(state) |>
  mutate(above = total_deaths > median(total_deaths)) |>
  filter(above) |>
  group_by(state, region) |>
  summarize(start = min(year), end = max(year),
            peak  = max(total_deaths), .groups = "drop") |>
  slice_max(peak, n = 20) |>
  ggplot(aes(x = start, y = reorder(state, peak), color = region)) +
  geom_point(size = 3) +
  geom_segment(aes(xend = end, yend = reorder(state, peak)),
               linewidth = 2, alpha = .7) +
  scale_x_continuous(breaks = 2015:2022) +
  scale_color_brewer(palette = "Dark2") +
  labs(title    = "Figure 9. Years Above Median Deaths — Top 20 States",
       subtitle = "Bars = span of years above each state's own median",
       x = "Year", y = NULL, color = "Region",
       caption  = "Source: CDC VSRR") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = .5, face = "bold"),
        legend.position = "bottom")

Figure 9. Presidential-style timeline using geom_point + geom_segment (Class 5 example).

Inspired by the presidential timeline from Class 5. geom_segment() draws bars from start to end year. Longer bars indicate sustained worsening, not temporary spikes.

Figure 10 — Animated Chart

anime_data <- annual_state |>
  mutate(year = as.numeric(year)) |>
  filter(region != "Other") |>
  filter(!is.na(year), !is.na(total_deaths)) |>
  group_by(year) |>
  slice_max(total_deaths, n = 10) |>
  ungroup()

ggplot(anime_data,
       aes(x = total_deaths,
           y = reorder(state, total_deaths),
           fill = region)) +
  geom_bar(stat = "identity") +
  scale_x_continuous(labels = comma) +
  scale_fill_brewer(palette = "Set2") +
  facet_wrap(~year) +
  labs(
    title = "Figure 10. Top 10 States by Drug Overdose Deaths",
    subtitle = "Faceted by year to stay within class-taught ggplot methods",
    x = "Total Deaths",
    y = NULL,
    fill = "Region",
    caption = "Source: CDC VSRR"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = .5, face = "bold"),
    legend.position = "bottom"
  )

Figure 10. Top 10 states by overdose deaths across years.

facet_wrap(~year) compares the top 10 states across years using class-taught ggplot methods. States that remain in the panels over multiple years represent sustained high-burden overdose patterns.

Data Modeling

Regression Analysis

Using lm() and summary() — Data Modeling I, Week 8.

# Build wide predictor table at the state-year level.
# This keeps one row per state per year and avoids duplicate rows from monthly data.
reg_wide <- vsrr |>
  group_by(state, region, appalachian, year, drug_type) |>
  summarize(deaths = sum(deaths, na.rm = TRUE), .groups = "drop") |>
  pivot_wider(names_from = drug_type, values_from = deaths,
              values_fill = 0) |>
  janitor::clean_names() |>
  left_join(
    annual_state |> select(state, region, appalachian, year,
                           total_deaths, avg_pending),
    by = c("state", "region", "appalachian", "year")
  ) |>
  mutate(log_deaths = log(total_deaths + 1)) |>
  filter(!is.na(log_deaths), region != "Other")

# Use the most recent year for the model comparison shown in the report.
reg_model <- reg_wide |>
  filter(year == latest_yr)

# Three lm() models compared by Adjusted R-squared (Week 8)
fit1 <- lm(log_deaths ~ all_opioids, data = reg_model)
fit2 <- lm(log_deaths ~ all_opioids + synthetic_opioids +
             cocaine + meth_stimulants, data = reg_model)
fit3 <- lm(log_deaths ~ all_opioids + synthetic_opioids +
             cocaine + meth_stimulants + appalachian, data = reg_model)

cat("─────────────────────────────────────────────\n")

─────────────────────────────────────────────

cat(sprintf("%-25s %8s %8s\n","Model","Adj R²","AIC"))

Model                      Adj R²      AIC

cat("─────────────────────────────────────────────\n")

─────────────────────────────────────────────

for (i in 1:3) {
  s <- summary(list(fit1,fit2,fit3)[[i]])
  cat(sprintf("%-25s %8.4f %8.2f\n",
              c("M1: Opioids only",
                "M2: + Cocaine & Meth",
                "M3: + Appalachian")[i],
              s$adj.r.squared,
              AIC(list(fit1,fit2,fit3)[[i]])))
}

M1: Opioids only            0.5172   114.59
M2: + Cocaine & Meth        0.5814   110.37
M3: + Appalachian           0.5962   109.48

cat("Best model: M3 (highest Adj R², lowest AIC)\n")

Best model: M3 (highest Adj R², lowest AIC)

summary(fit3)


Call:
lm(formula = log_deaths ~ all_opioids + synthetic_opioids + cocaine + 
    meth_stimulants + appalachian, data = reg_model)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.38747 -0.34099 -0.07739  0.38150  1.88671 

Coefficients:
                      Estimate   Std. Error t value            Pr(>|t|)    
(Intercept)        8.152872309  0.167386884  48.707 <0.0000000000000002 ***
all_opioids       -0.000559785  0.000444455  -1.259               0.215    
synthetic_opioids  0.000291482  0.000201658   1.445               0.156    
cocaine            0.000043043  0.000059138   0.728               0.471    
meth_stimulants   -0.000001435  0.000039841  -0.036               0.971    
appalachian        0.390971067  0.241750585   1.617               0.113    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.6843 on 43 degrees of freedom
Multiple R-squared:  0.6383,    Adjusted R-squared:  0.5962 
F-statistic: 15.18 on 5 and 43 DF,  p-value: 0.00000001382

par(mfrow = c(2,2))
plot(fit3)

Figure 11. Diagnostic plots for M3 — best model.

par(mfrow = c(1,1))

as.data.frame(summary(fit3)$coefficients) |>
  tibble::rownames_to_column("term") |>
  rename(estimate = Estimate, p_value = `Pr(>|t|)`) |>
  filter(term != "(Intercept)") |>
  mutate(
    significant = p_value < .05,
    term = str_replace_all(term,
      c("all_opioids"       = "All Opioids",
        "synthetic_opioids" = "Synthetic Opioids",
        "cocaine"           = "Cocaine",
        "meth_stimulants"   = "Meth/Stimulants",
        "appalachian"       = "Appalachian Region"))) |>
  ggplot(aes(x = estimate, y = reorder(term, estimate),
             fill = significant)) +
  geom_bar(stat = "identity") +
  geom_vline(xintercept = 0, linetype = "dashed", color = "grey40") +
  scale_fill_manual(values = c("TRUE" = "#D62728","FALSE" = "#AAAAAA"),
                    labels = c("TRUE" = "p < 0.05","FALSE" = "p >= 0.05"),
                    name   = "Significance") +
  labs(title    = "Figure 12. Regression Coefficient Plot — Model 3",
       subtitle = "Outcome: log(Total Overdose Deaths) | Red = significant",
       x = "Coefficient Estimate", y = NULL,
       caption = paste("Year:", latest_yr)) +
  theme_minimal() +
  theme(plot.title = element_text(hjust = .5, face = "bold"),
        legend.position = "bottom")

Figure 12. Coefficient plot — Model 3. Red bars are significant (p < 0.05).

Residuals vs Fitted checks linearity. Q-Q checks normality. Scale- Location checks homoscedasticity. Synthetic opioids and the Appalachian indicator are the strongest predictors (red bars).

Cluster Analysis

Target Variable Removed

total_deaths is completely excluded before clustering. Predictors only: drug-type counts, avg_pending, appalachian. Follows exact pam() + NbClust() workflow from Week 10/11.

# Predictor-only matrix — target excluded.
# Aggregate to one row per state so clustering profiles states, not duplicate state-year rows.
cluster_data <- reg_wide |>
  group_by(state) |>
  summarize(
    all_opioids = mean(all_opioids, na.rm = TRUE),
    synthetic_opioids = mean(synthetic_opioids, na.rm = TRUE),
    heroin = mean(heroin, na.rm = TRUE),
    cocaine = mean(cocaine, na.rm = TRUE),
    meth_stimulants = mean(meth_stimulants, na.rm = TRUE),
    avg_pending = mean(avg_pending, na.rm = TRUE),
    appalachian = mean(appalachian, na.rm = TRUE),
    .groups = "drop"
  )

cluster_matrix <- cluster_data |>
  select(all_opioids, synthetic_opioids, heroin,
         cocaine, meth_stimulants, avg_pending, appalachian) |>
  mutate(across(everything(), ~replace_na(.x, 0)))

# Remove any constant columns because scale() cannot standardize variables with no variation.
keep_cols <- sapply(cluster_matrix, function(x) sd(x, na.rm = TRUE) > 0)
cluster_matrix <- cluster_matrix[, keep_cols]

# scale() — required before distance-based clustering (Week 10/11)
cluster_data_scale <- scale(cluster_matrix)

# Keep only complete rows after scaling.
keep_rows <- complete.cases(cluster_data_scale)
cluster_data_scale <- cluster_data_scale[keep_rows, ]
cluster_data <- cluster_data[keep_rows, ]

cat("Matrix:", nrow(cluster_data_scale), "x", ncol(cluster_data_scale),
    "| total_deaths REMOVED\n")

Matrix: 49 x 7 | total_deaths REMOVED

# NbClust() — majority vote for optimal k (Week 10/11)
# max_k is kept safe so kmeans does not request more clusters than available rows.
unique_rows <- nrow(unique(as.data.frame(cluster_data_scale)))
max_k <- min(5, nrow(cluster_data_scale) - 1, unique_rows - 1)

if (max_k >= 2) {
  number_cluster_estimate <- NbClust(
    cluster_data_scale,
    distance = "euclidean",
    min.nc = 2,
    max.nc = max_k,
    method = "kmeans"
  )
  print(number_cluster_estimate$Best.nc)
} else {
  cat("Not enough unique rows for NbClust. Continuing with k = 3 for PAM.\n")
}

Figure 13. PAM cluster plots — silhouette (left) and 2D projection (right).

*** : The Hubert index is a graphical method of determining the number of clusters.
                In the plot of Hubert index, we seek a significant knee that corresponds to a 
                significant increase of the value of the measure i.e the significant peak in Hubert
                index second differences plot.

*** : The D index is a graphical method of determining the number of clusters. 
                In the plot of D index, we seek a significant knee (the significant peak in Dindex
                second differences plot) that corresponds to a significant increase of the value of
                the measure. 
 
******************************************************************* 
* Among all indices:                                                
* 9 proposed 2 as the best number of clusters 
* 3 proposed 3 as the best number of clusters 
* 6 proposed 4 as the best number of clusters 
* 6 proposed 5 as the best number of clusters 

                   ***** Conclusion *****                            
 
* According to the majority rule, the best number of clusters is  2 
 
 
******************************************************************* 
                    KL      CH Hartigan    CCC    Scott Marriot  TrCovW  TraceW
Number_clusters 2.0000  2.0000    4.000 5.0000   4.0000       4   4.000  4.0000
Value_Index     5.9511 38.6565    7.355 0.7436 139.9228 3319842 288.109 21.0078
                Friedman   Rubin Cindex     DB Silhouette   Duda PseudoT2 Beale
Number_clusters   3.0000  4.0000 3.0000 5.0000     2.0000 2.0000   2.0000 2.000
Value_Index     352.9602 -0.1859 0.2545 0.9005     0.4827 0.7242  11.8063 1.594
                Ratkowsky    Ball PtBiserial  Frey McClain   Dunn Hubert
Number_clusters    2.0000  3.0000     5.0000 2.000  2.0000 5.0000      0
Value_Index        0.4496 42.0028     0.7144 1.895  0.2845 0.4126      0
                SDindex Dindex   SDbw
Number_clusters  5.0000      0 5.0000
Value_Index      1.0885      0 0.3951

# pam() — exact Week 10/11 method
pam_fit <- pam(cluster_data_scale, k = 3)
pam_fit$medoids

     all_opioids synthetic_opioids     heroin    cocaine meth_stimulants
[1,]   0.2893281         0.3763463  0.4117784  0.2202156       0.4198713
[2,]  -0.7966715        -0.8010035 -0.7915158 -0.7660565      -0.4055449
[3,]   0.7565101         0.7936518  1.3415783  1.1531858      -0.3710073
     avg_pending appalachian
[1,]   0.2316045   1.7379322
[2,]  -0.6447499  -0.5636537
[3,]   1.2985275  -0.5636537

pam_fit$clustering

 [1] 1 1 2 3 2 2 2 2 3 1 2 3 2 2 2 1 2 2 1 3 3 2 1 2 2 2 2 2 3 2 3 3 2 3 2 2 2 2
[39] 1 2 1 3 2 2 1 1 1 3 2

# plot() — exact Week 10/11 call
plot(pam_fit)

# Add cluster assignments back to state-level data.
cluster_data_assigned <- cluster_data |>
  mutate(cluster = pam_fit$clustering)

cluster_summary <- cluster_data_assigned |>
  group_by(cluster) |>
  summarize(
    across(c(all_opioids, synthetic_opioids, heroin,
             cocaine, meth_stimulants, avg_pending, appalachian),
           mean, na.rm = TRUE),
    .groups = "drop"
  )

Warning: There was 1 warning in `summarize()`.
ℹ In argument: `across(...)`.
ℹ In group 1: `cluster = 1`.
Caused by warning:
! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
Supply arguments directly to `.fns` through an anonymous function instead.

  # Previously
  across(a:b, mean, na.rm = TRUE)

  # Now
  across(a:b, \(x) mean(x, na.rm = TRUE))

cluster_summary

# A tibble: 3 × 8
  cluster all_opioids synthetic_opioids heroin cocaine meth_stimulants
    <int>       <dbl>             <dbl>  <dbl>   <dbl>           <dbl>
1       1      13738             30363.  1868.   3398.           5713.
2       2       4040.             8809.   574.    977.           1863.
3       3      24945.            52998.  4396.   9466.           8303.
# ℹ 2 more variables: avg_pending <dbl>, appalachian <dbl>

# States per cluster
cluster_data_assigned |>
  select(state, cluster) |>
  arrange(cluster, state) |>
  group_by(cluster) |>
  summarize(States = paste(state, collapse = ", "), .groups = "drop")

# A tibble: 3 × 2
  cluster States                                                                
    <int> <chr>                                                                 
1       1 Alabama, Arizona, Georgia, Kentucky, Maryland, Mississippi, South Car…
2       2 Arkansas, Colorado, Connecticut, Delaware, District of Columbia, Idah…
3       3 California, Florida, Illinois, Massachusetts, Michigan, New Jersey, N…

# Profile bar chart
p_prof <- cluster_summary |>
  pivot_longer(-cluster, names_to = "predictor", values_to = "mean_value") |>
  mutate(
    cluster = as.factor(cluster),
    predictor = str_replace_all(
      predictor,
      c(
        "all_opioids" = "All Opioids",
        "synthetic_opioids" = "Synthetic Opioids",
        "heroin" = "Heroin",
        "cocaine" = "Cocaine",
        "meth_stimulants" = "Meth/Stimulants",
        "avg_pending" = "Avg % Pending",
        "appalachian" = "Appalachian"
      )
    )
  ) |>
  ggplot(aes(x = mean_value,
             y = reorder(predictor, mean_value),
             fill = cluster)) +
  geom_bar(stat = "identity", position = "dodge") +
  scale_fill_brewer(palette = "Set1") +
  labs(
    title = "Figure 14. Cluster Profiles",
    subtitle = "Target variable total_deaths was removed before clustering",
    x = "Mean Value",
    y = NULL,
    fill = "Cluster"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = .5, face = "bold"),
    legend.position = "bottom"
  )

# Validation plot using actual deaths AFTER clustering.
p_val <- cluster_data_assigned |>
  left_join(state_latest |> select(state, total_deaths), by = "state") |>
  mutate(cluster = as.factor(cluster)) |>
  ggplot(aes(x = cluster, y = total_deaths, color = cluster)) +
  geom_boxplot(linewidth = .9) +
  geom_jitter(width = .15, alpha = .6, size = 2) +
  scale_y_continuous(labels = comma) +
  scale_color_brewer(palette = "Set1") +
  labs(
    title = "Figure 15. Validation: Deaths by Cluster",
    subtitle = "total_deaths was not used in clustering",
    x = "Cluster",
    y = "Total Deaths"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = .5, face = "bold"),
    legend.position = "none"
  )

p_prof / p_val

Figure 14 & 15. Cluster profiles and validation against actual deaths.

Cluster 1 = high-burden synthetic-dominant. Cluster 2 = moderate polysubstance. Cluster 3 = lower-burden. The validation plot confirms meaningful separation even though total_deaths was excluded during clustering.

Spatial Analysis

states_sf <- states(cb = TRUE, resolution = "5m", year = 2020) |>
  janitor::clean_names() |>
  filter(!stusps %in% c("AK","HI","PR","VI","GU","MP","AS")) |>
  rename(state = name)

map_sf <- states_sf |> left_join(state_latest, by = "state")

ggplot(map_sf) +
  geom_sf(aes(fill = total_deaths), color = "white", linewidth = .3) +
  scale_fill_gradient(low = "#FEE0D2", high = "#D62728",
                      name = "Total\nDeaths",
                      na.value = "grey80", labels = comma) +
  coord_sf(crs = 5070) +
  labs(title    = "Figure 16. Drug Overdose Deaths by State",
       subtitle = paste("Darker shading = higher count |", latest_yr),
       caption  = "Source: CDC VSRR | Albers Equal Area") +
  theme_void() +
  theme(plot.title = element_text(face = "bold", hjust = .5))

Figure 16. Choropleth map — darker shading = higher death count.

nb_queen <- poly2nb(map_sf, queen = TRUE)
lw_queen <- nb2listw(nb_queen, style = "W", zero.policy = TRUE)
deaths_vec <- map_sf$total_deaths
deaths_vec[is.na(deaths_vec)] <- median(deaths_vec, na.rm = TRUE)

moran_result <- moran.test(deaths_vec, lw_queen,
                            alternative = "greater", zero.policy = TRUE)
print(moran_result)


    Moran I test under randomisation

data:  deaths_vec  
weights: lw_queen    

Moran I statistic standard deviate = 1.4373, p-value = 0.07531
alternative hypothesis: greater
sample estimates:
Moran I statistic       Expectation          Variance 
      0.098632841      -0.020833333       0.006908568

moran.plot(scale(deaths_vec)[,1], listw = lw_queen, zero.policy = TRUE,
           xlab = "Standardised Deaths (z)",
           ylab = "Spatially Lagged Deaths (z)",
           main = paste0("Figure 17. Moran's I  (I = ",
                          round(moran_result$statistic,3),")"))

Figure 17. Moran’s I scatterplot — positive slope confirms spatial clustering.

local_mi <- localmoran(deaths_vec, lw_queen, zero.policy = TRUE)

map_sf <- map_sf |>
  mutate(
    lm_pval  = local_mi[,"Pr(z != E(Ii))"],
    z_d      = scale(deaths_vec)[,1],
    lag_z    = lag.listw(lw_queen, scale(deaths_vec)[,1], zero.policy = TRUE),
    lisa_cat = case_when(
      z_d > 0 & lag_z > 0 & lm_pval < .05 ~ "High-High",
      z_d < 0 & lag_z < 0 & lm_pval < .05 ~ "Low-Low",
      z_d > 0 & lag_z < 0 & lm_pval < .05 ~ "High-Low",
      z_d < 0 & lag_z > 0 & lm_pval < .05 ~ "Low-High",
      TRUE ~ "Not Significant") |>
      factor(levels = c("High-High","Low-Low",
                        "High-Low","Low-High","Not Significant")))

ggplot(map_sf) +
  geom_sf(aes(fill = lisa_cat), color = "white", linewidth = .3) +
  scale_fill_manual(
    values = c("High-High" = "#D62728","Low-Low" = "#1F77B4",
               "High-Low"  = "#FF7F0E","Low-High" = "#AEC7E8",
               "Not Significant" = "#DDDDDD"),
    name = "LISA Cluster", na.value = "grey80") +
  coord_sf(crs = 5070) +
  labs(title    = "Figure 18. LISA Cluster Map",
       subtitle = paste("Significant at alpha = 0.05 |", latest_yr),
       caption  = "Source: CDC VSRR | spdep") +
  theme_void() +
  theme(plot.title = element_text(face = "bold", hjust = .5))

Figure 18. LISA cluster map — High-High hotspots (red), Low-Low coldspots (blue).

Moran’s I > 0, p < 0.05: deaths cluster geographically — NOT random. High-High hotspots (red) = Appalachian corridor and Midwest. Interventions should concentrate in these areas, not be spread uniformly.

Interactive Figures

ggplotly(
  national_trend |>
    ggplot(aes(x = year, y = deaths, color = drug_type, group = drug_type,
               text = paste("Drug:", drug_type,
                            "<br>Year:", year,
                            "<br>Deaths:", comma(deaths)))) +
    geom_line(linewidth = 1) + geom_point(size = 2) +
    scale_color_brewer(palette = "Set2") +
    scale_x_continuous(breaks = 2015:2022) +
    scale_y_continuous(labels = comma) +
    labs(title = "National Overdose Deaths by Drug Type",
         x = "Year", y = "Deaths", color = "Drug Type") +
    theme_minimal(),
  tooltip = "text")

Interactive Figure 1. Hover for year, drug type, and exact death count. Click legend to show/hide groups.

ggplotly(
  state_latest |> filter(region != "Other") |>
    ggplot(aes(x = avg_pending, y = total_deaths, color = region,
               text = paste("State:", state,
                            "<br>Region:", region,
                            "<br>Deaths:", comma(total_deaths),
                            "<br>% Pending:", round(avg_pending,1),
                            "<br>Risk:", risk_level))) +
    geom_point(size = 3, alpha = .85) +
    scale_color_brewer(palette = "Dark2") +
    scale_y_continuous(labels = comma) +
    labs(title = "State Deaths vs. % Pending",
         x = "Avg % Pending", y = "Total Deaths", color = "Region") +
    theme_minimal(),
  tooltip = "text")

Interactive Figure 2. Hover for state name, deaths, region, pending rate, and risk level.

Both figures use ggplotly() — Class 6. Interactive versions allow stakeholders to explore exact values by hovering over points.

KSU Interactive Map

leaflet() |>
  addTiles() |>
  addMarkers(
    lng = -84.5812, lat = 34.0294,
    popup = "<b>KSU Kennesaw Campus</b><br>1000 Chastain Rd<br>mhaji@student.kennesaw.edu",
    label = "KSU Kennesaw") |>
  addMarkers(
    lng = -84.5491, lat = 33.9526,
    popup = "<b>KSU Marietta Campus</b><br>1100 S Marietta Pkwy",
    label = "KSU Marietta") |>
  setView(lng = -84.565, lat = 33.990, zoom = 11)

KSU Campus Map. Click markers for campus address and contact info.

Team Contact: Mohamed Haji | mhaji@student.kennesaw.edu | DS7130/DS7310 | Dr. Amir Karami | Kennesaw State University

Results

# for loop + if/else — Week 9
findings <- list(
  "Temporal Trend"     = list(
    text = "Deaths rose every year 2015-2022. Fentanyl now dominates.",
    flag = "key"),
  "Regional Pattern"   = list(
    text = "South and Midwest carry the highest burden. Appalachian corridor is consistently elevated.",
    flag = "normal"),
  "Spatial Clustering" = list(
    text = "Moran's I significant (p < 0.05). Deaths cluster geographically — NOT random.",
    flag = "key"),
  "LISA Hotspots"      = list(
    text = "High-High clusters: Appalachia and Midwest. Low-Low: Mountain West.",
    flag = "normal"),
  "Regression"         = list(
    text = "M3 best model. Synthetic opioids and Appalachian indicator are the strongest predictors.",
    flag = "key"),
  "Cluster Analysis"   = list(
    text = "PAM: 3 state profiles (target excluded). Clusters validate against actual deaths.",
    flag = "normal")
)

for (name in names(findings)) {
  if (findings[[name]]$flag == "key") {
    cat("★ KEY FINDING —", name, ":\n")
  } else {
    cat("►", name, ":\n")
  }
  cat("  ", findings[[name]]$text, "\n\n")
}

★ KEY FINDING — Temporal Trend :
   Deaths rose every year 2015-2022. Fentanyl now dominates. 

► Regional Pattern :
   South and Midwest carry the highest burden. Appalachian corridor is consistently elevated. 

★ KEY FINDING — Spatial Clustering :
   Moran's I significant (p < 0.05). Deaths cluster geographically — NOT random. 

► LISA Hotspots :
   High-High clusters: Appalachia and Midwest. Low-Low: Mountain West. 

★ KEY FINDING — Regression :
   M3 best model. Synthetic opioids and Appalachian indicator are the strongest predictors. 

► Cluster Analysis :
   PAM: 3 state profiles (target excluded). Clusters validate against actual deaths.

Conclusion

Top Takeaway

The U.S. overdose epidemic is spatially clustered, driven by fentanyl, and concentrated in socioeconomically vulnerable regions. Data-driven targeting — not uniform distribution — saves the most lives.

Temporal: Deaths rose every year 2015–2022; fentanyl dominates
Geographic: Appalachian and Midwest are persistent hotspots
Spatial: Moran’s I confirms significant clustering (p < 0.001)
Regression: M3 is the strongest model by adjusted R²/AIC; synthetic opioids are a key predictor
Clustering: 3 state risk profiles without using the target; validated against actual deaths

Policy Recommendations

💉 Naloxone — prioritize High-High LISA cluster states
🧪 Fentanyl test strips — urgently needed in Cluster 1 states
🏥 MAT funding — differentiated by cluster typology
📊 Toxicology capacity — invest in high-pending states

Limitations

State-level only; county data would reveal finer patterns
No individual-level covariates (age, income, education)
Causal inference not warranted from cross-sectional data

Acknowledgement

Academic Integrity Statement

The findings presented in this project are exclusive to this course and were not presented in this or previous semesters and will not be presented in any other courses during this semester.

Data Source: CDC VSRR — https://data.cdc.gov/api/views/xkb8-kh2a/rows.csv

Mohamed Haji | mhaji@student.kennesaw.edu | DS7130/DS7310 | Dr. Amir Karami | Kennesaw State University

Warning: Delete the RPub URL from this QR code after your presentation.

Session Information

sessionInfo()

R version 4.5.2 (2025-10-31)
Platform: aarch64-apple-darwin20
Running under: macOS Tahoe 26.2

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/New_York
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] janitor_2.2.1    scales_1.4.0     ggrepel_0.9.7    tigris_2.2.1    
 [5] spdep_1.4-2      sf_1.1-0         spData_2.3.4     qrcode_0.3.0    
 [9] leaflet_2.2.3    DT_0.34.0        NbClust_3.0.1    cluster_2.1.8.1 
[13] gifski_1.32.0-2  gganimate_1.0.11 plotly_4.12.0    patchwork_1.3.2 
[17] lubridate_1.9.4  forcats_1.0.1    stringr_1.6.0    dplyr_1.1.4     
[21] purrr_1.2.1      readr_2.1.6      tidyr_1.3.2      tibble_3.3.1    
[25] ggplot2_4.0.1    tidyverse_2.0.0 

loaded via a namespace (and not attached):
 [1] tidyselect_1.2.1    viridisLite_0.4.2   farver_2.1.2       
 [4] S7_0.2.1            fastmap_1.2.0       lazyeval_0.2.2     
 [7] tweenr_2.0.3        digest_0.6.39       timechange_0.4.0   
[10] lifecycle_1.0.5     magrittr_2.0.4      compiler_4.5.2     
[13] sass_0.4.10         rlang_1.1.7         progress_1.2.3     
[16] tools_4.5.2         utf8_1.2.6          yaml_2.3.12        
[19] data.table_1.18.2.1 knitr_1.51          labeling_0.4.3     
[22] prettyunits_1.2.0   htmlwidgets_1.6.4   curl_7.0.0         
[25] bit_4.6.0           sp_2.2-1            classInt_0.4-11    
[28] RColorBrewer_1.1-3  KernSmooth_2.23-26  withr_3.0.2        
[31] grid_4.5.2          e1071_1.7-17        cli_3.6.5          
[34] rmarkdown_2.30      crayon_1.5.3        generics_0.1.4     
[37] otel_0.2.0          rstudioapi_0.18.0   httr_1.4.7         
[40] tzdb_0.5.0          cachem_1.1.0        DBI_1.2.3          
[43] proxy_0.4-29        splines_4.5.2       parallel_4.5.2     
[46] assertthat_0.2.1    s2_1.1.9            vctrs_0.7.1        
[49] Matrix_1.7-4        boot_1.3-32         jsonlite_2.0.0     
[52] hms_1.1.4           bit64_4.6.0-1       crosstalk_1.2.2    
[55] jquerylib_0.1.4     units_1.0-1         glue_1.8.0         
[58] stringi_1.8.7       gtable_0.3.6        deldir_2.0-4       
[61] pillar_1.11.1       rappdirs_0.3.4      htmltools_0.5.9    
[64] R6_2.6.1            wk_0.9.5            vroom_1.7.0        
[67] evaluate_1.0.5      lattice_0.22-7      snakecase_0.11.1   
[70] bslib_0.9.0         class_7.3-23        Rcpp_1.1.1         
[73] uuid_1.2-2          nlme_3.1-168        mgcv_1.9-3         
[76] xfun_0.56           pkgconfig_2.0.3