Final Project

Introduction

Is there a connection between the age and the ammount of overdose?

Drug overdose has emerged as one of the most pressing public health crises of the modern era, affecting individuals, families, and communities across the globe. An overdose occurs when a person consumes a quantity of a drug whether illicit, prescription, or over-the-counter—that overwhelms the body’s ability to process it safely, leading to severe physiological consequences or death. Over the past several decades, the frequency and severity of drug overdoses have increased dramatically, driven by factors such as the widespread availability of opioids, the rise of synthetic drugs like fentanyl, polysubstance use, and gaps in healthcare access. What was once viewed primarily as an individual problem is now widely recognized as a complex societal issue influenced by economic, social, and structural conditions.

Data Analyzing inspected to understand its structure, variable types, and the presence of missing or invalid values. Any placeholder symbols or inconsistent entries will be recoded as missing values, and rows with incomplete information in key variables such as age and year will be removed to ensure data accuracy. Relevant variables will then be selected and converted into appropriate data types, such as factors for categorical analysis. Next, I will generate contingency tables to summarize overdose counts across age groups and years. A Chi-square test of independence will be conducted to examine whether there is a statistically significant relationship between age group and year of overdose occurrence. To support the statistical findings, I will create dot-line graphs that visualize trends and changes in overdose counts across years for different age groups. Finally, the results will be interpreted by comparing observed and expected values, allowing for meaningful conclusions about patterns, trends, and potential demographic differences in drug overdose outcomes.

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.1     ✔ stringr   1.5.2
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(readr)
setwd("C:/Users/Thu Nguyen/Downloads/Drugs")
Drugs <- read_csv("Drugs.csv")

## Rows: 6228 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): INDICATOR, PANEL, UNIT, STUB_NAME, STUB_LABEL, AGE, FLAG
## dbl (8): PANEL_NUM, UNIT_NUM, STUB_NAME_NUM, STUB_LABEL_NUM, YEAR, YEAR_NUM,...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(Drugs)

## # A tibble: 6 × 15
##   INDICATOR    PANEL PANEL_NUM UNIT  UNIT_NUM STUB_NAME STUB_NAME_NUM STUB_LABEL
##   <chr>        <chr>     <dbl> <chr>    <dbl> <chr>             <dbl> <chr>     
## 1 Drug overdo… All …         0 Deat…        1 Total                 0 All perso…
## 2 Drug overdo… All …         0 Deat…        1 Total                 0 All perso…
## 3 Drug overdo… All …         0 Deat…        1 Total                 0 All perso…
## 4 Drug overdo… All …         0 Deat…        1 Total                 0 All perso…
## 5 Drug overdo… All …         0 Deat…        1 Total                 0 All perso…
## 6 Drug overdo… All …         0 Deat…        1 Total                 0 All perso…
## # ℹ 7 more variables: STUB_LABEL_NUM <dbl>, YEAR <dbl>, YEAR_NUM <dbl>,
## #   AGE <chr>, AGE_NUM <dbl>, ESTIMATE <dbl>, FLAG <chr>

str(Drugs)

## spc_tbl_ [6,228 × 15] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ INDICATOR     : chr [1:6228] "Drug overdose death rates" "Drug overdose death rates" "Drug overdose death rates" "Drug overdose death rates" ...
##  $ PANEL         : chr [1:6228] "All drug overdose deaths" "All drug overdose deaths" "All drug overdose deaths" "All drug overdose deaths" ...
##  $ PANEL_NUM     : num [1:6228] 0 0 0 0 0 0 0 0 0 0 ...
##  $ UNIT          : chr [1:6228] "Deaths per 100,000 resident population, age-adjusted" "Deaths per 100,000 resident population, age-adjusted" "Deaths per 100,000 resident population, age-adjusted" "Deaths per 100,000 resident population, age-adjusted" ...
##  $ UNIT_NUM      : num [1:6228] 1 1 1 1 1 1 1 1 1 1 ...
##  $ STUB_NAME     : chr [1:6228] "Total" "Total" "Total" "Total" ...
##  $ STUB_NAME_NUM : num [1:6228] 0 0 0 0 0 0 0 0 0 0 ...
##  $ STUB_LABEL    : chr [1:6228] "All persons" "All persons" "All persons" "All persons" ...
##  $ STUB_LABEL_NUM: num [1:6228] 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 ...
##  $ YEAR          : num [1:6228] 1999 2000 2001 2002 2003 ...
##  $ YEAR_NUM      : num [1:6228] 1 2 3 4 5 6 7 8 9 10 ...
##  $ AGE           : chr [1:6228] "All ages" "All ages" "All ages" "All ages" ...
##  $ AGE_NUM       : num [1:6228] 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 ...
##  $ ESTIMATE      : num [1:6228] 6.1 6.2 6.8 8.2 8.9 9.4 10.1 11.5 11.9 11.9 ...
##  $ FLAG          : chr [1:6228] NA NA NA NA ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   INDICATOR = col_character(),
##   ..   PANEL = col_character(),
##   ..   PANEL_NUM = col_double(),
##   ..   UNIT = col_character(),
##   ..   UNIT_NUM = col_double(),
##   ..   STUB_NAME = col_character(),
##   ..   STUB_NAME_NUM = col_double(),
##   ..   STUB_LABEL = col_character(),
##   ..   STUB_LABEL_NUM = col_double(),
##   ..   YEAR = col_double(),
##   ..   YEAR_NUM = col_double(),
##   ..   AGE = col_character(),
##   ..   AGE_NUM = col_double(),
##   ..   ESTIMATE = col_double(),
##   ..   FLAG = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

Drugs_clean <- Drugs %>%
  mutate(across(where(is.character), ~na_if(.x, "?")))
colSums(is.na(Drugs_clean))

##      INDICATOR          PANEL      PANEL_NUM           UNIT       UNIT_NUM 
##              0              0              0              0              0 
##      STUB_NAME  STUB_NAME_NUM     STUB_LABEL STUB_LABEL_NUM           YEAR 
##              0              0              0              0              0 
##       YEAR_NUM            AGE        AGE_NUM       ESTIMATE           FLAG 
##              0              0              0           1111           5117

Drugs_clean <- Drugs_clean %>%
  select(AGE_NUM, YEAR_NUM) %>%
  filter(AGE_NUM != "Death",
         AGE_NUM != "No Death")

#drop rows with any missing values
Drugs_clean <- Drugs_clean %>%
  filter(rowSums(is.na(.)) == 0)

Drugs_clean <- Drugs_clean %>%
  mutate(
    AGE_NUM = factor(AGE_NUM),
    YEAR_NUM = factor(YEAR_NUM)
  )

#inspect cleaned data
str(Drugs_clean)

## tibble [6,228 × 2] (S3: tbl_df/tbl/data.frame)
##  $ AGE_NUM : Factor w/ 10 levels "1.1","1.2","1.3",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ YEAR_NUM: Factor w/ 20 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...

colSums(is.na(Drugs_clean))

##  AGE_NUM YEAR_NUM 
##        0        0

observed_dataset <- table(Drugs_clean$AGE_NUM, Drugs_clean$YEAR_NUM)
observed_dataset

##       
##          1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18
##   1.1  144 144 144 144 144 144 144 144 144 144 144 144 144 144 144 144 144 144
##   1.2   18  18  18  18  18  18  18  18  18  18  18  18  18  18  18  18  18  18
##   1.3   18  18  18  18  18  18  18  18  18  18  18  18  18  18  18  18  18  18
##   1.4   18  18  18  18  18  18  18  18  18  18  18  18  18  18  18  18  18  18
##   1.5   18  18  18  18  18  18  18  18  18  18  18  18  18  18  18  18  18  18
##   1.6   18  18  18  18  18  18  18  18  18  18  18  18  18  18  18  18  18  18
##   1.7   18  18  18  18  18  18  18  18  18  18  18  18  18  18  18  18  18  18
##   1.8   18  18  18  18  18  18  18  18  18  18  18  18  18  18  18  18  18  18
##   1.9   18  18  18  18  18  18  18  18  18  18  18  18  18  18  18  18  18  18
##   1.91  18  18  18  18  18  18  18  18  18  18  18  18  18  18  18  18  18  18
##       
##         19  20
##   1.1  144 252
##   1.2   18  18
##   1.3   18  18
##   1.4   18  18
##   1.5   18  18
##   1.6   18  18
##   1.7   18  18
##   1.8   18  18
##   1.9   18  18
##   1.91  18  18

chisq.test(observed_dataset)$expected

##       
##                1         2         3         4         5         6         7
##   1.1  146.80925 146.80925 146.80925 146.80925 146.80925 146.80925 146.80925
##   1.2   17.68786  17.68786  17.68786  17.68786  17.68786  17.68786  17.68786
##   1.3   17.68786  17.68786  17.68786  17.68786  17.68786  17.68786  17.68786
##   1.4   17.68786  17.68786  17.68786  17.68786  17.68786  17.68786  17.68786
##   1.5   17.68786  17.68786  17.68786  17.68786  17.68786  17.68786  17.68786
##   1.6   17.68786  17.68786  17.68786  17.68786  17.68786  17.68786  17.68786
##   1.7   17.68786  17.68786  17.68786  17.68786  17.68786  17.68786  17.68786
##   1.8   17.68786  17.68786  17.68786  17.68786  17.68786  17.68786  17.68786
##   1.9   17.68786  17.68786  17.68786  17.68786  17.68786  17.68786  17.68786
##   1.91  17.68786  17.68786  17.68786  17.68786  17.68786  17.68786  17.68786
##       
##                8         9        10        11        12        13        14
##   1.1  146.80925 146.80925 146.80925 146.80925 146.80925 146.80925 146.80925
##   1.2   17.68786  17.68786  17.68786  17.68786  17.68786  17.68786  17.68786
##   1.3   17.68786  17.68786  17.68786  17.68786  17.68786  17.68786  17.68786
##   1.4   17.68786  17.68786  17.68786  17.68786  17.68786  17.68786  17.68786
##   1.5   17.68786  17.68786  17.68786  17.68786  17.68786  17.68786  17.68786
##   1.6   17.68786  17.68786  17.68786  17.68786  17.68786  17.68786  17.68786
##   1.7   17.68786  17.68786  17.68786  17.68786  17.68786  17.68786  17.68786
##   1.8   17.68786  17.68786  17.68786  17.68786  17.68786  17.68786  17.68786
##   1.9   17.68786  17.68786  17.68786  17.68786  17.68786  17.68786  17.68786
##   1.91  17.68786  17.68786  17.68786  17.68786  17.68786  17.68786  17.68786
##       
##               15        16        17        18        19        20
##   1.1  146.80925 146.80925 146.80925 146.80925 146.80925 198.62428
##   1.2   17.68786  17.68786  17.68786  17.68786  17.68786  23.93064
##   1.3   17.68786  17.68786  17.68786  17.68786  17.68786  23.93064
##   1.4   17.68786  17.68786  17.68786  17.68786  17.68786  23.93064
##   1.5   17.68786  17.68786  17.68786  17.68786  17.68786  23.93064
##   1.6   17.68786  17.68786  17.68786  17.68786  17.68786  23.93064
##   1.7   17.68786  17.68786  17.68786  17.68786  17.68786  23.93064
##   1.8   17.68786  17.68786  17.68786  17.68786  17.68786  23.93064
##   1.9   17.68786  17.68786  17.68786  17.68786  17.68786  23.93064
##   1.91  17.68786  17.68786  17.68786  17.68786  17.68786  23.93064

We could take the chi square test because it’s greater than 5

plot_data <- Drugs_clean %>%
  count(YEAR_NUM, AGE_NUM)

ggplot(plot_data, aes(x = AGE_NUM, y = n, group = YEAR_NUM, color = YEAR_NUM)) +
  geom_line() +
  geom_point(size = 3) +
  labs(
    title = "Dot-Line Graph of Yearly Counts by Age Group",
    x = "Age Group",
    y = "Count",
    color = "Year"
  ) +
  theme_minimal()

The dot-line graph illustrates the distribution of overdose counts across age groups over multiple years. Overall, the graph shows that overdose counts are heavily concentrated in the youngest age groups, with a sharp drop in counts as age group increases. After the initial age categories, the counts remain consistently low and relatively flat across subsequent age groups, indicating little variation over time for older groups.

chi <- chisq.test(observed_dataset)
chi

## 
##  Pearson's Chi-squared test
## 
## data:  observed_dataset
## X-squared = 29.535, df = 171, p-value = 1

The results of the Chi-square test of independence support this visual interpretation. The test produced a Chi-square statistic of 29.535 with 171 degrees of freedom and a p-value of 1.

Conclusion

Based on the dot-line graph overdoses occurring far more frequently in the younger age groups and steadily decreasing as age increases. This trend remains consistent across the dataset, as the counts drop sharply after the earliest age categories and then level off at relatively low values among older age groups. Although the lines representing different years largely overlap indicating little variation over time the consistent shape of the lines highlights a strong relationship between age and overdose counts. Overall, the graph suggests that age plays an important role in overdose frequency, with younger individuals experiencing higher counts compared to older age groups, even though this age-related pattern remains stable from year to year.

Reference https://catalog.data.gov/dataset/drug-overdose-death-rates-by-drug-type-sex-age-race-and-hispanic-origin-united-states-3f72f

Final Project

Thu Nguyen

2025-12-16