Data110_FinalProject_2026

Author

Emmanuel Gkatongoni

source:Clay Behavorial Health Center

Introduction

Drug and substance abuse continues to be a major issue throughout the United States and affects millions of people every year. The purpose of this project is to explore substance use trends across different states and years in order to better understand how different forms of drug and alcohol use may relate to each other. The dataset used in this project contains information about marijuana use, alcohol use disorder, tobacco use, cocaine use, and population estimates for multiple age groups across the United States.

The data appears to come from public health survey reporting sources connected to substance use monitoring in the United States, specifically national survey estimates collected and reported across states and years. The dataset includes both categorical and quantitative variables. Some categorical variables include State and Year, while quantitative variables include marijuana use rates, alcohol use disorder rates, tobacco use rates, cocaine use rates, and population values for individuals aged 18-25. For this project, I will mainly focus on variables related to marijuana use and examine whether factors such as alcohol use disorder, tobacco use, and cocaine use may help explain changes in marijuana use rates among young adults.

Although the dataset contains extensive information, there was no detailed ReadMe or documentation file included that fully explained the exact survey methodology or sampling procedures used to collect the data. However, based on the structure of the dataset and the variables provided, the information appears to come from large-scale survey estimates collected over multiple years and organized by state and age group.

I chose this topic because substance abuse is something that has personally affected people close to me, and it is an issue I have seen impact individuals and families in real life. I have also had some personal experience with seeing the effects of drug abuse, which made me interested in understanding how widespread these problems truly are across the country. Because of this, I wanted to explore the data to see how different forms of substance use are connected and how many people may be affected by these issues.

Libaries/Dataset loading/cleaning

# Load Libraries

library(tidyverse)
Warning: package 'ggplot2' was built under R version 4.3.3
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(plotly)
Warning: package 'plotly' was built under R version 4.3.3

Attaching package: 'plotly'

The following object is masked from 'package:ggplot2':

    last_plot

The following object is masked from 'package:stats':

    filter

The following object is masked from 'package:graphics':

    layout
library(dplyr)
library(readr)
library(broom)
#load dataset
drugs <- read_csv("drugs.csv")
Rows: 867 Columns: 53
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (1): State
dbl (52): Year, Population.12-17, Population.18-25, Population.26+, Totals.A...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(drugs)
Rows: 867
Columns: 53
$ State                                               <chr> "Alabama", "Alaska…
$ Year                                                <dbl> 2002, 2002, 2002, …
$ `Population.12-17`                                  <dbl> 380805, 69400, 485…
$ `Population.18-25`                                  <dbl> 499453, 62791, 602…
$ `Population.26+`                                    <dbl> 2812905, 368460, 3…
$ `Totals.Alcohol.Use Disorder Past Year.12-17`       <dbl> 18, 4, 36, 14, 173…
$ `Totals.Alcohol.Use Disorder Past Year.18-25`       <dbl> 68, 12, 117, 53, 5…
$ `Totals.Alcohol.Use Disorder Past Year.26+`         <dbl> 138, 27, 258, 101,…
$ `Rates.Alcohol.Use Disorder Past Year.12-17`        <dbl> 48.336, 61.479, 73…
$ `Rates.Alcohol.Use Disorder Past Year.18-25`        <dbl> 136.490, 187.891, …
$ `Rates.Alcohol.Use Disorder Past Year.26+`          <dbl> 49.068, 73.677, 77…
$ `Totals.Alcohol.Use Past Month.12-17`               <dbl> 57, 11, 91, 39, 48…
$ `Totals.Alcohol.Use Past Month.18-25`               <dbl> 254, 38, 352, 162,…
$ `Totals.Alcohol.Use Past Month.26+`                 <dbl> 1048, 206, 1774, 6…
$ `Rates.Alcohol.Use Past Month.12-17`                <dbl> 150.033, 158.988, …
$ `Rates.Alcohol.Use Past Month.18-25`                <dbl> 509.551, 598.311, …
$ `Rates.Alcohol.Use Past Month.26+`                  <dbl> 372.703, 559.151, …
$ `Totals.Tobacco.Cigarette Past Month.12-17`         <dbl> 52, 9, 62, 37, 235…
$ `Totals.Tobacco.Cigarette Past Month.18-25`         <dbl> 196, 28, 234, 154,…
$ `Totals.Tobacco.Cigarette Past Month.26+`           <dbl> 728, 92, 919, 539,…
$ `Rates.Tobacco.Cigarette Past Month.12-17`          <dbl> 136.906, 132.517, …
$ `Rates.Tobacco.Cigarette Past Month.18-25`          <dbl> 392.404, 439.749, …
$ `Rates.Tobacco.Cigarette Past Month.26+`            <dbl> 258.844, 249.578, …
$ `Totals.Illicit Drugs.Cocaine Used Past Year.12-17` <dbl> 6, 2, 16, 4, 53, 1…
$ `Totals.Illicit Drugs.Cocaine Used Past Year.18-25` <dbl> 27, 5, 51, 18, 259…
$ `Totals.Illicit Drugs.Cocaine Used Past Year.26+`   <dbl> 49, 5, 86, 26, 410…
$ `Rates.Illicit Drugs.Cocaine Used Past Year.12-17`  <dbl> 16.556, 24.400, 32…
$ `Rates.Illicit Drugs.Cocaine Used Past Year.18-25`  <dbl> 54.892, 83.680, 85…
$ `Rates.Illicit Drugs.Cocaine Used Past Year.26+`    <dbl> 17.513, 13.838, 25…
$ `Totals.Marijuana.New Users.12-17`                  <dbl> 20, 4, 25, 13, 158…
$ `Totals.Marijuana.New Users.18-25`                  <dbl> 18, 2, 18, 10, 126…
$ `Totals.Marijuana.New Users.26+`                    <dbl> 2, 0, 3, 1, 17, 3,…
$ `Rates.Marijuana.New Users.12-17`                   <dbl> 59.732, 77.736, 64…
$ `Rates.Marijuana.New Users.18-25`                   <dbl> 62.325, 84.250, 60…
$ `Rates.Marijuana.New Users.26+`                     <dbl> 0.914, 1.625, 1.36…
$ `Totals.Marijuana.Used Past Month.12-17`            <dbl> 24, 8, 38, 19, 241…
$ `Totals.Marijuana.Used Past Month.18-25`            <dbl> 62, 15, 91, 50, 63…
$ `Totals.Marijuana.Used Past Month.26+`              <dbl> 73, 26, 122, 57, 9…
$ `Rates.Marijuana.Used Past Month.12-17`             <dbl> 63.662, 110.781, 7…
$ `Rates.Marijuana.Used Past Month.18-25`             <dbl> 124.672, 239.907, …
$ `Rates.Marijuana.Used Past Month.26+`               <dbl> 25.967, 71.362, 36…
$ `Totals.Marijuana.Used Past Year.12-17`             <dbl> 49, 13, 82, 37, 44…
$ `Totals.Marijuana.Used Past Year.18-25`             <dbl> 119, 24, 166, 87, …
$ `Totals.Marijuana.Used Past Year.26+`               <dbl> 141, 46, 215, 104,…
$ `Rates.Marijuana.Used Past Year.12-17`              <dbl> 127.535, 188.730, …
$ `Rates.Marijuana.Used Past Year.18-25`              <dbl> 237.880, 389.026, …
$ `Rates.Marijuana.Used Past Year.26+`                <dbl> 50.275, 124.566, 6…
$ `Totals.Tobacco.Use Past Month.12-17`               <dbl> 63, 11, 73, 46, 29…
$ `Totals.Tobacco.Use Past Month.18-25`               <dbl> 226, 30, 240, 169,…
$ `Totals.Tobacco.Use Past Month.26+`                 <dbl> 930, 112, 1032, 66…
$ `Rates.Tobacco.Use Past Month.12-17`                <dbl> 166.578, 163.918, …
$ `Rates.Tobacco.Use Past Month.18-25`                <dbl> 451.976, 484.270, …
$ `Rates.Tobacco.Use Past Month.26+`                  <dbl> 330.659, 304.220, …

Before creating visualizations or building models, the dataset needed to be cleaned and organized. I selected the variables that were most relevant to the focus of this project and filtered out rows with missing values in important variables related to marijuana use. I also converted the Year variable into a factor so it could be treated as a categorical variable during analysis and visualization.

# Data cleaning and Wrangling
drugs_clean <- drugs %>%
  select(
    State,
    Year,
    `Rates.Marijuana.Used Past Year.18-25`,
    `Rates.Alcohol.Use Disorder Past Year.18-25`,
    `Rates.Tobacco.Use Past Month.18-25`,
    `Rates.Illicit Drugs.Cocaine Used Past Year.18-25`
  ) %>%
  filter(
    !is.na(`Rates.Marijuana.Used Past Year.18-25`)
  ) %>%
  mutate(
    Year = as.factor(Year)
  ) %>%
  arrange(State)

summary(drugs_clean)
    State                Year     Rates.Marijuana.Used Past Year.18-25
 Length:867         2002   : 51   Min.   :168.1                       
 Class :character   2003   : 51   1st Qu.:268.4                       
 Mode  :character   2004   : 51   Median :300.2                       
                    2005   : 51   Mean   :314.1                       
                    2006   : 51   3rd Qu.:353.1                       
                    2007   : 51   Max.   :532.0                       
                    (Other):561                                       
 Rates.Alcohol.Use Disorder Past Year.18-25 Rates.Tobacco.Use Past Month.18-25
 Min.   : 71.22                             Min.   :165.1                     
 1st Qu.:120.07                             1st Qu.:347.6                     
 Median :148.27                             Median :416.9                     
 Mean   :151.18                             Mean   :402.8                     
 3rd Qu.:177.74                             3rd Qu.:464.0                     
 Max.   :272.94                             Max.   :588.8                     
                                                                              
 Rates.Illicit Drugs.Cocaine Used Past Year.18-25
 Min.   : 18.34                                  
 1st Qu.: 42.85                                  
 Median : 55.66                                  
 Mean   : 56.92                                  
 3rd Qu.: 67.48                                  
 Max.   :122.38                                  
                                                 
#Exploratory Summary

drugs_clean %>%

  summarize(

    Average_Marijuana_Use = mean(`Rates.Marijuana.Used Past Year.18-25`, na.rm = TRUE),

    Average_Alcohol_Disorder = mean(`Rates.Alcohol.Use Disorder Past Year.18-25`, na.rm = TRUE),

    Average_Tobacco_Use = mean(`Rates.Tobacco.Use Past Month.18-25`, na.rm = TRUE)

  )
# A tibble: 1 × 3
  Average_Marijuana_Use Average_Alcohol_Disorder Average_Tobacco_Use
                  <dbl>                    <dbl>               <dbl>
1                  314.                     151.                403.

The summary statistics provide a quick overview of average substance use rates among individuals aged 18-25 across the dataset. These values help establish a general understanding of the scale of marijuana, alcohol, and tobacco use before moving into deeper statistical analysis and visualizations.

Multiple Linear Regression Analysis

To better understand the factors associated with marijuana use among individuals aged 18-25, I created a multiple linear regression model. The response variable for the model is the rate of marijuana use in the past year among individuals aged 18-25. The predictor variables include alcohol use disorder rates, tobacco use rates, cocaine use rates, and year.

The regression equation for the model is:

\[Y = \beta_0 + \beta_1(Alcohol) + \beta_2(Tobacco) + \beta_3(Cocaine) + \beta_4(Year)\]

Where:

  • Marijuana use rate among individuals aged 18-25
  • Alcohol = Alcohol use disorder rate
  • Tobacco = Tobacco use rate
  • Cocaine = Cocaine use rate
  • Year = Year of observation
model <- lm(
  `Rates.Marijuana.Used Past Year.18-25` ~
    `Rates.Alcohol.Use Disorder Past Year.18-25` +
    `Rates.Tobacco.Use Past Month.18-25` +
    `Rates.Illicit Drugs.Cocaine Used Past Year.18-25` +
    Year,
  data = drugs_clean
)

summary(model)

Call:
lm(formula = `Rates.Marijuana.Used Past Year.18-25` ~ `Rates.Alcohol.Use Disorder Past Year.18-25` + 
    `Rates.Tobacco.Use Past Month.18-25` + `Rates.Illicit Drugs.Cocaine Used Past Year.18-25` + 
    Year, data = drugs_clean)

Residuals:
     Min       1Q   Median       3Q      Max 
-107.252  -25.327   -2.639   24.884  143.150 

Coefficients:
                                                    Estimate Std. Error t value
(Intercept)                                         47.72518   15.20669   3.138
`Rates.Alcohol.Use Disorder Past Year.18-25`         0.42922    0.06315   6.797
`Rates.Tobacco.Use Past Month.18-25`                -0.04227    0.02641  -1.600
`Rates.Illicit Drugs.Cocaine Used Past Year.18-25`   2.88625    0.09335  30.918
Year2003                                           -10.70150    7.65523  -1.398
Year2004                                           -20.01631    7.65699  -2.614
Year2005                                           -20.75634    7.65823  -2.710
Year2006                                           -13.43812    7.66943  -1.752
Year2007                                             1.51048    7.71662   0.196
Year2008                                            30.18266    7.77125   3.884
Year2009                                            64.24317    7.88009   8.153
Year2010                                            79.54714    8.02212   9.916
Year2011                                            91.10052    8.08485  11.268
Year2012                                           103.19933    8.22848  12.542
Year2013                                           110.31732    8.42469  13.095
Year2014                                           100.97084    8.66886  11.648
Year2015                                            91.98875    8.98942  10.233
Year2016                                            96.28898    9.17704  10.492
Year2017                                            99.56351    9.40088  10.591
Year2018                                           117.08666    9.72672  12.038
                                                   Pr(>|t|)    
(Intercept)                                        0.001757 ** 
`Rates.Alcohol.Use Disorder Past Year.18-25`       2.01e-11 ***
`Rates.Tobacco.Use Past Month.18-25`               0.109883    
`Rates.Illicit Drugs.Cocaine Used Past Year.18-25`  < 2e-16 ***
Year2003                                           0.162499    
Year2004                                           0.009105 ** 
Year2005                                           0.006858 ** 
Year2006                                           0.080107 .  
Year2007                                           0.844858    
Year2008                                           0.000111 ***
Year2009                                           1.28e-15 ***
Year2010                                            < 2e-16 ***
Year2011                                            < 2e-16 ***
Year2012                                            < 2e-16 ***
Year2013                                            < 2e-16 ***
Year2014                                            < 2e-16 ***
Year2015                                            < 2e-16 ***
Year2016                                            < 2e-16 ***
Year2017                                            < 2e-16 ***
Year2018                                            < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 38.64 on 847 degrees of freedom
Multiple R-squared:  0.6715,    Adjusted R-squared:  0.6641 
F-statistic: 91.12 on 19 and 847 DF,  p-value: < 2.2e-16

Diagnostic Plots

par(mfrow = c(2,2))
plot(model)

Regression Interpretation

The multiple linear regression model was used to analyze how alcohol use disorder rates, tobacco use rates, cocaine use rates, and year were associated with marijuana use among individuals aged 18-25. The overall model was statistically significant with an F-statistic of 91.12 and a p-value smaller than 0.001, suggesting that the predictor variables collectively explained a significant amount of variation in marijuana use rates.The model produced an Adjusted R-squared value of approximately 0.664, meaning that about 66.4% of the variation in marijuana use rates could be explained by the variables included in the model. This indicates a relatively strong model for a public health and behavioral dataset.Among the predictor variables, cocaine use rates had the strongest relationship with marijuana use rates, with a coefficient estimate of approximately 2.89 and a highly significant p-value below 0.001. Alcohol use disorder rates were also statistically significant and positively associated with marijuana use. Tobacco use rates, however, were not statistically significant at the 0.05 level, suggesting that tobacco use may not be as strong of a predictor in this specific model compared to the other variables.The year variables also showed noticeable increases in marijuana use rates over time, especially in later years such as 2009 through 2018, where many coefficients were statistically significant. This suggests that marijuana use rates among individuals aged 18-25 generally increased throughout the years included in the dataset.The diagnostic plots showed that the residuals were mostly centered around zero, although there was some spread in variability as fitted values increased. The Q-Q plot suggested that the residuals were approximately normally distributed with some deviations at the tails. Overall, the regression diagnostics indicated that the model performed reasonably well while still showing some natural variability expected in real-world behavioral and public health data.

Visualization 1: Marijuana Use by State

state_marijuana <- drugs_clean %>%
  group_by(State) %>%
  summarize(
    Average_Marijuana_Use = mean(`Rates.Marijuana.Used Past Year.18-25`, na.rm = TRUE)
  ) %>%
  arrange(desc(Average_Marijuana_Use)) %>%
  slice_head(n = 15)

ggplot(state_marijuana, aes(
  x = reorder(State, Average_Marijuana_Use),
  y = Average_Marijuana_Use,
  fill = Average_Marijuana_Use
)) +
  geom_col() +
  coord_flip() +
  scale_fill_gradient(
    low = "lightblue",
    high = "darkblue",
    name = "Average Marijuana Use Rate"
  ) +
  labs(
    title = "Top 15 States by Average Marijuana Use Rate Among Ages 18-25",
    x = "State",
    y = "Average Marijuana Use Rate",
    caption = "Data Source: Public substance use survey estimates organized by state and year"
  ) +
  theme_minimal()

This visualization shows the top 15 states with the highest average marijuana use rates among individuals aged 18-25. Vermont had the highest average marijuana use rate in the dataset, followed closely by Rhode Island, New Hampshire, and Massachusetts. Many of the states with the highest marijuana use rates were located in the Northeast or Western regions of the United States. The differences between states may reflect variations in marijuana laws, cultural attitudes, access to substances, or other social and environmental factors. The visualization also demonstrates that marijuana use rates vary substantially across states, suggesting that location may play an important role in substance use behavior among young adults.

Visualization 2: Relationship Between Tobacco Use and Marijuana Use

scatter_plot <- ggplot(
  drugs_clean,
  aes(
    x = `Rates.Tobacco.Use Past Month.18-25`,
    y = `Rates.Marijuana.Used Past Year.18-25`,
    color = Year
  )
) +
  geom_point(size = 3, alpha = 0.7) +
  geom_smooth(method = "lm", se = FALSE, color = "black") +
  labs(
    title = "Relationship Between Tobacco Use and Marijuana Use Among Ages 18-25",
    x = "Tobacco Use Rate",
    y = "Marijuana Use Rate",
    color = "Year",
    caption = "Data Source: Public substance use survey estimates organized by state and year"
  ) +
  theme_dark()

ggplotly(scatter_plot)
`geom_smooth()` using formula = 'y ~ x'

The scatterplot explores the relationship between tobacco use and marijuana use among individuals aged 18-25 across different states and years. Although there is a large amount of variability in the data, the trend line suggests a slight negative relationship between tobacco use rates and marijuana use rates in this dataset. The visualization also shows how marijuana use rates changed across years, with later years generally appearing to have higher marijuana use values. The interactive format makes it easier to examine clusters of points and identify possible trends or outliers throughout the dataset. Even though the relationship between tobacco and marijuana use was not especially strong in the regression model, the visualization still helps demonstrate how substance use behaviors may vary across states and over time.

Tableau Visualization

In addition to the R visualizations, I also created a Tableau dashboard to explore substance use trends across different states and years. The dashboard focuses on marijuana use rates among individuals aged 18-25 and allows viewers to interact with the data more dynamically.

My Tableau Public visualization

The Tableau dashboard makes it easier to compare marijuana use rates across states while also allowing the viewer to filter and explore the data interactively. This type of visualization is especially useful for large datasets because it provides a more flexible way to identify patterns and trends that may not be as obvious in static graphs alone.

Background Research

Substance abuse continues to be a significant public health issue in the United States, especially among younger populations. According to the National Institute on Drug Abuse (NIDA), marijuana is one of the most commonly used drugs in the United States, particularly among young adults. Researchers have found that substance use behaviors are often connected, meaning that individuals who use tobacco or alcohol may also be more likely to use marijuana or other drugs.

Studies have also shown that social environment, mental health, peer influence, and access to substances can all affect substance use rates. In recent years, changing marijuana laws across different states may have also influenced usage rates among young adults. Because of these factors, analyzing patterns across states and years can help provide a better understanding of how substance use trends vary throughout the country.

In addition, the Centers for Disease Control and Prevention (CDC) states that substance abuse can have long-term effects on physical health, mental health, relationships, education, and employment. Understanding these trends is important because substance abuse impacts not only individuals, but also families and communities.

Conclusion

The purpose of this project was to explore substance use trends across different states and years in the United States, with a focus on marijuana use among individuals aged 18-25. Using multiple linear regression and several visualizations, I analyzed how alcohol use disorder, tobacco use, and cocaine use may relate to marijuana use rates. The regression results suggested that some substance use behaviors are connected and may help explain differences in marijuana use across states and years.

The visualizations also helped reveal patterns within the dataset. The state comparison chart showed that some states had noticeably higher marijuana use rates than others, while the interactive scatterplot suggested a positive relationship between tobacco use and marijuana use. The Tableau dashboard further improved the analysis by allowing the data to be explored interactively across states and years.

One interesting part of this project was seeing how widespread substance abuse trends are throughout the country. Because this topic has personally affected people close to me, it made the analysis feel more meaningful and helped me better understand how many individuals and communities may be impacted by substance abuse. While the dataset provided strong information for analysis, one limitation was the lack of detailed documentation about the exact data collection methodology.

If I had more time, I would have liked to include additional visualizations involving geographic maps or more advanced interactive features to compare substance use trends over time. Overall, this project demonstrated how data analysis and visualization can be used to better understand important public health issues and identify patterns within large datasets.

References

National Institute on Drug Abuse. “Marijuana Research Report.” National Institutes of Health, U.S. Department of Health and Human Services, https://nida.nih.gov/publications/research-reports/marijuana/marijuana-addictive

Centers for Disease Control and Prevention. “Understanding Drug Overdose and Prevention.” CDC, https://www.cdc.gov/drugoverdose/index.html