project 3

DRUGSAt the VERY TOP of your document, include an image that connects your topic with your dataset.

Intro

This dataset focuses on substance abuse, examining different types of drugs such as cigarettes, marijuana, cocaine, and alcohol. It was collected in the United States and emphasizes variations across different age groups and states. The data was gathered from individual states as part of the NSDUH study (National Survey on Drug Use and Health) and spans from 2002 to 2018.

It offers a comprehensive overview of substance use behaviors across age groups, specifically those aged 12–17, 18–25, and 26 and older. The dataset presents annual statistics on population size, along with the total number and rate per 1,000 individuals engaging in activities such as alcohol consumption, tobacco use, cigarette smoking, marijuana, and cocaine use. Additionally, it provides insights into both past-month and past-year usage, as well as the number of new marijuana users within the last year.’

I chose this dataset because I wanted to analyze substance use trends across the country for my age group (18–25), as I feel this is a critical period when many people begin developing addictions. I’m hoping to explore patterns across all states to identify any noticeable trends. This analysis is important because it can help determine whether progress is being made in the fight against drug abuse. It also allows for comparisons between states with high and low substance abuse rates, potentially revealing the impact of different laws, policies, and prevention efforts implemented across the country.

  1. Load the necessary libraries.

Load Library

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes)

library(highcharter)
Registered S3 method overwritten by 'quantmod':
  method            from
  as.zoo.data.frame zoo 
library(ggplot2)         

Load your dataset using the read_csv() command (do NOT use read.csv() ).

setwd("~/Desktop/data 110")

 drugs <- read_csv("drugs.csv")
Rows: 867 Columns: 53
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (1): State
dbl (52): Year, Population.12-17, Population.18-25, Population.26+, Totals.A...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Statistical Analysis

cor(drugs$"Rates.Tobacco.Use Past Month.18-25", drugs$"Rates.Alcohol.Use Past Month.18-25") # Check correlation
[1] 0.3643252
fit1 <- lm(`Rates.Alcohol.Use Past Month.18-25` ~ `Rates.Tobacco.Use Past Month.18-25`, data = drugs) # predicting alcohol use from tobacco use
summary(fit1)

Call:
lm(formula = `Rates.Alcohol.Use Past Month.18-25` ~ `Rates.Tobacco.Use Past Month.18-25`, 
    data = drugs)

Residuals:
     Min       1Q   Median       3Q      Max 
-246.381  -43.437    7.697   50.185  173.461 

Coefficients:
                                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)                          463.89586   12.78948   36.27   <2e-16 ***
`Rates.Tobacco.Use Past Month.18-25`   0.35850    0.03116   11.51   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 72.68 on 865 degrees of freedom
Multiple R-squared:  0.1327,    Adjusted R-squared:  0.1317 
F-statistic: 132.4 on 1 and 865 DF,  p-value: < 2.2e-16
plot(fit1)

filtering

drugs_recent <- drugs |> 
  filter(Year == 2018) |>                              # filter for most recent year and rename 
  select(State, 
         "Rates.Alcohol.Use Past Month.12-17", 
         
         "Rates.Alcohol.Use Past Month.18-25", 
         
         "Rates.Alcohol.Use Past Month.26+",
         
         "Rates.Marijuana.Used Past Year.12-17",
         
         "Rates.Marijuana.Used Past Year.18-25",
         
         "Rates.Marijuana.Used Past Year.26+",
         
         "Rates.Tobacco.Use Past Month.12-17",
         
         "Rates.Tobacco.Use Past Month.18-25",
         
         "Rates.Tobacco.Use Past Month.26+") |>     # columns i needed 
  
  rename(
    Alcohol_12_17 = "Rates.Alcohol.Use Past Month.12-17",
    
    Alcohol_18_25 = "Rates.Alcohol.Use Past Month.18-25",
    
    Alcohol_26_plus = "Rates.Alcohol.Use Past Month.26+",
    
    Marijuana_12_17 = "Rates.Marijuana.Used Past Year.12-17",
    
    Marijuana_18_25 = "Rates.Marijuana.Used Past Year.18-25",
    
    Marijuana_26_plus = "Rates.Marijuana.Used Past Year.26+",
    
    Tobacco_12_17 = "Rates.Tobacco.Use Past Month.12-17",
    
    Tobacco_18_25 = "Rates.Tobacco.Use Past Month.18-25",
    
    Tobacco_26_plus = "Rates.Tobacco.Use Past Month.26+" )

I decided to filter all the variables I wanted to explore, which turned out to be every single one of them. I then renamed them to create shorter, more understandable column titles.

Plot 1: Histogram of Alcohol Use Among 18–25 Year Olds for the year 2018

p1 <- drugs_recent |> 
  
  ggplot(aes(x = Alcohol_18_25)) +   # Plot for my  x-axis
  
  geom_histogram(fill = "#69b3a2", bins = 10, color = "pink") +   # making the  histogram
 
   labs(
    title = "Alcohol Use (18–25) in 2018", # Title of my histogram 
    
    x = "Alcohol Use Rate per 1,000",  # x - axis 
    
    y = "Frequency",   # Y - axis 
    
    caption = "Source: NSDUH"
  ) +
  theme_minimal()  ## Clean background theme
p1 

This histogram illustrates the alcohol consumption patterns among individuals aged 18 to 25 across the United States.

of Marijuana bar graph by Age Group (2018)

I will categorize the ages of marijuana users and create a chart to identify which age group has the highest usage across the United States.

marijuana_long <- drugs_recent |> 
  
  pivot_longer(cols = starts_with("Marijuana"), 
               names_to = "Age_Group",
               values_to = "Rate") |>   # Turn age groups into a single column
  
  mutate(Age_Group = case_when(   # Make the names shorter
    Age_Group == "Marijuana_12_17" ~ "12–17",
    Age_Group == "Marijuana_18_25" ~ "18–25",
    Age_Group == "Marijuana_26_plus" ~ "26+"
  ))
marijuana_long
# A tibble: 153 × 9
   State Alcohol_12_17 Alcohol_18_25 Alcohol_26_plus Tobacco_12_17 Tobacco_18_25
   <chr>         <dbl>         <dbl>           <dbl>         <dbl>         <dbl>
 1 Alab…          82.2          458.            476.          53.1          306.
 2 Alab…          82.2          458.            476.          53.1          306.
 3 Alab…          82.2          458.            476.          53.1          306.
 4 Alas…          87.3          535.            570.          65.6          258.
 5 Alas…          87.3          535.            570.          65.6          258.
 6 Alas…          87.3          535.            570.          65.6          258.
 7 Ariz…          79.1          490.            539.          29.3          215.
 8 Ariz…          79.1          490.            539.          29.3          215.
 9 Ariz…          79.1          490.            539.          29.3          215.
10 Arka…          84.1          487.            441.          54.7          309.
# ℹ 143 more rows
# ℹ 3 more variables: Tobacco_26_plus <dbl>, Age_Group <chr>, Rate <dbl>

plot 2 Bar Graph

p2 <- marijuana_long |> 
  
  group_by(Age_Group) |>   # Group by each age group
  
  summarise(Average_Rate = mean(Rate, na.rm = TRUE)) |>  # Get average rate for each group
  
  ggplot(aes(x = Age_Group, y = Average_Rate, fill = Age_Group)) +   # Setup bar graph
  
  geom_col() +   # Make the bars
  
  labs(
    title = "Average Marijuana Use by Age Group (2018)",  # Title
    
    x = "Age Group",    # x-axis
    
    y = "Average Use Rate per 1,000",   # y-axis
  ) +
  theme_minimal()   # Clean theme

p2  # Show the plot

It’s interesting to note that the age group of 18-25 shows a significant difference in marijuana usage compared to other age brackets. I’m quite surprised to see that the usage rates for those aged 26 and older are quite similar to those in the 12-17 range.

Plot 3: Top 10 States with Highest drug Use (18–25)

top10_tobacco <- drugs |> 
  
  filter(Year == 2018) |>  # 2018 data
  
  arrange(desc(`Rates.Tobacco.Use Past Month.18-25`)) |>  # Highest tobacco use first
  
  slice_head(n = 10) |>  # Select top 10 states
  
  select(State, `Rates.Tobacco.Use Past Month.18-25`) |> 
  
  rename(Rate = `Rates.Tobacco.Use Past Month.18-25`)

top10_tobacco
# A tibble: 10 × 2
   State          Rate
   <chr>         <dbl>
 1 West Virginia  400.
 2 Montana        398.
 3 Wyoming        381.
 4 Oklahoma       354.
 5 Kentucky       351.
 6 Tennessee      332.
 7 Indiana        331.
 8 Ohio           331.
 9 Mississippi    321.
10 Vermont        316.
top10_cocaine <- drugs |>
  
  filter(Year == 2018) |>  #  2018 data
  
  arrange(desc(`Rates.Illicit Drugs.Cocaine Used Past Year.18-25`)) |>  # Highest cocaine use first
  
  slice_head(n = 10) |>  # Select top 10 states
  
  select(State, `Rates.Illicit Drugs.Cocaine Used Past Year.18-25`) |>
  
  rename(Rate = `Rates.Illicit Drugs.Cocaine Used Past Year.18-25`)  

top10_cocaine
# A tibble: 10 × 2
   State          Rate
   <chr>         <dbl>
 1 Colorado      110. 
 2 New Hampshire  92.3
 3 Vermont        90.9
 4 Oregon         84.1
 5 California     77.1
 6 Nevada         74.1
 7 Maine          72.5
 8 Massachusetts  69.8
 9 Montana        69.2
10 Rhode Island   66.6
top10_marijuana <- drugs |> 
  
  filter(Year == 2018) |>  #  2018 data
  
  arrange(desc(`Rates.Marijuana.Used Past Year.18-25`)) |>  # highest marijuana use
  
  slice_head(n = 10) |>  # select top 10
  
  select(State, `Rates.Marijuana.Used Past Year.18-25`) |>

  rename(Rate = `Rates.Marijuana.Used Past Year.18-25`)  
  
top10_marijuana
# A tibble: 10 × 2
   State                 Rate
   <chr>                <dbl>
 1 Vermont               522.
 2 District of Columbia  498.
 3 Maine                 486.
 4 Colorado              485.
 5 Oregon                473.
 6 Massachusetts         466.
 7 New Hampshire         459.
 8 Washington            455.
 9 Rhode Island          450.
10 Michigan              441.
top10_alcohol <- drugs |> 
  
  filter(Year == 2018) |>  #  2018 data
  
  arrange(desc(`Rates.Alcohol.Use Past Month.18-25`)) |>  # highest alcohol use rate
  
  slice_head(n = 10) |>  # Keep top 10 states
  
  select(State, `Rates.Alcohol.Use Past Month.18-25`) |> 
  
  rename(Rate = `Rates.Alcohol.Use Past Month.18-25`)  

top10_alcohol
# A tibble: 10 × 2
   State                 Rate
   <chr>                <dbl>
 1 District of Columbia  706.
 2 New Hampshire         689.
 3 North Dakota          678.
 4 Vermont               677.
 5 Wisconsin             664.
 6 Rhode Island          662.
 7 Connecticut           656.
 8 Minnesota             652.
 9 Massachusetts         641.
10 Iowa                  639.

plot 3

p3 <- top10_alcohol |> 
  
  ggplot(aes(x = reorder(State, Rate), y = Rate, fill = State)) +
  
  geom_col() +   # Make vertical bars
  
  coord_flip() + # Flip bars to horizontal
  
  labs(
    title = "Top 10 States with Highest Alcohol Use (18–25) in 2018",  # Title
    
    x = "State",        # X-axis 
    
    y = "Alcohol Use Rate per 1,000",  # Y-axis 
    
    caption = "Source: NSDUH"   #  source
  ) +
  
  theme_minimal()  # Clean white background

p3  # Show the plot

Summary

Working with this data was a lot of fun, even though I initially felt overwhelmed by the numerous variables and unsure about the direction of my project. However, after some guidance from Professor Saidi, I gained clarity on my focus. I decided to explore the impact of various substances included in the dataset, particularly concentrating on the age group of 18 to 25. I chose this demographic because I believe it represents a significant portion of substance abuse, especially with alcohol and marijuana being prevalent in college environments. My hypothesis turned out to be accurate, highlighting the importance of addressing these issues among young adults.

source

open ai https://chatgpt.com/ https://www.rpubs.com/rsaidi https://www.youtube.com/ https://en.wikipedia.org/wiki/Legality_of_cannabis_by_U.S._jurisdictionhttps://www.samhsa.gov/data/data-we-collect/nsduh-national-survey-drug-use-and-health https://evokewellness.com/blog/top-10-most-used-drugs/