At the VERY TOP of your document, include an image that connects your topic with your dataset.
Intro
This dataset focuses on substance abuse, examining different types of drugs such as cigarettes, marijuana, cocaine, and alcohol. It was collected in the United States and emphasizes variations across different age groups and states. The data was gathered from individual states as part of the NSDUH study (National Survey on Drug Use and Health) and spans from 2002 to 2018.
It offers a comprehensive overview of substance use behaviors across age groups, specifically those aged 12–17, 18–25, and 26 and older. The dataset presents annual statistics on population size, along with the total number and rate per 1,000 individuals engaging in activities such as alcohol consumption, tobacco use, cigarette smoking, marijuana, and cocaine use. Additionally, it provides insights into both past-month and past-year usage, as well as the number of new marijuana users within the last year.’
I chose this dataset because I wanted to analyze substance use trends across the country for my age group (18–25), as I feel this is a critical period when many people begin developing addictions. I’m hoping to explore patterns across all states to identify any noticeable trends. This analysis is important because it can help determine whether progress is being made in the fight against drug abuse. It also allows for comparisons between states with high and low substance abuse rates, potentially revealing the impact of different laws, policies, and prevention efforts implemented across the country.
Load the necessary libraries.
Load Library
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.0.4
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes)library(highcharter)
Registered S3 method overwritten by 'quantmod':
method from
as.zoo.data.frame zoo
library(ggplot2)
Load your dataset using the read_csv() command (do NOT use read.csv() ).
Rows: 867 Columns: 53
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): State
dbl (52): Year, Population.12-17, Population.18-25, Population.26+, Totals.A...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Statistical Analysis
cor(drugs$"Rates.Tobacco.Use Past Month.18-25", drugs$"Rates.Alcohol.Use Past Month.18-25") # Check correlation
[1] 0.3643252
fit1 <-lm(`Rates.Alcohol.Use Past Month.18-25`~`Rates.Tobacco.Use Past Month.18-25`, data = drugs) # predicting alcohol use from tobacco usesummary(fit1)
Call:
lm(formula = `Rates.Alcohol.Use Past Month.18-25` ~ `Rates.Tobacco.Use Past Month.18-25`,
data = drugs)
Residuals:
Min 1Q Median 3Q Max
-246.381 -43.437 7.697 50.185 173.461
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 463.89586 12.78948 36.27 <2e-16 ***
`Rates.Tobacco.Use Past Month.18-25` 0.35850 0.03116 11.51 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 72.68 on 865 degrees of freedom
Multiple R-squared: 0.1327, Adjusted R-squared: 0.1317
F-statistic: 132.4 on 1 and 865 DF, p-value: < 2.2e-16
plot(fit1)
filtering
drugs_recent <- drugs |>filter(Year ==2018) |># filter for most recent year and rename select(State, "Rates.Alcohol.Use Past Month.12-17", "Rates.Alcohol.Use Past Month.18-25", "Rates.Alcohol.Use Past Month.26+","Rates.Marijuana.Used Past Year.12-17","Rates.Marijuana.Used Past Year.18-25","Rates.Marijuana.Used Past Year.26+","Rates.Tobacco.Use Past Month.12-17","Rates.Tobacco.Use Past Month.18-25","Rates.Tobacco.Use Past Month.26+") |># columns i needed rename(Alcohol_12_17 ="Rates.Alcohol.Use Past Month.12-17",Alcohol_18_25 ="Rates.Alcohol.Use Past Month.18-25",Alcohol_26_plus ="Rates.Alcohol.Use Past Month.26+",Marijuana_12_17 ="Rates.Marijuana.Used Past Year.12-17",Marijuana_18_25 ="Rates.Marijuana.Used Past Year.18-25",Marijuana_26_plus ="Rates.Marijuana.Used Past Year.26+",Tobacco_12_17 ="Rates.Tobacco.Use Past Month.12-17",Tobacco_18_25 ="Rates.Tobacco.Use Past Month.18-25",Tobacco_26_plus ="Rates.Tobacco.Use Past Month.26+" )
I decided to filter all the variables I wanted to explore, which turned out to be every single one of them. I then renamed them to create shorter, more understandable column titles.
Plot 1: Histogram of Alcohol Use Among 18–25 Year Olds for the year 2018
p1 <- drugs_recent |>ggplot(aes(x = Alcohol_18_25)) +# Plot for my x-axisgeom_histogram(fill ="#69b3a2", bins =10, color ="pink") +# making the histogramlabs(title ="Alcohol Use (18–25) in 2018", # Title of my histogram x ="Alcohol Use Rate per 1,000", # x - axis y ="Frequency", # Y - axis caption ="Source: NSDUH" ) +theme_minimal() ## Clean background themep1
This histogram illustrates the alcohol consumption patterns among individuals aged 18 to 25 across the United States.
of Marijuana bar graph by Age Group (2018)
I will categorize the ages of marijuana users and create a chart to identify which age group has the highest usage across the United States.
marijuana_long <- drugs_recent |>pivot_longer(cols =starts_with("Marijuana"), names_to ="Age_Group",values_to ="Rate") |># Turn age groups into a single columnmutate(Age_Group =case_when( # Make the names shorter Age_Group =="Marijuana_12_17"~"12–17", Age_Group =="Marijuana_18_25"~"18–25", Age_Group =="Marijuana_26_plus"~"26+" ))marijuana_long
p2 <- marijuana_long |>group_by(Age_Group) |># Group by each age groupsummarise(Average_Rate =mean(Rate, na.rm =TRUE)) |># Get average rate for each groupggplot(aes(x = Age_Group, y = Average_Rate, fill = Age_Group)) +# Setup bar graphgeom_col() +# Make the barslabs(title ="Average Marijuana Use by Age Group (2018)", # Titlex ="Age Group", # x-axisy ="Average Use Rate per 1,000", # y-axis ) +theme_minimal() # Clean themep2 # Show the plot
It’s interesting to note that the age group of 18-25 shows a significant difference in marijuana usage compared to other age brackets. I’m quite surprised to see that the usage rates for those aged 26 and older are quite similar to those in the 12-17 range.
Plot 3: Top 10 States with Highest drug Use (18–25)
top10_tobacco <- drugs |>filter(Year ==2018) |># 2018 dataarrange(desc(`Rates.Tobacco.Use Past Month.18-25`)) |># Highest tobacco use firstslice_head(n =10) |># Select top 10 statesselect(State, `Rates.Tobacco.Use Past Month.18-25`) |>rename(Rate =`Rates.Tobacco.Use Past Month.18-25`)top10_tobacco
top10_cocaine <- drugs |>filter(Year ==2018) |># 2018 dataarrange(desc(`Rates.Illicit Drugs.Cocaine Used Past Year.18-25`)) |># Highest cocaine use firstslice_head(n =10) |># Select top 10 statesselect(State, `Rates.Illicit Drugs.Cocaine Used Past Year.18-25`) |>rename(Rate =`Rates.Illicit Drugs.Cocaine Used Past Year.18-25`) top10_cocaine
# A tibble: 10 × 2
State Rate
<chr> <dbl>
1 Colorado 110.
2 New Hampshire 92.3
3 Vermont 90.9
4 Oregon 84.1
5 California 77.1
6 Nevada 74.1
7 Maine 72.5
8 Massachusetts 69.8
9 Montana 69.2
10 Rhode Island 66.6
top10_marijuana <- drugs |>filter(Year ==2018) |># 2018 dataarrange(desc(`Rates.Marijuana.Used Past Year.18-25`)) |># highest marijuana useslice_head(n =10) |># select top 10select(State, `Rates.Marijuana.Used Past Year.18-25`) |>rename(Rate =`Rates.Marijuana.Used Past Year.18-25`) top10_marijuana
# A tibble: 10 × 2
State Rate
<chr> <dbl>
1 Vermont 522.
2 District of Columbia 498.
3 Maine 486.
4 Colorado 485.
5 Oregon 473.
6 Massachusetts 466.
7 New Hampshire 459.
8 Washington 455.
9 Rhode Island 450.
10 Michigan 441.
top10_alcohol <- drugs |>filter(Year ==2018) |># 2018 dataarrange(desc(`Rates.Alcohol.Use Past Month.18-25`)) |># highest alcohol use rateslice_head(n =10) |># Keep top 10 statesselect(State, `Rates.Alcohol.Use Past Month.18-25`) |>rename(Rate =`Rates.Alcohol.Use Past Month.18-25`) top10_alcohol
# A tibble: 10 × 2
State Rate
<chr> <dbl>
1 District of Columbia 706.
2 New Hampshire 689.
3 North Dakota 678.
4 Vermont 677.
5 Wisconsin 664.
6 Rhode Island 662.
7 Connecticut 656.
8 Minnesota 652.
9 Massachusetts 641.
10 Iowa 639.
plot 3
p3 <- top10_alcohol |>ggplot(aes(x =reorder(State, Rate), y = Rate, fill = State)) +geom_col() +# Make vertical barscoord_flip() +# Flip bars to horizontallabs(title ="Top 10 States with Highest Alcohol Use (18–25) in 2018", # Titlex ="State", # X-axis y ="Alcohol Use Rate per 1,000", # Y-axis caption ="Source: NSDUH"# source ) +theme_minimal() # Clean white backgroundp3 # Show the plot
Plot 4: Marijuana Trends in Top 10 States
maryJane10 <- drugs |>filter(State %in%c("Vermont","District of Columbia","Maine","Colorado","Oregon","Massachusetts","New Hampshire","Washington","Rhode Island","Michigan" ), Year >=2003& Year <=2018) |>## Limit to recent yearsselect(State, Year, `Rates.Marijuana.Used Past Year.18-25`) |>rename(Rate =`Rates.Marijuana.Used Past Year.18-25`)maryJane10
# A tibble: 160 × 3
State Year Rate
<chr> <dbl> <dbl>
1 Colorado 2003 312.
2 District of Columbia 2003 309.
3 Maine 2003 369.
4 Massachusetts 2003 381.
5 Michigan 2003 321.
6 New Hampshire 2003 398.
7 Oregon 2003 342.
8 Rhode Island 2003 411.
9 Vermont 2003 433.
10 Washington 2003 295.
# ℹ 150 more rows
highchart() |>hc_add_series(data = maryJane10,type ="line",hcaes(x = Year, y = Rate, group = State)) |>hc_title(text ="Marijuana Use Trends (Ages 18–25) in Top 10 States") |>hc_xAxis(title =list(text ="Year")) |>hc_yAxis(title =list(text ="Use Rate per 1,000")) |>hc_tooltip(shared =TRUE)
This chart highlights the top 10 states with the highest marijuana usage, showcasing data from 2003 to 2018. Across all states, there has been a consistent rise in marijuana consumption over the years.Out of the 10 states listed, 9 have legalized recreational marijuana, while New Hampshire has not.Given that marijuana is now legal for medical use in 39 out of 50 states and for recreational use in 24 states, I believe this trend will continue. It seems likely that, over time, marijuana will become legal in all states.
Summary
Working with this data was a lot of fun, even though I initially felt overwhelmed by the numerous variables and unsure about the direction of my project. However, after some guidance from Professor Saidi, I gained clarity on my focus. I decided to explore the impact of various substances included in the dataset, particularly concentrating on the age group of 18 to 25. I chose this demographic because I believe it represents a significant portion of substance abuse, especially with alcohol and marijuana being prevalent in college environments. My hypothesis turned out to be accurate, highlighting the importance of addressing these issues among young adults.
source
open ai https://chatgpt.com/ https://www.rpubs.com/rsaidi https://www.youtube.com/ https://en.wikipedia.org/wiki/Legality_of_cannabis_by_U.S._jurisdictionhttps://www.samhsa.gov/data/data-we-collect/nsduh-national-survey-drug-use-and-health https://evokewellness.com/blog/top-10-most-used-drugs/