My topic is drug poisoning deaths in the United States and what contributes the most to them. The most important part about this topic is understanding what drug poisoning is. Contrary to what it might seem, a drug overdose is not the same as drug poisoning. An overdose is from taking too much of a drug or taking drugs that contradict with each other. Drug poisoning is when you are accidentally exposed to a substance via the environment, people, etc.
My data set is from the NCHS(National Center for Health Statistics), which is a part of the CDC. There are a couple of questions I have about my data set. One question I will answer via linear regression: which variables most significantly contribute to deaths via drug poisoning. A few smaller questions that I want to answer are whether or not there is a difference in the rates people die based on sex, and how the deaths have changed from the years 2000-2016.
This topic doesn’t have a particularly strong link to me, and thus there isn’t a prominent reason why I chose it. My first reason for considering it was that it met all of the requirements for this project’s data set, and there were a good amount of observations to work with. However, I ultimately picked this data set due to a friend I have. He volunteers at a local fire station and has taken many classes related to fire safety, drug safety, CPR, etc. One day, completely out of the blue, he offered me a pack of Narcan. He said that he had received like three packs from a class he had recently taken. This was more than a year ago, and the Narcan itself is definitely expired. Even so, I still keep it near my desk(my desk is in an empty closet, and I have it hanging from the clothing hanger bar thingy). I suppose looking up and seeing the Narcan is what really made me settle on this topic and data set. (photos of narcan at the end)
Important Variables and Definitions (order does not indicate significance):
year : the year of the death
deaths : the amount of deaths of that observation
population : the population of the state for that particular observation
age_adjusted_rate : the rate of death that has been adjusted so as to not be swayed by the variety in ages. (can you tell that I misunderstood this variable and didn’t realize until I was completely finished with all of my graphs and my statistical test, but when I removed it the adjusted R squared went down so I decided to just keep it in and save myself the trouble in exchange for admitting that I was a little slow when I started this project.)
Loading libraries
library(tidyverse)
## Warning: package 'purrr' was built under R version 4.4.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(ggridges)
## Warning: package 'ggridges' was built under R version 4.4.2
library(treemap)
## Warning: package 'treemap' was built under R version 4.4.2
library(highcharter)
## Warning: package 'highcharter' was built under R version 4.4.3
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
## Highcharts (www.highcharts.com) is a Highsoft software product which is
## not free for commercial and Governmental use
library(RColorBrewer)
library(ggthemes)
## Warning: package 'ggthemes' was built under R version 4.4.3
Loading in the data set
setwd("~/data 110")
drugmortality <- read_csv("NCHS_-_Drug_Poisoning_Mortality_by_State__United_States_20250427.csv")
## Rows: 2862 Columns: 19
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (6): State, Sex, Age Group, Race and Hispanic Origin, State Crude Rate ...
## dbl (13): Year, Deaths, Population, Crude Death Rate, Standard Error for Cru...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(drugmortality)
## # A tibble: 6 × 19
## State Year Sex `Age Group` Race and Hispanic Ori…¹ Deaths Population
## <chr> <dbl> <chr> <chr> <chr> <dbl> <dbl>
## 1 Alabama 1999 Both Sexes All Ages All Races-All Origins 169 4430143
## 2 Alabama 2000 Both Sexes All Ages All Races-All Origins 197 4447100
## 3 Alabama 2001 Both Sexes All Ages All Races-All Origins 216 4467634
## 4 Alabama 2002 Both Sexes All Ages All Races-All Origins 211 4480089
## 5 Alabama 2003 Both Sexes All Ages All Races-All Origins 197 4503491
## 6 Alabama 2004 Both Sexes All Ages All Races-All Origins 283 4530729
## # ℹ abbreviated name: ¹`Race and Hispanic Origin`
## # ℹ 12 more variables: `Crude Death Rate` <dbl>,
## # `Standard Error for Crude Rate` <dbl>,
## # `Lower Confidence Limit for Crude Rate` <dbl>,
## # `Upper Confidence Limit for Crude Rate` <dbl>, `Age-adjusted Rate` <dbl>,
## # `Standard Error for Age-adjusted Rate` <dbl>,
## # `Lower Confidence Limit for Age-adjusted Rate` <dbl>, …
Cleaning the data: lowering the capital letters, replacing spaces and dashes with underscores.
names(drugmortality) <- tolower(names(drugmortality))
names(drugmortality) <- gsub(" ","_",names(drugmortality))
names(drugmortality) <- gsub("-","_",names(drugmortality))
head(drugmortality)
## # A tibble: 6 × 19
## state year sex age_group race_and_hispanic_origin deaths population
## <chr> <dbl> <chr> <chr> <chr> <dbl> <dbl>
## 1 Alabama 1999 Both Sexes All Ages All Races-All Origins 169 4430143
## 2 Alabama 2000 Both Sexes All Ages All Races-All Origins 197 4447100
## 3 Alabama 2001 Both Sexes All Ages All Races-All Origins 216 4467634
## 4 Alabama 2002 Both Sexes All Ages All Races-All Origins 211 4480089
## 5 Alabama 2003 Both Sexes All Ages All Races-All Origins 197 4503491
## 6 Alabama 2004 Both Sexes All Ages All Races-All Origins 283 4530729
## # ℹ 12 more variables: crude_death_rate <dbl>,
## # standard_error_for_crude_rate <dbl>,
## # lower_confidence_limit_for_crude_rate <dbl>,
## # upper_confidence_limit_for_crude_rate <dbl>, age_adjusted_rate <dbl>,
## # standard_error_for_age_adjusted_rate <dbl>,
## # lower_confidence_limit_for_age_adjusted_rate <dbl>,
## # upper_confidence_limit_for_age_adjusted_rate <dbl>, …
I wanted to see if there was anything that might indicate a significant difference between the death rates for females and males. To do so I graphed geom density ridges.
sexdeaths <- ggplot(drugmortality, aes(x = crude_death_rate, y = age_group, fill = sex)) +
geom_density_ridges(alpha = 0.4) +
scale_fill_manual(values = c("#157021", "#970483", "#041197"),
labels = c("Both Sexes", "Female", "Male")) +
labs(x = "crude death rate",
y = "age groups",
title = "Death Rates Based on Sex",
caption = "NCHS - Drug Poisoning Mortality by State: United States") +
theme_calc()
sexdeaths
## Picking joint bandwidth of 1.76
In general, visually it appears that while there is a slightly significant difference in the rate of which males and females die. Females die at a noticeably lower rate across most age groups. It’s not by a whole lot, but it’s there.
I needed to yoink out “United States” from the state variable because it was throwing off the count, so I filtered it out.
countdata <- drugmortality |>
select(state, deaths) |>
filter(!state == "United States")
head(countdata)
## # A tibble: 6 × 2
## state deaths
## <chr> <dbl>
## 1 Alabama 169
## 2 Alabama 197
## 3 Alabama 216
## 4 Alabama 211
## 5 Alabama 197
## 6 Alabama 283
Finding the states with the highest deaths so that I can graph about four to five states instead of 50. I don’t really want to combine all of the states just for this purpose–they are separated due to different years–so I will just pick the first four states I see.
counts <- count(countdata, state, deaths) |>
group_by(state) |>
arrange(desc(deaths))
head(counts)
## # A tibble: 6 × 3
## # Groups: state [3]
## state deaths n
## <chr> <dbl> <int>
## 1 Florida 4728 1
## 2 California 4659 1
## 3 California 4654 1
## 4 Pennsylvania 4627 1
## 5 California 4521 1
## 6 California 4452 1
Creating an interactive graph using Highchart.
HCData <- drugmortality |>
filter(state %in% c("California","Pennsylvania", "Ohio", "Florida")) |>
arrange(year)
cols <- c("#a21e1a", "#a2681a", "#086613", "#f26fff")
highchart() |>
hc_add_series(data = HCData,
type = "line", hcaes(x = year,
y = deaths,
group = state)) |>
hc_colors(cols) |>
hc_xAxis(title = list(text="Year")) |>
hc_yAxis(title = list(text="Death Count")) |>
hc_plotOptions(series = list(marker = list(symbol = "circle"))) |>
hc_legend(align = "right", # "LEADING LINES!" my middle school yearbook teacher screams. "MAKE THE LINES LOOK LIKE THEY'RE LEADING TO SOMETHING!" she screams. I listen and put the legend in the top right.
verticalAlign = "top") |>
hc_title(text= "Death Counts from 2000 to 2016") |>
hc_caption(text = "NCHS - Drug Poisoning Mortality by State: United States")
Honestly, I did not realize the data only went up to 2016. That’s my bad. The data was last updated very recently, so I though it went up to more recent years, but I guess they either edited the data or made corrections.
The chart shows that the deaths have been going up steadily, though took a dip around 2012. I’m not sure what might have caused this, as the only truly notable aspect of that year was Obama’s reelection. Which, honestly, for some people that might have actually been the cause. However, immediately after it began climbing again. 2014 was a terrible year, though once again I do not know what could have been the cause. Either way, something caused it to drop and then immediately start to climb again. California is in its own world though, it’s been consistently rising the whole time with one blip in 2000-2001.
Doing the linear regression and also using scipen because I prefer to be able to actually see just how small or big the numbers are.
lmData <- drugmortality |>
select(deaths, age_adjusted_rate, year, population) |>
filter(!is.na(deaths), !is.na(age_adjusted_rate), !is.na(year), !is.na(population))
options(scipen = 999)
lmModel <- lm(deaths ~ age_adjusted_rate + year + population, data = lmData)
summary(lmModel)
##
## Call:
## lm(formula = deaths ~ age_adjusted_rate + year + population,
## data = lmData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16286.4 -546.6 -24.7 582.0 22305.5
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) -122382.920368460 34220.344689049 -3.576
## age_adjusted_rate 136.184341278 14.704263143 9.262
## year 60.095490120 17.097418941 3.515
## population 0.000123862 0.000001369 90.470
## Pr(>|t|)
## (Intercept) 0.000363 ***
## age_adjusted_rate < 0.0000000000000002 ***
## year 0.000457 ***
## population < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2407 on 1130 degrees of freedom
## Multiple R-squared: 0.8817, Adjusted R-squared: 0.8814
## F-statistic: 2807 on 3 and 1130 DF, p-value: < 0.00000000000000022
The model is well fitted, as it predicts 88.14 percent of the variance present, and this variables are very significant as the pvalue is nearly zero.
Equation: y = -122382.920 + (136.184 * age_adjusted_rate) + (60.095 * year) + (0.000123 * population)
Interpretation: For each unit the age_adjusted_rate, year, and population increase, the estimated amount of deaths increases by 126.184, 60.095, and 0.000123 respectively. When all three variables are at zero, the assumed number of deaths is at -122382.920.
I just want to say, there’s no way the intercept is THAT low, so I am automatically assuming I did something wrong, but I don’t know what. Therefore, I am moving on.
Graphing my model with the code Ria gave me. Ty Ria, shout out to Ria Rhodes.
lmData$predicted <- predict(lmModel)
ggplot(lmData, aes(predicted ,deaths)) +
geom_point() +
geom_smooth(color = "#985100", method = lm, se = TRUE, fill = "#00ad41") +
labs(title = "Linear Regression Model Graph",
x = "Predictors: Year, Population, Age_adjusted_rate",
caption = "NCHS - Drug Poisoning Mortality by State: United States") +
theme_calc()
## `geom_smooth()` using formula = 'y ~ x'
I just wanna say the way that I struggled to get this to work was insane. Apparently my previous method of removing NAs from the linear model just didn’t work and it was causing problems here. It took so long to figure out that the disparity in the amounts of observations(which were the issue) was because the NAs were, in fact, NOT removed when I thought they were.
The graph itself is sort of confusing to look at. I know the odd look might be due to the method I used to graph it, but the way the points are grouped is really confusing to me. The SE is barely visible, so that’s good, the model is very precise. Overall, as the predictors go up, the deaths also go up. However, the way the points are grouped is quite interesting. Its shape vaguely reminisces a fish bone, and I can’t quite tell if that’s a good thing or not.
plot(lmModel)
Overall it’s not a terrible model, though I do notice that the Residuals vs Fitted graph is identical to the graph I did for my model. Not sure what that indicates because I am only in my second semester, but I’m sure I’ll find out soon.
The Q-Q plot also isn’t terrible. The larger the predictors get, the less accurate the predictions are, and there are multiple outliers. But the model if very tightly fitted towards the center. So, yippiee.
The general consensus in the diagnostic plots seems to be that the model is very good at predicting the deaths when the predictors are also low, so as the predictors get higher the model gets less accurate.
Here I am graphing the residuals histogram.
hist(residuals(lmModel))
The residuals graph indicates that the line is pretty accurate, as the distribution is very normal.
The year, population of the state, and the age rate that has been standardized are the most prominent in predicting deaths by drug poisoning. The linear regression model that led me to this conclusion has a p-value of almost zero and an Adjusted R-Square of 88.14. This means the model predicts a good amount of the variance, and the variables are very significant. The residuals are very normally distributed around 0, but as the predictor variables get higher, the accuracy goes down. This answers my question, though the results are pretty boring.
NCHS - Drug Poisoning Mortality by State: United States | Data | Centers for Disease Control and Prevention. 4 June 2018, data.cdc.gov/National-Center-for-Health-Statistics/NCHS-Drug-Poisoning-Mortality-by-State-United-Stat/xbxb-epbu/about_data.
Poisoning/Prescription Drug Overdose. www.tn.gov/health/health-program-areas/fhw/injury-and-violence-prevention-programs/injury-topics/poisoning-prescription-drug-overdose.html.
————————————————————————
Extra information that isn’t a part of my project and is just here to fulfill my desire for irrelevant commentary.
And a deep sigh of relaxation comes from both me and the cat (Luna) currently sitting on my lap, who has bore witness to the entirety of this project. Granted, it was through a shaky view as she was almost always resting on my arm, which does need to be moving in order to code.
My father told me to do this:
library(car)
## Warning: package 'car' was built under R version 4.4.3
## Loading required package: carData
## Warning: package 'carData' was built under R version 4.4.3
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
## The following object is masked from 'package:purrr':
##
## some
vif(lmModel)
## age_adjusted_rate year population
## 1.539218 1.539763 1.000466
He is currently an economist for the government and used to work as a data analyst. He also teaches ECON at MC. His name is Kuo-liang Chang, or Matt Chang. You can find a picture of him with a single Google search. It’s kind of creepy.
According to him, this means that there is no particularly dominating variable, so yippie?
Photographic evidence of the Narcan for educational purposes I guess? I don’t think many people know what Narcan actually looks like.