Final Project Data 110

Introduction of Topic and Data set:

My topic is drug poisoning deaths in the United States and what contributes the most to them. The most important part about this topic is understanding what drug poisoning is. Contrary to what it might seem, a drug overdose is not the same as drug poisoning. An overdose is from taking too much of a drug or taking drugs that contradict with each other. Drug poisoning is when you are accidentally exposed to a substance via the environment, people, etc.

image source: https://www.tn.gov/health/health-program-areas/fhw/injury-and-violence-prevention-programs/injury-topics/poisoning-prescription-drug-overdose.html

My data set is from the NCHS(National Center for Health Statistics), which is a part of the CDC. There are a couple of questions I have about my data set. One question I will answer via linear regression: which variables most significantly contribute to deaths via drug poisoning. A few smaller questions that I want to answer are whether or not there is a difference in the rates people die based on sex, and how the deaths have changed from the years 2000-2016.

This topic doesn’t have a particularly strong link to me, and thus there isn’t a prominent reason why I chose it. My first reason for considering it was that it met all of the requirements for this project’s data set, and there were a good amount of observations to work with. However, I ultimately picked this data set due to a friend I have. He volunteers at a local fire station and has taken many classes related to fire safety, drug safety, CPR, etc. One day, completely out of the blue, he offered me a pack of Narcan. He said that he had received like three packs from a class he had recently taken. This was more than a year ago, and the Narcan itself is definitely expired. Even so, I still keep it near my desk(my desk is in an empty closet, and I have it hanging from the clothing hanger bar thingy). I suppose looking up and seeing the Narcan is what really made me settle on this topic and data set. (photos of narcan at the end)

Important Variables and Definitions (order does not indicate significance):

year : the year of the death
deaths : the amount of deaths of that observation
population : the population of the state for that particular observation
age_adjusted_rate : the rate of death that has been adjusted so as to not be swayed by the variety in ages. (can you tell that I misunderstood this variable and didn’t realize until I was completely finished with all of my graphs and my statistical test, but when I removed it the adjusted R squared went down so I decided to just keep it in and save myself the trouble in exchange for admitting that I was a little slow when I started this project.)

Loading libraries

library(tidyverse)

## Warning: package 'purrr' was built under R version 4.4.3

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggplot2)
library(ggridges)

## Warning: package 'ggridges' was built under R version 4.4.2

library(treemap)

## Warning: package 'treemap' was built under R version 4.4.2

library(highcharter)

## Warning: package 'highcharter' was built under R version 4.4.3

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo 
## Highcharts (www.highcharts.com) is a Highsoft software product which is
## not free for commercial and Governmental use

library(RColorBrewer)
library(ggthemes)

## Warning: package 'ggthemes' was built under R version 4.4.3

Loading in the data set

setwd("~/data 110")
drugmortality <- read_csv("NCHS_-_Drug_Poisoning_Mortality_by_State__United_States_20250427.csv")

## Rows: 2862 Columns: 19
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (6): State, Sex, Age Group, Race and Hispanic Origin, State Crude Rate ...
## dbl (13): Year, Deaths, Population, Crude Death Rate, Standard Error for Cru...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(drugmortality)

## # A tibble: 6 × 19
##   State    Year Sex        `Age Group` Race and Hispanic Ori…¹ Deaths Population
##   <chr>   <dbl> <chr>      <chr>       <chr>                    <dbl>      <dbl>
## 1 Alabama  1999 Both Sexes All Ages    All Races-All Origins      169    4430143
## 2 Alabama  2000 Both Sexes All Ages    All Races-All Origins      197    4447100
## 3 Alabama  2001 Both Sexes All Ages    All Races-All Origins      216    4467634
## 4 Alabama  2002 Both Sexes All Ages    All Races-All Origins      211    4480089
## 5 Alabama  2003 Both Sexes All Ages    All Races-All Origins      197    4503491
## 6 Alabama  2004 Both Sexes All Ages    All Races-All Origins      283    4530729
## # ℹ abbreviated name: ¹`Race and Hispanic Origin`
## # ℹ 12 more variables: `Crude Death Rate` <dbl>,
## #   `Standard Error for Crude Rate` <dbl>,
## #   `Lower Confidence Limit for Crude Rate` <dbl>,
## #   `Upper Confidence Limit for Crude Rate` <dbl>, `Age-adjusted Rate` <dbl>,
## #   `Standard Error for Age-adjusted Rate` <dbl>,
## #   `Lower Confidence Limit for Age-adjusted Rate` <dbl>, …

Cleaning the data: lowering the capital letters, replacing spaces and dashes with underscores.

names(drugmortality) <- tolower(names(drugmortality))       
names(drugmortality) <- gsub(" ","_",names(drugmortality))
names(drugmortality) <- gsub("-","_",names(drugmortality))
head(drugmortality)

## # A tibble: 6 × 19
##   state    year sex        age_group race_and_hispanic_origin deaths population
##   <chr>   <dbl> <chr>      <chr>     <chr>                     <dbl>      <dbl>
## 1 Alabama  1999 Both Sexes All Ages  All Races-All Origins       169    4430143
## 2 Alabama  2000 Both Sexes All Ages  All Races-All Origins       197    4447100
## 3 Alabama  2001 Both Sexes All Ages  All Races-All Origins       216    4467634
## 4 Alabama  2002 Both Sexes All Ages  All Races-All Origins       211    4480089
## 5 Alabama  2003 Both Sexes All Ages  All Races-All Origins       197    4503491
## 6 Alabama  2004 Both Sexes All Ages  All Races-All Origins       283    4530729
## # ℹ 12 more variables: crude_death_rate <dbl>,
## #   standard_error_for_crude_rate <dbl>,
## #   lower_confidence_limit_for_crude_rate <dbl>,
## #   upper_confidence_limit_for_crude_rate <dbl>, age_adjusted_rate <dbl>,
## #   standard_error_for_age_adjusted_rate <dbl>,
## #   lower_confidence_limit_for_age_adjusted_rate <dbl>,
## #   upper_confidence_limit_for_age_adjusted_rate <dbl>, …

I wanted to see if there was anything that might indicate a significant difference between the death rates for females and males. To do so I graphed geom density ridges.

sexdeaths <- ggplot(drugmortality, aes(x = crude_death_rate, y = age_group, fill = sex)) +
  geom_density_ridges(alpha = 0.4) +
 scale_fill_manual(values = c("#157021", "#970483", "#041197"),
labels = c("Both Sexes", "Female", "Male")) +
  labs(x = "crude death rate",
       y = "age groups",
       title = "Death Rates Based on Sex",
       caption = "NCHS - Drug Poisoning Mortality by State: United States") +
  theme_calc()

sexdeaths

## Picking joint bandwidth of 1.76

In general, visually it appears that while there is a slightly significant difference in the rate of which males and females die. Females die at a noticeably lower rate across most age groups. It’s not by a whole lot, but it’s there.

I needed to yoink out “United States” from the state variable because it was throwing off the count, so I filtered it out.

countdata <- drugmortality |>
  select(state, deaths) |>
  filter(!state == "United States")
head(countdata)

## # A tibble: 6 × 2
##   state   deaths
##   <chr>    <dbl>
## 1 Alabama    169
## 2 Alabama    197
## 3 Alabama    216
## 4 Alabama    211
## 5 Alabama    197
## 6 Alabama    283

Finding the states with the highest deaths so that I can graph about four to five states instead of 50. I don’t really want to combine all of the states just for this purpose–they are separated due to different years–so I will just pick the first four states I see.

counts <- count(countdata, state, deaths) |>
  group_by(state) |>
  arrange(desc(deaths))
head(counts)

## # A tibble: 6 × 3
## # Groups:   state [3]
##   state        deaths     n
##   <chr>         <dbl> <int>
## 1 Florida        4728     1
## 2 California     4659     1
## 3 California     4654     1
## 4 Pennsylvania   4627     1
## 5 California     4521     1
## 6 California     4452     1

Creating an interactive graph using Highchart.

HCData <- drugmortality |>
  filter(state %in% c("California","Pennsylvania", "Ohio", "Florida")) |>
  arrange(year)

cols <- c("#a21e1a", "#a2681a", "#086613", "#f26fff")

highchart() |>
  hc_add_series(data = HCData,
                   type = "line", hcaes(x = year,
                   y = deaths, 
                   group = state)) |>
  hc_colors(cols) |>
  hc_xAxis(title = list(text="Year")) |>
  hc_yAxis(title = list(text="Death Count")) |>
  hc_plotOptions(series = list(marker = list(symbol = "circle"))) |>      
  hc_legend(align = "right",                    # "LEADING LINES!" my middle school yearbook teacher screams. "MAKE THE LINES LOOK LIKE THEY'RE LEADING TO SOMETHING!" she screams. I listen and put the legend in the top right.
            verticalAlign = "top") |>
  hc_title(text= "Death Counts from 2000 to 2016") |>
  hc_caption(text = "NCHS - Drug Poisoning Mortality by State: United States")

Honestly, I did not realize the data only went up to 2016. That’s my bad. The data was last updated very recently, so I though it went up to more recent years, but I guess they either edited the data or made corrections.

The chart shows that the deaths have been going up steadily, though took a dip around 2012. I’m not sure what might have caused this, as the only truly notable aspect of that year was Obama’s reelection. Which, honestly, for some people that might have actually been the cause. However, immediately after it began climbing again. 2014 was a terrible year, though once again I do not know what could have been the cause. Either way, something caused it to drop and then immediately start to climb again. California is in its own world though, it’s been consistently rising the whole time with one blip in 2000-2001.

Linear Regression

Doing the linear regression and also using scipen because I prefer to be able to actually see just how small or big the numbers are.

lmData <- drugmortality |>
  select(deaths, age_adjusted_rate, year, population) |>
  filter(!is.na(deaths), !is.na(age_adjusted_rate), !is.na(year), !is.na(population))

options(scipen = 999)
lmModel <- lm(deaths ~ age_adjusted_rate + year + population, data = lmData)
summary(lmModel)

## 
## Call:
## lm(formula = deaths ~ age_adjusted_rate + year + population, 
##     data = lmData)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16286.4   -546.6    -24.7    582.0  22305.5 
## 
## Coefficients:
##                            Estimate        Std. Error t value
## (Intercept)       -122382.920368460   34220.344689049  -3.576
## age_adjusted_rate     136.184341278      14.704263143   9.262
## year                   60.095490120      17.097418941   3.515
## population              0.000123862       0.000001369  90.470
##                               Pr(>|t|)    
## (Intercept)                   0.000363 ***
## age_adjusted_rate < 0.0000000000000002 ***
## year                          0.000457 ***
## population        < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2407 on 1130 degrees of freedom
## Multiple R-squared:  0.8817, Adjusted R-squared:  0.8814 
## F-statistic:  2807 on 3 and 1130 DF,  p-value: < 0.00000000000000022

The model is well fitted, as it predicts 88.14 percent of the variance present, and this variables are very significant as the pvalue is nearly zero.

Equation: y = -122382.920 + (136.184 * age_adjusted_rate) + (60.095 * year) + (0.000123 * population)

Interpretation: For each unit the age_adjusted_rate, year, and population increase, the estimated amount of deaths increases by 126.184, 60.095, and 0.000123 respectively. When all three variables are at zero, the assumed number of deaths is at -122382.920.

I just want to say, there’s no way the intercept is THAT low, so I am automatically assuming I did something wrong, but I don’t know what. Therefore, I am moving on.

Graphing my model with the code Ria gave me. Ty Ria, shout out to Ria Rhodes.

lmData$predicted <- predict(lmModel)

ggplot(lmData, aes(predicted ,deaths)) + 
  geom_point() + 
  geom_smooth(color = "#985100", method = lm, se = TRUE, fill = "#00ad41") +
  labs(title = "Linear Regression Model Graph",
       x = "Predictors: Year, Population, Age_adjusted_rate",
       caption = "NCHS - Drug Poisoning Mortality by State: United States") +
  theme_calc()

## `geom_smooth()` using formula = 'y ~ x'

I just wanna say the way that I struggled to get this to work was insane. Apparently my previous method of removing NAs from the linear model just didn’t work and it was causing problems here. It took so long to figure out that the disparity in the amounts of observations(which were the issue) was because the NAs were, in fact, NOT removed when I thought they were.

The graph itself is sort of confusing to look at. I know the odd look might be due to the method I used to graph it, but the way the points are grouped is really confusing to me. The SE is barely visible, so that’s good, the model is very precise. Overall, as the predictors go up, the deaths also go up. However, the way the points are grouped is quite interesting. Its shape vaguely reminisces a fish bone, and I can’t quite tell if that’s a good thing or not.

Diagnostic Plots

plot(lmModel)

Overall it’s not a terrible model, though I do notice that the Residuals vs Fitted graph is identical to the graph I did for my model. Not sure what that indicates because I am only in my second semester, but I’m sure I’ll find out soon.

The Q-Q plot also isn’t terrible. The larger the predictors get, the less accurate the predictions are, and there are multiple outliers. But the model if very tightly fitted towards the center. So, yippiee.

The general consensus in the diagnostic plots seems to be that the model is very good at predicting the deaths when the predictors are also low, so as the predictors get higher the model gets less accurate.

Here I am graphing the residuals histogram.

hist(residuals(lmModel))

The residuals graph indicates that the line is pretty accurate, as the distribution is very normal.

Conclusion

The year, population of the state, and the age rate that has been standardized are the most prominent in predicting deaths by drug poisoning. The linear regression model that led me to this conclusion has a p-value of almost zero and an Adjusted R-Square of 88.14. This means the model predicts a good amount of the variance, and the variables are very significant. The residuals are very normally distributed around 0, but as the predictor variables get higher, the accuracy goes down. This answers my question, though the results are pretty boring.

Works Cited

NCHS - Drug Poisoning Mortality by State: United States | Data | Centers for Disease Control and Prevention. 4 June 2018, data.cdc.gov/National-Center-for-Health-Statistics/NCHS-Drug-Poisoning-Mortality-by-State-United-Stat/xbxb-epbu/about_data.

Poisoning/Prescription Drug Overdose. www.tn.gov/health/health-program-areas/fhw/injury-and-violence-prevention-programs/injury-topics/poisoning-prescription-drug-overdose.html.

————————————————————————

Extra information that isn’t a part of my project and is just here to fulfill my desire for irrelevant commentary.

And a deep sigh of relaxation comes from both me and the cat (Luna) currently sitting on my lap, who has bore witness to the entirety of this project. Granted, it was through a shaky view as she was almost always resting on my arm, which does need to be moving in order to code.

My father told me to do this:

library(car)

## Warning: package 'car' was built under R version 4.4.3

## Loading required package: carData

## Warning: package 'carData' was built under R version 4.4.3

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

## The following object is masked from 'package:purrr':
## 
##     some

vif(lmModel)

## age_adjusted_rate              year        population 
##          1.539218          1.539763          1.000466

He is currently an economist for the government and used to work as a data analyst. He also teaches ECON at MC. His name is Kuo-liang Chang, or Matt Chang. You can find a picture of him with a single Google search. It’s kind of creepy.

According to him, this means that there is no particularly dominating variable, so yippie?

Photographic evidence of the Narcan for educational purposes I guess? I don’t think many people know what Narcan actually looks like.