For this project, I investigated the Center for Disease Control (CDC) dataset of opioid overdose deaths in the USA from 1999 to 2016.
The data is sourced from the CDC’s National Center for Health Statistics (NCHS). For more information on the NCHS and its publications, see: https://www.cdc.gov/nchs/products/index.htm
The dataset contains observations of opioid death rates by the following categorization:
State
Age Group (0-14, 15-24, 25-34, 35-44, 45-54, 55-64, 65-74, 75+, All Ages)
Population
Year
Sex
Race and Hispanic Origin (Hispanic, Non-Hispanic Black, Non-Hispanic White, All Races-All Origins)
For these categorizations, a number of statistical columns are provided. We will consider the values in Crude Death Rate to be the best available estimate for drug overdose deaths for the given observation point. Some other data points provided are:
Standard Error for Crude Rate
Age-adjusted Rate
Lower Confidence Limit for Crude Rate
Upper Confidence Limit for Crude Rate
In the future, I would also like to investigate how this error and confidence data might influence the significance of the observed trends in the dataset.
According to the CDC, drug overdose deaths quadrupled from 1999 to 2019. The CDC describes the opioid epidemic as occurring in three waves:
The first wave began with an increasing practice of opioid prescription in the 1990s
The second began in 2010, with a rapid increase in overdose deaths involving heroin
The third began in 2013 with significant increases in overdose deaths involving synthetic opioids, particularly fentanyl.[1]
library(BSDA)
library(tidyverse)
library(multcomp)
library(ggpubr)
library(ggthemes)
library(ggtext)
df <- readr::read_csv("/Users/daniellefevre/Documents/MATH\ 217/drug_poisonings/NCHS_-_Drug_Poisoning_Mortality_by_State__United_States.csv")
The effect of the opioid epidemic on the country has not been uniform, with certain geographic regions and demographics being affected worse than others. In this report, I will examine how these factors may correlate to the severity of the opioid epidemic, and how this may have changed over time. The data from the CDC reports a “Crude Death Rate”, which will be considered as a proxy for the overall severity of the epidemic at any point of observation.
While the opioid crisis has certainly not spared either gender, there seems to be a frequent narrative around men facing a greater risk of overdose and accounting for a higher percentage of deaths from overdose. For example, a Guardian article published in 2017 questions “Is North America’s opioid epidemic a crisis of masculinity?”[3] While drugabuse.gov notes that “illicit drug use is more likely to result in emergency department visits or overdose deaths for men than for women.”
While it is true that men account for an overall greater occurrence of overdose deaths in the dataset, let us investigate the extent to which this difference is statistically significant.
Compute mean Crude Death Rate by Gender for all years:
my_data <- df
my_data$Sex <- ordered(my_data$Sex, levels = c("Male", "Female", "Both Sexes"))
group_by(my_data, Sex) %>%
summarise(
count = n(),
mean = mean(`Crude Death Rate`, na.rm = TRUE),
sd = sd(`Crude Death Rate`, na.rm = TRUE)
)
Let us assume a null relationship, that is, that the overall crude death rate for men and women is actually the same. We will use a t-test to test the alternative hypothesis: that the mean crude death rates for men and women from opioids have a difference greater than 0.
tsum.test(mean.x = 13.08, n.x = 648, s.x = 11.42, mean.y =7.01, n.y = 648, s.y = 648, alternative = "two.sided")
Welch Modified Two-Sample t-Test
data: Summarized x and y
t = 0.23842, df = 647.4, p-value = 0.8116
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-43.92375 56.06375
sample estimates:
mean of x mean of y
13.08 7.01
The 95% confidence interval includes zero. Therefore we fail to reject the null hypothesis, finding no compelling evidence that the difference in mean crude death rate from drug overdoses between 1999 and 2016 is non-zero.
Statistically speaking, it seems that the opioid crisis has not been categorically worse for men or women. While it is perhaps still worth noting that men account for a greater occurrence of deaths in the actual data, we should also be cautious of rhetoric that attempts to paint the opioid crisis as either a predominantly male or female issue.
In this section, we will take a look at whether any possible correlation exists between these two variables.
The motivation for this part of the investigation is based on a few different things. There is an unfortunate and extremely fallacious association between poverty and alcohol and substance abuse. I suspect that this association would not ultimately be born out by a robust survey of the data. However, on the other hand, it is not at all difficult to imagine why reduced economic circumstances could be correlated to increased deaths from drug overdoses. Aside from individuals turning to substances out of idleness and desperation, lower income could imply lower municipal budgets for things like treatment and rehabilitation.
For this section we use US Census data on the median household income; both by state and for the country as a whole.
Note: The raw US census data can be found in the file median_household_income_by_Census.csv. Necessary Pre-processing steps on this data are carried out in preprocess_median_incomes.Rmd. At the end of executing each line of code in this file, median_incomes_preprocessed.csv will be generated in the source folder. Both files are included with this report.
median_income <- readr::read_csv(file = "/Users/daniellefevre/Documents/MATH\ 217/drug_poisonings/median_incomes_preprocessed.csv")
median_income_transpose <- as_tibble(cbind(nms = names(median_income), t(median_income)))
# The state names are now in the first row, so we'll reset the names and drop
# the row from the table
names(median_income_transpose) <- median_income_transpose[1,]
median_income_transpose <- median_income_transpose[seq(2, nrow(median_income_transpose)),]
# The transposition caused the numeric vectors to become Character strings,
# So we will manually coerce each vector to an integer
median_income_transpose <- median_income_transpose %>% mutate(across(names(median_income_transpose), as.integer))
# The first column contains the original column headers, so for now we'll drop
# this, and add it back later
temp <- median_income_transpose[,seq(2, ncol(median_income_transpose))]
We are interested in the deviation of each state from the national average, so we will subtract each state column from the “United States” column.
First, we need this function:
subtract_from_natl_avg <- function(x) (x - temp$`United States`);
Now we apply it across each column of temp using mutate_all()
deviation_from_national_mean_income <- temp %>% mutate_all(subtract_from_natl_avg)
deviation_from_national_mean_income$Year <- seq(2019, 1984)
dev <- pivot_longer(deviation_from_national_mean_income,
cols = !Year,
names_to = "State",
values_to = "Difference From National Median")
#library(ggtext)
dev %>%
ggplot(aes(x = Year, y = `Difference From National Median`, group = State)) +
geom_line() +
labs(title = "State Median Income Difference from<br>National Median Income Difference; 1984 - 2019") +
theme(plot.title.position = "panel",
plot.title = element_markdown(hjust = 0.5)) +
scale_y_continuous(labels = scales::dollar_format()) +
theme_hc()
We can see here that the variance of median household income between states has been increasing over the past few decades.
How does this data measure up against state trends in the opioid epidemic? Let’s highlight some states which were particularly impacted:
top_10 <- df %>%
group_by(State) %>%
summarise(avg = mean(`Crude Death Rate`)) %>%
arrange(desc(avg)) %>%
head(10)
top_10
# Note: I ran into errors trying to use the State column from top_10, so
# in the next line I just create the vector of state names directly
top_10_states <- c("West Virginia", "New Mexico", "Nevada", "Kentucky", "Utah", "Rhode Island",
"Pennsylvania", "Oklahoma", "Arizona", "Tennessee")
top <- dev %>% filter(State %in% top_10_states)
remainder <- dev %>% filter(!(State %in% top_10_states))
dev$top_10 <- if_else(dev$State %in% top_10_states, "Top 20% impact from opioid crisis", "Rest of U.S.")
dev$top_10 <- factor(dev$top_10, levels = c("Rest of U.S.", "Top 20% impact from opioid crisis"))
p <- dev %>%
ggplot() +
geom_line(mapping = aes(x = Year, y = `Difference From National Median`, color = top_10, group = State,
text = paste(as.character(State),
as.character(Year),
paste("Median income: ", dollar_format()(`Difference From National Median`), " (relative to Nation as a whole)"),
sep = "\n"))) +
labs(title = "State Deviation from National Median Household Income<br>and Effect of the Opioid Crisis",
color = "") +
theme(plot.title = element_text(hjust = 0.5)) +
scale_y_continuous(labels = scales::dollar_format()) +
theme_hc()
fig <- ggplotly(p, tooltip = c("text"))
fig
NA
In this plot, the 10 states which were on average the most significantly impacted by the opioid crisis during the years in dataset are highlighted against the rest of the country. it appears that states with below the national median household income tended to fare worse in the opioid crisis, though we can see that 2-3 of the 10 hardest hit states still were above the national median household income. If you hover the mouse pointer over this plot, the tooltip will give the State name corresponding to each line. One thing I noticed here that directly contradicts my hypothesis is that while not in the top 10, DC and Maryland have been particularly hard hit by the opioid crisis as well, and they show the greatest positive deviation in household income from the national median in recent years of any states. I believe an argument could be made not to include DC in an otherwise State-level analysis, and in the future it might be interesting to see how excluding DC influences the result. Though it might be a stretch to leave Maryland out, it’s also worth noting that of course Maryland’s income distribution could be likewise skewed from the fact that many Maryland residents work in the District of Columbia.
In order to compute the correlation coefficients, we first need just a bit more manipulation of the data.
Use the pivot_wider function to move the state Crude Death Rate from columns to rows, and remove the rows of census data which do not correspond to years in the opioid dataset:
cdr <- df %>%
dplyr::select("State", `Crude Death Rate`, "Year") %>%
filter(State != "United States") %>%
pivot_wider(names_from = State, values_from = `Crude Death Rate`)
deviation_from_national_mean_income <- deviation_from_national_mean_income %>%
filter(Year %in% cdr$Year)
Finally we’ll use mapply() to compute the correlation coefficient for each state:
statewise_correlation <- mapply(cor, cdr[-1],
deviation_from_national_mean_income[-1])
# The following multiplation by -1 is done so that the result is positive for
# increasing overdoses and decreasing median income
statewise_correlation <- -1*statewise_correlation
statewise_correlation
Alabama Alaska Arizona Arkansas California Colorado Connecticut
-0.666940271 0.526071423 -0.343345612 -0.312858702 0.473856280 0.398913711 0.734020925
Delaware District of Columbia Florida Georgia Hawaii Idaho Illinois
-0.624237064 0.193095716 -0.668810181 -0.739843105 0.191817306 0.179822632 -0.408459197
Indiana Iowa Kansas Kentucky Louisiana Maine Maryland
-0.641938870 0.716353625 0.032190058 -0.809797975 -0.603058656 -0.189769850 0.389802677
Massachusetts Michigan Minnesota Mississippi Missouri Montana Nebraska
0.775718100 -0.824506124 0.067424853 -0.593530326 -0.097753544 0.695273533 0.518053818
Nevada New Hampshire New Jersey New Mexico New York North Carolina North Dakota
-0.818249649 0.891892583 0.541304413 -0.444885994 0.390100004 -0.433298225 0.581456765
Ohio Oklahoma Oregon Pennsylvania Rhode Island South Carolina South Dakota
-0.747852115 -0.002586998 0.741563036 0.604772757 0.165363791 -0.482942739 0.721251815
Tennessee Texas Utah Vermont Virginia Washington West Virginia
-0.577845381 0.645982340 0.517933673 0.500928674 0.420718710 0.491076814 0.027752530
Wisconsin Wyoming
-0.389648294 0.951088746
Interestingly, in some states, the course of the opioid epidemic is very tightly correlated with the relative change in difference of median household income from the rest of the country, while in other states it seems to be the exact opposite.
mean(statewise_correlation)
[1] 0.05222436
sd(statewise_correlation)
[1] 0.5576218
Taking the dataset as a whole, it would seem that income trends and the impact of the opioid crisis are overall not closely related.
As noted previously, the CDC considers the opioid crisis as having come in distinct waves, each marked by distinct trends. It is possible that looking at the dataset in this manner is glossing over more localized trends.
More specifically, the CDC considers that the first wave of the opioid epidemic was driven by an increase in opioid prescriptions, whereas the second and third waves have been driven by illegal drugs. I am curious whether we might see some difference in the apparent correlation between individuals’ financial means and opioid overdose trends in waves 2 and 3 vs. wave 1. To that end, we’ll perform a similar analysis as the previous section for each wave separately:
# Drop Wyoming from cdr and United States from
# deviation_from_national_mean_income
cdr_income <- cdr[-52] %>% right_join( deviation_from_national_mean_income[-1],
by = c("Year"),
suffix = c(".cdr", ".income"))
wave_1 <- cdr_income %>%
filter(Year < 2010)
wave_2 <- cdr_income %>%
filter(Year >= 2010, Year < 2013)
wave_3 <- cdr_income %>%
filter(Year >= 2013)
statewise_correlation <- mapply(cor, wave_1[,2:51],
wave_1[,52:101])
statewise_correlation <- -1*statewise_correlation
statewise_correlation
Alabama.cdr Alaska.cdr Arizona.cdr Arkansas.cdr California.cdr Colorado.cdr
0.48401191 -0.15780070 0.25062835 0.17350939 -0.48193773 -0.41657854
Connecticut.cdr Delaware.cdr District of Columbia.cdr Florida.cdr Georgia.cdr Hawaii.cdr
-0.76855307 0.69748222 0.24943031 -0.02261499 0.50130707 -0.46349809
Idaho.cdr Illinois.cdr Indiana.cdr Iowa.cdr Kansas.cdr Kentucky.cdr
-0.24364551 0.38147848 0.86429561 -0.35390903 0.45242607 0.66456635
Louisiana.cdr Maine.cdr Maryland.cdr Massachusetts.cdr Michigan.cdr Minnesota.cdr
0.07673144 -0.32315600 -0.05830165 -0.60096788 0.84299882 0.62348184
Mississippi.cdr Missouri.cdr Montana.cdr Nebraska.cdr Nevada.cdr New Hampshire.cdr
0.86501236 0.74604640 -0.43615198 -0.36436656 -0.29292696 -0.84437630
New Jersey.cdr New Mexico.cdr New York.cdr North Carolina.cdr North Dakota.cdr Ohio.cdr
-0.20055702 -0.07569806 -0.07851634 0.76427356 -0.78173938 0.68317504
Oklahoma.cdr Oregon.cdr Pennsylvania.cdr Rhode Island.cdr South Carolina.cdr South Dakota.cdr
-0.40034599 0.06907923 0.03093761 -0.49711407 0.92000132 -0.69859385
Tennessee.cdr Texas.cdr Utah.cdr Vermont.cdr Virginia.cdr Washington.cdr
0.66516771 0.47883202 -0.36619122 -0.18963367 -0.72758749 -0.84219008
West Virginia.cdr Wisconsin.cdr
-0.29820580 0.62593701
mean(statewise_correlation)
[1] 0.02251304
sd(statewise_correlation)
[1] 0.5361566
statewise_correlation <- mapply(cor, wave_2[,2:51],
wave_2[,52:101])
statewise_correlation <- -1*statewise_correlation
statewise_correlation
Alabama.cdr Alaska.cdr Arizona.cdr Arkansas.cdr California.cdr Colorado.cdr
-0.61070987 -0.71775611 0.86876686 0.73257265 0.97254551 0.71344397
Connecticut.cdr Delaware.cdr District of Columbia.cdr Florida.cdr Georgia.cdr Hawaii.cdr
0.98934043 -0.65192459 0.95134242 0.76834255 0.99850316 -0.26057623
Idaho.cdr Illinois.cdr Indiana.cdr Iowa.cdr Kansas.cdr Kentucky.cdr
-0.09951711 0.66178964 0.73506995 -0.75299068 -0.91219587 0.92172208
Louisiana.cdr Maine.cdr Maryland.cdr Massachusetts.cdr Michigan.cdr Minnesota.cdr
-0.99394872 -0.44914109 -0.91875020 -0.93904099 0.62461680 -0.77666966
Mississippi.cdr Missouri.cdr Montana.cdr Nebraska.cdr Nevada.cdr New Hampshire.cdr
0.09359080 0.59224457 0.87944128 0.55926645 0.56959359 0.99240411
New Jersey.cdr New Mexico.cdr New York.cdr North Carolina.cdr North Dakota.cdr Ohio.cdr
-0.66390621 0.71558154 0.70139383 0.49125416 0.95171066 0.99929550
Oklahoma.cdr Oregon.cdr Pennsylvania.cdr Rhode Island.cdr South Carolina.cdr South Dakota.cdr
0.03204155 -0.93863286 -0.90638406 -0.13728076 0.11031329 0.28264178
Tennessee.cdr Texas.cdr Utah.cdr Vermont.cdr Virginia.cdr Washington.cdr
-0.43851405 0.37654557 -0.05478378 0.99579560 -0.73914002 -0.18161308
West Virginia.cdr Wisconsin.cdr
0.97727695 -0.98697738
mean(statewise_correlation)
[1] 0.1425599
sd(statewise_correlation)
[1] 0.724573
statewise_correlation <- mapply(cor, wave_3[,2:51],
wave_3[,52:101])
statewise_correlation <- -1*statewise_correlation
statewise_correlation
Alabama.cdr Alaska.cdr Arizona.cdr Arkansas.cdr California.cdr Colorado.cdr
0.93995766 -0.82768161 -0.48781565 0.45928960 -0.79669799 0.21234606
Connecticut.cdr Delaware.cdr District of Columbia.cdr Florida.cdr Georgia.cdr Hawaii.cdr
-0.80695586 0.55996568 0.06149972 0.56291157 0.75159728 0.07808964
Idaho.cdr Illinois.cdr Indiana.cdr Iowa.cdr Kansas.cdr Kentucky.cdr
0.32652694 0.39335628 0.05904805 0.49692633 -0.87026872 0.91056098
Louisiana.cdr Maine.cdr Maryland.cdr Massachusetts.cdr Michigan.cdr Minnesota.cdr
0.86717478 0.96828159 0.24010059 -0.82285222 -0.54375729 0.06447112
Mississippi.cdr Missouri.cdr Montana.cdr Nebraska.cdr Nevada.cdr New Hampshire.cdr
0.82031569 0.71424608 0.90813596 -0.88020978 0.29479300 0.69238332
New Jersey.cdr New Mexico.cdr New York.cdr North Carolina.cdr North Dakota.cdr Ohio.cdr
0.59807751 -0.61972774 -0.72375771 -0.76640136 0.16645901 -0.06622447
Oklahoma.cdr Oregon.cdr Pennsylvania.cdr Rhode Island.cdr South Carolina.cdr South Dakota.cdr
-0.55335599 -0.18152957 -0.05225413 0.81061099 -0.61538093 0.96812447
Tennessee.cdr Texas.cdr Utah.cdr Vermont.cdr Virginia.cdr Washington.cdr
-0.87750995 0.89991492 0.26338133 0.78910389 0.61813905 -0.90373372
West Virginia.cdr Wisconsin.cdr
0.77016286 0.37778070
mean(statewise_correlation)
[1] 0.1249524
sd(statewise_correlation)
[1] 0.6444564
Interestingly, while still not a strong predictor for opioid overdoses, we do see that the mean correlation coefficient between deviation from national median household income and opioid overdose death rates is about 6-7 times higher during waves 2 and 3 than wave 1. Unfortunately, I did not have time to determine if this result was statistically significant, as the standard error varies for every measurement in the census data, but it would be great to investigate this in the future.
In this section, we will carry out a linear regression analysis. We will consider the Crude Death Rate as a linear function of state, year, sex, and difference from the national household median income in the state of residence.
new <- left_join(df, dev, by = c("State", "Year"))
names(new) <- tolower(names(new))
names(new) <- gsub(" ", "", names(new))
fit1 <- lm(crudedeathrate ~ state + year + sex + differencefromnationalmedian, data = new )
summary(fit1)
Call:
lm(formula = crudedeathrate ~ state + year + sex + differencefromnationalmedian,
data = new)
Residuals:
Min 1Q Median 3Q Max
-17.948 -4.018 -0.530 2.292 50.212
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.177e+03 5.456e+01 -21.576 < 2e-16 ***
stateAlaska 7.220e+00 3.170e+00 2.277 0.02285 *
stateArizona 6.154e+00 2.561e+00 2.403 0.01634 *
stateArkansas -2.075e-01 2.509e+00 -0.083 0.93409
stateCalifornia 2.349e+00 2.845e+00 0.826 0.40896
stateColorado 6.508e+00 2.968e+00 2.193 0.02840 *
stateConnecticut 6.579e+00 3.282e+00 2.005 0.04509 *
stateDelaware 5.916e+00 2.761e+00 2.143 0.03223 *
stateDistrict of Columbia 7.221e+00 2.785e+00 2.593 0.00958 **
stateFlorida 4.620e+00 2.518e+00 1.835 0.06661 .
stateGeorgia 4.528e-01 2.556e+00 0.177 0.85939
stateHawaii 2.713e+00 3.052e+00 0.889 0.37407
stateIdaho 1.030e+00 2.559e+00 0.402 0.68742
stateIllinois 2.497e+00 2.709e+00 0.922 0.35659
stateIndiana 3.083e+00 2.549e+00 1.209 0.22667
stateIowa -1.974e+00 2.637e+00 -0.748 0.45432
stateKansas 4.817e-02 2.571e+00 0.019 0.98505
stateKentucky 8.146e+00 2.497e+00 3.263 0.00112 **
stateLouisiana 3.254e+00 2.502e+00 1.300 0.19359
stateMaine 3.377e+00 2.542e+00 1.328 0.18413
stateMaryland 8.934e+00 3.370e+00 2.651 0.00807 **
stateMassachusetts 7.577e+00 3.037e+00 2.495 0.01265 *
stateMichigan 3.884e+00 2.612e+00 1.487 0.13721
stateMinnesota 1.720e-01 3.026e+00 0.057 0.95468
stateMississippi -1.222e+00 2.531e+00 -0.483 0.62930
stateMissouri 4.225e+00 2.587e+00 1.633 0.10257
stateMontana 1.304e+00 2.497e+00 0.522 0.60162
stateNebraska -2.952e+00 2.660e+00 -1.110 0.26714
stateNevada 1.030e+01 2.638e+00 3.903 9.73e-05 ***
stateNew Hampshire 8.074e+00 3.342e+00 2.416 0.01577 *
stateNew Jersey 4.319e+00 3.221e+00 1.341 0.18008
stateNew Mexico 1.129e+01 2.497e+00 4.521 6.41e-06 ***
stateNew York 2.905e-01 2.628e+00 0.111 0.91198
stateNorth Carolina 2.093e+00 2.505e+00 0.836 0.40349
stateNorth Dakota -4.562e+00 2.584e+00 -1.766 0.07757 .
stateOhio 6.086e+00 2.556e+00 2.381 0.01735 *
stateOklahoma 5.818e+00 2.502e+00 2.326 0.02010 *
stateOregon 2.962e+00 2.640e+00 1.122 0.26209
statePennsylvania 7.416e+00 2.634e+00 2.816 0.00489 **
stateRhode Island 8.396e+00 2.715e+00 3.092 0.00201 **
stateSouth Carolina 1.683e+00 2.500e+00 0.673 0.50083
stateSouth Dakota -3.474e+00 2.558e+00 -1.358 0.17457
stateTennessee 5.225e+00 2.496e+00 2.093 0.03644 *
stateTexas -1.008e-01 2.568e+00 -0.039 0.96869
stateUnited States 1.742e+00 1.965e+00 0.886 0.37551
stateUtah 9.547e+00 2.930e+00 3.259 0.00113 **
stateVermont 2.888e+00 2.693e+00 1.072 0.28366
stateVirginia 2.131e+00 3.034e+00 0.702 0.48259
stateWashington 5.913e+00 2.879e+00 2.054 0.04012 *
stateWest Virginia 1.219e+01 2.510e+00 4.857 1.25e-06 ***
stateWisconsin 2.467e+00 2.694e+00 0.916 0.35991
year 5.905e-01 2.713e-02 21.763 < 2e-16 ***
sexFemale -2.967e+00 4.160e-01 -7.133 1.25e-12 ***
sexMale 3.104e+00 4.160e-01 7.462 1.13e-13 ***
differencefromnationalmedian -1.650e-04 9.822e-05 -1.680 0.09301 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 7.488 on 2789 degrees of freedom
(18 observations deleted due to missingness)
Multiple R-squared: 0.2552, Adjusted R-squared: 0.2408
F-statistic: 17.7 on 54 and 2789 DF, p-value: < 2.2e-16
par(mfrow = c(2,2))
plot(fit1)
In this case, we see that the model is not very effective. The diagnostic plots do not indicate homogeneity of variance.
What if we limit the scope of the model to the states with the most significant p-values?
top_2 <- new %>%
filter(state %in% c("West Virginia", "New Mexico", "Nevada", "Kentucky", "Utah", "Rhode Island",
"Pennsylvania", "Oklahoma", "Arizona", "Tennessee"))
fit2 <- lm(crudedeathrate ~ state + year + differencefromnationalmedian, data = top_2 )
summary(fit2)
Call:
lm(formula = crudedeathrate ~ state + year + differencefromnationalmedian,
data = top_2)
Residuals:
Min 1Q Median 3Q Max
-11.6597 -1.8726 -0.0945 1.7964 17.4264
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.073e+03 9.953e+01 -20.832 < 2e-16 ***
stateKentucky 3.438e+00 1.355e+00 2.538 0.01206 *
stateNevada 3.503e+00 1.178e+00 2.975 0.00336 **
stateNew Mexico 6.318e+00 1.285e+00 4.915 2.09e-06 ***
stateOklahoma 5.933e-01 1.228e+00 0.483 0.62962
statePennsylvania 6.556e-01 1.173e+00 0.559 0.57693
stateRhode Island 1.111e+00 1.272e+00 0.873 0.38376
stateTennessee 3.000e-01 1.296e+00 0.231 0.81729
stateUtah 1.198e+00 1.601e+00 0.749 0.45510
stateWest Virginia 7.951e+00 1.502e+00 5.295 3.69e-07 ***
year 1.040e+00 4.960e-02 20.972 < 2e-16 ***
differencefromnationalmedian 5.963e-05 1.161e-04 0.514 0.60806
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.391 on 168 degrees of freedom
Multiple R-squared: 0.7648, Adjusted R-squared: 0.7494
F-statistic: 49.66 on 11 and 168 DF, p-value: < 2.2e-16
par(mfrow = c(2,2))
plot(fit2)
In this case, the diagnostic plots indicate much greater homogeneity of variance, so we can regard this more limited model as more useful than the first.
Looking at the P-values, we see that the year and residence among the states which were particularly hard hit by the epidemic were the most significant predictors. Based on this result, we can reject the null hypothesis that the opioid crisis was uniform among different states. Furthermore, the passage of time was a stronger predictor than income deviation or state of residence. This suggests that the opioid crisis was an event which unfolded with relatively little regard for econimic diffferences, at least between different states. Rather, simply the progression of the epidemic itself, and residence in one of the the particularly hard hit states, were the strongest predictors of overdose death rate. Further research is needed to explain why states such as West Virginia or New Mexico were significantly more impacted than others.
We can see from these plots that deviation from national median household income was a much stronger predictor among the states most affected by the opioid crisis than among the nation as a whole.
Let us now shift gears and examine the variation among different age groups. Personally, I tend to think of drug addiction and overdose as a trend among young people, but this may be based on my experience of being frequently warned off of drugs as a teenager. I am interested to see how this compares with the data.
#ages <- df
#ages$`Age Group` <- ordered(ages$`Age Group`, levels = c("0-14", "15-24", "25-34", "45-54", "55-64", "65-74", "75+"))
# Todo: fix order of X axis
ggboxplot(df, x = "Age Group", y = "Crude Death Rate",
color = "Age Group", #pallette = c("#00AFBB", "#E7B800", "#FC4E07"),
#order = c("Male", "Female", "Both Sexes"),
ylab = "Crude Death Rate", xlab = "Age Group")
ggline(df, x = "Age Group", y = "Crude Death Rate",
add = c("mean_se", "jitter"))
It appears from these plots that opioid overdoses tend to be more common among adults than teenagers.
We can also see that overdoses are exceedingly uncommon among children under 14 and the elderly. This is not particularly surprising, and we might as well leave these groups out of our analysis of variance, as they will only skew the result. We will test for a difference in mean crude death rate between the age groups of 25-34, 35-44, 45-54, and 55-64; the null hypothesis being that there is no meaningful difference in the crude death rate between these groups.
df_aov <- df %>%
filter(`Age Group` != "0–14",
`Age Group` != "15–24",
`Age Group` != "All Ages",
`Age Group` != "65–74",
`Age Group` != "75+")
res.aov <- aov(`Crude Death Rate` ~ `Age Group`, data = df_aov)
summary(res.aov)
Df Sum Sq Mean Sq F value Pr(>F)
`Age Group` 3 10287 3429 40.52 <2e-16 ***
Residuals 860 72774 85
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Since the p-value is less than a significance level of 0.05, we can conclude there is a statistically significant difference in the mean crude death rate among the age groups of 25-34, 35-44, 45-54, and 55-64.
library(multcomp)
summary(glht(res.aov, lincft = mcp(group = "Tukey")))
Simultaneous Tests for General Linear Hypotheses
Fit: aov(formula = `Crude Death Rate` ~ `Age Group`, data = df_aov)
Linear Hypotheses:
Estimate Std. Error t value Pr(>|t|)
(Intercept) == 0 14.1942 0.6259 22.678 <0.001 ***
`Age Group`35–44 == 0 4.6639 0.8852 5.269 <0.001 ***
`Age Group`45–54 == 0 6.8756 0.8852 7.768 <0.001 ***
`Age Group`55–64 == 0 -1.6893 0.8852 -1.908 0.158
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Adjusted p values reported -- single-step method)
plot(res.aov, 1)
There is no apparent relationship between residuals and group means, so we can assume the homogeneity of variances.
plot(res.aov, 2)
shapiro.test(x = residuals(object = res.aov))
Shapiro-Wilk normality test
data: residuals(object = res.aov)
W = 0.92656, p-value < 2.2e-16
The Shapiro-Wilk p-value of 2.2e-16 < 0.001 provides strong evidence of non-normality, and so the assumptions required for this analysis of variance to be valid may not be satisfied.
Since the normal assumption may not be valid, we can apply the Kruskal-Wallis test, which does not require this assumption:
kruskal.test(`Crude Death Rate` ~ `Age Group`, data = df_aov)
Kruskal-Wallis rank sum test
data: Crude Death Rate by Age Group
Kruskal-Wallis chi-squared = 141.89, df = 3, p-value < 2.2e-16
Based on our p-value of 2.2e-16 < 0.05, we can reject the the null hypothesis and conclude there is meaningful evidence of a difference in crude death rate from drug overdose between the age groups of 25-34, 35-44, 45-54, and 55-64.
In this report, we investigated CDC data on drug overdoses between 1999 and 2016. We found that while men do account for a greater occurrence of drug overdose deaths in the dataset, the difference between men and women was not statistically significant. We compared trends in the drug overdose rate with between-state trends in household income. We found that overall, trends in household income relative to the rest of the country were very loosely correlated with trends in drug overdose deaths, household financial means may have been a more significant factor during the waves 2 and 3 than during wave 1, however more analysis is needed to draw a firm conclusion. Through linear regression analysis, we determined that year and state of residence were the most important predictors of drug overdose death rate in the dataset. Finally, we examined the breakdown of drug overdose deaths between different age groups, and found a statistically significant difference in overdose death rates between adults aged 25-34, 35-44, 45-54, and 55-64, all of which experienced an overdose death rate that was comparable to one another and significantly higher than the other age groups in the dataset.
[1] https://www.cdc.gov/drugoverdose/epidemic/index.html
[3] https://www.theguardian.com/world/2017/jul/12/opioids-crisis-men-overdoses-psychology