Introduction

I grew up in New Hampshire, which has been called “Ground Zero for Opioids” (US News). President Trump even called New Hampshire “a drug-infested den” (Washington Post). However, New Hampshire isn’t the only state or country facing an opioid crisis. On October 26, 2017 President Trump declared the opioid crisis a public health emergency. As the leading cause of deaths for Americans under 50 years old, the current overdose epidemic is the deadliest drug crisis in American History. To learn more about the crisis please watch the brief video below:

The Facts on America’s Opioid Epidemic

Although we know the magnitude of the problem, what causes a person to become addicted to opioids in the first place? Elaine Fehrman, Vincent Egan, and Evgeny M. Mirkes viewed this question from a psychological standpoint and collected data from 1885 respondents from over seven countries to evaluate how personality traits and demographics affect drug consumption.

In this analysis, I will use their data to evaluate and analyze the risk of drug consumption for 18 drugs by utilizing:

  • Data cleaning

  • Data visualization

  • Association analysis

Identifying drug-use risk allows targeted intervention to happen before a person may become susceptible to the highly addictive nature of drugs and the risk of potential overdose. As an epidemic in our country, the first step to a solution is truly understanding the cause and risks at the root of the problem.

Packages

To analyze this data, I will use the following R packages:

library(data.table)   # Importing Data
library(tidyverse)    # Manipulating and visualizing data
library(DT)           # Viewing data in table format
library(arules)       # Association analysis
library(arulesViz)    # Visualizing for association analysis

Data Preparation

The Drug consumption (quantified) Data Set was donated by Evgeny M. Mirkes to the University of California Irvine in October 2016. The data and codebook can be found in the UCI Machine Learning Repository. The data is composed of 32 columns and 1885 rows and does not contain any missing values.

Elaine Fehrman collected the data between March 2011 and March 2012 using an online survey tool from Survey Gizmo. Her collection methods are outline below:

“The study recruited 2051 participants over a 12-month recruitment period. Of these persons, 166 did not respond correctly to a validity check built into the middle of the scale, so were presumed to being inattentive to the questions being asked. Nine of these persons were found to also have endorsed using a fictitious recreational drug, and which was included precisely to identify respondents who over-claim, as have other studies of this kind. This led a useable sample of 1885 participants (male/female = 943/942).”

Except from: Fehrman, E.; Muhammad, A. K.; Mirkes, E. M.; Egan, V.; Gorban, A. N. The Five Factor Model of personality and evaluation of drug consumption risk.

Although the data does not contain any missing values, the data could be presented in a cleaner format, especially because the data contains many categorical variables. Take a look:

Load Data

To begin, I will load the raw data and view the unclean dataset.

url <- "http://archive.ics.uci.edu/ml/machine-learning-databases/00373/drug_consumption.data"
drug.use <- fread(url, sep = ",", header = FALSE, showProgress = FALSE)
column.names <- c("ID", "Age", "Gender", "Education", "Country", "Ethnicity", "Nscore", "Escore", "Oscore", "Ascore", "Cscore", "Impulsive", "SS", "Alcohol", "Amphet", "Amyl", "Benzos", "Caff", "Cannabis", "Choc", "Coke", "Crack", "Ecstasy", "Heroin", "Ketamine", "Legalh", "LSD", "Meth", "Mushrooms", "Nicotine", "Semer", "VSA")
setnames(drug.use, column.names)

View the first 5 rows of the raw data:

datatable(drug.use, options = list(scrollX = TRUE, searching = FALSE, pageLength = 5), caption = 'Table 1: Raw Drug Use Data')

Clean Data

To clean that data I will perform the following tasks:

  • Remove column 1

  • Create categorical data for columns 2:6

  • Update the categorical data for the drug use in Alcohol:VSA to be binary: 0 = nonuser, 1 = user

drug.use[drug.use == "CL0"]<- 0
drug.use[drug.use == "CL1"]<- 0
drug.use[drug.use == "CL2"]<- 1
drug.use[drug.use == "CL3"]<- 1
drug.use[drug.use == "CL4"]<- 1
drug.use[drug.use == "CL5"]<- 1
drug.use[drug.use == "CL6"]<- 1

drug.clean <- drug.use %>%
                as_tibble %>%
                mutate_at(vars(Age:Ethnicity), funs(as.factor)) %>%
                mutate(Age = factor(Age, labels = c("18_24", "25_34", "35_44", "45_54", "55_64", "65_"))) %>%
                mutate(Gender = factor(Gender, labels = c("Male", "Female"))) %>%
                mutate(Education = factor(Education, labels = c("Under16", "At16", "At17", "At18", "SomeCollege","ProfessionalCert", "Bachelors", "Masters", "Doctorate"))) %>%
                mutate(Country = factor(Country, labels = c("USA", "NewZealand", "Other", "Australia", "Ireland","Canada","UK"))) %>%
                mutate(Ethnicity = factor(Ethnicity, labels = c("Black", "Asian", "White", "White/Black", "Other", "White/Asian", "Black/Asian"))) %>%
                mutate_at(vars(Alcohol:VSA), funs(as.numeric)) %>%
                select(-ID)

View the clean data:

datatable(drug.clean, options = list(scrollX = TRUE, searching = FALSE, pageLength = 5), caption = 'Table 2: Clean Drug Use Data')

Data Visualization

The visualizations will focus on heroin use, as it is a primary driver of the opioid epidemic.

Visualization of Categorical Data

age.hist <- ggplot(drug.clean, aes(Age))+
  geom_bar(aes(fill=as.factor(Heroin)), width = 0.5,show.legend=F) + 
  theme(axis.text.x = element_text(angle=65, vjust=0.6)) + 
  labs(title="Age")

education.hist <- ggplot(drug.clean, aes(Education))+
  geom_bar(aes(fill=as.factor(Heroin)), width = 0.5, show.legend=F) + 
  theme(axis.text.x = element_text(angle=65, vjust=0.6)) + 
  labs(title="Education")

country.hist <- ggplot(drug.clean, aes(Country))+
  geom_bar(aes(fill=as.factor(Heroin)), width = 0.5, show.legend=F) + 
  theme(axis.text.x = element_text(angle=65, vjust=0.6)) + 
  labs(title="Country")

ethnicity.hist <- ggplot(drug.clean, aes(Ethnicity))+
  geom_bar(aes(fill=as.factor(Heroin)), width = 0.5) + 
  theme(axis.text.x = element_text(angle=65, vjust=0.6)) + 
  labs(title="Ethnicity", fill = "Heroin Use")

gridExtra::grid.arrange(age.hist, education.hist, country.hist, ethnicity.hist, nrow = 2)

Visualization of Continuous Data

Nscore.den <- ggplot(drug.clean, aes(Nscore))+
  geom_density(aes(fill=factor(Heroin)), alpha=0.8, show.legend=F) + 
  labs(title="Nscore", 
       subtitle="Neuroticism",
       x="Nscore")

Escore.den <- ggplot(drug.clean, aes(Escore))+
  geom_density(aes(fill=factor(Heroin)), alpha=0.8, show.legend=F) + 
  labs(title="Escore", 
       subtitle="Extraversion",
       x="Escore")

Oscore.den <- ggplot(drug.clean, aes(Oscore))+
  geom_density(aes(fill=factor(Heroin)), alpha=0.8, show.legend=F) + 
  labs(title="Oscore", 
       subtitle="Openness to experience",
       x="Oscore")

Ascore.den <- ggplot(drug.clean, aes(Ascore))+
  geom_density(aes(fill=factor(Heroin)), alpha=0.8, show.legend=F) + 
  labs(title="Ascore", 
       subtitle="Agreeableness",
       x="Ascore")

Cscore.den <- ggplot(drug.clean, aes(Cscore))+
  geom_density(aes(fill=factor(Heroin)), alpha=0.8, show.legend=F) + 
  labs(title="Cscore", 
       subtitle="Conscientiousness",
       x="Ascore")

Impulsive.den <- ggplot(drug.clean, aes(Impulsive))+
  geom_density(aes(fill=factor(Heroin)), alpha=0.8, show.legend=F) + 
  labs(title="Impulsive", 
       subtitle="Impulsiveness",
       x="Impulsive")

SS.den <- ggplot(drug.clean, aes(SS))+
  geom_density(aes(fill=factor(Heroin)), alpha=0.8) + 
  labs(title="SS", 
       subtitle="Sensation Seeking",
       x="SS",
       fill="Heroin Use")

gridExtra::grid.arrange(Nscore.den, Escore.den, Oscore.den, Ascore.den, Cscore.den, Impulsive.den, SS.den, nrow = 3)

Associaton Rules

Association rules allow us to assess what drugs are being used together. To analyze association rules, a subset of just the drug variables are seperated from the clean dataset. Chocolate and caffeine were removed as those are commonly consumed items. Several drug associations are evident in the graphs and charts.

drug.arules0 <- drug.clean %>%
  select(13:31,-Choc, -Caff)

drug.arules <- as(as.matrix(drug.arules0), "transactions")

itemFrequencyPlot(drug.arules, support = 0.1, cex.names=0.8)

basket_rules <- apriori(drug.arules,parameter = list(sup = 0.3, conf = 0.7,target="rules"))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.7    0.1    1 none FALSE            TRUE       5     0.3      1
##  maxlen target   ext
##      10  rules FALSE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 565 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[17 item(s), 1885 transaction(s)] done [0.00s].
## sorting and recoding items ... [9 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [82 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
inspect(subset(basket_rules, size(basket_rules)>3))
##      lhs                              rhs        support   confidence
## [1]  {Amphet,Cannabis,Nicotine}    => {Alcohol}  0.3098143 0.9881557 
## [2]  {Alcohol,Amphet,Nicotine}     => {Cannabis} 0.3098143 0.9831650 
## [3]  {Alcohol,Amphet,Cannabis}     => {Nicotine} 0.3098143 0.9068323 
## [4]  {Cannabis,Mushrooms,Nicotine} => {Alcohol}  0.3140584 0.9899666 
## [5]  {Alcohol,Mushrooms,Nicotine}  => {Cannabis} 0.3140584 0.9916248 
## [6]  {Alcohol,Cannabis,Mushrooms}  => {Nicotine} 0.3140584 0.8875562 
## [7]  {Cannabis,Coke,Nicotine}      => {Alcohol}  0.3151194 0.9933110 
## [8]  {Alcohol,Coke,Nicotine}       => {Cannabis} 0.3151194 0.9753695 
## [9]  {Alcohol,Cannabis,Coke}       => {Nicotine} 0.3151194 0.9152542 
## [10] {Benzos,Cannabis,Nicotine}    => {Alcohol}  0.3082228 0.9830795 
## [11] {Alcohol,Benzos,Nicotine}     => {Cannabis} 0.3082228 0.9416532 
## [12] {Alcohol,Benzos,Cannabis}     => {Nicotine} 0.3082228 0.8870229 
## [13] {Cannabis,Legalh,Nicotine}    => {Alcohol}  0.3458886 0.9893778 
## [14] {Alcohol,Legalh,Nicotine}     => {Cannabis} 0.3458886 0.9863843 
## [15] {Alcohol,Cannabis,Legalh}     => {Nicotine} 0.3458886 0.8993103 
## [16] {Cannabis,Ecstasy,Nicotine}   => {Alcohol}  0.3442971 0.9923547 
## [17] {Alcohol,Ecstasy,Nicotine}    => {Cannabis} 0.3442971 0.9848255 
## [18] {Alcohol,Cannabis,Ecstasy}    => {Nicotine} 0.3442971 0.9001387 
##      lift     count
## [1]  1.025137 584  
## [2]  1.465032 584  
## [3]  1.352357 584  
## [4]  1.027015 592  
## [5]  1.477639 592  
## [6]  1.323610 592  
## [7]  1.030485 594  
## [8]  1.453416 594  
## [9]  1.364916 594  
## [10] 1.019871 581  
## [11] 1.403175 581  
## [12] 1.322815 581  
## [13] 1.026405 652  
## [14] 1.469830 652  
## [15] 1.341139 652  
## [16] 1.029493 649  
## [17] 1.467507 649  
## [18] 1.342375 649
plot(head(sort(basket_rules, by="lift"), 20), method = "graph")

plot(basket_rules, method="grouped")