Capstone Proposal: An Analysis of Drug Consumption Risk

Opioid Introduction

I grew up in New Hampshire, which has been called “Ground Zero for Opioids” (US News). President Trump even called New Hampshire “a drug-infested den” (Washington Post). However, New Hampshire isn’t the only state or country facing an opioid crisis. On October 26, 2017 President Trump declared the opioid crisis a public health emergency. As the leading cause of deaths for Americans under 50 years old, the current overdose epidemic is the deadliest drug crisis in American History. To learn more about the crisis please watch the brief video below:

Capstone Overview

Although we know the magnitude of the problem, what causes a person to become addicted to opioids in the first place? Elaine Fehrman, Vincent Egan, and Evgeny M. Mirkes viewed this question from a psychological standpoint and collected data from 1885 respondents from over seven countries to evaluate how personality traits and demographics affect drug consumption.

In this analysis, I will use their data to evaluate and analyze the risk of drug consumption for heroin by utilizing:

Data cleaning
Data visualization
Predictive modeling with logistic response
Association analysis

Identifying drug-use risk allows targeted intervention to happen before a person may become susceptible to the highly addictive nature of drugs and the risk of potential overdose. As an epidemic in our country, the first step to a solution is truly understanding the cause and risks at the root of the problem.

Packages

To analyze this data, I will use the following R packages:

library(data.table)   # Importing Data
library(tidyverse)    # Manipulating and visualizing data
library(DT)           # Viewing data in table format
library(arules)       # Association analysis
library(arulesViz)    # Visualizing for association analysis

#Additional packages will be added for the predictive models

Data Preparation

The Drug consumption (quantified) Data Set was donated by Evgeny M. Mirkes to the University of California Irvine in October 2016. The data and codebook can be found in the UCI Machine Learning Repository. The data is composed of 32 columns and 1885 rows and does not contain any missing values.

Elaine Fehrman collected the data between March 2011 and March 2012 using an online survey tool from Survey Gizmo. Her collection methods are outline below:

“The study recruited 2051 participants over a 12-month recruitment period. Of these persons, 166 did not respond correctly to a validity check built into the middle of the scale, so were presumed to being inattentive to the questions being asked. Nine of these persons were found to also have endorsed using a fictitious recreational drug, and which was included precisely to identify respondents who over-claim, as have other studies of this kind. This led a useable sample of 1885 participants (male/female = 943/942).”

Except from: Fehrman, E.; Muhammad, A. K.; Mirkes, E. M.; Egan, V.; Gorban, A. N. The Five Factor Model of personality and evaluation of drug consumption risk.

Although the data does not contain any missing values, the data could be presented in a cleaner format, especially because the data contains many categorical variables. Take a look:

Load Data

To begin, I will load the raw data and view the unclean dataset.

url <- "http://archive.ics.uci.edu/ml/machine-learning-databases/00373/drug_consumption.data"
drug.use <- fread(url, sep = ",", header = FALSE, showProgress = FALSE)
column.names <- c("ID", "Age", "Gender", "Education", "Country", "Ethnicity", "Nscore", "Escore", "Oscore", "Ascore", "Cscore", "Impulsive", "SS", "Alcohol", "Amphet", "Amyl", "Benzos", "Caff", "Cannabis", "Choc", "Coke", "Crack", "Ecstasy", "Heroin", "Ketamine", "Legalh", "LSD", "Meth", "Mushrooms", "Nicotine", "Semer", "VSA")
setnames(drug.use, column.names)

View the first 5 rows of the raw data:

datatable(drug.use, options = list(scrollX = TRUE, searching = FALSE, pageLength = 5), caption = 'Table 1: Raw Drug Use Data')

Clean Data

To clean that data I will perform the following tasks:

Remove column 1
Create categorical data for columns 2:6
Update the categorical data for the drug use in Alcohol:VSA to be binary: 0 = nonuser, 1 = user

drug.use[drug.use == "CL0"]<- 0
drug.use[drug.use == "CL1"]<- 0
drug.use[drug.use == "CL2"]<- 1
drug.use[drug.use == "CL3"]<- 1
drug.use[drug.use == "CL4"]<- 1
drug.use[drug.use == "CL5"]<- 1
drug.use[drug.use == "CL6"]<- 1

drug.clean <- drug.use %>%
                as_tibble %>%
                mutate_at(vars(Age:Ethnicity), funs(as.factor)) %>%
                mutate(Age = factor(Age, labels = c("18_24", "25_34", "35_44", "45_54", "55_64", "65_"))) %>%
                mutate(Gender = factor(Gender, labels = c("Male", "Female"))) %>%
                mutate(Education = factor(Education, labels = c("Under16", "At16", "At17", "At18", "SomeCollege","ProfessionalCert", "Bachelors", "Masters", "Doctorate"))) %>%
                mutate(Country = factor(Country, labels = c("USA", "NewZealand", "Other", "Australia", "Ireland","Canada","UK"))) %>%
                mutate(Ethnicity = factor(Ethnicity, labels = c("Black", "Asian", "White", "White/Black", "Other", "White/Asian", "Black/Asian"))) %>%
                mutate_at(vars(Alcohol:VSA), funs(as.numeric)) %>%
                select(-ID)

View the clean data:

datatable(drug.clean, options = list(scrollX = TRUE, searching = FALSE, pageLength = 5), caption = 'Table 2: Clean Drug Use Data')

Exploratory Visualization

Visualization of Categorical Data

age.hist <- ggplot(drug.clean, aes(Age))+
  geom_bar(aes(fill=as.factor(Heroin)), width = 0.5,show.legend=F) + 
  theme(axis.text.x = element_text(angle=65, vjust=0.6)) + 
  labs(title="Age")

education.hist <- ggplot(drug.clean, aes(Education))+
  geom_bar(aes(fill=as.factor(Heroin)), width = 0.5, show.legend=F) + 
  theme(axis.text.x = element_text(angle=65, vjust=0.6)) + 
  labs(title="Education")

country.hist <- ggplot(drug.clean, aes(Country))+
  geom_bar(aes(fill=as.factor(Heroin)), width = 0.5, show.legend=F) + 
  theme(axis.text.x = element_text(angle=65, vjust=0.6)) + 
  labs(title="Country")

ethnicity.hist <- ggplot(drug.clean, aes(Ethnicity))+
  geom_bar(aes(fill=as.factor(Heroin)), width = 0.5) + 
  theme(axis.text.x = element_text(angle=65, vjust=0.6)) + 
  labs(title="Ethnicity", fill = "Heroin Use")

gridExtra::grid.arrange(age.hist, education.hist, country.hist, ethnicity.hist, nrow = 2)

Visualization of Continuous Data

Nscore.den <- ggplot(drug.clean, aes(Nscore))+
  geom_density(aes(fill=factor(Heroin)), alpha=0.8, show.legend=F) + 
  labs(title="Nscore", 
       subtitle="Neuroticism",
       x="Nscore")

Escore.den <- ggplot(drug.clean, aes(Escore))+
  geom_density(aes(fill=factor(Heroin)), alpha=0.8, show.legend=F) + 
  labs(title="Escore", 
       subtitle="Extraversion",
       x="Escore")

Oscore.den <- ggplot(drug.clean, aes(Oscore))+
  geom_density(aes(fill=factor(Heroin)), alpha=0.8, show.legend=F) + 
  labs(title="Oscore", 
       subtitle="Openness to experience",
       x="Oscore")

Ascore.den <- ggplot(drug.clean, aes(Ascore))+
  geom_density(aes(fill=factor(Heroin)), alpha=0.8, show.legend=F) + 
  labs(title="Ascore", 
       subtitle="Agreeableness",
       x="Ascore")

Cscore.den <- ggplot(drug.clean, aes(Cscore))+
  geom_density(aes(fill=factor(Heroin)), alpha=0.8, show.legend=F) + 
  labs(title="Cscore", 
       subtitle="Conscientiousness",
       x="Ascore")

Impulsive.den <- ggplot(drug.clean, aes(Impulsive))+
  geom_density(aes(fill=factor(Heroin)), alpha=0.8, show.legend=F) + 
  labs(title="Impulsive", 
       subtitle="Impulsiveness",
       x="Impulsive")

SS.den <- ggplot(drug.clean, aes(SS))+
  geom_density(aes(fill=factor(Heroin)), alpha=0.8) + 
  labs(title="SS", 
       subtitle="Sensation Seeking",
       x="SS",
       fill="Heroin Use")

gridExtra::grid.arrange(Nscore.den, Escore.den, Oscore.den, Ascore.den, Cscore.den, Impulsive.den, SS.den, nrow = 3)

Proposed Models

Methods to be employed for predictive modeling:

*GLM

*Tree

*GAM

*LDA

*SVM

*Random Forest

*Neural Network

*Elastic Net, Lasso or Ridge Regression

In each case, heroin use will be a binary response indicating a user or non-user. The indicator variables are both categorical and continuous and are overviewed in the exploratory visualization section. The models will be compared to find the best predictive model.