Project Proposal: An Analysis of Drug Consumption Risk

Introduction

I grew up in New Hampshire, which has been called “Ground Zero for Opioids” (US News). President Trump even called New Hampshire “a drug-infested den” (Washington Post). However, New Hampshire isn’t the only state or country facing an opioid crisis. On October 26, 2017 President Trump declared the opioid crisis a public health emergency. As the leading cause of deaths for Americans under 50 years old, the current overdose epidemic is the deadliest drug crisis in American History. To learn more about the crisis please watch the brief video below:

Although we know the magnitude of the problem, what causes a person to become addicted to opioids in the first place? Elaine Fehrman, Vincent Egan, and Evgeny M. Mirkes viewed this question from a psychological standpoint and collected data from 1885 respondents from over seven countries to evaluate how personality traits and demographics affect drug consumption.

In this analysis, I will use their data to evaluate and analyze the risk of drug consumption for 18 drugs by utilizing: * Data exploration * Cluster analysis * Linear regression * Association analysis

Identifying drug-use risk allows targeted intervention to happen before a person may become susceptible to the highly addictive nature of drugs and the risk of potential overdose. As an epidemic in our country, the first step to a solution is truly understanding the cause and risks at the root of the problem.

Packages

To analyze this data, I will use the following R packages:

library(data.table)   # Importing Data
library(dplyr)        # Manipulating data
library(tidyverse)    # Tidying data
library(DT)           # Viewing data in table format
library(ggplot2)      # Visualizing data
library(fpc)          # K-means clustering
library(leaps)        # Selecting variables (linear regression)
library(arules)       # Association analysis
library(arulesViz)    # Visualizing for association analysis

Data Preparation

The Drug consumption (quantified) Data Set was donated by Evgeny M. Mirkes to the University of California Irvine in October 2016. The data and codebook can be found in the UCI Machine Learning Repository. The data is composed of 32 columns and 1885 rows and does not contain any missing values.

Elaine Fehrman collected the data between March 2011 and March 2012 using an online survey tool from Survey Gizmo. Her collection methods are outline below:

“The study recruited 2051 participants over a 12-month recruitment period. Of these persons, 166 did not respond correctly to a validity check built into the middle of the scale, so were presumed to being inattentive to the questions being asked. Nine of these persons were found to also have endorsed using a fictitious recreational drug, and which was included precisely to identify respondents who over-claim, as have other studies of this kind. This led a useable sample of 1885 participants (male/female = 943/942).”

Except from: Fehrman, E.; Muhammad, A. K.; Mirkes, E. M.; Egan, V.; Gorban, A. N. The Five Factor Model of personality and evaluation of drug consumption risk.

Although the data does not contain any missing values, the data could be presented in a cleaner format, especially because the data contains many categorical variables. Take a look:

Load Data

To begin, I will load the raw data and view the unclean dataset.

url <- "http://archive.ics.uci.edu/ml/machine-learning-databases/00373/drug_consumption.data"
drug.use <- fread(url, sep = ",", header = FALSE, showProgress = FALSE)
column.names <- c("ID", "Age", "Gender", "Education", "Country", "Ethnicity", "Nscore", "Escore", "Oscore", "Ascore", "Cscore", "Impulsive", "SS", "Alcohol", "Amphet", "Amyl", "Benzos", "Caff", "Cannabis", "Choc", "Coke", "Crack", "Ecstasy", "Heroin", "Ketamine", "Legalh", "LSD", "Meth", "Mushrooms", "Nicotine", "Semer", "VSA")
setnames(drug.use, column.names)

View the first 5 rows of the raw data:

datatable(drug.use, options = list(scrollX = TRUE, searching = FALSE, pageLength = 5))

Clean Data

To clean that data I will perform the following tasks:

Remove column 1
Create categorical data for columns 2:6

drug.clean <- drug.use %>%
                as_tibble %>%
                mutate_at(vars(Age:Ethnicity), funs(as.factor)) %>%
                mutate(Age = factor(Age, labels = c("18_24", "25_34", "35_44", "45_54",                       "55_64", "65_"))) %>%
                mutate(Gender = factor(Gender, labels = c("Male", "Female"))) %>%
                mutate(Education = factor(Education, labels = c("Under16", "At16", "At17",                    "At18", "SomeCollege","ProfessionalCert", "Bachelors", "Masters",                             "Doctorate"))) %>%
                mutate(Country = factor(Country, labels = c("USA", "NewZealand", "Other",                     "Australia", "Ireland","Canada","UK"))) %>%
                mutate(Ethnicity = factor(Ethnicity, labels = c("Black", "Asian", "White",                    "WhiteBlack", "Other", "WhiteAsian", "BlackAsian"))) %>%
                select(-ID)

## Warning in combine_vars(vars, ind_list): '.Random.seed' is not an integer
## vector but of type 'NULL', so ignored

View the clean data:

datatable(drug.clean, options = list(scrollX = TRUE, searching = FALSE, pageLength = 5))

Proposed Exploratory Data Analysis

Goals for the project:

Make the format of this R markdown much better
Update variable values using dplyr and piping operations
Visualize data using ggplot
Learn how to make a floating table of contents
Perform machine learning techniques learned in Data Mining I and II
Provided recommendations for drug intervention programs based on results