Hello

This week in data mining, we were given the 2018 treatment episode data set admissions. This data contained 1.9 million observations of data concerning individuals entering rehab. For our analysis, we chose to examine the effect that cannabis legislation had on rehab patients. The first step was finding out what each state’s individual cannabis legislation was and inputting it into our program. Note that we created two variables: - Cannabis Status recreational, medical, medical with THC limit and illegal - lim_THC illegal/medical with THC limit

# Create Variable For MJ_Status
# Note that Georgia and Oregon did not have enough data for this set
# Nebraska is the only state Decriminalized but illegal
IllegalStateKeys <- c(16, 25, 26, 30, 31)
RecreationalStateKeys <- c(2,4,6,8,11,17,23,32,34,46,50,53)
Medical_lim_THC_StateKeys <- c(1,10,18:21,37,45,47,48,51,55,56)
Medical_StateKeys <- c(5,9,12,15,22,24,27,28,29,33,35,36,38:44,49,54,72)

THC_illegal = c(IllegalStateKeys, Medical_lim_THC_StateKeys)
THC_legal = c(RecreationalStateKeys, Medical_StateKeys)


data$Cannabis_Status <- NA # Declare an empty variable
data$Cannabis_Status[data$STFIPS %in% IllegalStateKeys] = "Illegal" # assign Char "Illegal" in %in% c
data$Cannabis_Status[data$STFIPS %in% RecreationalStateKeys] = "Recreational" # assign Char "Recreational" in %in% c
data$Cannabis_Status[data$STFIPS %in% Medical_lim_THC_StateKeys] = "Medical_lim_THC" # assign Char "Medical_lim_THC" in %in% c
data$Cannabis_Status[data$STFIPS %in% Medical_StateKeys] = "Medical" # assign Char "Medical" in %in% c
# Make Cannabis Status a factor
data$Cannabis_Status <- as.factor(data$Cannabis_Status)

# Make new variable lim_THC
data$lim_THC <- NA # Declare an empty variable
data$lim_THC[data$STFIPS %in% THC_illegal] = "Illegal" # assign char "Illegal"
data$lim_THC[data$STFIPS %in% THC_legal] = "legal" # assign char "Illegal"

# Make new variable a factor
data$lim_THC <- as.factor(data$lim_THC)

Now that we have our states coded, let’s sort the number of patients according to their legislation and show the variation of states legislation.

One of the issues was identifying the correct independent and dependent variables with such an extensive data set. A variable of interest to us in this process was NOPRIOR. NOPRIOR indicated the number of previous times a patient had been to rehab. We concluded that this could essentially be split into two groups—patients in recovery for the first time and patients in their second attempt.

readmitteddata <- data %>%
  filter(NOPRIOR >= 0) %>%
  mutate(readmitted=ifelse(NOPRIOR > 1, "TRUE", "FALSE"))
readmitteddata$readmitted <- as.factor(readmitteddata$readmitted)

# Histogram of readmittance cases by state
Readmitted_Hist <- ggplot(readmitteddata, aes(readmitted, fill = readmitted)) + geom_bar() + labs(x = "",  y = "Number of Rehab Patients",title ="Readmitted or Not")
Readmitted_Hist

1,045,967 people were being admitted for their first time, and 674,591 people returning. Now we moved on to our math calculations. For this set, we chose to use Naive Bayes techniques and identify conditional probability. Conditional probability can be interpreted as the probability of event A given event B. For our study, this is usually the probability of an event happening given cannabis is illegal or recreationally legal.

# Start constructing Naive Bayes 

# select the data columns you want and put into smaller data set NaiveData 
# or its gonna slow everything down

NaiveData <- readmitteddata %>%
  select(readmitted, Cannabis_Status)

# Split into training and test set
set.seed(1)
dt = sort(sample(nrow(NaiveData), nrow(NaiveData)*.8))
training<-NaiveData[dt,]
testing<-NaiveData[-dt,]


Bayes1 <- naiveBayes(readmitted ~ Cannabis_Status, data = training)
Bayes1
## 
## Naive Bayes Classifier for Discrete Predictors
## 
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
## 
## A-priori probabilities:
## Y
##     FALSE      TRUE 
## 0.6080536 0.3919464 
## 
## Conditional probabilities:
##        Cannabis_Status
## Y          Illegal    Medical Medical_lim_THC Recreational
##   FALSE 0.05437103 0.41925293      0.17235854   0.35401749
##   TRUE  0.16501790 0.45592436      0.11093564   0.26812211
# 

In this, we see that rehab patients in states where cannabis is illegal are x3 more likely to be returning patients than first-time patients. Let us dig deeper. In this Naive Bayes, we calculate lim_THC probability with ARRESTS, METHUSE and, PSYPROB. Note that ARRESTS is arrests within the previous thirty days, METHUSE is if opioid drugs are part of the patient’s plan, and PSYPROB is the presence of mental or substance use disorders.

# Start constructing Naive Bayes 
#
# select the data columns you want and put into smaller data set NaiveData 
# or its gonna slow everything down

# remember to use readmitteddata if you want the readmitted variable, just data if not
NaiveData <- data %>% # Use the assignment operator into new DS and pipe with training data,
  select(lim_THC, ARRESTS, METHUSE, PSYPROB) %>% # Select all variables you want
  filter(ARRESTS >= 0,
         METHUSE >= 0,
         PSYPROB >= 0)
# Make numerics into factors
NaiveData$ARRESTS <- as.factor(NaiveData$ARRESTS) # # Any sets you just filtered need to be turned into factors
NaiveData$METHUSE <- as.factor(NaiveData$METHUSE)
NaiveData$PSYPROB <- as.factor(NaiveData$PSYPROB)

# From here everything else is standard
# Split into training and test set
set.seed(1)
dt = sort(sample(nrow(NaiveData), nrow(NaiveData)*.8))
training<-NaiveData[dt,]
testing<-NaiveData[-dt,]


Naive2 <- naiveBayes(lim_THC ~., data = training)
Naive2  
## 
## Naive Bayes Classifier for Discrete Predictors
## 
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
## 
## A-priori probabilities:
## Y
##   Illegal     legal 
## 0.2309957 0.7690043 
## 
## Conditional probabilities:
##          ARRESTS
## Y                   0           1           2
##   Illegal 0.930830631 0.062218695 0.006950674
##   legal   0.913262133 0.070204249 0.016533618
## 
##          METHUSE
## Y                 1         2
##   Illegal 0.1537506 0.8462494
##   legal   0.1948012 0.8051988
## 
##          PSYPROB
## Y                 1         2
##   Illegal 0.4090048 0.5909952
##   legal   0.4236565 0.5763435

There are two key takeaways from this. First, there was a 25% probability difference of METHUSE in states with illegal THC vs legal. States with illegal THC had noticeably higher rates for needing opioid medications. Next, we saw that PSYPROB probability was very close. This means that regardless of THC’s legality, patients had similar chances for entering rehab with some sort of disorder. Next, we wanted to look at healthcare. Part of the revenue gained from not enforcing cannabis legislation or the tax revenue from its sales is supposed to go towards funding rehabilitation centers.

# remember to use readmitteddata if you want the readmitted variable, just data if not
NaiveData <- data %>% # Use the assignment operator into new DS and pipe with training data,
  select(PRIMPAY,Cannabis_Status) %>% # Select all variables you want
  filter(PRIMPAY >= 0) # Filter any and all variables with -9s to be >= 0
NaiveData$PRIMPAY <- as.factor(NaiveData$PRIMPAY) # # Any sets you just filtered need to be turned into factors

# From here everything else is standard
# Split into training and test set
set.seed(1)
dt = sort(sample(nrow(NaiveData), nrow(NaiveData)*.8))
training<-NaiveData[dt,]
testing<-NaiveData[-dt,]


Naive2 <- naiveBayes(Cannabis_Status ~., data = training)
Naive2  
## 
## Naive Bayes Classifier for Discrete Predictors
## 
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
## 
## A-priori probabilities:
## Y
##         Illegal         Medical Medical_lim_THC    Recreational 
##       0.0170158       0.4000960       0.1594044       0.4234838 
## 
## Conditional probabilities:
##                  PRIMPAY
## Y                           1           2           3           4           5
##   Illegal         0.054442246 0.020383590 0.001376107 0.084974628 0.834007053
##   Medical         0.038750786 0.030849928 0.015867558 0.626713682 0.188856131
##   Medical_lim_THC 0.066469584 0.064156002 0.009933714 0.302197903 0.491204715
##   Recreational    0.089895600 0.065266148 0.029042503 0.629663164 0.137354036
##                  PRIMPAY
## Y                           6           7
##   Illegal         0.000602047 0.004214329
##   Medical         0.051059300 0.047902615
##   Medical_lim_THC 0.003644810 0.062393272
##   Recreational    0.006064921 0.042713629

In our probability calculations, we found that in states where THC is legal, Medicaid is the most likely form of paying for rehab. Meanwhile, in states where THC is illegal, ‘Other government payments’ were the most likely form of paying. Therefore we could not find any conclusive evidence that cannabis revenues funded more rehabilitation. Many patients admitted to rehab had more than one drug of abuse. The data also included the route they administered this drug. We investigated this data to identify any potential effects.

NaiveData <- data %>% # Use the assignment operator into new DS and pipe with training data,
  select(ROUTE1,ROUTE2, ROUTE3, Cannabis_Status) %>% # Select all variables you want
  filter(ROUTE1 >= 0,
         ROUTE2 >= 0,
         ROUTE3 >=0) # Filter any and all variables with -9s to be >= 0
NaiveData$ROUTE1 <- as.factor(NaiveData$ROUTE1) # # Any sets you just filtered need to be turned into factors
NaiveData$ROUTE2 <- as.factor(NaiveData$ROUTE2) # # Any sets you just filtered need to be turned into factors
NaiveData$ROUTE3 <- as.factor(NaiveData$ROUTE3) # # Any sets you just filtered need to be turned into factors

# From here everything else is standard
# Split into training and test set
set.seed(1)
dt = sort(sample(nrow(NaiveData), nrow(NaiveData)*.8))
training<-NaiveData[dt,]
testing<-NaiveData[-dt,]


Naive2 <- naiveBayes(Cannabis_Status ~., data = training)
Naive2  
## 
## Naive Bayes Classifier for Discrete Predictors
## 
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
## 
## A-priori probabilities:
## Y
##         Illegal         Medical Medical_lim_THC    Recreational 
##      0.09955849      0.50998340      0.18935127      0.20110685 
## 
## Conditional probabilities:
##                  ROUTE1
## Y                           1           2           3           4           5
##   Illegal         0.296696163 0.151559671 0.137039188 0.408794287 0.005910690
##   Medical         0.317496949 0.192090923 0.181525922 0.301282607 0.007603599
##   Medical_lim_THC 0.285070149 0.282034239 0.139220723 0.287782708 0.005892180
##   Recreational    0.304653012 0.260220219 0.144055613 0.289024576 0.002046581
## 
##                  ROUTE2
## Y                           1           2           3           4           5
##   Illegal         0.341795073 0.329085380 0.136936691 0.186921316 0.005261540
##   Medical         0.296633740 0.359803640 0.178497822 0.154606514 0.010458284
##   Medical_lim_THC 0.315501105 0.382560583 0.142903336 0.152029030 0.007005946
##   Recreational    0.275273582 0.446729699 0.132046750 0.143463627 0.002486342
## 
##                  ROUTE3
## Y                           1           2           3           4           5
##   Illegal         0.397519560 0.361474598 0.125046978 0.110492330 0.005466535
##   Medical         0.344836556 0.416950690 0.144688486 0.082832541 0.010691727
##   Medical_lim_THC 0.354626619 0.422602260 0.130059101 0.085742002 0.006970018
##   Recreational    0.320602811 0.475567884 0.124807604 0.075638922 0.003382778

The exciting takeaway here was in the admission route for the secondary and tertiary drug route of admission. While rates were relatively similar for primary drug admission, we saw a noticeable increase in smoking secondary and tertiary drugs in states where cannabis was recreationally legal. While I would not describe this as a gateway effect, we see that recreational states had drastically higher rates of smoking other drugs of abuse than others. This dataset included a series of variables called drug flags. Drug flags were binary variables that indicated whether that substance was present at the time of the patient’s admission. Here we calculated the probability of the drug flags based on the legality of cannabis.

NaiveData <- data %>% # Use the assignment operator into new DS and pipe with training data,
  select(COKEFLG,MARFLG, HERFLG, MTHAMFLG, Cannabis_Status) %>% # Select all variables you want
  filter(COKEFLG >= 0,
         MARFLG >= 0,
         HERFLG >=0,
         MTHAMFLG >= 0) # Filter any and all variables with -9s to be >= 0
NaiveData$COKEFLG <- as.factor(NaiveData$COKEFLG) # # Any sets you just filtered need to be turned into factors
NaiveData$MARFLG <- as.factor(NaiveData$MARFLG) # # Any sets you just filtered need to be turned into factors
NaiveData$HERFLG <- as.factor(NaiveData$MTHAMFLG) # # Any sets you just filtered need to be turned into factors
NaiveData$MTHAMFLG <- as.factor(NaiveData$MTHAMFLG)
# From here everything else is standard
# Split into training and test set
set.seed(1)
dt = sort(sample(nrow(NaiveData), nrow(NaiveData)*.8))
training<-NaiveData[dt,]
testing<-NaiveData[-dt,]


Naive2 <- naiveBayes(Cannabis_Status ~., data = training)
Naive2  
## 
## Naive Bayes Classifier for Discrete Predictors
## 
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
## 
## A-priori probabilities:
## Y
##         Illegal         Medical Medical_lim_THC    Recreational 
##      0.09058841      0.44883146      0.15840089      0.30217924 
## 
## Conditional probabilities:
##                  COKEFLG
## Y                         0         1
##   Illegal         0.7403008 0.2596992
##   Medical         0.7501399 0.2498601
##   Medical_lim_THC 0.8334917 0.1665083
##   Recreational    0.8839783 0.1160217
## 
##                  MARFLG
## Y                         0         1
##   Illegal         0.7854923 0.2145077
##   Medical         0.6906322 0.3093678
##   Medical_lim_THC 0.6154204 0.3845796
##   Recreational    0.7474525 0.2525475
## 
##                  HERFLG
## Y                          0          1
##   Illegal         0.92195765 0.07804235
##   Medical         0.89104945 0.10895055
##   Medical_lim_THC 0.77551137 0.22448863
##   Recreational    0.75740964 0.24259036
## 
##                  MTHAMFLG
## Y                          0          1
##   Illegal         0.92195765 0.07804235
##   Medical         0.89104945 0.10895055
##   Medical_lim_THC 0.77551137 0.22448863
##   Recreational    0.75740964 0.24259036

Here we might have our biggest takeaway. The MARFLG (Marijuana Flag) had nearly identical probabilities of being raised whether cannabis was recreationally legal or illegal. This is incredible, in our opinion. States with illegal cannabis see a similar likelihood of patients entering treatment with cannabis in their systems as states where it is recreationally legal. Our final investigation will be building a predictive model. For our model, we predicted readmitted by Cannabis_Status and SERVICES.

NaiveData <- readmitteddata %>% # Use the assignment operator into new DS and pipe with training data,
  select(readmitted, Cannabis_Status, SERVICES) %>% # Select all variables you want
  filter(SERVICES >= 0) # Filter any and all variables with -9s to be >= 0
NaiveData$SERVICES <- as.factor(NaiveData$SERVICES) # # Any sets you just filtered need to be turned into factors
# From here everything else is standard
# Split into training and test set
set.seed(1)
dt = sort(sample(nrow(NaiveData), nrow(NaiveData)*.8))
training<-NaiveData[dt,]
testing<-NaiveData[-dt,]


Naive2 <- naiveBayes(readmitted ~ Cannabis_Status + SERVICES, data = training)
# Predict outcomes
predictions2 <- predict(Naive2, training)

# Confusion Matrix - train data
(confusion2 <- table(predictions2, training$readmitted))
##             
## predictions2  FALSE   TRUE
##        FALSE 715461 364243
##        TRUE  121492 175250
# Check misclassification percent
1 - sum(diag(confusion2)) / sum(confusion2)
## [1] 0.3528907
# Now we repeat for our test data
# Predict outcomes
predictions2 <- predict(Naive2, testing)

# Confusion Matrix - train data
(confusion2 <- table(predictions2, testing$readmitted))
##             
## predictions2  FALSE   TRUE
##        FALSE 178526  91020
##        TRUE   30488  44078
# Check misclassification percent
(1 - sum(diag(confusion2)) / sum(confusion2))
## [1] 0.353106

Our prediction model predicted about 2/3 outcomes (outcomes being whether or not that patient was readmitted or not). This is underwhelming and really should be considered just okay. Overall, this study went really well, but more work should be placed on predictive modelling for drug use. Thanks for reading!