Data 622 HW #4 LATE

Part #0: data exploration and preparation

The data set for this HW is of unknown origin. However, we are told that it is “from a real-life research project.”

# Load data
library(tidyverse)
cols_adhd <- cols(
  .default = col_integer(),
  initial = col_character())
adhd <- readr::read_csv('data/ADHD_data.csv', col_types = cols_adhd) %>% 
  select(-initial)

We have a data dictionary, although many fields are missing context. I’ve summarized what we know and what we can assume:

Group	Type	Notes
`age`	Continuous	18 to 69, no NAs
`sex`	Binary	1 = male, 2 = female, no NAs
`race`	Categorical	1 = white, 2 = black, 3 = hispanic, 6 = missing
ADHD self report: `adhd_1` through `adhd_18`, `adhd_total`	Ordered discrete	Each of the questions has the same scale, but we do not know the questions or what a high or low value means. However, it is likely that higher scores mean a higher likelihood that the individual would be diagnosed with ADHD. The total matches the sum of the 18 questions, except for rows 39, 43, and 114
Mood disorder questionnaire: `mood_q1a` through `mood_q1m`, `mood_q2`	Ordered binary	1 = yes, 0 = no. Again, we do not know the questions
`mood_q3`	Ordered discrete	0 = no problem, 1 = minor, 2 = moderate, 3 = serious
`mood_total`	Continuous	Except for row 107 `mood_total` is the sum of the other mood questions. Question 3 has a different scale than the others, but the researchers still used a straight sum for the total field, so Q3 has a higher weight on the total. Because of the “direction” of the labels for Q3, we can assume that higher values mean higher likelihood the subject would be diagnosed with a mood disorder like depression
`subs_alcohol`, `subs_thc`, etc.	Ordered discrete	0 = no use, 1 = use, 2 = abuse, 3 = dependence. Four observations have NAs (53, 73, 106, 162) and they have NAs in all fields
`court_order`	Ordered binary	1 = yes, 0 = no. We are not told what this means
`education`	Continuous	Grade level achieved. 9 NAs
`violence_history`	Ordered binary	1 = yes, 0 = no. We do not know what kind of violence this is. 5 NAs
`disorderly_conduct`	Ordered binary	1 = yes, 0 = no. 11 NAs
`suicide_attempt`	Ordered binary	1 = yes, 0 = no. 13 NAs
`abuse_history`	Partially ordered discrete	0 = No, 1 = Physical, 2 = Sexual, 3 = Emotional, 4 = P&S, 5 = P&E, 6 = S&E, 7 = P&S&E. 14 NAs
`non_substance_related`	Ordered discrete	“Non-substance-related Dx.” Dx is a common abbreviation for diagnosis. 0 = none, 1 = one, 2 = More than one. We do not know what kind of diagnoses these are. They could be mood disorders, hospitals admissions, etc. 22 NAs
`substance_related`	Ordered discrete	“Substance-related Dx.” 0 = none, 1 = one, 2 = two, 3 = three or more. 23 NAs. Record 175 is NA but not NA for `non_substance_related`. All others with NA here are NA in both fields
`psych_meds`	Ordered discrete	118 NAs or 67% of the data set. 0 = none, 1 = one psychotropic med, 2 = more than one psychotropic med
Removed: `initial`	Character	This column is an identifier and won’t help in analysis. It was dropped on load

Create dummy variables for categorical data

k-means, SVM, and PCA require variables that are numeric where a distance measure can be defined, e.g., Euclidean distance. This means that discrete variables need to be ordered or binary.

I need to convert race and abuse_history.

There is one Hispanic individual and six unknown. I’ll convert this to a binary variable: 1 being “white” and 2 being “non-white”

table(adhd$race, useNA = "always")
## 
##    1    2    3    6 <NA> 
##   72  100    1    2    0

The abuse history column’s encoding is trying to show the different possible combinations of physical, sexual, and emotional abuse. One could argue that there is no way to encode the combinations in an ordered way. For example, which is more severe, “physical and emotional” or “physical and sexual?” Instead, we can use three indicator variables for this information.

table(adhd$abuse_history, useNA = "always")
## 
##    0    1    2    3    4    5    6    7 <NA> 
##  101    8   20    4    6   10    4    8   14

Drop columns where meaning is unclear

We don’t need to know the meaning of our data to use these unsupervised methods, however, using columns where the meaning is unclear doesn’t sit well with me. The column court_order is not defined sufficiently for me to want to use it. I will drop this column below.

Deal with missing values

The methods we’ll use don’t work with missing values. A few columns have small numbers of NAs. However, psych_meds is NA 67% of the time. We should treat this separately.

What columns have missing values, and how many in each?

# Count of NAs by column
column_NAs <- adhd %>% 
  is.na() %>% 
  colSums() %>% 
  unname() %>% 
  data.frame(
    column = names(adhd), 
    NA_count = .) %>% 
  filter(NA_count > 0) %>% 
  arrange(NA_count)
print(column_NAs)
##                   column NA_count
## 1           subs_alcohol        4
## 2               subs_thc        4
## 3           subs_cocaine        4
## 4        subs_stimulants        4
## 5         subs_sedatives        4
## 6           subs_opioids        4
## 7            court_order        5
## 8              education        9
## 9       violence_history       11
## 10    disorderly_conduct       11
## 11       suicide_attempt       13
## 12         abuse_history       14
## 13 non_substance_related       22
## 14     substance_related       23
## 15            psych_meds      118

What rows have missing values? As we go down the list of columns with missing values, how many rows do we add to the set with NAs?

column_NAs$new_row_NAs <- NA_integer_
row_NAs <- 0L
for (i in 1:nrow(column_NAs)) {
  nom <- column_NAs$column[i]
  col <- adhd[[nom]]
  row_NAs_col <- ifelse(is.na(col), 1L, 0L)
  row_NAs <- row_NAs + row_NAs_col
  column_NAs$new_row_NAs[i] <- sum(row_NAs_col == 1 & row_NAs == 1)
}
column_NAs$cumulative_NAs <- cumsum(column_NAs$new_row_NAs)
print(column_NAs)
##                   column NA_count new_row_NAs cumulative_NAs
## 1           subs_alcohol        4           4              4
## 2               subs_thc        4           0              4
## 3           subs_cocaine        4           0              4
## 4        subs_stimulants        4           0              4
## 5         subs_sedatives        4           0              4
## 6           subs_opioids        4           0              4
## 7            court_order        5           1              5
## 8              education        9           6             11
## 9       violence_history       11           1             12
## 10    disorderly_conduct       11           0             12
## 11       suicide_attempt       13           2             14
## 12         abuse_history       14           2             16
## 13 non_substance_related       22          16             32
## 14     substance_related       23           1             33
## 15            psych_meds      118          87            120

If we treat psych_meds separately, there are 23 rows with NAs, or 13% of our observations. One strategy could be to simply drop these rows. There are several strategies we could investigate:

drop rows with NAs
impute using a measure of central tendency, e.g., mean or mode
search for correlations with other variables and predict missing values
add indicator variables for rows with missing data
random assignment

The problem with #3 is that the rows that have NAs tend to have NAs in the other columns that we might want to use as predictors. For example, disorderly_conduct and violence_history overlap in their NAs perfectly! violence_history and abuse_history overlap near perfectly.

For #4, we still can’t leave the NAs, so we have to combine it with #2 or #3. Also, I don’t want to have a dozen indicator variables – one for each column with missing data. However, since there is so much overlap in the observations that have NAs, it would be good to include an indicator has_NAs which is 1 if an observation has an NA in any column.

I’m going to follow this strategy for NAs:

create an indicator variable has_NAs and set it to 1 for all observations that have NA in any column except psych_meds
create a separate indicator for psych_meds
replace NAs in psych_meds with 0. About 20% of our data has non-zero, non-missing values, i.e., they are taking at least one medication. About 20% of the population takes medication for mental disorders. It stands to reason that our missing values will tend to be zero
replace NAs in other columns with the mode of that column

Prepare the data

temp_has_NAs <- adhd %>% 
  select(-psych_meds) %>% 
  is.na %>% 
  rowSums %>% `>`(1) %>% `*`(1)
           
adhd2 <- adhd %>% 
  mutate(
    abuse_physical      = ifelse(abuse_history %in% c(1, 4, 5, 7), 1, 0),
    abuse_sexual        = ifelse(abuse_history %in% c(2, 4, 6, 7), 1, 0),
    abuse_emotional     = ifelse(abuse_history %in% c(3, 5, 6, 7), 1, 0),
    race_white          = ifelse(race == 1, 1, 2),
    psych_meds_NAs      = ifelse(is.na(psych_meds), 1, 0),
    psych_meds          = coalesce(psych_meds, 0),
    has_NAs             = temp_has_NAs,
    subs_alcohol        = coalesce(subs_alcohol, 0),
    subs_thc            = coalesce(subs_thc, 0),
    subs_cocaine        = coalesce(subs_cocaine, 0),
    subs_stimulants     = coalesce(subs_stimulants, 0),
    subs_sedatives      = coalesce(subs_sedatives, 0),
    subs_opioids        = coalesce(subs_opioids, 0),
    court_order         = coalesce(court_order, 0),
    education           = coalesce(education, 12),
    violence_history    = coalesce(violence_history, 0),
    disorderly_conduct  = coalesce(disorderly_conduct, 1),
    suicide_attempt     = coalesce(suicide_attempt, 0),
    suicide_attempt_f   = factor(ifelse(suicide_attempt == 0, 'No', 'Yes'))  ) %>% 
  select(
    -abuse_history, 
    -race, 
    -substance_related, 
    -non_substance_related)

Other EDA

Please see the appendix for a more in-depth survey of our data.

Part #1: Clustering

I used the tutorial at https://uc-r.github.io/kmeans_clustering to help me with this problem.

library(cluster)
library(factoextra)
library(gridExtra)
library(NbClust)

adhd_scaled <- adhd2 %>% 
  select(-suicide_attempt_f) %>% 
  scale %>% 
  as_tibble

k2 <- kmeans(adhd_scaled, centers = 2, nstart = 25)
k3 <- kmeans(adhd_scaled, centers = 3, nstart = 25)
k4 <- kmeans(adhd_scaled, centers = 4, nstart = 25)
k5 <- kmeans(adhd_scaled, centers = 5, nstart = 25)

p1 <- fviz_cluster(k2, geom = "point",  data = adhd_scaled) + ggtitle("k = 2")
p2 <- fviz_cluster(k3, geom = "point",  data = adhd_scaled) + ggtitle("k = 3")
p3 <- fviz_cluster(k4, geom = "point",  data = adhd_scaled) + ggtitle("k = 4")
p4 <- fviz_cluster(k5, geom = "point",  data = adhd_scaled) + ggtitle("k = 5")

grid.arrange(p1, p2, p3, p4, nrow = 2)

What is the optimal number of clusters? The University of Cincinnati tutorial suggests these methods:

Elbow method
Silhouette method
Gap statistic
NbClust package

set.seed(123)

fviz_nbclust(adhd_scaled, kmeans, method = "wss")

fviz_nbclust(adhd_scaled, kmeans, method = "silhouette")

gap_stat <- clusGap(
  adhd_scaled, 
  FUN = kmeans, 
  nstart = 25,
  K.max = 10, 
  B = 50)
fviz_gap_stat(gap_stat)


set.seed(342)
nbc <- NbClust(adhd_scaled, method = 'kmeans')

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
##

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 13 proposed 2 as the best number of clusters 
## * 8 proposed 3 as the best number of clusters 
## * 1 proposed 8 as the best number of clusters 
## * 1 proposed 9 as the best number of clusters 
## * 1 proposed 15 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  2 
##  
##  
## *******************************************************************

The NbClust package runs 30 different methods for picking k. Thirteen of them said k = 2 is best.

adhd2$k2 <- k2$cluster
adhd_tall <- adhd2 %>% 
  mutate(ID = 1:nrow(adhd2)) %>% 
  select(
    ID,
    age,
    sex,
    adhd_total,
    mood_total,
    subs_alcohol,
    subs_thc,
    subs_cocaine,
    subs_stimulants,
    subs_sedatives,
    subs_opioids,
    court_order,
    education,
    violence_history,
    disorderly_conduct,
    suicide_attempt,
    psych_meds,
    abuse_physical,
    abuse_sexual,
    abuse_emotional,
    race_white,
    psych_meds_NAs,
    has_NAs,
    k2) %>% 
  pivot_longer(
    cols = -c('k2', 'ID'), 
    names_to = 'field', 
    values_to = 'value')

ggplot(data = adhd_tall) +
  aes(x = value, y = k2) +
  geom_jitter() +
  facet_wrap(~field, scales = 'free')

Two things pop out in this last plot: the adhd_total score correlates most highly with the cluster assignment when k = 2 and nearly all users of stimulants fall into the first cluster.

ggplot(adhd2) +
  aes(x = adhd_total, group = k2, fill = k2) + 
  geom_density(alpha = 0.5)

table(adhd2$k2, adhd2$subs_stimulants)
##    
##      0  1  3
##   1 81  0  1
##   2 83  6  4

This simple partitioning could be defined as “low adhd scores” and “high adhd scores or stimulant users.”

Part #2: Principal components analysis

pca1 <- prcomp(adhd_scaled)

# Add PCs for the SVM part below
# I'm going to use them as the two dimensions for plot.svm
adhd2$PC1 <- pca1$x[, 1]
adhd2$PC2 <- pca1$x[, 2]

adhd2 %>% 
  ggplot(., aes(PC1, PC2)) + 
    geom_text(aes(label = adhd_total), size = 3) +
    geom_point(
      aes(size = adhd_total, color = k2), alpha = 0.2)

We know from the k-means part that adhd_total is partitioned well by the first two principal components.

I like to use PCA to summarize columns that are thematically related. For example, what if we only look at the adhd_* columns? Do we see anything interesting?

pca2 <- adhd_scaled %>% 
  select(starts_with('adhd')) %>% 
  prcomp()

plot(summary(pca2)$importance[3, ])

It appears to take quite a few PCs to explain 80% plus of the variation. What if we left out the total?

pca3 <- adhd_scaled %>% 
  select(starts_with('adhd')) %>% 
  select(-adhd_total) %>% 
  prcomp()

plot(summary(pca3)$importance[3, ])

It looks about the same. It does not appear that we can roll the adhd_* questions up any better than the adhd_total column already does!

Part #3: Support vector machine

First, let’s split our data between training and test.

set.seed(3)
L_train <- sample(
  c(TRUE, FALSE), 
  size = nrow(adhd2),
  replace = TRUE,
  prob = c(0.7, 0.3))
train <- adhd2[L_train, ]
test <- adhd2[!L_train, ]
dim(train)
## [1] 122  59
dim(test)
## [1] 53 59
table(L_train, adhd2$k2)
##        
## L_train  1  2
##   FALSE 24 29
##   TRUE  58 64

What variables should we include? It makes sense that we don’t need to include the individual questions, and can just use adhd_total and mood_total for the questionnaires.

library(randomForest)

rf1 <- train %>% 
  select(
    -matches('adhd_[^t]'),
    -matches('mood_[^t]'),
    -k2, -PC1, -PC2, -suicide_attempt_f) %>% 
  randomForest(
    factor(suicide_attempt) ~ ., 
    data = .,
    importance = TRUE)

varImpPlot(rf1)

This quick random forest variable importance plot reinforces what I would have guessed are the important variables in this exercise. Leaving off the individual questions is OK, in my opinion.

There doesn’t seem to be an obvious separation when using PCs. I’ll try mood_total and adhd_total too.

ggplot(train) +
  aes(
    x = PC1, y = PC2, 
    color = suicide_attempt_f, shape = suicide_attempt_f,
    size = age) +
  geom_point(alpha = 0.75)


ggplot(train) +
  aes(
    x = adhd_total, y = mood_total, 
    color = suicide_attempt_f, shape = suicide_attempt_f) +
  geom_jitter(size = 3, alpha = 0.75)

SVMs work in hyperplanes, and it may be impossible for us to see what the correct separation is, or whether it is working.
Using the accuracy of classification on the test set will be the only way to judge between different kernels.

Here is a simple example using a linear kernel with default hyperparameters. There is overfitting and we will want to add a penalty for misclassification.

library(kernlab)
library(e1071)
library(caret)

form <- formula(
  suicide_attempt_f ~ 
    age + sex + adhd_total + mood_total + 
    subs_alcohol + subs_thc + subs_cocaine + 
    subs_stimulants + subs_sedatives + subs_opioids +
    court_order + education + violence_history + 
    disorderly_conduct + psych_meds + abuse_physical +
    abuse_sexual + abuse_emotional + race_white +
    psych_meds_NAs + has_NAs)

svm1 <- svm(formula = form, data = train, kernel = "linear")

pred1_train <- predict(svm1, train)
pred1_test <- predict(svm1, test)

confusionMatrix(pred1_train, train$suicide_attempt_f)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction No Yes
##        No  81  13
##        Yes  6  22
##                                           
##                Accuracy : 0.8443          
##                  95% CI : (0.7675, 0.9036)
##     No Information Rate : 0.7131          
##     P-Value [Acc > NIR] : 0.0005407       
##                                           
##                   Kappa : 0.5952          
##                                           
##  Mcnemar's Test P-Value : 0.1686686       
##                                           
##             Sensitivity : 0.9310          
##             Specificity : 0.6286          
##          Pos Pred Value : 0.8617          
##          Neg Pred Value : 0.7857          
##              Prevalence : 0.7131          
##          Detection Rate : 0.6639          
##    Detection Prevalence : 0.7705          
##       Balanced Accuracy : 0.7798          
##                                           
##        'Positive' Class : No              
## 
confusionMatrix(pred1_test, test$suicide_attempt_f)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction No Yes
##        No  34  10
##        Yes  5   4
##                                           
##                Accuracy : 0.717           
##                  95% CI : (0.5765, 0.8321)
##     No Information Rate : 0.7358          
##     P-Value [Acc > NIR] : 0.6865          
##                                           
##                   Kappa : 0.1779          
##                                           
##  Mcnemar's Test P-Value : 0.3017          
##                                           
##             Sensitivity : 0.8718          
##             Specificity : 0.2857          
##          Pos Pred Value : 0.7727          
##          Neg Pred Value : 0.4444          
##              Prevalence : 0.7358          
##          Detection Rate : 0.6415          
##    Detection Prevalence : 0.8302          
##       Balanced Accuracy : 0.5788          
##                                           
##        'Positive' Class : No              
##

The built in tune function doesn’t do a great job out of the box. Why? Because we want to get the model that does the best job of predicting “yes” cases. We do not want to think someone is not at risk of suicide when they actually are. As such, we will need to define our own error function. Because confusionMatrix uses the alphabetical ordering of factor levels, “No” is treated as the positive class. Therefore, we want the model that maximizes specificity.

set.seed(0)
tune1 <- tune(
  svm, 
  form, 
  data = train, 
  kernel = "linear",
  error.fun = function(r, p) {1 - specificity(r, p)},
  class.weights = 'inverse',
  ranges = list(
    cost = c(0.1, 1, 10, 100, 1000) ))

confusionMatrix(
  predict(tune1$best.model, test), 
  test$suicide_attempt_f)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction No Yes
##        No  29   4
##        Yes 10  10
##                                           
##                Accuracy : 0.7358          
##                  95% CI : (0.5967, 0.8474)
##     No Information Rate : 0.7358          
##     P-Value [Acc > NIR] : 0.5713          
##                                           
##                   Kappa : 0.4026          
##                                           
##  Mcnemar's Test P-Value : 0.1814          
##                                           
##             Sensitivity : 0.7436          
##             Specificity : 0.7143          
##          Pos Pred Value : 0.8788          
##          Neg Pred Value : 0.5000          
##              Prevalence : 0.7358          
##          Detection Rate : 0.5472          
##    Detection Prevalence : 0.6226          
##       Balanced Accuracy : 0.7289          
##                                           
##        'Positive' Class : No              
##

The linear model does OK, especially when we weight “Yes” higher than “No.” What about using a different kernel?

set.seed(1)
tune2 <- tune(
  svm, 
  form, 
  data = train, 
  error.fun = function(r, p) {1 - specificity(r, p)},
  class.weights = 'inverse',
  cost = 1,
  ranges = list(
    kernel = c('linear', 'radial', 'polynomial', 'sigmoid'),
    gamma = c(0.5, 1, 2, 3, 4)  
  ))

confusionMatrix(
  predict(tune2$best.model, test), 
  test$suicide_attempt_f)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction No Yes
##        No  29   4
##        Yes 10  10
##                                           
##                Accuracy : 0.7358          
##                  95% CI : (0.5967, 0.8474)
##     No Information Rate : 0.7358          
##     P-Value [Acc > NIR] : 0.5713          
##                                           
##                   Kappa : 0.4026          
##                                           
##  Mcnemar's Test P-Value : 0.1814          
##                                           
##             Sensitivity : 0.7436          
##             Specificity : 0.7143          
##          Pos Pred Value : 0.8788          
##          Neg Pred Value : 0.5000          
##              Prevalence : 0.7358          
##          Detection Rate : 0.5472          
##    Detection Prevalence : 0.6226          
##       Balanced Accuracy : 0.7289          
##                                           
##        'Positive' Class : No              
##

The best performing kernel is linear with cost of 1.

Appendix: explore the domain of each variable

# Prints `table` function for each variable
for (nom in names(adhd)) {
  col <- adhd[[nom]]
  dcol <- unique(col)
  message(paste0(nom, ': '))
  print(table(col, useNA = "always"))
  message('\n')
}
## age:
## col
##   18   19   20   21   22   23   24   25   26   27   28   29   30   31   32   33 
##    1    1    3    1    6    5    5    5    8    4    3    2    4    2    1    7 
##   34   35   36   37   38   39   40   41   42   43   44   45   46   47   48   49 
##    6    1    4    1    3    3    6    4    7    5   12    6    5    6    6    6 
##   50   51   52   53   54   55   56   57   61   69 <NA> 
##    5    9    2    6    3    1    3    3    3    1    0
## 
## sex:
## col
##    1    2 <NA> 
##   99   76    0
## 
## race:
## col
##    1    2    3    6 <NA> 
##   72  100    1    2    0
## 
## adhd_q1:
## col
##    0    1    2    3    4 <NA> 
##   39   43   44   30   19    0
## 
## adhd_q2:
## col
##    0    1    2    3    4 <NA> 
##   25   46   47   33   24    0
## 
## adhd_q3:
## col
##    0    1    2    3    4 <NA> 
##   26   46   46   32   25    0
## 
## adhd_q4:
## col
##    0    1    2    3    4 <NA> 
##   27   31   50   31   36    0
## 
## adhd_q5:
## col
##    0    1    2    3    4    5 <NA> 
##   33   21   32   47   41    1    0
## 
## adhd_q6:
## col
##    0    1    2    3    4 <NA> 
##   36   29   45   45   20    0
## 
## adhd_q7:
## col
##    0    1    2    3    4 <NA> 
##   22   53   54   25   21    0
## 
## adhd_q8:
## col
##    0    1    2    3    4 <NA> 
##   21   40   40   42   32    0
## 
## adhd_q9:
## col
##    0    1    2    3    4 <NA> 
##   31   43   36   41   24    0
## 
## adhd_q10:
## col
##    0    1    2    3    4 <NA> 
##   15   46   49   33   32    0
## 
## adhd_q11:
## col
##    0    1    2    3    4 <NA> 
##   16   33   48   43   35    0
## 
## adhd_q12:
## col
##    0    1    2    3    4 <NA> 
##   55   55   37   15   13    0
## 
## adhd_q13:
## col
##    0    1    2    3    4 <NA> 
##   15   29   46   47   38    0
## 
## adhd_q14:
## col
##    0    1    2    3    4 <NA> 
##   27   24   40   47   37    0
## 
## adhd_q15:
## col
##    0    1    2    3    4 <NA> 
##   50   39   35   27   24    0
## 
## adhd_q16:
## col
##    0    1    2    3    4 <NA> 
##   40   49   39   17   30    0
## 
## adhd_q17:
## col
##    0    1    2    3    4 <NA> 
##   49   41   46   22   17    0
## 
## adhd_q18:
## col
##    0    1    2    3    4 <NA> 
##   49   52   35   20   19    0
## 
## adhd_total:
## col
##    0    1    3    5    6    7    8    9   10   11   12   13   14   16   17   18 
##    1    2    1    1    3    2    1    2    2    1    4    1    4    1    8    1 
##   19   20   21   23   24   25   26   27   28   29   30   31   32   33   34   35 
##    5    3    3    1    6    4    1    2    6    2    3    7    7    3    1    3 
##   36   37   38   39   40   41   42   43   44   45   46   47   48   49   50   51 
##    3    2    3    3    6    3    5    3    2    3    3    3    6    6    3    2 
##   52   53   54   55   56   57   58   62   63   65   67   69   71   72 <NA> 
##    3    1    3    3    3    2    1    2    1    3    1    1    1    2    0
## 
## mood_q1a:
## col
##    0    1 <NA> 
##   79   96    0
## 
## mood_q1b:
## col
##    0    1 <NA> 
##   75  100    0
## 
## mood_q1c:
## col
##    0    1 <NA> 
##   80   95    0
## 
## mood_q1d:
## col
##    0    1 <NA> 
##   73  102    0
## 
## mood_q1e:
## col
##    0    1 <NA> 
##   78   97    0
## 
## mood_q1f:
## col
##    0    1 <NA> 
##   53  122    0
## 
## mood_q1g:
## col
##    0    1 <NA> 
##   49  126    0
## 
## mood_q1h:
## col
##    0    1 <NA> 
##   77   98    0
## 
## mood_q1i:
## col
##    0    1 <NA> 
##   72  103    0
## 
## mood_q1j:
## col
##    0    1 <NA> 
##  107   68    0
## 
## mood_q1k:
## col
##    0    1 <NA> 
##   90   85    0
## 
## mood_q1l:
## col
##    0    1 <NA> 
##   73  102    0
## 
## mood_q1m:
## col
##    0    1 <NA> 
##   89   86    0
## 
## mood_q2:
## col
##    0    1 <NA> 
##   49  126    0
## 
## mood_q3:
## col
##    0    1    2    3 <NA> 
##   25   25   49   76    0
## 
## mood_total:
## col
##    0    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15 
##    9    3    5    6    4    7   10    6    8   12   13   18   12   13   12   14 
##   16   17 <NA> 
##   12   11    0
## 
## subs_alcohol:
## col
##    0    1    2    3 <NA> 
##   80   18    7   66    4
## 
## subs_thc:
## col
##    0    1    2    3 <NA> 
##  116   12    3   40    4
## 
## subs_cocaine:
## col
##    0    1    2    3 <NA> 
##  101    9    5   56    4
## 
## subs_stimulants:
## col
##    0    1    3 <NA> 
##  160    6    5    4
## 
## subs_sedatives:
## col
##    0    1    2    3 <NA> 
##  161    4    1    5    4
## 
## subs_opioids:
## col
##    0    1    3 <NA> 
##  146    4   21    4
## 
## court_order:
## col
##    0    1 <NA> 
##  155   15    5
## 
## education:
## col
##    6    7    8    9   10   11   12   13   14   15   16   17   18   19 <NA> 
##    2    2    5   12   12   23   67   15   14    1    7    2    3    1    9
## 
## violence_history:
## col
##    0    1 <NA> 
##  124   40   11
## 
## disorderly_conduct:
## col
##    0    1 <NA> 
##   45  119   11
## 
## suicide_attempt:
## col
##    0    1 <NA> 
##  113   49   13
## 
## abuse_history:
## col
##    0    1    2    3    4    5    6    7 <NA> 
##  101    8   20    4    6   10    4    8   14
## 
## non_substance_related:
## col
##    0    1    2 <NA> 
##  102   35   16   22
## 
## substance_related:
## col
##    0    1    2    3 <NA> 
##   42   61   35   14   23
## 
## psych_meds:
## col
##    0    1    2 <NA> 
##   19   21   17  118
## 


# adhd totals
s <- adhd %>% 
  select(starts_with('adhd'), -adhd_total) %>% 
  rowSums(., na.rm = TRUE)

which(adhd$adhd_total - s != 0)
## [1]  39  43 114

adhd %>% 
  select(starts_with('adhd')) %>% 
  slice(39, 43, 114) %>% 
  as.data.frame
##   adhd_q1 adhd_q2 adhd_q3 adhd_q4 adhd_q5 adhd_q6 adhd_q7 adhd_q8 adhd_q9
## 1       4       4       2       4       1       0       0       4       2
## 2       0       1       2       2       4       2       1       1       1
## 3       1       0       0       2       1       0       3       1       0
##   adhd_q10 adhd_q11 adhd_q12 adhd_q13 adhd_q14 adhd_q15 adhd_q16 adhd_q17
## 1        4        0        0        3        3        0        4        4
## 2        1        1        0        3        2        1        0        0
## 3        1        1        0        1        0        0        0        0
##   adhd_q18 adhd_total
## 1        0         35
## 2        0         31
## 3        0         10


# mood totals
s <- adhd %>% 
  select(starts_with('mood'), -mood_total) %>% 
  rowSums(., na.rm = TRUE)

which(adhd$mood_total - s != 0)
## [1] 107

adhd %>% 
  select(starts_with('mood')) %>% 
  slice(107) %>% 
  as.data.frame
##   mood_q1a mood_q1b mood_q1c mood_q1d mood_q1e mood_q1f mood_q1g mood_q1h
## 1        1        0        0        1        1        0        1        1
##   mood_q1i mood_q1j mood_q1k mood_q1l mood_q1m mood_q2 mood_q3 mood_total
## 1        0        0        0        0        0       0       1          3


# substance questions NAs
s <- adhd %>% 
  select(starts_with('subs_')) %>% 
  is.na() %>% 
  rowSums()

which(s > 0)
## [1]  53  73 106 162

adhd %>% 
  slice(53, 73, 106, 162) %>% 
  select(starts_with('subs_'))
## # A tibble: 4 x 6
##   subs_alcohol subs_thc subs_cocaine subs_stimulants subs_sedatives subs_opioids
##          <int>    <int>        <int>           <int>          <int>        <int>
## 1           NA       NA           NA              NA             NA           NA
## 2           NA       NA           NA              NA             NA           NA
## 3           NA       NA           NA              NA             NA           NA
## 4           NA       NA           NA              NA             NA           NA


# NAs in Dx questions
s <- adhd %>% 
  select(ends_with('_related')) %>% 
  is.na() %>% 
  rowSums()

which(s == 1)
## [1] 175

adhd %>% 
  slice(175) %>% 
  select(ends_with('_related'))
## # A tibble: 1 x 2
##   non_substance_related substance_related
##                   <int>             <int>
## 1                     1                NA


# Some bi-variate tables to maybe help understand variable definitions
table(adhd$court_order, adhd$subs_alcohol, useNA = 'always')
##       
##         0  1  2  3 <NA>
##   0    73 16  7 59    0
##   1     6  2  0  7    0
##   <NA>  1  0  0  0    4
table(adhd$court_order, adhd$subs_thc, useNA = 'always')
##       
##          0   1   2   3 <NA>
##   0    111   9   3  32    0
##   1      4   3   0   8    0
##   <NA>   1   0   0   0    4
table(adhd$court_order, adhd$subs_cocaine, useNA = 'always')
##       
##         0  1  2  3 <NA>
##   0    94  8  5 48    0
##   1     7  1  0  7    0
##   <NA>  0  0  0  1    4
table(adhd$court_order, adhd$subs_stimulants, useNA = 'always')
##       
##          0   1   3 <NA>
##   0    146   5   4    0
##   1     13   1   1    0
##   <NA>   1   0   0    4
table(adhd$court_order, adhd$subs_sedatives, useNA = 'always')
##       
##          0   1   2   3 <NA>
##   0    145   4   1   5    0
##   1     15   0   0   0    0
##   <NA>   1   0   0   0    4
table(adhd$court_order, adhd$subs_opioids, useNA = 'always')
##       
##          0   1   3 <NA>
##   0    131   4  20    0
##   1     14   0   1    0
##   <NA>   1   0   0    4

table(adhd$court_order, adhd$violence_history, useNA = 'always')
##       
##          0   1 <NA>
##   0    115  35    5
##   1      9   5    1
##   <NA>   0   0    5
table(adhd$court_order, adhd$disorderly_conduct, useNA = 'always')
##       
##          0   1 <NA>
##   0     45 105    5
##   1      0  14    1
##   <NA>   0   0    5
table(adhd$court_order, adhd$non_substance_related, useNA = 'always')
##       
##         0  1  2 <NA>
##   0    91 33 14   17
##   1    10  2  2    1
##   <NA>  1  0  0    4
table(adhd$court_order, adhd$substance_related, useNA = 'always')
##       
##         0  1  2  3 <NA>
##   0    39 54 32 12   18
##   1     3  6  3  2    1
##   <NA>  0  1  0  0    4

table(adhd$disorderly_conduct, adhd$violence_history, useNA = 'always')
##       
##         0  1 <NA>
##   0    45  0    0
##   1    79 40    0
##   <NA>  0  0   11
table(adhd$disorderly_conduct, adhd$non_substance_related, useNA = 'always')
##       
##         0  1  2 <NA>
##   0    20 16  7    2
##   1    77 19  9   14
##   <NA>  5  0  0    6
table(adhd$disorderly_conduct, adhd$substance_related, useNA = 'always')
##       
##         0  1  2  3 <NA>
##   0    21 13  6  2    3
##   1    21 46 26 12   14
##   <NA>  0  2  3  0    6

table(adhd$non_substance_related, adhd$violence_history, useNA = 'always')
##       
##         0  1 <NA>
##   0    73 24    5
##   1    25 10    0
##   2    13  3    0
##   <NA> 13  3    6
table(adhd$non_substance_related, adhd$substance_related, useNA = 'always')
##       
##         0  1  2  3 <NA>
##   0     8 53 29 12    0
##   1    22  6  4  2    1
##   2    12  2  2  0    0
##   <NA>  0  0  0  0   22

table(adhd$substance_related, adhd$violence_history, useNA = 'always')
##       
##         0  1 <NA>
##   0    35  7    0
##   1    42 17    2
##   2    23  9    3
##   3    10  4    0
##   <NA> 14  3    6


par(mfrow = c(1, 3))
hist(adhd$adhd_total)
hist(adhd$mood_total)
hist(adhd$age)

# Significant correlation between NAs
#   Leave off psych_meds since there are so many!
adhd %>% 
  is.na %>% 
  as_tibble %>% 
  group_by(
    subs_alcohol, subs_thc, subs_cocaine, subs_stimulants, 
    subs_sedatives, subs_opioids, court_order, education, 
    violence_history, disorderly_conduct, suicide_attempt, 
    abuse_history, non_substance_related, substance_related) %>% 
  summarize(Count = n()) %>% 
  as.matrix %>% 
  cor %>% 
  as.data.frame(row.names = FALSE) %>% 
  cbind(data.frame(column1 = names(.)), .) %>% 
  pivot_longer(-column1, names_to = 'column2', values_to = 'corr') %>% 
  filter(column1 < column2) %>% 
  filter(round(corr, 1) > 0.75) %>% 
  as.data.frame
## `summarise()` has grouped output by 'subs_alcohol', 'subs_thc', 'subs_cocaine', 'subs_stimulants', 'subs_sedatives', 'subs_opioids', 'court_order', 'education', 'violence_history', 'disorderly_conduct', 'suicide_attempt', 'abuse_history', 'non_substance_related'. You can override using the `.groups` argument.
##                  column1           column2      corr
## 1           subs_alcohol          subs_thc 1.0000000
## 2           subs_alcohol      subs_cocaine 1.0000000
## 3           subs_alcohol   subs_stimulants 1.0000000
## 4           subs_alcohol    subs_sedatives 1.0000000
## 5           subs_alcohol      subs_opioids 1.0000000
## 6           subs_cocaine          subs_thc 1.0000000
## 7           subs_cocaine   subs_stimulants 1.0000000
## 8           subs_cocaine    subs_sedatives 1.0000000
## 9           subs_cocaine      subs_opioids 1.0000000
## 10       subs_stimulants          subs_thc 1.0000000
## 11        subs_sedatives          subs_thc 1.0000000
## 12        subs_sedatives   subs_stimulants 1.0000000
## 13          subs_opioids          subs_thc 1.0000000
## 14          subs_opioids   subs_stimulants 1.0000000
## 15          subs_opioids    subs_sedatives 1.0000000
## 16           court_order      subs_alcohol 0.7784989
## 17           court_order          subs_thc 0.7784989
## 18           court_order      subs_cocaine 0.7784989
## 19           court_order   subs_stimulants 0.7784989
## 20           court_order    subs_sedatives 0.7784989
## 21           court_order      subs_opioids 0.7784989
## 22    disorderly_conduct  violence_history 1.0000000
## 23 non_substance_related substance_related 0.8539126