The data set for this HW is of unknown origin. However, we are told that it is “from a real-life research project.”
# Load data
library(tidyverse)
cols_adhd <- cols(
.default = col_integer(),
initial = col_character())
adhd <- readr::read_csv('data/ADHD_data.csv', col_types = cols_adhd) %>%
select(-initial)
We have a data dictionary, although many fields are missing context. I’ve summarized what we know and what we can assume:
| Group | Type | Notes |
|---|---|---|
age |
Continuous | 18 to 69, no NAs |
sex |
Binary | 1 = male, 2 = female, no NAs |
race |
Categorical | 1 = white, 2 = black, 3 = hispanic, 6 = missing |
ADHD self report: adhd_1 through adhd_18, adhd_total |
Ordered discrete | Each of the questions has the same scale, but we do not know the questions or what a high or low value means. However, it is likely that higher scores mean a higher likelihood that the individual would be diagnosed with ADHD. The total matches the sum of the 18 questions, except for rows 39, 43, and 114 |
Mood disorder questionnaire: mood_q1a through mood_q1m, mood_q2 |
Ordered binary | 1 = yes, 0 = no. Again, we do not know the questions |
mood_q3 |
Ordered discrete | 0 = no problem, 1 = minor, 2 = moderate, 3 = serious |
mood_total |
Continuous | Except for row 107 mood_total is the sum of the other mood questions. Question 3 has a different scale than the others, but the researchers still used a straight sum for the total field, so Q3 has a higher weight on the total. Because of the “direction” of the labels for Q3, we can assume that higher values mean higher likelihood the subject would be diagnosed with a mood disorder like depression |
subs_alcohol, subs_thc, etc. |
Ordered discrete | 0 = no use, 1 = use, 2 = abuse, 3 = dependence. Four observations have NAs (53, 73, 106, 162) and they have NAs in all fields |
court_order |
Ordered binary | 1 = yes, 0 = no. We are not told what this means |
education |
Continuous | Grade level achieved. 9 NAs |
violence_history |
Ordered binary | 1 = yes, 0 = no. We do not know what kind of violence this is. 5 NAs |
disorderly_conduct |
Ordered binary | 1 = yes, 0 = no. 11 NAs |
suicide_attempt |
Ordered binary | 1 = yes, 0 = no. 13 NAs |
abuse_history |
Partially ordered discrete | 0 = No, 1 = Physical, 2 = Sexual, 3 = Emotional, 4 = P&S, 5 = P&E, 6 = S&E, 7 = P&S&E. 14 NAs |
non_substance_related |
Ordered discrete | “Non-substance-related Dx.” Dx is a common abbreviation for diagnosis. 0 = none, 1 = one, 2 = More than one. We do not know what kind of diagnoses these are. They could be mood disorders, hospitals admissions, etc. 22 NAs |
substance_related |
Ordered discrete | “Substance-related Dx.” 0 = none, 1 = one, 2 = two, 3 = three or more. 23 NAs. Record 175 is NA but not NA for non_substance_related. All others with NA here are NA in both fields |
psych_meds |
Ordered discrete | 118 NAs or 67% of the data set. 0 = none, 1 = one psychotropic med, 2 = more than one psychotropic med |
Removed: initial |
Character | This column is an identifier and won’t help in analysis. It was dropped on load |
k-means, SVM, and PCA require variables that are numeric where a distance measure can be defined, e.g., Euclidean distance. This means that discrete variables need to be ordered or binary.
I need to convert race and abuse_history.
There is one Hispanic individual and six unknown. I’ll convert this to a binary variable: 1 being “white” and 2 being “non-white”
table(adhd$race, useNA = "always")
##
## 1 2 3 6 <NA>
## 72 100 1 2 0
The abuse history column’s encoding is trying to show the different possible combinations of physical, sexual, and emotional abuse. One could argue that there is no way to encode the combinations in an ordered way. For example, which is more severe, “physical and emotional” or “physical and sexual?” Instead, we can use three indicator variables for this information.
table(adhd$abuse_history, useNA = "always")
##
## 0 1 2 3 4 5 6 7 <NA>
## 101 8 20 4 6 10 4 8 14
We don’t need to know the meaning of our data to use these unsupervised methods, however, using columns where the meaning is unclear doesn’t sit well with me. The column court_order is not defined sufficiently for me to want to use it. I will drop this column below.
The methods we’ll use don’t work with missing values. A few columns have small numbers of NAs. However, psych_meds is NA 67% of the time. We should treat this separately.
What columns have missing values, and how many in each?
# Count of NAs by column
column_NAs <- adhd %>%
is.na() %>%
colSums() %>%
unname() %>%
data.frame(
column = names(adhd),
NA_count = .) %>%
filter(NA_count > 0) %>%
arrange(NA_count)
print(column_NAs)
## column NA_count
## 1 subs_alcohol 4
## 2 subs_thc 4
## 3 subs_cocaine 4
## 4 subs_stimulants 4
## 5 subs_sedatives 4
## 6 subs_opioids 4
## 7 court_order 5
## 8 education 9
## 9 violence_history 11
## 10 disorderly_conduct 11
## 11 suicide_attempt 13
## 12 abuse_history 14
## 13 non_substance_related 22
## 14 substance_related 23
## 15 psych_meds 118
What rows have missing values? As we go down the list of columns with missing values, how many rows do we add to the set with NAs?
column_NAs$new_row_NAs <- NA_integer_
row_NAs <- 0L
for (i in 1:nrow(column_NAs)) {
nom <- column_NAs$column[i]
col <- adhd[[nom]]
row_NAs_col <- ifelse(is.na(col), 1L, 0L)
row_NAs <- row_NAs + row_NAs_col
column_NAs$new_row_NAs[i] <- sum(row_NAs_col == 1 & row_NAs == 1)
}
column_NAs$cumulative_NAs <- cumsum(column_NAs$new_row_NAs)
print(column_NAs)
## column NA_count new_row_NAs cumulative_NAs
## 1 subs_alcohol 4 4 4
## 2 subs_thc 4 0 4
## 3 subs_cocaine 4 0 4
## 4 subs_stimulants 4 0 4
## 5 subs_sedatives 4 0 4
## 6 subs_opioids 4 0 4
## 7 court_order 5 1 5
## 8 education 9 6 11
## 9 violence_history 11 1 12
## 10 disorderly_conduct 11 0 12
## 11 suicide_attempt 13 2 14
## 12 abuse_history 14 2 16
## 13 non_substance_related 22 16 32
## 14 substance_related 23 1 33
## 15 psych_meds 118 87 120
If we treat psych_meds separately, there are 23 rows with NAs, or 13% of our observations. One strategy could be to simply drop these rows. There are several strategies we could investigate:
The problem with #3 is that the rows that have NAs tend to have NAs in the other columns that we might want to use as predictors. For example, disorderly_conduct and violence_history overlap in their NAs perfectly! violence_history and abuse_history overlap near perfectly.
For #4, we still can’t leave the NAs, so we have to combine it with #2 or #3. Also, I don’t want to have a dozen indicator variables – one for each column with missing data. However, since there is so much overlap in the observations that have NAs, it would be good to include an indicator has_NAs which is 1 if an observation has an NA in any column.
I’m going to follow this strategy for NAs:
has_NAs and set it to 1 for all observations that have NA in any column except psych_medspsych_medspsych_meds with 0. About 20% of our data has non-zero, non-missing values, i.e., they are taking at least one medication. About 20% of the population takes medication for mental disorders. It stands to reason that our missing values will tend to be zerotemp_has_NAs <- adhd %>%
select(-psych_meds) %>%
is.na %>%
rowSums %>% `>`(1) %>% `*`(1)
adhd2 <- adhd %>%
mutate(
abuse_physical = ifelse(abuse_history %in% c(1, 4, 5, 7), 1, 0),
abuse_sexual = ifelse(abuse_history %in% c(2, 4, 6, 7), 1, 0),
abuse_emotional = ifelse(abuse_history %in% c(3, 5, 6, 7), 1, 0),
race_white = ifelse(race == 1, 1, 2),
psych_meds_NAs = ifelse(is.na(psych_meds), 1, 0),
psych_meds = coalesce(psych_meds, 0),
has_NAs = temp_has_NAs,
subs_alcohol = coalesce(subs_alcohol, 0),
subs_thc = coalesce(subs_thc, 0),
subs_cocaine = coalesce(subs_cocaine, 0),
subs_stimulants = coalesce(subs_stimulants, 0),
subs_sedatives = coalesce(subs_sedatives, 0),
subs_opioids = coalesce(subs_opioids, 0),
court_order = coalesce(court_order, 0),
education = coalesce(education, 12),
violence_history = coalesce(violence_history, 0),
disorderly_conduct = coalesce(disorderly_conduct, 1),
suicide_attempt = coalesce(suicide_attempt, 0),
suicide_attempt_f = factor(ifelse(suicide_attempt == 0, 'No', 'Yes')) ) %>%
select(
-abuse_history,
-race,
-substance_related,
-non_substance_related)
Please see the appendix for a more in-depth survey of our data.
I used the tutorial at https://uc-r.github.io/kmeans_clustering to help me with this problem.
library(cluster)
library(factoextra)
library(gridExtra)
library(NbClust)
adhd_scaled <- adhd2 %>%
select(-suicide_attempt_f) %>%
scale %>%
as_tibble
k2 <- kmeans(adhd_scaled, centers = 2, nstart = 25)
k3 <- kmeans(adhd_scaled, centers = 3, nstart = 25)
k4 <- kmeans(adhd_scaled, centers = 4, nstart = 25)
k5 <- kmeans(adhd_scaled, centers = 5, nstart = 25)
p1 <- fviz_cluster(k2, geom = "point", data = adhd_scaled) + ggtitle("k = 2")
p2 <- fviz_cluster(k3, geom = "point", data = adhd_scaled) + ggtitle("k = 3")
p3 <- fviz_cluster(k4, geom = "point", data = adhd_scaled) + ggtitle("k = 4")
p4 <- fviz_cluster(k5, geom = "point", data = adhd_scaled) + ggtitle("k = 5")
grid.arrange(p1, p2, p3, p4, nrow = 2)
What is the optimal number of clusters? The University of Cincinnati tutorial suggests these methods:
NbClust packageset.seed(123)
fviz_nbclust(adhd_scaled, kmeans, method = "wss")
fviz_nbclust(adhd_scaled, kmeans, method = "silhouette")
gap_stat <- clusGap(
adhd_scaled,
FUN = kmeans,
nstart = 25,
K.max = 10,
B = 50)
fviz_gap_stat(gap_stat)
set.seed(342)
nbc <- NbClust(adhd_scaled, method = 'kmeans')
## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##
## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## *******************************************************************
## * Among all indices:
## * 13 proposed 2 as the best number of clusters
## * 8 proposed 3 as the best number of clusters
## * 1 proposed 8 as the best number of clusters
## * 1 proposed 9 as the best number of clusters
## * 1 proposed 15 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 2
##
##
## *******************************************************************
The NbClust package runs 30 different methods for picking k. Thirteen of them said k = 2 is best.
adhd2$k2 <- k2$cluster
adhd_tall <- adhd2 %>%
mutate(ID = 1:nrow(adhd2)) %>%
select(
ID,
age,
sex,
adhd_total,
mood_total,
subs_alcohol,
subs_thc,
subs_cocaine,
subs_stimulants,
subs_sedatives,
subs_opioids,
court_order,
education,
violence_history,
disorderly_conduct,
suicide_attempt,
psych_meds,
abuse_physical,
abuse_sexual,
abuse_emotional,
race_white,
psych_meds_NAs,
has_NAs,
k2) %>%
pivot_longer(
cols = -c('k2', 'ID'),
names_to = 'field',
values_to = 'value')
ggplot(data = adhd_tall) +
aes(x = value, y = k2) +
geom_jitter() +
facet_wrap(~field, scales = 'free')
Two things pop out in this last plot: the adhd_total score correlates most highly with the cluster assignment when k = 2 and nearly all users of stimulants fall into the first cluster.
ggplot(adhd2) +
aes(x = adhd_total, group = k2, fill = k2) +
geom_density(alpha = 0.5)
table(adhd2$k2, adhd2$subs_stimulants)
##
## 0 1 3
## 1 81 0 1
## 2 83 6 4
This simple partitioning could be defined as “low adhd scores” and “high adhd scores or stimulant users.”
pca1 <- prcomp(adhd_scaled)
# Add PCs for the SVM part below
# I'm going to use them as the two dimensions for plot.svm
adhd2$PC1 <- pca1$x[, 1]
adhd2$PC2 <- pca1$x[, 2]
adhd2 %>%
ggplot(., aes(PC1, PC2)) +
geom_text(aes(label = adhd_total), size = 3) +
geom_point(
aes(size = adhd_total, color = k2), alpha = 0.2)
We know from the k-means part that adhd_total is partitioned well by the first two principal components.
I like to use PCA to summarize columns that are thematically related. For example, what if we only look at the adhd_* columns? Do we see anything interesting?
pca2 <- adhd_scaled %>%
select(starts_with('adhd')) %>%
prcomp()
plot(summary(pca2)$importance[3, ])
It appears to take quite a few PCs to explain 80% plus of the variation. What if we left out the total?
pca3 <- adhd_scaled %>%
select(starts_with('adhd')) %>%
select(-adhd_total) %>%
prcomp()
plot(summary(pca3)$importance[3, ])
It looks about the same. It does not appear that we can roll the adhd_* questions up any better than the adhd_total column already does!
First, let’s split our data between training and test.
set.seed(3)
L_train <- sample(
c(TRUE, FALSE),
size = nrow(adhd2),
replace = TRUE,
prob = c(0.7, 0.3))
train <- adhd2[L_train, ]
test <- adhd2[!L_train, ]
dim(train)
## [1] 122 59
dim(test)
## [1] 53 59
table(L_train, adhd2$k2)
##
## L_train 1 2
## FALSE 24 29
## TRUE 58 64
What variables should we include? It makes sense that we don’t need to include the individual questions, and can just use adhd_total and mood_total for the questionnaires.
library(randomForest)
rf1 <- train %>%
select(
-matches('adhd_[^t]'),
-matches('mood_[^t]'),
-k2, -PC1, -PC2, -suicide_attempt_f) %>%
randomForest(
factor(suicide_attempt) ~ .,
data = .,
importance = TRUE)
varImpPlot(rf1)
This quick random forest variable importance plot reinforces what I would have guessed are the important variables in this exercise. Leaving off the individual questions is OK, in my opinion.
There doesn’t seem to be an obvious separation when using PCs. I’ll try mood_total and adhd_total too.
ggplot(train) +
aes(
x = PC1, y = PC2,
color = suicide_attempt_f, shape = suicide_attempt_f,
size = age) +
geom_point(alpha = 0.75)
ggplot(train) +
aes(
x = adhd_total, y = mood_total,
color = suicide_attempt_f, shape = suicide_attempt_f) +
geom_jitter(size = 3, alpha = 0.75)
SVMs work in hyperplanes, and it may be impossible for us to see what the correct separation is, or whether it is working.
Using the accuracy of classification on the test set will be the only way to judge between different kernels.
Here is a simple example using a linear kernel with default hyperparameters. There is overfitting and we will want to add a penalty for misclassification.
library(kernlab)
library(e1071)
library(caret)
form <- formula(
suicide_attempt_f ~
age + sex + adhd_total + mood_total +
subs_alcohol + subs_thc + subs_cocaine +
subs_stimulants + subs_sedatives + subs_opioids +
court_order + education + violence_history +
disorderly_conduct + psych_meds + abuse_physical +
abuse_sexual + abuse_emotional + race_white +
psych_meds_NAs + has_NAs)
svm1 <- svm(formula = form, data = train, kernel = "linear")
pred1_train <- predict(svm1, train)
pred1_test <- predict(svm1, test)
confusionMatrix(pred1_train, train$suicide_attempt_f)
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 81 13
## Yes 6 22
##
## Accuracy : 0.8443
## 95% CI : (0.7675, 0.9036)
## No Information Rate : 0.7131
## P-Value [Acc > NIR] : 0.0005407
##
## Kappa : 0.5952
##
## Mcnemar's Test P-Value : 0.1686686
##
## Sensitivity : 0.9310
## Specificity : 0.6286
## Pos Pred Value : 0.8617
## Neg Pred Value : 0.7857
## Prevalence : 0.7131
## Detection Rate : 0.6639
## Detection Prevalence : 0.7705
## Balanced Accuracy : 0.7798
##
## 'Positive' Class : No
##
confusionMatrix(pred1_test, test$suicide_attempt_f)
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 34 10
## Yes 5 4
##
## Accuracy : 0.717
## 95% CI : (0.5765, 0.8321)
## No Information Rate : 0.7358
## P-Value [Acc > NIR] : 0.6865
##
## Kappa : 0.1779
##
## Mcnemar's Test P-Value : 0.3017
##
## Sensitivity : 0.8718
## Specificity : 0.2857
## Pos Pred Value : 0.7727
## Neg Pred Value : 0.4444
## Prevalence : 0.7358
## Detection Rate : 0.6415
## Detection Prevalence : 0.8302
## Balanced Accuracy : 0.5788
##
## 'Positive' Class : No
##
The built in tune function doesn’t do a great job out of the box. Why? Because we want to get the model that does the best job of predicting “yes” cases. We do not want to think someone is not at risk of suicide when they actually are. As such, we will need to define our own error function. Because confusionMatrix uses the alphabetical ordering of factor levels, “No” is treated as the positive class. Therefore, we want the model that maximizes specificity.
set.seed(0)
tune1 <- tune(
svm,
form,
data = train,
kernel = "linear",
error.fun = function(r, p) {1 - specificity(r, p)},
class.weights = 'inverse',
ranges = list(
cost = c(0.1, 1, 10, 100, 1000) ))
confusionMatrix(
predict(tune1$best.model, test),
test$suicide_attempt_f)
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 29 4
## Yes 10 10
##
## Accuracy : 0.7358
## 95% CI : (0.5967, 0.8474)
## No Information Rate : 0.7358
## P-Value [Acc > NIR] : 0.5713
##
## Kappa : 0.4026
##
## Mcnemar's Test P-Value : 0.1814
##
## Sensitivity : 0.7436
## Specificity : 0.7143
## Pos Pred Value : 0.8788
## Neg Pred Value : 0.5000
## Prevalence : 0.7358
## Detection Rate : 0.5472
## Detection Prevalence : 0.6226
## Balanced Accuracy : 0.7289
##
## 'Positive' Class : No
##
The linear model does OK, especially when we weight “Yes” higher than “No.” What about using a different kernel?
set.seed(1)
tune2 <- tune(
svm,
form,
data = train,
error.fun = function(r, p) {1 - specificity(r, p)},
class.weights = 'inverse',
cost = 1,
ranges = list(
kernel = c('linear', 'radial', 'polynomial', 'sigmoid'),
gamma = c(0.5, 1, 2, 3, 4)
))
confusionMatrix(
predict(tune2$best.model, test),
test$suicide_attempt_f)
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 29 4
## Yes 10 10
##
## Accuracy : 0.7358
## 95% CI : (0.5967, 0.8474)
## No Information Rate : 0.7358
## P-Value [Acc > NIR] : 0.5713
##
## Kappa : 0.4026
##
## Mcnemar's Test P-Value : 0.1814
##
## Sensitivity : 0.7436
## Specificity : 0.7143
## Pos Pred Value : 0.8788
## Neg Pred Value : 0.5000
## Prevalence : 0.7358
## Detection Rate : 0.5472
## Detection Prevalence : 0.6226
## Balanced Accuracy : 0.7289
##
## 'Positive' Class : No
##
The best performing kernel is linear with cost of 1.
# Prints `table` function for each variable
for (nom in names(adhd)) {
col <- adhd[[nom]]
dcol <- unique(col)
message(paste0(nom, ': '))
print(table(col, useNA = "always"))
message('\n')
}
## age:
## col
## 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
## 1 1 3 1 6 5 5 5 8 4 3 2 4 2 1 7
## 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49
## 6 1 4 1 3 3 6 4 7 5 12 6 5 6 6 6
## 50 51 52 53 54 55 56 57 61 69 <NA>
## 5 9 2 6 3 1 3 3 3 1 0
##
## sex:
## col
## 1 2 <NA>
## 99 76 0
##
## race:
## col
## 1 2 3 6 <NA>
## 72 100 1 2 0
##
## adhd_q1:
## col
## 0 1 2 3 4 <NA>
## 39 43 44 30 19 0
##
## adhd_q2:
## col
## 0 1 2 3 4 <NA>
## 25 46 47 33 24 0
##
## adhd_q3:
## col
## 0 1 2 3 4 <NA>
## 26 46 46 32 25 0
##
## adhd_q4:
## col
## 0 1 2 3 4 <NA>
## 27 31 50 31 36 0
##
## adhd_q5:
## col
## 0 1 2 3 4 5 <NA>
## 33 21 32 47 41 1 0
##
## adhd_q6:
## col
## 0 1 2 3 4 <NA>
## 36 29 45 45 20 0
##
## adhd_q7:
## col
## 0 1 2 3 4 <NA>
## 22 53 54 25 21 0
##
## adhd_q8:
## col
## 0 1 2 3 4 <NA>
## 21 40 40 42 32 0
##
## adhd_q9:
## col
## 0 1 2 3 4 <NA>
## 31 43 36 41 24 0
##
## adhd_q10:
## col
## 0 1 2 3 4 <NA>
## 15 46 49 33 32 0
##
## adhd_q11:
## col
## 0 1 2 3 4 <NA>
## 16 33 48 43 35 0
##
## adhd_q12:
## col
## 0 1 2 3 4 <NA>
## 55 55 37 15 13 0
##
## adhd_q13:
## col
## 0 1 2 3 4 <NA>
## 15 29 46 47 38 0
##
## adhd_q14:
## col
## 0 1 2 3 4 <NA>
## 27 24 40 47 37 0
##
## adhd_q15:
## col
## 0 1 2 3 4 <NA>
## 50 39 35 27 24 0
##
## adhd_q16:
## col
## 0 1 2 3 4 <NA>
## 40 49 39 17 30 0
##
## adhd_q17:
## col
## 0 1 2 3 4 <NA>
## 49 41 46 22 17 0
##
## adhd_q18:
## col
## 0 1 2 3 4 <NA>
## 49 52 35 20 19 0
##
## adhd_total:
## col
## 0 1 3 5 6 7 8 9 10 11 12 13 14 16 17 18
## 1 2 1 1 3 2 1 2 2 1 4 1 4 1 8 1
## 19 20 21 23 24 25 26 27 28 29 30 31 32 33 34 35
## 5 3 3 1 6 4 1 2 6 2 3 7 7 3 1 3
## 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51
## 3 2 3 3 6 3 5 3 2 3 3 3 6 6 3 2
## 52 53 54 55 56 57 58 62 63 65 67 69 71 72 <NA>
## 3 1 3 3 3 2 1 2 1 3 1 1 1 2 0
##
## mood_q1a:
## col
## 0 1 <NA>
## 79 96 0
##
## mood_q1b:
## col
## 0 1 <NA>
## 75 100 0
##
## mood_q1c:
## col
## 0 1 <NA>
## 80 95 0
##
## mood_q1d:
## col
## 0 1 <NA>
## 73 102 0
##
## mood_q1e:
## col
## 0 1 <NA>
## 78 97 0
##
## mood_q1f:
## col
## 0 1 <NA>
## 53 122 0
##
## mood_q1g:
## col
## 0 1 <NA>
## 49 126 0
##
## mood_q1h:
## col
## 0 1 <NA>
## 77 98 0
##
## mood_q1i:
## col
## 0 1 <NA>
## 72 103 0
##
## mood_q1j:
## col
## 0 1 <NA>
## 107 68 0
##
## mood_q1k:
## col
## 0 1 <NA>
## 90 85 0
##
## mood_q1l:
## col
## 0 1 <NA>
## 73 102 0
##
## mood_q1m:
## col
## 0 1 <NA>
## 89 86 0
##
## mood_q2:
## col
## 0 1 <NA>
## 49 126 0
##
## mood_q3:
## col
## 0 1 2 3 <NA>
## 25 25 49 76 0
##
## mood_total:
## col
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
## 9 3 5 6 4 7 10 6 8 12 13 18 12 13 12 14
## 16 17 <NA>
## 12 11 0
##
## subs_alcohol:
## col
## 0 1 2 3 <NA>
## 80 18 7 66 4
##
## subs_thc:
## col
## 0 1 2 3 <NA>
## 116 12 3 40 4
##
## subs_cocaine:
## col
## 0 1 2 3 <NA>
## 101 9 5 56 4
##
## subs_stimulants:
## col
## 0 1 3 <NA>
## 160 6 5 4
##
## subs_sedatives:
## col
## 0 1 2 3 <NA>
## 161 4 1 5 4
##
## subs_opioids:
## col
## 0 1 3 <NA>
## 146 4 21 4
##
## court_order:
## col
## 0 1 <NA>
## 155 15 5
##
## education:
## col
## 6 7 8 9 10 11 12 13 14 15 16 17 18 19 <NA>
## 2 2 5 12 12 23 67 15 14 1 7 2 3 1 9
##
## violence_history:
## col
## 0 1 <NA>
## 124 40 11
##
## disorderly_conduct:
## col
## 0 1 <NA>
## 45 119 11
##
## suicide_attempt:
## col
## 0 1 <NA>
## 113 49 13
##
## abuse_history:
## col
## 0 1 2 3 4 5 6 7 <NA>
## 101 8 20 4 6 10 4 8 14
##
## non_substance_related:
## col
## 0 1 2 <NA>
## 102 35 16 22
##
## substance_related:
## col
## 0 1 2 3 <NA>
## 42 61 35 14 23
##
## psych_meds:
## col
## 0 1 2 <NA>
## 19 21 17 118
##
# adhd totals
s <- adhd %>%
select(starts_with('adhd'), -adhd_total) %>%
rowSums(., na.rm = TRUE)
which(adhd$adhd_total - s != 0)
## [1] 39 43 114
adhd %>%
select(starts_with('adhd')) %>%
slice(39, 43, 114) %>%
as.data.frame
## adhd_q1 adhd_q2 adhd_q3 adhd_q4 adhd_q5 adhd_q6 adhd_q7 adhd_q8 adhd_q9
## 1 4 4 2 4 1 0 0 4 2
## 2 0 1 2 2 4 2 1 1 1
## 3 1 0 0 2 1 0 3 1 0
## adhd_q10 adhd_q11 adhd_q12 adhd_q13 adhd_q14 adhd_q15 adhd_q16 adhd_q17
## 1 4 0 0 3 3 0 4 4
## 2 1 1 0 3 2 1 0 0
## 3 1 1 0 1 0 0 0 0
## adhd_q18 adhd_total
## 1 0 35
## 2 0 31
## 3 0 10
# mood totals
s <- adhd %>%
select(starts_with('mood'), -mood_total) %>%
rowSums(., na.rm = TRUE)
which(adhd$mood_total - s != 0)
## [1] 107
adhd %>%
select(starts_with('mood')) %>%
slice(107) %>%
as.data.frame
## mood_q1a mood_q1b mood_q1c mood_q1d mood_q1e mood_q1f mood_q1g mood_q1h
## 1 1 0 0 1 1 0 1 1
## mood_q1i mood_q1j mood_q1k mood_q1l mood_q1m mood_q2 mood_q3 mood_total
## 1 0 0 0 0 0 0 1 3
# substance questions NAs
s <- adhd %>%
select(starts_with('subs_')) %>%
is.na() %>%
rowSums()
which(s > 0)
## [1] 53 73 106 162
adhd %>%
slice(53, 73, 106, 162) %>%
select(starts_with('subs_'))
## # A tibble: 4 x 6
## subs_alcohol subs_thc subs_cocaine subs_stimulants subs_sedatives subs_opioids
## <int> <int> <int> <int> <int> <int>
## 1 NA NA NA NA NA NA
## 2 NA NA NA NA NA NA
## 3 NA NA NA NA NA NA
## 4 NA NA NA NA NA NA
# NAs in Dx questions
s <- adhd %>%
select(ends_with('_related')) %>%
is.na() %>%
rowSums()
which(s == 1)
## [1] 175
adhd %>%
slice(175) %>%
select(ends_with('_related'))
## # A tibble: 1 x 2
## non_substance_related substance_related
## <int> <int>
## 1 1 NA
# Some bi-variate tables to maybe help understand variable definitions
table(adhd$court_order, adhd$subs_alcohol, useNA = 'always')
##
## 0 1 2 3 <NA>
## 0 73 16 7 59 0
## 1 6 2 0 7 0
## <NA> 1 0 0 0 4
table(adhd$court_order, adhd$subs_thc, useNA = 'always')
##
## 0 1 2 3 <NA>
## 0 111 9 3 32 0
## 1 4 3 0 8 0
## <NA> 1 0 0 0 4
table(adhd$court_order, adhd$subs_cocaine, useNA = 'always')
##
## 0 1 2 3 <NA>
## 0 94 8 5 48 0
## 1 7 1 0 7 0
## <NA> 0 0 0 1 4
table(adhd$court_order, adhd$subs_stimulants, useNA = 'always')
##
## 0 1 3 <NA>
## 0 146 5 4 0
## 1 13 1 1 0
## <NA> 1 0 0 4
table(adhd$court_order, adhd$subs_sedatives, useNA = 'always')
##
## 0 1 2 3 <NA>
## 0 145 4 1 5 0
## 1 15 0 0 0 0
## <NA> 1 0 0 0 4
table(adhd$court_order, adhd$subs_opioids, useNA = 'always')
##
## 0 1 3 <NA>
## 0 131 4 20 0
## 1 14 0 1 0
## <NA> 1 0 0 4
table(adhd$court_order, adhd$violence_history, useNA = 'always')
##
## 0 1 <NA>
## 0 115 35 5
## 1 9 5 1
## <NA> 0 0 5
table(adhd$court_order, adhd$disorderly_conduct, useNA = 'always')
##
## 0 1 <NA>
## 0 45 105 5
## 1 0 14 1
## <NA> 0 0 5
table(adhd$court_order, adhd$non_substance_related, useNA = 'always')
##
## 0 1 2 <NA>
## 0 91 33 14 17
## 1 10 2 2 1
## <NA> 1 0 0 4
table(adhd$court_order, adhd$substance_related, useNA = 'always')
##
## 0 1 2 3 <NA>
## 0 39 54 32 12 18
## 1 3 6 3 2 1
## <NA> 0 1 0 0 4
table(adhd$disorderly_conduct, adhd$violence_history, useNA = 'always')
##
## 0 1 <NA>
## 0 45 0 0
## 1 79 40 0
## <NA> 0 0 11
table(adhd$disorderly_conduct, adhd$non_substance_related, useNA = 'always')
##
## 0 1 2 <NA>
## 0 20 16 7 2
## 1 77 19 9 14
## <NA> 5 0 0 6
table(adhd$disorderly_conduct, adhd$substance_related, useNA = 'always')
##
## 0 1 2 3 <NA>
## 0 21 13 6 2 3
## 1 21 46 26 12 14
## <NA> 0 2 3 0 6
table(adhd$non_substance_related, adhd$violence_history, useNA = 'always')
##
## 0 1 <NA>
## 0 73 24 5
## 1 25 10 0
## 2 13 3 0
## <NA> 13 3 6
table(adhd$non_substance_related, adhd$substance_related, useNA = 'always')
##
## 0 1 2 3 <NA>
## 0 8 53 29 12 0
## 1 22 6 4 2 1
## 2 12 2 2 0 0
## <NA> 0 0 0 0 22
table(adhd$substance_related, adhd$violence_history, useNA = 'always')
##
## 0 1 <NA>
## 0 35 7 0
## 1 42 17 2
## 2 23 9 3
## 3 10 4 0
## <NA> 14 3 6
par(mfrow = c(1, 3))
hist(adhd$adhd_total)
hist(adhd$mood_total)
hist(adhd$age)
# Significant correlation between NAs
# Leave off psych_meds since there are so many!
adhd %>%
is.na %>%
as_tibble %>%
group_by(
subs_alcohol, subs_thc, subs_cocaine, subs_stimulants,
subs_sedatives, subs_opioids, court_order, education,
violence_history, disorderly_conduct, suicide_attempt,
abuse_history, non_substance_related, substance_related) %>%
summarize(Count = n()) %>%
as.matrix %>%
cor %>%
as.data.frame(row.names = FALSE) %>%
cbind(data.frame(column1 = names(.)), .) %>%
pivot_longer(-column1, names_to = 'column2', values_to = 'corr') %>%
filter(column1 < column2) %>%
filter(round(corr, 1) > 0.75) %>%
as.data.frame
## `summarise()` has grouped output by 'subs_alcohol', 'subs_thc', 'subs_cocaine', 'subs_stimulants', 'subs_sedatives', 'subs_opioids', 'court_order', 'education', 'violence_history', 'disorderly_conduct', 'suicide_attempt', 'abuse_history', 'non_substance_related'. You can override using the `.groups` argument.
## column1 column2 corr
## 1 subs_alcohol subs_thc 1.0000000
## 2 subs_alcohol subs_cocaine 1.0000000
## 3 subs_alcohol subs_stimulants 1.0000000
## 4 subs_alcohol subs_sedatives 1.0000000
## 5 subs_alcohol subs_opioids 1.0000000
## 6 subs_cocaine subs_thc 1.0000000
## 7 subs_cocaine subs_stimulants 1.0000000
## 8 subs_cocaine subs_sedatives 1.0000000
## 9 subs_cocaine subs_opioids 1.0000000
## 10 subs_stimulants subs_thc 1.0000000
## 11 subs_sedatives subs_thc 1.0000000
## 12 subs_sedatives subs_stimulants 1.0000000
## 13 subs_opioids subs_thc 1.0000000
## 14 subs_opioids subs_stimulants 1.0000000
## 15 subs_opioids subs_sedatives 1.0000000
## 16 court_order subs_alcohol 0.7784989
## 17 court_order subs_thc 0.7784989
## 18 court_order subs_cocaine 0.7784989
## 19 court_order subs_stimulants 0.7784989
## 20 court_order subs_sedatives 0.7784989
## 21 court_order subs_opioids 0.7784989
## 22 disorderly_conduct violence_history 1.0000000
## 23 non_substance_related substance_related 0.8539126