DATA 622 - Homework4 - Mental Health Analysis

Libraries

library(kableExtra)
library(tidyverse)
library(ggplot2)
library(dplyr)
library(psych)
library(caret)
library(mice)
library(randomForest)
library(caTools)
library(corrplot)
library(class)
library(rpart)
library(rpart.plot)
library(naniar)
library(xgboost)
library(DiagrammeR)
library(factoextra)
library(e1071)
library(FactoMineR)
library(skimr)

Background

For this assignment, we will be working with a very interesting mental health dataset from a real-life research project. All identifying information, of course, has been removed. The attached spreadsheet has the data (the tab name “Data”). The data dictionary is given in the second tab. You can get as creative as you want. The assignment is designed to really get you to think about how you could use different methods.

The target variable is ‘Suicide Attempt’.

Data Dictionary

ADHD Data Dictionary

Problem Statement

Conduct a thorough Exploratory Data Analysis (EDA) to understand the dataset. (20 points)
Use a clustering method to find clusters of patients here. Whether you choose to use k-means clustering or hierarchical clustering is up to you as long as you reason through your work. You are free to be creative in terms of which variables or some combination of those you want to use. Can you come up with creative names for the profiles you found? (40 points)
Let’s explore using Principal Component Analysis on this dataset. You will note that there are different types of questions in the dataset: column: E-W: ADHD self-report; column X – AM: mood disorders questionnaire, column AN-AS: Individual Substance Misuse; etc. You could just use ONE of the sets of questionnaire, for example, you can conduct PCA on the ADHD score, or mood disorder score, etc. Please reason through your work as you decide on which sets of variables you want to use to conduct Principal Component Analysis. What did you learn from the PCA? Can you comment on which question may have a heavy bearing on the score? (40 points)
Assume you are modeling whether a patient attempted suicide (column AX). This is a binary target variable. Please use Gradient Boosting to predict whether a patient attempts suicides. Please use whatever boosting approach you deem appropriate. But please be sure to walk us through your steps. (50 points)
Using the same target variable (suicide attempt), please use support vector machine to model this. You might want to consider reducing the number of variables or somehow use extracted information from the variables. This can be a really fun modeling task! (50 points)

Dataset

dataset <- read_csv('https://raw.githubusercontent.com/dcorrea614/DATA622/main/HW4/ADHD_data.csv')
head(dataset)%>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="300px")

Initial	Age	Sex	Race	ADHD Q1	ADHD Q2	ADHD Q3	ADHD Q4	ADHD Q5	ADHD Q6	ADHD Q7	ADHD Q8	ADHD Q9	ADHD Q10	ADHD Q11	ADHD Q12	ADHD Q13	ADHD Q14	ADHD Q15	ADHD Q16	ADHD Q17	ADHD Q18	ADHD Total	MD Q1a	MD Q1b	MD Q1c	MD Q1d	MD Q1e	MD Q1f	MD Q1g	MD Q1h	MD Q1i	MD Q1j	MD Q1k	MD Q1L	MD Q1m	MD Q2	MD Q3	MD TOTAL	Alcohol	THC	Cocaine	Stimulants	Court order	Education	Hx of Violence	Disorderly Conduct	Suicide	Abuse	Non-subst Dx	Psych meds.
JA	24	1	1	1	1	4	2	3	1	1	3	2	4	2	4	1	0	3	1	3	4	40	1	1	1	1	0	1	1	1	1	1	1	0	1	1	3	15	1	1	1	0	1	11	0	1	1	0	2	2
LA	48	2	1	3	3	4	4	5	2	2	3	2	4	1	4	2	4	4	3	1	4	55	1	1	1	1	1	1	1	1	1	0	0	1	0	1	3	14	0	0	0	0	0	14	0	0	1	4	1	1
MD	51	2	1	2	1	2	1	3	3	3	2	0	1	2	0	2	2	3	2	1	1	31	0	0	0	0	1	1	1	0	0	0	0	0	0	0	2	5	0	0	0	0	0	12	0	0	0	6	2	1
RD	43	1	1	3	3	2	2	4	3	2	4	4	2	3	1	3	3	1	2	1	2	45	1	1	0	0	1	1	1	1	1	0	0	1	1	1	3	13	1	1	1	1	0	12	0	0	1	7	2	2
RB	34	1	1	4	4	2	4	4	2	3	4	4	2	4	1	3	2	1	2	1	1	48	0	1	0	1	0	1	1	0	0	0	0	0	0	1	2	7	1	1	0	0	1	9	1	1	1	0	2	0
SB	39	2	1	2	3	1	4	3	2	3	4	4	2	4	2	4	4	3	4	3	3	55	0	1	0	1	1	1	1	1	1	1	1	1	0	1	3	14	1	0	0	0	0	11	0	1	1	2	0	0

Descriptive Dataset Summary

summary(dataset)%>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="400px")

Initial	Age	Sex	Race	ADHD Q1	ADHD Q2	ADHD Q3	ADHD Q4	ADHD Q5	ADHD Q6	ADHD Q7	ADHD Q8	ADHD Q9	ADHD Q10	ADHD Q11	ADHD Q12	ADHD Q13	ADHD Q14	ADHD Q15	ADHD Q16	ADHD Q17	ADHD Q18	ADHD Total	MD Q1a	MD Q1b	MD Q1c	MD Q1d	MD Q1e	MD Q1f	MD Q1g	MD Q1h	MD Q1i	MD Q1j	MD Q1k	MD Q1L	MD Q1m	MD Q2	MD Q3	MD TOTAL	Alcohol	THC	Cocaine	Stimulants	Sedative-hypnotics	Opioids	Court order	Education	Hx of Violence	Disorderly Conduct	Suicide	Abuse	Non-subst Dx	Subst Dx	Psych meds.
Length:175	Min. :18.00	Min. :1.000	Min. :1.00	Min. :0.000	Min. :0.000	Min. :0.000	Min. :0.000	Min. :0.000	Min. :0.000	Min. :0.000	Min. :0.000	Min. :0.000	Min. :0.00	Min. :0.000	Min. :0.000	Min. :0.000	Min. :0.000	Min. :0.000	Min. :0.000	Min. :0.000	Min. :0.000	Min. : 0.00	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.00	Min. :0.00	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.00	Min. :0.000	Min. : 0.00	Min. :0.000	Min. :0.000	Min. :0.000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.00000	Min. : 6.0	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.000	Min. :0.0000	Min. :0.000	Min. :0.0000
Class :character	1st Qu.:29.50	1st Qu.:1.000	1st Qu.:1.00	1st Qu.:1.000	1st Qu.:1.000	1st Qu.:1.000	1st Qu.:1.000	1st Qu.:1.000	1st Qu.:1.000	1st Qu.:1.000	1st Qu.:1.000	1st Qu.:1.000	1st Qu.:1.00	1st Qu.:1.000	1st Qu.:0.000	1st Qu.:1.500	1st Qu.:1.000	1st Qu.:0.000	1st Qu.:1.000	1st Qu.:0.000	1st Qu.:0.000	1st Qu.:21.00	1st Qu.:0.0000	1st Qu.:0.0000	1st Qu.:0.0000	1st Qu.:0.0000	1st Qu.:0.0000	1st Qu.:0.0000	1st Qu.:0.00	1st Qu.:0.00	1st Qu.:0.0000	1st Qu.:0.0000	1st Qu.:0.0000	1st Qu.:0.0000	1st Qu.:0.0000	1st Qu.:0.00	1st Qu.:1.000	1st Qu.: 6.50	1st Qu.:0.000	1st Qu.:0.000	1st Qu.:0.000	1st Qu.:0.0000	1st Qu.:0.0000	1st Qu.:0.0000	1st Qu.:0.00000	1st Qu.:11.0	1st Qu.:0.0000	1st Qu.:0.0000	1st Qu.:0.0000	1st Qu.:0.000	1st Qu.:0.0000	1st Qu.:0.000	1st Qu.:0.0000
Mode :character	Median :42.00	Median :1.000	Median :2.00	Median :2.000	Median :2.000	Median :2.000	Median :2.000	Median :3.000	Median :2.000	Median :2.000	Median :2.000	Median :2.000	Median :2.00	Median :2.000	Median :1.000	Median :2.000	Median :2.000	Median :1.000	Median :1.000	Median :1.000	Median :1.000	Median :33.00	Median :1.0000	Median :1.0000	Median :1.0000	Median :1.0000	Median :1.0000	Median :1.0000	Median :1.00	Median :1.00	Median :1.0000	Median :0.0000	Median :0.0000	Median :1.0000	Median :0.0000	Median :1.00	Median :2.000	Median :11.00	Median :1.000	Median :0.000	Median :0.000	Median :0.0000	Median :0.0000	Median :0.0000	Median :0.00000	Median :12.0	Median :0.0000	Median :1.0000	Median :0.0000	Median :0.000	Median :0.0000	Median :1.000	Median :1.0000
NA	Mean :39.47	Mean :1.434	Mean :1.64	Mean :1.697	Mean :1.914	Mean :1.909	Mean :2.103	Mean :2.257	Mean :1.909	Mean :1.829	Mean :2.137	Mean :1.909	Mean :2.12	Mean :2.274	Mean :1.291	Mean :2.366	Mean :2.246	Mean :1.634	Mean :1.703	Mean :1.526	Mean :1.474	Mean :34.32	Mean :0.5486	Mean :0.5714	Mean :0.5429	Mean :0.5829	Mean :0.5543	Mean :0.6971	Mean :0.72	Mean :0.56	Mean :0.5886	Mean :0.3886	Mean :0.4857	Mean :0.5829	Mean :0.4914	Mean :0.72	Mean :2.006	Mean :10.02	Mean :1.345	Mean :0.807	Mean :1.094	Mean :0.1228	Mean :0.1228	Mean :0.3918	Mean :0.08823	Mean :11.9	Mean :0.2439	Mean :0.7256	Mean :0.3025	Mean :1.329	Mean :0.4379	Mean :1.138	Mean :0.9649
NA	3rd Qu.:48.00	3rd Qu.:2.000	3rd Qu.:2.00	3rd Qu.:3.000	3rd Qu.:3.000	3rd Qu.:3.000	3rd Qu.:3.000	3rd Qu.:3.000	3rd Qu.:3.000	3rd Qu.:3.000	3rd Qu.:3.000	3rd Qu.:3.000	3rd Qu.:3.00	3rd Qu.:3.000	3rd Qu.:2.000	3rd Qu.:3.000	3rd Qu.:3.000	3rd Qu.:3.000	3rd Qu.:3.000	3rd Qu.:2.000	3rd Qu.:2.000	3rd Qu.:47.50	3rd Qu.:1.0000	3rd Qu.:1.0000	3rd Qu.:1.0000	3rd Qu.:1.0000	3rd Qu.:1.0000	3rd Qu.:1.0000	3rd Qu.:1.00	3rd Qu.:1.00	3rd Qu.:1.0000	3rd Qu.:1.0000	3rd Qu.:1.0000	3rd Qu.:1.0000	3rd Qu.:1.0000	3rd Qu.:1.00	3rd Qu.:3.000	3rd Qu.:14.00	3rd Qu.:3.000	3rd Qu.:1.500	3rd Qu.:3.000	3rd Qu.:0.0000	3rd Qu.:0.0000	3rd Qu.:0.0000	3rd Qu.:0.00000	3rd Qu.:13.0	3rd Qu.:0.0000	3rd Qu.:1.0000	3rd Qu.:1.0000	3rd Qu.:2.000	3rd Qu.:1.0000	3rd Qu.:2.000	3rd Qu.:2.0000
NA	Max. :69.00	Max. :2.000	Max. :6.00	Max. :4.000	Max. :4.000	Max. :4.000	Max. :4.000	Max. :5.000	Max. :4.000	Max. :4.000	Max. :4.000	Max. :4.000	Max. :4.00	Max. :4.000	Max. :4.000	Max. :4.000	Max. :4.000	Max. :4.000	Max. :4.000	Max. :4.000	Max. :4.000	Max. :72.00	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.00	Max. :1.00	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.00	Max. :3.000	Max. :17.00	Max. :3.000	Max. :3.000	Max. :3.000	Max. :3.0000	Max. :3.0000	Max. :3.0000	Max. :1.00000	Max. :19.0	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :7.000	Max. :2.0000	Max. :3.000	Max. :2.0000
NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA’s :4	NA’s :4	NA’s :4	NA’s :4	NA’s :4	NA’s :4	NA’s :5	NA’s :9	NA’s :11	NA’s :11	NA’s :13	NA’s :14	NA’s :22	NA’s :23	NA’s :118

# Show dataset summary deatils
#skim(dataset)

Pre-Processing

Missing Value Analysis

Based on the above descriptive data summary, there are quite a few variables with missing values. So we conducted an analysis of all missing values in various attributes to identify proper imputation technique.

## Counts of missing data per feature
dataset_missing_counts <- data.frame(apply(dataset, 2, function(x) length(which(is.na(x)))))
dataset_missing_pct <- data.frame(apply(dataset, 2,function(x) {sum(is.na(x)) / length(x) * 100}))

dataset_missing_counts <- cbind(Feature = rownames(dataset_missing_counts), dataset_missing_counts, dataset_missing_pct)
colnames(dataset_missing_counts) <- c('Feature','NA_Count','NA_Percentage')
rownames(dataset_missing_counts) <- NULL

dataset_missing_counts <- dataset_missing_counts %>% filter(`NA_Count` != 0) %>% arrange(desc(`NA_Count`))

dataset_missing_counts  %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="300px")

Feature	NA_Count	NA_Percentage
Psych meds.	118	67.428571
Subst Dx	23	13.142857
Non-subst Dx	22	12.571429
Abuse	14	8.000000
Suicide	13	7.428571
Hx of Violence	11	6.285714
Disorderly Conduct	11	6.285714
Education	9	5.142857
Court order	5	2.857143
Alcohol	4	2.285714
THC	4	2.285714
Cocaine	4	2.285714
Stimulants	4	2.285714
Sedative-hypnotics	4	2.285714
Opioids	4	2.285714

ggplot(dataset_missing_counts, aes(x = NA_Count, y = reorder(Feature, NA_Count))) + 
  geom_bar(stat = 'identity', fill = 'steelblue') +
  geom_label(aes(label = NA_Count)) +
  labs(title = 'Missing Counts') +
  theme(plot.title = element_text(hjust = 0.5), axis.title.y = element_blank(), axis.title.x = element_blank())

# Use nanair package to plot missing value patterns
gg_miss_upset(dataset)

Data Imputation

Based on above missing value analysis, we are going to perform data imputation using the mice package following Random Forest method. But before that, we remove the Initial as we do not need the ID columns for imputation. Additionally, we removed the ADHD and MDQ subtotal columns to avoid collinearity.

# removing Initial, ADHD Total and MD Total columns
dataset <- dataset %>%
  select(-c('Initial','ADHD Total','MD TOTAL'))

# cleaning up the column names for the imputation function
colNamesNoSpace <- colnames(dataset) %>%
  str_remove_all(' |-|\\.')

colnames(dataset) <- colNamesNoSpace

#imputation by using the random forest method ('rf')
init <- mice(dataset, maxit = 0, silent = TRUE)
predM <- init$predictorMatrix
set.seed(123)
imputed <- mice(dataset, method = 'rf', predictorMatrix = predM, m=5, silent = TRUE)

## 
##  iter imp variable
##   1   1  Alcohol  THC  Cocaine  Stimulants  Sedativehypnotics  Opioids  Courtorder  Education  HxofViolence  DisorderlyConduct  Suicide  Abuse  NonsubstDx  SubstDx  Psychmeds
##   1   2  Alcohol  THC  Cocaine  Stimulants  Sedativehypnotics  Opioids  Courtorder  Education  HxofViolence  DisorderlyConduct  Suicide  Abuse  NonsubstDx  SubstDx  Psychmeds
##   1   3  Alcohol  THC  Cocaine  Stimulants  Sedativehypnotics  Opioids  Courtorder  Education  HxofViolence  DisorderlyConduct  Suicide  Abuse  NonsubstDx  SubstDx  Psychmeds
##   1   4  Alcohol  THC  Cocaine  Stimulants  Sedativehypnotics  Opioids  Courtorder  Education  HxofViolence  DisorderlyConduct  Suicide  Abuse  NonsubstDx  SubstDx  Psychmeds
##   1   5  Alcohol  THC  Cocaine  Stimulants  Sedativehypnotics  Opioids  Courtorder  Education  HxofViolence  DisorderlyConduct  Suicide  Abuse  NonsubstDx  SubstDx  Psychmeds
##   2   1  Alcohol  THC  Cocaine  Stimulants  Sedativehypnotics  Opioids  Courtorder  Education  HxofViolence  DisorderlyConduct  Suicide  Abuse  NonsubstDx  SubstDx  Psychmeds
##   2   2  Alcohol  THC  Cocaine  Stimulants  Sedativehypnotics  Opioids  Courtorder  Education  HxofViolence  DisorderlyConduct  Suicide  Abuse  NonsubstDx  SubstDx  Psychmeds
##   2   3  Alcohol  THC  Cocaine  Stimulants  Sedativehypnotics  Opioids  Courtorder  Education  HxofViolence  DisorderlyConduct  Suicide  Abuse  NonsubstDx  SubstDx  Psychmeds
##   2   4  Alcohol  THC  Cocaine  Stimulants  Sedativehypnotics  Opioids  Courtorder  Education  HxofViolence  DisorderlyConduct  Suicide  Abuse  NonsubstDx  SubstDx  Psychmeds
##   2   5  Alcohol  THC  Cocaine  Stimulants  Sedativehypnotics  Opioids  Courtorder  Education  HxofViolence  DisorderlyConduct  Suicide  Abuse  NonsubstDx  SubstDx  Psychmeds
##   3   1  Alcohol  THC  Cocaine  Stimulants  Sedativehypnotics  Opioids  Courtorder  Education  HxofViolence  DisorderlyConduct  Suicide  Abuse  NonsubstDx  SubstDx  Psychmeds
##   3   2  Alcohol  THC  Cocaine  Stimulants  Sedativehypnotics  Opioids  Courtorder  Education  HxofViolence  DisorderlyConduct  Suicide  Abuse  NonsubstDx  SubstDx  Psychmeds
##   3   3  Alcohol  THC  Cocaine  Stimulants  Sedativehypnotics  Opioids  Courtorder  Education  HxofViolence  DisorderlyConduct  Suicide  Abuse  NonsubstDx  SubstDx  Psychmeds
##   3   4  Alcohol  THC  Cocaine  Stimulants  Sedativehypnotics  Opioids  Courtorder  Education  HxofViolence  DisorderlyConduct  Suicide  Abuse  NonsubstDx  SubstDx  Psychmeds
##   3   5  Alcohol  THC  Cocaine  Stimulants  Sedativehypnotics  Opioids  Courtorder  Education  HxofViolence  DisorderlyConduct  Suicide  Abuse  NonsubstDx  SubstDx  Psychmeds
##   4   1  Alcohol  THC  Cocaine  Stimulants  Sedativehypnotics  Opioids  Courtorder  Education  HxofViolence  DisorderlyConduct  Suicide  Abuse  NonsubstDx  SubstDx  Psychmeds
##   4   2  Alcohol  THC  Cocaine  Stimulants  Sedativehypnotics  Opioids  Courtorder  Education  HxofViolence  DisorderlyConduct  Suicide  Abuse  NonsubstDx  SubstDx  Psychmeds
##   4   3  Alcohol  THC  Cocaine  Stimulants  Sedativehypnotics  Opioids  Courtorder  Education  HxofViolence  DisorderlyConduct  Suicide  Abuse  NonsubstDx  SubstDx  Psychmeds
##   4   4  Alcohol  THC  Cocaine  Stimulants  Sedativehypnotics  Opioids  Courtorder  Education  HxofViolence  DisorderlyConduct  Suicide  Abuse  NonsubstDx  SubstDx  Psychmeds
##   4   5  Alcohol  THC  Cocaine  Stimulants  Sedativehypnotics  Opioids  Courtorder  Education  HxofViolence  DisorderlyConduct  Suicide  Abuse  NonsubstDx  SubstDx  Psychmeds
##   5   1  Alcohol  THC  Cocaine  Stimulants  Sedativehypnotics  Opioids  Courtorder  Education  HxofViolence  DisorderlyConduct  Suicide  Abuse  NonsubstDx  SubstDx  Psychmeds
##   5   2  Alcohol  THC  Cocaine  Stimulants  Sedativehypnotics  Opioids  Courtorder  Education  HxofViolence  DisorderlyConduct  Suicide  Abuse  NonsubstDx  SubstDx  Psychmeds
##   5   3  Alcohol  THC  Cocaine  Stimulants  Sedativehypnotics  Opioids  Courtorder  Education  HxofViolence  DisorderlyConduct  Suicide  Abuse  NonsubstDx  SubstDx  Psychmeds
##   5   4  Alcohol  THC  Cocaine  Stimulants  Sedativehypnotics  Opioids  Courtorder  Education  HxofViolence  DisorderlyConduct  Suicide  Abuse  NonsubstDx  SubstDx  Psychmeds
##   5   5  Alcohol  THC  Cocaine  Stimulants  Sedativehypnotics  Opioids  Courtorder  Education  HxofViolence  DisorderlyConduct  Suicide  Abuse  NonsubstDx  SubstDx  Psychmeds

# complete the imputation and show summary of the imputed data
dataset <- complete(imputed)
summary(dataset) %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="300px")

Age	Sex	Race	ADHDQ1	ADHDQ2	ADHDQ3	ADHDQ4	ADHDQ5	ADHDQ6	ADHDQ7	ADHDQ8	ADHDQ9	ADHDQ10	ADHDQ11	ADHDQ12	ADHDQ13	ADHDQ14	ADHDQ15	ADHDQ16	ADHDQ17	ADHDQ18	MDQ1a	MDQ1b	MDQ1c	MDQ1d	MDQ1e	MDQ1f	MDQ1g	MDQ1h	MDQ1i	MDQ1j	MDQ1k	MDQ1L	MDQ1m	MDQ2	MDQ3	Alcohol	THC	Cocaine	Stimulants	Sedativehypnotics	Opioids	Courtorder	Education	HxofViolence	DisorderlyConduct	Suicide	Abuse	NonsubstDx	SubstDx	Psychmeds
Min. :18.00	Min. :1.000	Min. :1.00	Min. :0.000	Min. :0.000	Min. :0.000	Min. :0.000	Min. :0.000	Min. :0.000	Min. :0.000	Min. :0.000	Min. :0.000	Min. :0.00	Min. :0.000	Min. :0.000	Min. :0.000	Min. :0.000	Min. :0.000	Min. :0.000	Min. :0.000	Min. :0.000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.00	Min. :0.00	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.00	Min. :0.000	Min. :0.000	Min. :0.0000	Min. :0.000	Min. :0.00	Min. :0.00	Min. :0.0000	Min. :0.00000	Min. : 6.00	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.000	Min. :0.0000	Min. :0.000	Min. :0.0000
1st Qu.:29.50	1st Qu.:1.000	1st Qu.:1.00	1st Qu.:1.000	1st Qu.:1.000	1st Qu.:1.000	1st Qu.:1.000	1st Qu.:1.000	1st Qu.:1.000	1st Qu.:1.000	1st Qu.:1.000	1st Qu.:1.000	1st Qu.:1.00	1st Qu.:1.000	1st Qu.:0.000	1st Qu.:1.500	1st Qu.:1.000	1st Qu.:0.000	1st Qu.:1.000	1st Qu.:0.000	1st Qu.:0.000	1st Qu.:0.0000	1st Qu.:0.0000	1st Qu.:0.0000	1st Qu.:0.0000	1st Qu.:0.0000	1st Qu.:0.0000	1st Qu.:0.00	1st Qu.:0.00	1st Qu.:0.0000	1st Qu.:0.0000	1st Qu.:0.0000	1st Qu.:0.0000	1st Qu.:0.0000	1st Qu.:0.00	1st Qu.:1.000	1st Qu.:0.000	1st Qu.:0.0000	1st Qu.:0.000	1st Qu.:0.00	1st Qu.:0.00	1st Qu.:0.0000	1st Qu.:0.00000	1st Qu.:11.00	1st Qu.:0.0000	1st Qu.:0.0000	1st Qu.:0.0000	1st Qu.:0.000	1st Qu.:0.0000	1st Qu.:0.000	1st Qu.:0.0000
Median :42.00	Median :1.000	Median :2.00	Median :2.000	Median :2.000	Median :2.000	Median :2.000	Median :3.000	Median :2.000	Median :2.000	Median :2.000	Median :2.000	Median :2.00	Median :2.000	Median :1.000	Median :2.000	Median :2.000	Median :1.000	Median :1.000	Median :1.000	Median :1.000	Median :1.0000	Median :1.0000	Median :1.0000	Median :1.0000	Median :1.0000	Median :1.0000	Median :1.00	Median :1.00	Median :1.0000	Median :0.0000	Median :0.0000	Median :1.0000	Median :0.0000	Median :1.00	Median :2.000	Median :1.000	Median :0.0000	Median :0.000	Median :0.00	Median :0.00	Median :0.0000	Median :0.00000	Median :12.00	Median :0.0000	Median :1.0000	Median :0.0000	Median :0.000	Median :0.0000	Median :1.000	Median :1.0000
Mean :39.47	Mean :1.434	Mean :1.64	Mean :1.697	Mean :1.914	Mean :1.909	Mean :2.103	Mean :2.257	Mean :1.909	Mean :1.829	Mean :2.137	Mean :1.909	Mean :2.12	Mean :2.274	Mean :1.291	Mean :2.366	Mean :2.246	Mean :1.634	Mean :1.703	Mean :1.526	Mean :1.474	Mean :0.5486	Mean :0.5714	Mean :0.5429	Mean :0.5829	Mean :0.5543	Mean :0.6971	Mean :0.72	Mean :0.56	Mean :0.5886	Mean :0.3886	Mean :0.4857	Mean :0.5829	Mean :0.4914	Mean :0.72	Mean :2.006	Mean :1.366	Mean :0.7886	Mean :1.086	Mean :0.12	Mean :0.12	Mean :0.3829	Mean :0.08571	Mean :11.91	Mean :0.2343	Mean :0.7257	Mean :0.3086	Mean :1.326	Mean :0.4171	Mean :1.137	Mean :0.7771
3rd Qu.:48.00	3rd Qu.:2.000	3rd Qu.:2.00	3rd Qu.:3.000	3rd Qu.:3.000	3rd Qu.:3.000	3rd Qu.:3.000	3rd Qu.:3.000	3rd Qu.:3.000	3rd Qu.:3.000	3rd Qu.:3.000	3rd Qu.:3.000	3rd Qu.:3.00	3rd Qu.:3.000	3rd Qu.:2.000	3rd Qu.:3.000	3rd Qu.:3.000	3rd Qu.:3.000	3rd Qu.:3.000	3rd Qu.:2.000	3rd Qu.:2.000	3rd Qu.:1.0000	3rd Qu.:1.0000	3rd Qu.:1.0000	3rd Qu.:1.0000	3rd Qu.:1.0000	3rd Qu.:1.0000	3rd Qu.:1.00	3rd Qu.:1.00	3rd Qu.:1.0000	3rd Qu.:1.0000	3rd Qu.:1.0000	3rd Qu.:1.0000	3rd Qu.:1.0000	3rd Qu.:1.00	3rd Qu.:3.000	3rd Qu.:3.000	3rd Qu.:1.0000	3rd Qu.:3.000	3rd Qu.:0.00	3rd Qu.:0.00	3rd Qu.:0.0000	3rd Qu.:0.00000	3rd Qu.:13.00	3rd Qu.:0.0000	3rd Qu.:1.0000	3rd Qu.:1.0000	3rd Qu.:2.000	3rd Qu.:1.0000	3rd Qu.:2.000	3rd Qu.:1.0000
Max. :69.00	Max. :2.000	Max. :6.00	Max. :4.000	Max. :4.000	Max. :4.000	Max. :4.000	Max. :5.000	Max. :4.000	Max. :4.000	Max. :4.000	Max. :4.000	Max. :4.00	Max. :4.000	Max. :4.000	Max. :4.000	Max. :4.000	Max. :4.000	Max. :4.000	Max. :4.000	Max. :4.000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.00	Max. :1.00	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.00	Max. :3.000	Max. :3.000	Max. :3.0000	Max. :3.000	Max. :3.00	Max. :3.00	Max. :3.0000	Max. :1.00000	Max. :19.00	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :7.000	Max. :2.0000	Max. :3.000	Max. :2.0000

We also checked for presence of any de-generate variables and removed from the dataset.

# capturing the degenerate variables
degenCols <- nearZeroVar(dataset)

# identifying them 
colnames(dataset[,degenCols])

## [1] "Stimulants"        "Sedativehypnotics"

# removing from the dataset
dataset <- dataset[,-degenCols]

Exploratory Data Analysis

# make dataset long to place distribution in a facetwrap
vars <- dataset %>%
  gather(key = 'predictor_variable', value = 'value', -Suicide) %>%
  mutate(Suicide = ifelse(Suicide==1,'Y','N'))

# Distribution of ADHD variables
vars %>%
  filter(str_detect(predictor_variable,'ADHD')) %>%
  ggplot() +
  geom_histogram(aes(x = value, y = ..density.., fill = Suicide), bins = 15) +
  labs(title = 'Distributions of ADHD Variables') +
  theme(plot.title = element_text(hjust = 0.5)) +
  facet_wrap(. ~predictor_variable, scales = 'free', ncol = 3)

# Distribution of MDQ variables
vars %>%
  filter(str_detect(predictor_variable,'MD')) %>%
  ggplot() +
  geom_histogram(aes(x = value, y = ..density.., fill = Suicide), bins = 15) +
  labs(title = 'Distributions of MDQ Variables') +
  theme(plot.title = element_text(hjust = 0.5)) +
  facet_wrap(. ~predictor_variable, scales = 'free', ncol = 3)

# Distribution of other variables
vars %>%
  filter(!str_detect(predictor_variable,'MD|ADH')) %>%
  ggplot() +
  geom_histogram(aes(x = value, y = ..density.., fill = Suicide), bins = 15) +
  labs(title = 'Distributions of Other Variables') +
  theme(plot.title = element_text(hjust = 0.5)) +
  facet_wrap(. ~predictor_variable, scales = 'free', ncol = 3)

Observations

ADHD distributions illustrate that when the values are from 0 - 3, the suicide value is more likely to be N. The inverse appears to be true, when the values are 4 - 5, the suicide value is more likely to be Y.
MDQ distribution demonstrates that when the value is 0, the suicide value is more like N. However, the opposite does not to seem hold here.
The rest of the variables distribution show that Abuse and Alcohol could be strong indicators for suicide.

Correlation Plot: Multicollinearity Check

corrMatrix <- round(cor(dataset),4)
corrMatrix %>% corrplot(., method = "color", outline = T, addgrid.col = "darkgray", order="hclust", addrect = 4, rect.col = "black", rect.lwd = 5,cl.pos = "b", tl.col = "indianred4", tl.cex = 1.0, cl.cex = 1.0, addCoef.col = "white", number.digits = 2, number.cex = 0.8, col = colorRampPalette(c("darkred","white","dodgerblue4"))(100))

From the plot above, it can be concluded that there are no variable pairs having high correlation.

Splitting Data: Train/Test

We are going to do a 75-25% split for training and test purposes.

sample = sample.split(dataset$Suicide, SplitRatio = 0.75)
train = subset(dataset, sample == TRUE)
test = subset(dataset, sample == FALSE)

#head(train)%>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="300px")

Model Building

y_train <- as.factor(train$Suicide)
y_test <- as.factor(test$Suicide)

X_train <- train %>% select(-'Suicide')
X_test <- test %>% select(-'Suicide')

Problem2: Clustering Method

K-means clustering is an example of an unsupervised machine learning algorithm. A clustering algorithm is a collection of data points that are aggregated together because of certain similarities. The “k” in k-means is the number of centroids you need in the dataset - or “imaginary or real location representing the center of the cluster”. The “means” in k-means refers to averaging of the data, or finding the centroid.

The first step is to standardize the data to have a mean of zero and standard deviation of one. This is because the kmeans algorithm uses a distance-based measurement to determine the similarity between data points. Most of the time, features in a dataset will have different units of measurements.

#select numeric values only
nums <- X_test %>% dplyr::select(where(is.numeric))
#center and scale data for K-means algorithm
dftrans <- preProcess(nums, method = c("center", "scale", "BoxCox"))
dftrans <- predict(dftrans, nums)

In order to determine the number of clusters needed for our algortihm, we will use the average silhoutte method. The average silhoutte method measures how well each object lies within a given cluster. The higher the average silhouette width, the better the clustering. The funcion fviz_nbclust allows us to do this process with ease, as shown below.

library(factoextra)
#determine # of clusters
fviz_nbclust(dftrans, kmeans, method = "silhouette", k.max = 10) +
  labs(subtitle = "Silhouette method") # add subtitle

Another popular method is the Elbow method. Ther Elbow method is used to help determine the optimal number of clusters for the optimal value of k.

fviz_nbclust(dftrans, kmeans, method = "wss") +
  labs(subtitle = "Elbow method") # add subtitle

Both of our methods to find ‘k’ suggest 2 clusters is the optimal number. We can visualize the results of our clusters as shown below.

final <- kmeans(dftrans, 2, nstart =25)
print(final)

## K-means clustering with 2 clusters of sizes 25, 19
## 
## Cluster means:
##           Age        Sex       Race     ADHDQ1     ADHDQ2     ADHDQ3     ADHDQ4
## 1 -0.03780229  0.2585521 -0.1736544  0.6385017  0.3843586  0.4393394  0.5424152
## 2  0.04973985 -0.3402001  0.2284926 -0.8401339 -0.5057350 -0.5780781 -0.7137043
##       ADHDQ5     ADHDQ6     ADHDQ7     ADHDQ8     ADHDQ9    ADHDQ10    ADHDQ11
## 1  0.3112391  0.4136629  0.4312272  0.5065839  0.5292581  0.4813072  0.4100311
## 2 -0.4095251 -0.5442933 -0.5674042 -0.6665578 -0.6963922 -0.6332990 -0.5395146
##      ADHDQ12    ADHDQ13    ADHDQ14    ADHDQ15    ADHDQ16    ADHDQ17    ADHDQ18
## 1  0.3326787  0.3997419  0.4782983  0.6970718  0.3248803  0.4230100  0.4833985
## 2 -0.4377351 -0.5259761 -0.6293399 -0.9171997 -0.4274741 -0.5565921 -0.6360507
##        MDQ1a      MDQ1b      MDQ1c      MDQ1d      MDQ1e      MDQ1f      MDQ1g
## 1  0.4012353  0.4154862  0.2971580  0.4012353  0.4953176  0.5535052  0.7032124
## 2 -0.5279412 -0.5466923 -0.3909974 -0.5279412 -0.6517336 -0.7282963 -0.9252795
##        MDQ1h      MDQ1i     MDQ1j      MDQ1k      MDQ1L      MDQ1m       MDQ2
## 1  0.3029965  0.2595573  0.352529  0.2887801  0.5053652  0.3220678  0.6401750
## 2 -0.3986795 -0.3415228 -0.463854 -0.3799738 -0.6649542 -0.4237735 -0.8423355
##         MDQ3      Alcohol        THC      Cocaine     Opioids Courtorder
## 1  0.3454007 -0.007060812  0.1408134 -0.008730403  0.09413079  0.1000364
## 2 -0.4544747  0.009290543 -0.1852809  0.011487373 -0.12385631 -0.1316268
##     Education HxofViolence DisorderlyConduct      Abuse NonsubstDx    SubstDx
## 1  0.05086440   0.03002316       -0.08875668  0.2150859  0.2350936  0.0445040
## 2 -0.06692684  -0.03950416        0.11678511 -0.2830078 -0.3093337 -0.0585579
##    Psychmeds
## 1  0.1108598
## 2 -0.1458681
## 
## Clustering vector:
##   2  17  22  24  37  38  40  50  56  60  64  66  67  68  75  80  85  86  92  97 
##   1   1   2   1   1   1   1   2   2   1   1   2   1   1   2   1   2   2   1   1 
##  98 101 103 105 108 109 114 116 121 122 128 131 132 133 143 148 151 152 156 163 
##   1   1   1   2   1   1   2   2   2   1   1   2   1   2   2   1   1   2   1   2 
## 164 165 166 172 
##   2   2   1   2 
## 
## Within cluster sum of squares by cluster:
## [1] 997.9955 655.8164
##  (between_SS / total_SS =  19.9 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

fviz_cluster(final, data = dftrans)

library(cluster)
#generate cluster plot
d <- dist(t(dftrans), method="euclidian")   
kfit <- kmeans(d, 2)   
clusplot(as.matrix(d), kfit$cluster, color=T, shade=T, labels=2, lines=0)

Model Summary

We can record the summary of the kmeans cluster model metrics below:

predict_km <- kmeans(dftrans, 2, nstart = 25)
print(predict_km)

## K-means clustering with 2 clusters of sizes 25, 19
## 
## Cluster means:
##           Age        Sex       Race     ADHDQ1     ADHDQ2     ADHDQ3     ADHDQ4
## 1 -0.03780229  0.2585521 -0.1736544  0.6385017  0.3843586  0.4393394  0.5424152
## 2  0.04973985 -0.3402001  0.2284926 -0.8401339 -0.5057350 -0.5780781 -0.7137043
##       ADHDQ5     ADHDQ6     ADHDQ7     ADHDQ8     ADHDQ9    ADHDQ10    ADHDQ11
## 1  0.3112391  0.4136629  0.4312272  0.5065839  0.5292581  0.4813072  0.4100311
## 2 -0.4095251 -0.5442933 -0.5674042 -0.6665578 -0.6963922 -0.6332990 -0.5395146
##      ADHDQ12    ADHDQ13    ADHDQ14    ADHDQ15    ADHDQ16    ADHDQ17    ADHDQ18
## 1  0.3326787  0.3997419  0.4782983  0.6970718  0.3248803  0.4230100  0.4833985
## 2 -0.4377351 -0.5259761 -0.6293399 -0.9171997 -0.4274741 -0.5565921 -0.6360507
##        MDQ1a      MDQ1b      MDQ1c      MDQ1d      MDQ1e      MDQ1f      MDQ1g
## 1  0.4012353  0.4154862  0.2971580  0.4012353  0.4953176  0.5535052  0.7032124
## 2 -0.5279412 -0.5466923 -0.3909974 -0.5279412 -0.6517336 -0.7282963 -0.9252795
##        MDQ1h      MDQ1i     MDQ1j      MDQ1k      MDQ1L      MDQ1m       MDQ2
## 1  0.3029965  0.2595573  0.352529  0.2887801  0.5053652  0.3220678  0.6401750
## 2 -0.3986795 -0.3415228 -0.463854 -0.3799738 -0.6649542 -0.4237735 -0.8423355
##         MDQ3      Alcohol        THC      Cocaine     Opioids Courtorder
## 1  0.3454007 -0.007060812  0.1408134 -0.008730403  0.09413079  0.1000364
## 2 -0.4544747  0.009290543 -0.1852809  0.011487373 -0.12385631 -0.1316268
##     Education HxofViolence DisorderlyConduct      Abuse NonsubstDx    SubstDx
## 1  0.05086440   0.03002316       -0.08875668  0.2150859  0.2350936  0.0445040
## 2 -0.06692684  -0.03950416        0.11678511 -0.2830078 -0.3093337 -0.0585579
##    Psychmeds
## 1  0.1108598
## 2 -0.1458681
## 
## Clustering vector:
##   2  17  22  24  37  38  40  50  56  60  64  66  67  68  75  80  85  86  92  97 
##   1   1   2   1   1   1   1   2   2   1   1   2   1   1   2   1   2   2   1   1 
##  98 101 103 105 108 109 114 116 121 122 128 131 132 133 143 148 151 152 156 163 
##   1   1   1   2   1   1   2   2   2   1   1   2   1   2   2   1   1   2   1   2 
## 164 165 166 172 
##   2   2   1   2 
## 
## Within cluster sum of squares by cluster:
## [1] 997.9955 655.8164
##  (between_SS / total_SS =  19.9 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

dftrans %>%
  mutate(Cluster = predict_km$cluster) %>%
  group_by(Cluster) %>%
  summarise_all("mean") %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="300px")

Cluster	Age	Sex	Race	ADHDQ1	ADHDQ2	ADHDQ3	ADHDQ4	ADHDQ5	ADHDQ6	ADHDQ7	ADHDQ8	ADHDQ9	ADHDQ10	ADHDQ11	ADHDQ12	ADHDQ13	ADHDQ14	ADHDQ15	ADHDQ16	ADHDQ17	ADHDQ18	MDQ1a	MDQ1b	MDQ1c	MDQ1d	MDQ1e	MDQ1f	MDQ1g	MDQ1h	MDQ1i	MDQ1j	MDQ1k	MDQ1L	MDQ1m	MDQ2	MDQ3	Alcohol	THC	Cocaine	Opioids	Courtorder	Education	HxofViolence	DisorderlyConduct	Abuse	NonsubstDx	SubstDx	Psychmeds
1	-0.0378023	0.2585521	-0.1736544	0.6385017	0.3843586	0.4393394	0.5424152	0.3112391	0.4136629	0.4312272	0.5065839	0.5292581	0.4813072	0.4100311	0.3326787	0.3997419	0.4782983	0.6970718	0.3248803	0.4230100	0.4833985	0.4012353	0.4154862	0.2971580	0.4012353	0.4953176	0.5535052	0.7032124	0.3029965	0.2595573	0.352529	0.2887801	0.5053652	0.3220678	0.6401750	0.3454007	-0.0070608	0.1408134	-0.0087304	0.0941308	0.1000364	0.0508644	0.0300232	-0.0887567	0.2150859	0.2350936	0.0445040	0.1108598
2	0.0497399	-0.3402001	0.2284926	-0.8401339	-0.5057350	-0.5780781	-0.7137043	-0.4095251	-0.5442933	-0.5674042	-0.6665578	-0.6963922	-0.6332990	-0.5395146	-0.4377351	-0.5259761	-0.6293399	-0.9171997	-0.4274741	-0.5565921	-0.6360507	-0.5279412	-0.5466923	-0.3909974	-0.5279412	-0.6517336	-0.7282963	-0.9252795	-0.3986795	-0.3415228	-0.463854	-0.3799738	-0.6649542	-0.4237735	-0.8423355	-0.4544747	0.0092905	-0.1852809	0.0114874	-0.1238563	-0.1316268	-0.0669268	-0.0395042	0.1167851	-0.2830078	-0.3093337	-0.0585579	-0.1458681

Problem3: Principal Component Analysis (PCA)

Principal Component Analysis (PCA) is an unsupervised, non-parametric statistical technique primarily used for dimensionality reduction in machine learning. It is a useful technique for exploratory data analysis, allowing us to better visualize the variation present in a dataset with many variables. For our particular use case here, it appears that many of the questionnaire variables fall on likert scales, which when prepared for analysis are extended to dummy variables. This creates many additional features and can make analysis more difficult due to an increased number of dimensions. Therefore, utilizing PCA to reduce the number of dimensions on our entired dataset and measure the amount of variance explained is beneficial. In order to do this, we’ll use the prcomp() function:

dataset.pca <- prcomp(dataset, center = TRUE, scale = TRUE)

summary(dataset.pca)

## Importance of components:
##                           PC1     PC2     PC3     PC4     PC5     PC6    PC7
## Standard deviation     3.4879 2.15325 1.51854 1.37071 1.32056 1.30841 1.2246
## Proportion of Variance 0.2483 0.09462 0.04706 0.03834 0.03559 0.03494 0.0306
## Cumulative Proportion  0.2483 0.34289 0.38995 0.42830 0.46389 0.49883 0.5294
##                            PC8     PC9    PC10    PC11    PC12    PC13   PC14
## Standard deviation     1.19453 1.14053 1.12488 1.09600 1.03186 1.01797 1.0023
## Proportion of Variance 0.02912 0.02655 0.02582 0.02451 0.02173 0.02115 0.0205
## Cumulative Proportion  0.55855 0.58510 0.61092 0.63544 0.65717 0.67831 0.6988
##                           PC15   PC16    PC17    PC18    PC19    PC20    PC21
## Standard deviation     0.97020 0.9469 0.91595 0.87918 0.84215 0.82789 0.82016
## Proportion of Variance 0.01921 0.0183 0.01712 0.01577 0.01447 0.01399 0.01373
## Cumulative Proportion  0.71803 0.7363 0.75345 0.76922 0.78370 0.79768 0.81141
##                           PC22    PC23    PC24    PC25    PC26    PC27   PC28
## Standard deviation     0.77736 0.76284 0.74558 0.73182 0.70326 0.69314 0.6825
## Proportion of Variance 0.01233 0.01188 0.01134 0.01093 0.01009 0.00981 0.0095
## Cumulative Proportion  0.82374 0.83562 0.84696 0.85789 0.86799 0.87779 0.8873
##                           PC29    PC30    PC31   PC32   PC33    PC34    PC35
## Standard deviation     0.67280 0.65795 0.62679 0.6182 0.5979 0.57780 0.56968
## Proportion of Variance 0.00924 0.00883 0.00802 0.0078 0.0073 0.00681 0.00662
## Cumulative Proportion  0.89654 0.90537 0.91339 0.9212 0.9285 0.93530 0.94192
##                           PC36    PC37    PC38    PC39    PC40   PC41   PC42
## Standard deviation     0.55339 0.52917 0.51864 0.49580 0.47679 0.4748 0.4644
## Proportion of Variance 0.00625 0.00571 0.00549 0.00502 0.00464 0.0046 0.0044
## Cumulative Proportion  0.94817 0.95388 0.95937 0.96439 0.96903 0.9736 0.9780
##                           PC43    PC44    PC45    PC46    PC47    PC48    PC49
## Standard deviation     0.45002 0.43212 0.41491 0.39393 0.37519 0.35038 0.31036
## Proportion of Variance 0.00413 0.00381 0.00351 0.00317 0.00287 0.00251 0.00197
## Cumulative Proportion  0.98217 0.98598 0.98949 0.99266 0.99553 0.99803 1.00000

The scale = 0 argument to biplot ensures that the arrows are scaled to represent the loadings; other values for scale give slightly different biplots with different interpretations.

biplot(dataset.pca, scale = 0, cex=0.5)

From the biplot above, it’s difficult to tell much given the very large number of features. However, from our PCA analysis, we can also take a look at the eigenvalues that were generated by using a scree plot. The plot below shows the percentage of variance explained by each principal component.

fviz_eig(dataset.pca)

#compute standard deviation of each principal component
std_dev <- dataset.pca$sdev

#compute variance
pr_var <- std_dev^2


prop_varex <- pr_var/sum(pr_var)

round(prop_varex[1:10], 2)

##  [1] 0.25 0.09 0.05 0.04 0.04 0.03 0.03 0.03 0.03 0.03

The first principal component in our example therefore explains 25% of the variability, and the second principal component explains 9%. Together, the first ten principal components explain 62% of the variability. And we proceed to plot the PVE and cumulative PVE to provide us our scree plots as we did earlier.

#scree plot
plot(prop_varex, xlab = "Principal Component",
             ylab = "Proportion of Variance Explained",
             type = "b")

#cumulative scree plot
plot(cumsum(prop_varex), xlab = "Principal Component",
              ylab = "Cumulative Proportion of Variance Explained",
              type = "b")

As we can see above in our plots of the PVE and the cumultative PVE, the first few principal components account for a much larger proportion of the variance explained than the remainder of the 50 principal components for this dataset. Additionally, the proportion of variance explained by the first principal component at 25% is ~3 times the second principal component’s proportion of variance explained.

Although running PCA on the entire dataset is helpful in some ways, certain findings from a biplot are affected by the number of variables in the overall dataset. Additionally, we are seeing from our PCA analysis that the first few principal components explain a much larger proportion of the variability than later principal components. Because of this, we thought it would be interesting to dive a bit deeper into which variables seem to hold the most importance in determining the first few dimensions. To do this, we’ll use the FactoMineR and factoextra packages to determine the eigenvalues of each dimension, with particular interest in dimensions 1 and 2, since those are what we’ve plotted above and hold the highest proportion of variance explained.

eig.val <- get_eigenvalue(dataset.pca)
eig.val %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="300px")

	eigenvalue	variance.percent	cumulative.variance.percent
Dim.1	12.1653335	24.8272112	24.82721
Dim.2	4.6364822	9.4622086	34.28942
Dim.3	2.3059660	4.7060531	38.99547
Dim.4	1.8788556	3.8343992	42.82987
Dim.5	1.7438785	3.5589358	46.38881
Dim.6	1.7119463	3.4937679	49.88258
Dim.7	1.4996351	3.0604798	52.94306
Dim.8	1.4269113	2.9120639	55.85512
Dim.9	1.3008135	2.6547214	58.50984
Dim.10	1.2653565	2.5823602	61.09220
Dim.11	1.2012221	2.4514737	63.54367
Dim.12	1.0647328	2.1729240	65.71660
Dim.13	1.0362688	2.1148343	67.83143
Dim.14	1.0046132	2.0502310	69.88166
Dim.15	0.9412784	1.9209763	71.80264
Dim.16	0.8966428	1.8298832	73.63252
Dim.17	0.8389721	1.7121880	75.34471
Dim.18	0.7729593	1.5774679	76.92218
Dim.19	0.7092249	1.4473977	78.36958
Dim.20	0.6854086	1.3987931	79.76837
Dim.21	0.6726696	1.3727952	81.14117
Dim.22	0.6042928	1.2332506	82.37442
Dim.23	0.5819237	1.1875994	83.56202
Dim.24	0.5558916	1.1344727	84.69649
Dim.25	0.5355566	1.0929727	85.78946
Dim.26	0.4945789	1.0093447	86.79881
Dim.27	0.4804497	0.9805095	87.77932
Dim.28	0.4657421	0.9504941	88.72981
Dim.29	0.4526588	0.9237935	89.65360
Dim.30	0.4328935	0.8834562	90.53706
Dim.31	0.3928620	0.8017593	91.33882
Dim.32	0.3821097	0.7798158	92.11863
Dim.33	0.3575121	0.7296164	92.84825
Dim.34	0.3338529	0.6813324	93.52958
Dim.35	0.3245365	0.6623194	94.19190
Dim.36	0.3062365	0.6249724	94.81687
Dim.37	0.2800222	0.5714738	95.38835
Dim.38	0.2689839	0.5489466	95.93730
Dim.39	0.2458200	0.5016734	96.43897
Dim.40	0.2273312	0.4639412	96.90291
Dim.41	0.2254670	0.4601368	97.36305
Dim.42	0.2156907	0.4401851	97.80323
Dim.43	0.2025136	0.4132931	98.21652
Dim.44	0.1867237	0.3810689	98.59759
Dim.45	0.1721464	0.3513193	98.94891
Dim.46	0.1551781	0.3166900	99.26560
Dim.47	0.1407682	0.2872820	99.55288
Dim.48	0.1227652	0.2505412	99.80343
Dim.49	0.0963212	0.1965739	100.00000

res.pca <- PCA(dataset, scale.unit = TRUE, ncp = 5, graph = FALSE)
var <- get_pca_var(res.pca)

Interestingly, eigenvalues less than 1 indicate that principal components account for more variance than accounted by one of the original variables in the standardized data. Because of this, many use this as a cutoff point for which PCs should be retained. Since we see this cutoff occur at Dim = 50, anything past this dimension doesn’t provide good insight into our data. Above, our scree plot cut this even further, by showing that anything past the first ten dimensions does not account for a large proportion of variance explained. Therefore, going forward, we will only focus on the first few dimensions to see if we can garner any insights.

Since our biplot above was very crowded and difficult to interpret, we decided to look a bit deeper at the quality of representation of the variables, which is determined by taking the square cosine (cos2) and accounts for a variable’s position relative to the circumference of the correlation circle (not pictured in our biplot above but can be visually seen by the length of each vector from the center/origin). After subjecting these cos2 values to a corrplot across the first five PCs, we can examine below:

corrplot(var$cos2, is.corr=FALSE)

Typically, a high cos2 value indicates a good representation of the variable on the principal component, and in our case the variable is positioned close to the circumference of the correlation circle (and farther away from the origin) – which we can visibly see with variables such as ADHDQ8, ADHDQ9, ADHDQ10, ADHDQ13 and MDQ1g on the biplot. The opposite is true for variables with a low cos2 value, which tend to fall closer to the origin.

We can also take a look at factors such as the contribution of variables on our principal components. Variables that are correlated with PC1 and PC2 are the most important in explaining the variability in the dataset. Therefore, when we conduct another correlation plot, but this time of variable contribution, we see the following:

corrplot(var$contrib, is.corr=FALSE)

We can also see this in a barplot below, and shows the variables that have the highest contribution percentage for our first two PCs:

fviz_contrib(dataset.pca, choice = "var", axes = 1, top = 15)

fviz_contrib(dataset.pca, choice = "var", axes = 2, top = 15)

Similar to our biplot visualization and our cos2 values, we can confirm that ADHDQ8, ADHDQ9, ADHDQ10, ADHDQ13 and MDQ1g etc. along with many other ADHD and MD questions contribute most to the variability explained in our dataset. This is important for us to take note of for future analysis, where we will be looking more closely at features that seem to provide more insight into clustering and classification.

Graph of variables. Positive correlated variables point to the same side of the plot. Negative correlated variables point to opposite sides of the graph.

fviz_pca_var(dataset.pca,
             col.var = "contrib", # Color by contributions to the PC
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             repel = TRUE     # Avoid text overlapping
             )

## Warning: ggrepel: 20 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

Problem4: Gradient Boosting Method

The boosting method is uses trees in a sequential pattern. The successive tree is developed using information from the previous tree to minimize its error.

The boosting method has three tuning parameters, number of trees, shrinkage parameter, and number of splits in a tree.

We will be using the stochastic gradient boosting in our model. This approach resamples the observations and columns in each iteration.

The xgbTree method will create a model by choosing the best tuning parameters.

Here, we can see our parameters that are used in the model.

gbmTune$bestTune

	nrounds	max_depth	eta	gamma	colsample_bytree	min_child_weight	subsample
29	100	2	0.3	0	0.8	1	0.5

We identify the variables with the most importance to the prediction below, with Abuse having the most importance important.

varImp(gbmTune)['importance'] %>%  kbl() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),latex_options="scale_down")

	Overall
Abuse	100.000000
ADHDQ18	71.380958
ADHDQ17	55.879768
Age	51.509844
Alcohol	51.055537
MDQ1m	49.026194
ADHDQ12	47.783539
Opioids	45.998844
ADHDQ15	43.101288
MDQ1e	38.887030
Psychmeds	35.248431
ADHDQ8	32.076875
Race	28.912055
ADHDQ16	27.731237
MDQ1d	26.337484
ADHDQ1	22.903558
ADHDQ7	22.639072
ADHDQ4	21.321127
Cocaine	21.106270
ADHDQ13	20.337441
Courtorder	18.663410
ADHDQ6	18.401690
ADHDQ2	17.755851
MDQ1b	16.918646
ADHDQ3	16.597874
MDQ1g	16.593649
MDQ1f	15.946980
Education	15.771490
ADHDQ11	14.317792
ADHDQ9	14.025456
MDQ1c	13.881699
ADHDQ5	13.273160
SubstDx	13.159069
MDQ1L	12.898276
THC	10.914987
MDQ1k	10.896834
MDQ2	8.999137
ADHDQ10	8.720960
MDQ3	8.106795
NonsubstDx	7.216400
Sex	6.533485
MDQ1a	5.566541
DisorderlyConduct	5.386087
MDQ1i	3.449383
MDQ1h	1.389954
ADHDQ14	0.000000
MDQ1j	0.000000
HxofViolence	0.000000

Visualization of the first 3 trees used in the model.

xgb.plot.tree(model = gbmTune$finalModel,trees = 1:3)

Now, to look at how the model performed on the test dataset.

gbm_predict <- predict(gbmTune, newdata = X_test)
gbm_conf_matrix <- confusionMatrix(gbm_predict, y_test)
print(gbm_conf_matrix)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 27  8
##          1  3  6
##                                           
##                Accuracy : 0.75            
##                  95% CI : (0.5966, 0.8681)
##     No Information Rate : 0.6818          
##     P-Value [Acc > NIR] : 0.2113          
##                                           
##                   Kappa : 0.3632          
##                                           
##  Mcnemar's Test P-Value : 0.2278          
##                                           
##             Sensitivity : 0.9000          
##             Specificity : 0.4286          
##          Pos Pred Value : 0.7714          
##          Neg Pred Value : 0.6667          
##              Prevalence : 0.6818          
##          Detection Rate : 0.6136          
##    Detection Prevalence : 0.7955          
##       Balanced Accuracy : 0.6643          
##                                           
##        'Positive' Class : 0               
##

Model Summary

We record the summary of the Stochastic gradient boosting model metrics in a data frame -

gbm_model <- confusionMatrix(table(gbm_predict, y_test))$byClass
gbm_accuracy <- confusionMatrix(table(gbm_predict, y_test))$overall['Accuracy']
gbm_model <- data.frame(gbm_model)
gbm_model <- rbind("Accuracy" = gbm_accuracy, gbm_model)

gbm_model %>%
 kbl() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),latex_options="scale_down")

	gbm_model
Accuracy	0.7500000
Sensitivity	0.9000000
Specificity	0.4285714
Pos Pred Value	0.7714286
Neg Pred Value	0.6666667
Precision	0.7714286
Recall	0.9000000
F1	0.8307692
Prevalence	0.6818182
Detection Rate	0.6136364
Detection Prevalence	0.7954545
Balanced Accuracy	0.6642857

Problem5: Support Vector Machine Model

Support Vector Machines are supervised learning models used to analyze data for classification problems.

For our dataset, SVM will help us decide an optimal decision boundary which can then help classify suicide attempts. We will build a linear model as well as a radial model and evaluate the performance for each to determine which SVM model is appropriate given our data.

Feature Selection

We would like to calculate the correlation between our independent variables and the target, Suicide. This is to identify the most significant predictors of suicide attempts.

#Compute correlation between each variable and Suicide
target_corr <- function(x, y) cor(y, x)
Suicide_Correlation <- sapply(train, target_corr, y=train$Suicide) 
#Output Correlation with Target
Suicidecorr <- data.frame(Suicide_Correlation)
Suicidecorr %>% 
  kbl(caption = "Correlation with Suicide") %>%
  kable_minimal()

Correlation with Suicide
	Suicide_Correlation
Age	-0.0413520
Sex	0.1350691
Race	-0.1121179
ADHDQ1	0.1633275
ADHDQ2	0.1973216
ADHDQ3	0.1414619
ADHDQ4	0.0987259
ADHDQ5	0.1696575
ADHDQ6	0.0591130
ADHDQ7	0.1313229
ADHDQ8	0.1313311
ADHDQ9	0.0967320
ADHDQ10	0.1188377
ADHDQ11	0.0961761
ADHDQ12	0.0984859
ADHDQ13	0.1693613
ADHDQ14	0.1706819
ADHDQ15	0.1266747
ADHDQ16	0.1198645
ADHDQ17	0.0398154
ADHDQ18	0.0332194
MDQ1a	0.1038367
MDQ1b	0.2479469
MDQ1c	-0.0916712
MDQ1d	0.1797094
MDQ1e	-0.0275834
MDQ1f	0.1876374
MDQ1g	0.2997813
MDQ1h	0.0570587
MDQ1i	0.0501235
MDQ1j	0.0485245
MDQ1k	0.1708144
MDQ1L	0.1412683
MDQ1m	0.0713625
MDQ2	0.2596162
MDQ3	0.1764281
Alcohol	0.2480052
THC	-0.0759040
Cocaine	-0.1126979
Opioids	0.2160137
Courtorder	0.1578478
Education	-0.0873615
HxofViolence	0.1378318
DisorderlyConduct	0.0144309
Suicide	1.0000000
Abuse	0.3109506
NonsubstDx	0.0476540
SubstDx	0.1086603
Psychmeds	-0.0045748

There seems to be no strong correlation present in the data. Out of the independent variables,history of abuse alcohol usage maybe indicators of suicide attempts among the respondents. Certain ADHD and MD responses also seem to have a moderate correlation with suicide.

We will keep the variables that had a somewhat strong positive correlation such as abuse, alcohol, MDQ1G,MDQ1D and ADHDQ1. For the rest, we will use caret package to determine which features maybe significant in predicting suicide attempts.

set.seed(7)
# prepare training scheme
control <- trainControl(method="repeatedcv", number=10, repeats=3)
# train the model
model <- train(as.factor(Suicide)~., data=dataset, method="lvq", preProcess="scale", trControl=control)
# estimate variable importance
importance <- varImp(model, scale=FALSE)
# summarize importance
print(importance)

## loess r-squared variable importance
## 
##   only 20 most important variables shown (out of 48)
## 
##            Overall
## Abuse      0.08951
## MDQ1g      0.07775
## Alcohol    0.05840
## MDQ1b      0.05224
## MDQ2       0.05005
## MDQ1f      0.03921
## Courtorder 0.03732
## MDQ1d      0.03565
## Sex        0.03550
## MDQ1a      0.03363
## MDQ3       0.03281
## ADHDQ1     0.03123
## Opioids    0.02842
## MDQ1L      0.02681
## ADHDQ4     0.02321
## SubstDx    0.02049
## MDQ1k      0.02041
## ADHDQ7     0.01922
## Age        0.01870
## ADHDQ3     0.01854

We had already identified some of the variables above from the correlation matrix. We will keep the top 10 important variables for our SVM model.

Linear SVM

First, let’s build the linear model with all variables (including those deemed insignificant). We will set the cost parameter to 10.

set.seed(123)
linear_svm1=svm(formula=as.factor(Suicide)~.,data=train, kernel="linear", cost=10, scale=FALSE)
print(linear_svm1)

## 
## Call:
## svm(formula = as.factor(Suicide) ~ ., data = train, kernel = "linear", 
##     cost = 10, scale = FALSE)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  10 
## 
## Number of Support Vectors:  63

test$pred1 = predict(linear_svm1, test)
 
confusionMatrix(test$pred,as.factor(test$Suicide))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 25  8
##          1  5  6
##                                          
##                Accuracy : 0.7045         
##                  95% CI : (0.548, 0.8324)
##     No Information Rate : 0.6818         
##     P-Value [Acc > NIR] : 0.4435         
##                                          
##                   Kappa : 0.2778         
##                                          
##  Mcnemar's Test P-Value : 0.5791         
##                                          
##             Sensitivity : 0.8333         
##             Specificity : 0.4286         
##          Pos Pred Value : 0.7576         
##          Neg Pred Value : 0.5455         
##              Prevalence : 0.6818         
##          Detection Rate : 0.5682         
##    Detection Prevalence : 0.7500         
##       Balanced Accuracy : 0.6310         
##                                          
##        'Positive' Class : 0              
##

Let’s see if including only the important features improves model fit.

set.seed(123)
linear_svm2=svm(formula=as.factor(Suicide)~ Abuse+MDQ1g+Alcohol+MDQ1d+MDQ1b+MDQ2+ADHDQ1+MDQ3+Opioids+MDQ1a,data=train, kernel="linear", cost=10, scale=FALSE)
print(linear_svm2)

## 
## Call:
## svm(formula = as.factor(Suicide) ~ Abuse + MDQ1g + Alcohol + MDQ1d + 
##     MDQ1b + MDQ2 + ADHDQ1 + MDQ3 + Opioids + MDQ1a, data = train, 
##     kernel = "linear", cost = 10, scale = FALSE)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  10 
## 
## Number of Support Vectors:  73

test$pred2 = predict(linear_svm2, test)
 
confusionMatrix(test$pred2,as.factor(test$Suicide))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 26 10
##          1  4  4
##                                           
##                Accuracy : 0.6818          
##                  95% CI : (0.5242, 0.8139)
##     No Information Rate : 0.6818          
##     P-Value [Acc > NIR] : 0.5717          
##                                           
##                   Kappa : 0.172           
##                                           
##  Mcnemar's Test P-Value : 0.1814          
##                                           
##             Sensitivity : 0.8667          
##             Specificity : 0.2857          
##          Pos Pred Value : 0.7222          
##          Neg Pred Value : 0.5000          
##              Prevalence : 0.6818          
##          Detection Rate : 0.5909          
##    Detection Prevalence : 0.8182          
##       Balanced Accuracy : 0.5762          
##                                           
##        'Positive' Class : 0               
##

The model fit has not improved after removing the insignificant features.

Radial SVM

First we will build a radial model with all variables

set.seed(123)
radial_svm = svm(as.factor(Suicide)~., data=train, C = 13)
print(radial_svm)

## 
## Call:
## svm(formula = as.factor(Suicide) ~ ., data = train, C = 13)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
## 
## Number of Support Vectors:  111

test$pred3 = predict(radial_svm, test)
 
confusionMatrix(test$pred3,as.factor(test$Suicide))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 30 12
##          1  0  2
##                                           
##                Accuracy : 0.7273          
##                  95% CI : (0.5721, 0.8504)
##     No Information Rate : 0.6818          
##     P-Value [Acc > NIR] : 0.319382        
##                                           
##                   Kappa : 0.1852          
##                                           
##  Mcnemar's Test P-Value : 0.001496        
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.1429          
##          Pos Pred Value : 0.7143          
##          Neg Pred Value : 1.0000          
##              Prevalence : 0.6818          
##          Detection Rate : 0.6818          
##    Detection Prevalence : 0.9545          
##       Balanced Accuracy : 0.5714          
##                                           
##        'Positive' Class : 0               
##

We will build another radial model with only the most significant variables

set.seed(123)
radial_svm2 = svm(as.factor(Suicide)~Abuse+MDQ1g+Alcohol+MDQ1d+MDQ1b+MDQ2+ADHDQ1+MDQ3+Opioids+MDQ1a, data=train, C = 13)
print(radial_svm2)

## 
## Call:
## svm(formula = as.factor(Suicide) ~ Abuse + MDQ1g + Alcohol + MDQ1d + 
##     MDQ1b + MDQ2 + ADHDQ1 + MDQ3 + Opioids + MDQ1a, data = train, 
##     C = 13)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
## 
## Number of Support Vectors:  89

test$pred4 = predict(radial_svm2, test)
 
confusionMatrix(test$pred4,as.factor(test$Suicide))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 27 12
##          1  3  2
##                                           
##                Accuracy : 0.6591          
##                  95% CI : (0.5008, 0.7951)
##     No Information Rate : 0.6818          
##     P-Value [Acc > NIR] : 0.69143         
##                                           
##                   Kappa : 0.0517          
##                                           
##  Mcnemar's Test P-Value : 0.03887         
##                                           
##             Sensitivity : 0.9000          
##             Specificity : 0.1429          
##          Pos Pred Value : 0.6923          
##          Neg Pred Value : 0.4000          
##              Prevalence : 0.6818          
##          Detection Rate : 0.6136          
##    Detection Prevalence : 0.8864          
##       Balanced Accuracy : 0.5214          
##                                           
##        'Positive' Class : 0               
##

The model fit for the radial model has not improved after removing the insignificant variables. The model performance for the first radial model also outperforms the other SVM models.

Model Summary

We will store the results of our final SVM model in a dataframe

SVM_Model_Final <- confusionMatrix(test$pred4,as.factor(test$Suicide))$byClass
SVM_Model_Final <- data.frame(SVM_Model_Final)
AccuracySVM <- confusionMatrix(test$pred4, as.factor(test$Suicide))$overall['Accuracy']
SVM_Model_Final<- rbind("Accuracy" = AccuracySVM, SVM_Model_Final)
tabularview <- data.frame(SVM_Model_Final)
tabularview %>%  kableExtra::kbl() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),latex_options="scale_down")

	SVM_Model_Final
Accuracy	0.6590909
Sensitivity	0.9000000
Specificity	0.1428571
Pos Pred Value	0.6923077
Neg Pred Value	0.4000000
Precision	0.6923077
Recall	0.9000000
F1	0.7826087
Prevalence	0.6818182
Detection Rate	0.6136364
Detection Prevalence	0.8863636
Balanced Accuracy	0.5214286

Conclusion

The linear SVM model outperformed the radial SVM model, however the Stochastic Gradient Boosting model holds the best accuracy on the test data. This looks to be the results of having a higher specificity performance in the model. This outcome may be the results of Gradient Boosting having low bias and variance with its boosting method. The k-means clustering model is an unsupervised machine learning model that allowed us to find insight into our mental health dataset. With an optimal clustering of 2, we were able to group our data into two groups.

Also, based on the PCA we found that first few principal components explained a large proportion of variances. Based on further analysis, we also found the most important variables in terms of representation and contribution in the first couple of principal components.

DATA 622 - Homework4 - Mental Health Analysis

Group1: Diego Correa, Amanda Arce, Soumya Ghosh & Atina Karim

November 08, 2021

Libraries

Background

Data Dictionary

Problem Statement

Dataset

Descriptive Dataset Summary

Pre-Processing

Missing Value Analysis

Data Imputation

Exploratory Data Analysis

Observations

Correlation Plot: Multicollinearity Check

Splitting Data: Train/Test

Model Building

Problem2: Clustering Method

Model Summary

Problem3: Principal Component Analysis (PCA)

Problem4: Gradient Boosting Method

Model Summary

Problem5: Support Vector Machine Model

Feature Selection

Linear SVM

Radial SVM

Model Summary

Conclusion

Initial	Age	Sex	Race	ADHD Q1	ADHD Q2	ADHD Q3	ADHD Q4	ADHD Q5	ADHD Q6	ADHD Q7	ADHD Q8	ADHD Q9	ADHD Q10	ADHD Q11	ADHD Q12	ADHD Q13	ADHD Q14	ADHD Q15	ADHD Q16	ADHD Q17	ADHD Q18	ADHD Total	MD Q1a	MD Q1b	MD Q1c	MD Q1d	MD Q1e	MD Q1f	MD Q1g	MD Q1h	MD Q1i	MD Q1j	MD Q1k	MD Q1L	MD Q1m	MD Q2	MD Q3	MD TOTAL	Alcohol	THC	Cocaine	Stimulants	Court order	Education	Hx of Violence	Disorderly Conduct	Suicide	Abuse	Non-subst Dx	Psych meds.
JA	24	1	1	1	1	4	2	3	1	1	3	2	4	2	4	1	0	3	1	3	4	40	1	1	1	1	0	1	1	1	1	1	1	0	1	1	3	15	1	1	1	0	1	11	0	1	1	0	2	2
LA	48	2	1	3	3	4	4	5	2	2	3	2	4	1	4	2	4	4	3	1	4	55	1	1	1	1	1	1	1	1	1	0	0	1	0	1	3	14	0	0	0	0	0	14	0	0	1	4	1	1
MD	51	2	1	2	1	2	1	3	3	3	2	0	1	2	0	2	2	3	2	1	1	31	0	0	0	0	1	1	1	0	0	0	0	0	0	0	2	5	0	0	0	0	0	12	0	0	0	6	2	1
RD	43	1	1	3	3	2	2	4	3	2	4	4	2	3	1	3	3	1	2	1	2	45	1	1	0	0	1	1	1	1	1	0	0	1	1	1	3	13	1	1	1	1	0	12	0	0	1	7	2	2
RB	34	1	1	4	4	2	4	4	2	3	4	4	2	4	1	3	2	1	2	1	1	48	0	1	0	1	0	1	1	0	0	0	0	0	0	1	2	7	1	1	0	0	1	9	1	1	1	0	2	0
SB	39	2	1	2	3	1	4	3	2	3	4	4	2	4	2	4	4	3	4	3	3	55	0	1	0	1	1	1	1	1	1	1	1	1	0	1	3	14	1	0	0	0	0	11	0	1	1	2	0	0

Initial	Age	Sex	Race	ADHD Q1	ADHD Q2	ADHD Q3	ADHD Q4	ADHD Q5	ADHD Q6	ADHD Q7	ADHD Q8	ADHD Q9	ADHD Q10	ADHD Q11	ADHD Q12	ADHD Q13	ADHD Q14	ADHD Q15	ADHD Q16	ADHD Q17	ADHD Q18	ADHD Total	MD Q1a	MD Q1b	MD Q1c	MD Q1d	MD Q1e	MD Q1f	MD Q1g	MD Q1h	MD Q1i	MD Q1j	MD Q1k	MD Q1L	MD Q1m	MD Q2	MD Q3	MD TOTAL	Alcohol	THC	Cocaine	Stimulants	Court order	Education	Hx of Violence	Disorderly Conduct	Suicide	Abuse	Non-subst Dx	Psych meds.
JA	24	1	1	1	1	4	2	3	1	1	3	2	4	2	4	1	0	3	1	3	4	40	1	1	1	1	0	1	1	1	1	1	1	0	1	1	3	15	1	1	1	0	1	11	0	1	1	0	2	2
LA	48	2	1	3	3	4	4	5	2	2	3	2	4	1	4	2	4	4	3	1	4	55	1	1	1	1	1	1	1	1	1	0	0	1	0	1	3	14	0	0	0	0	0	14	0	0	1	4	1	1
MD	51	2	1	2	1	2	1	3	3	3	2	0	1	2	0	2	2	3	2	1	1	31	0	0	0	0	1	1	1	0	0	0	0	0	0	0	2	5	0	0	0	0	0	12	0	0	0	6	2	1
RD	43	1	1	3	3	2	2	4	3	2	4	4	2	3	1	3	3	1	2	1	2	45	1	1	0	0	1	1	1	1	1	0	0	1	1	1	3	13	1	1	1	1	0	12	0	0	1	7	2	2
RB	34	1	1	4	4	2	4	4	2	3	4	4	2	4	1	3	2	1	2	1	1	48	0	1	0	1	0	1	1	0	0	0	0	0	0	1	2	7	1	1	0	0	1	9	1	1	1	0	2	0
SB	39	2	1	2	3	1	4	3	2	3	4	4	2	4	2	4	4	3	4	3	3	55	0	1	0	1	1	1	1	1	1	1	1	1	0	1	3	14	1	0	0	0	0	11	0	1	1	2	0	0

Initial	Age	Sex	Race	ADHD Q1	ADHD Q2	ADHD Q3	ADHD Q4	ADHD Q5	ADHD Q6	ADHD Q7	ADHD Q8	ADHD Q9	ADHD Q10	ADHD Q11	ADHD Q12	ADHD Q13	ADHD Q14	ADHD Q15	ADHD Q16	ADHD Q17	ADHD Q18	ADHD Total	MD Q1a	MD Q1b	MD Q1c	MD Q1d	MD Q1e	MD Q1f	MD Q1g	MD Q1h	MD Q1i	MD Q1j	MD Q1k	MD Q1L	MD Q1m	MD Q2	MD Q3	MD TOTAL	Alcohol	THC	Cocaine	Stimulants	Court order	Education	Hx of Violence	Disorderly Conduct	Suicide	Abuse	Non-subst Dx	Psych meds.
JA	24	1	1	1	1	4	2	3	1	1	3	2	4	2	4	1	0	3	1	3	4	40	1	1	1	1	0	1	1	1	1	1	1	0	1	1	3	15	1	1	1	0	1	11	0	1	1	0	2	2
LA	48	2	1	3	3	4	4	5	2	2	3	2	4	1	4	2	4	4	3	1	4	55	1	1	1	1	1	1	1	1	1	0	0	1	0	1	3	14	0	0	0	0	0	14	0	0	1	4	1	1
MD	51	2	1	2	1	2	1	3	3	3	2	0	1	2	0	2	2	3	2	1	1	31	0	0	0	0	1	1	1	0	0	0	0	0	0	0	2	5	0	0	0	0	0	12	0	0	0	6	2	1
RD	43	1	1	3	3	2	2	4	3	2	4	4	2	3	1	3	3	1	2	1	2	45	1	1	0	0	1	1	1	1	1	0	0	1	1	1	3	13	1	1	1	1	0	12	0	0	1	7	2	2
RB	34	1	1	4	4	2	4	4	2	3	4	4	2	4	1	3	2	1	2	1	1	48	0	1	0	1	0	1	1	0	0	0	0	0	0	1	2	7	1	1	0	0	1	9	1	1	1	0	2	0
SB	39	2	1	2	3	1	4	3	2	3	4	4	2	4	2	4	4	3	4	3	3	55	0	1	0	1	1	1	1	1	1	1	1	1	0	1	3	14	1	0	0	0	0	11	0	1	1	2	0	0