Libraries

library(kableExtra)
library(tidyverse)
library(ggplot2)
library(dplyr)
library(psych)
library(caret)
library(mice)
library(randomForest)
library(caTools)
library(corrplot)
library(class)
library(rpart)
library(rpart.plot)
library(naniar)
library(xgboost)
library(DiagrammeR)
library(factoextra)
library(e1071)
library(FactoMineR)
library(skimr)

Background

For this assignment, we will be working with a very interesting mental health dataset from a real-life research project. All identifying information, of course, has been removed. The attached spreadsheet has the data (the tab name “Data”). The data dictionary is given in the second tab. You can get as creative as you want. The assignment is designed to really get you to think about how you could use different methods.

The target variable is ‘Suicide Attempt’.

Data Dictionary

ADHD Data Dictionary

Problem Statement

  1. Conduct a thorough Exploratory Data Analysis (EDA) to understand the dataset. (20 points)
  2. Use a clustering method to find clusters of patients here. Whether you choose to use k-means clustering or hierarchical clustering is up to you as long as you reason through your work. You are free to be creative in terms of which variables or some combination of those you want to use. Can you come up with creative names for the profiles you found? (40 points)
  3. Let’s explore using Principal Component Analysis on this dataset. You will note that there are different types of questions in the dataset: column: E-W: ADHD self-report; column X – AM: mood disorders questionnaire, column AN-AS: Individual Substance Misuse; etc. You could just use ONE of the sets of questionnaire, for example, you can conduct PCA on the ADHD score, or mood disorder score, etc. Please reason through your work as you decide on which sets of variables you want to use to conduct Principal Component Analysis. What did you learn from the PCA? Can you comment on which question may have a heavy bearing on the score? (40 points)
  4. Assume you are modeling whether a patient attempted suicide (column AX). This is a binary target variable. Please use Gradient Boosting to predict whether a patient attempts suicides. Please use whatever boosting approach you deem appropriate. But please be sure to walk us through your steps. (50 points)
  5. Using the same target variable (suicide attempt), please use support vector machine to model this. You might want to consider reducing the number of variables or somehow use extracted information from the variables. This can be a really fun modeling task! (50 points)

Dataset

dataset <- read_csv('https://raw.githubusercontent.com/dcorrea614/DATA622/main/HW4/ADHD_data.csv')
head(dataset)%>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="300px")
Initial Age Sex Race ADHD Q1 ADHD Q2 ADHD Q3 ADHD Q4 ADHD Q5 ADHD Q6 ADHD Q7 ADHD Q8 ADHD Q9 ADHD Q10 ADHD Q11 ADHD Q12 ADHD Q13 ADHD Q14 ADHD Q15 ADHD Q16 ADHD Q17 ADHD Q18 ADHD Total MD Q1a MD Q1b MD Q1c MD Q1d MD Q1e MD Q1f MD Q1g MD Q1h MD Q1i MD Q1j MD Q1k MD Q1L MD Q1m MD Q2 MD Q3 MD TOTAL Alcohol THC Cocaine Stimulants Sedative-hypnotics Opioids Court order Education Hx of Violence Disorderly Conduct Suicide Abuse Non-subst Dx Subst Dx Psych meds.
JA 24 1 1 1 1 4 2 3 1 1 3 2 4 2 4 1 0 3 1 3 4 40 1 1 1 1 0 1 1 1 1 1 1 0 1 1 3 15 1 1 1 0 0 0 1 11 0 1 1 0 2 0 2
LA 48 2 1 3 3 4 4 5 2 2 3 2 4 1 4 2 4 4 3 1 4 55 1 1 1 1 1 1 1 1 1 0 0 1 0 1 3 14 0 0 0 0 0 0 0 14 0 0 1 4 1 0 1
MD 51 2 1 2 1 2 1 3 3 3 2 0 1 2 0 2 2 3 2 1 1 31 0 0 0 0 1 1 1 0 0 0 0 0 0 0 2 5 0 0 0 0 0 0 0 12 0 0 0 6 2 0 1
RD 43 1 1 3 3 2 2 4 3 2 4 4 2 3 1 3 3 1 2 1 2 45 1 1 0 0 1 1 1 1 1 0 0 1 1 1 3 13 1 1 1 1 0 0 0 12 0 0 1 7 2 0 2
RB 34 1 1 4 4 2 4 4 2 3 4 4 2 4 1 3 2 1 2 1 1 48 0 1 0 1 0 1 1 0 0 0 0 0 0 1 2 7 1 1 0 0 0 0 1 9 1 1 1 0 2 0 0
SB 39 2 1 2 3 1 4 3 2 3 4 4 2 4 2 4 4 3 4 3 3 55 0 1 0 1 1 1 1 1 1 1 1 1 0 1 3 14 1 0 0 0 0 0 0 11 0 1 1 2 0 0 0

Descriptive Dataset Summary

summary(dataset)%>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="400px")
Initial Age Sex Race ADHD Q1 ADHD Q2 ADHD Q3 ADHD Q4 ADHD Q5 ADHD Q6 ADHD Q7 ADHD Q8 ADHD Q9 ADHD Q10 ADHD Q11 ADHD Q12 ADHD Q13 ADHD Q14 ADHD Q15 ADHD Q16 ADHD Q17 ADHD Q18 ADHD Total MD Q1a MD Q1b MD Q1c MD Q1d MD Q1e MD Q1f MD Q1g MD Q1h MD Q1i MD Q1j MD Q1k MD Q1L MD Q1m MD Q2 MD Q3 MD TOTAL Alcohol THC Cocaine Stimulants Sedative-hypnotics Opioids Court order Education Hx of Violence Disorderly Conduct Suicide Abuse Non-subst Dx Subst Dx Psych meds.
Length:175 Min. :18.00 Min. :1.000 Min. :1.00 Min. :0.000 Min. :0.000 Min. :0.000 Min. :0.000 Min. :0.000 Min. :0.000 Min. :0.000 Min. :0.000 Min. :0.000 Min. :0.00 Min. :0.000 Min. :0.000 Min. :0.000 Min. :0.000 Min. :0.000 Min. :0.000 Min. :0.000 Min. :0.000 Min. : 0.00 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.00 Min. :0.00 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.00 Min. :0.000 Min. : 0.00 Min. :0.000 Min. :0.000 Min. :0.000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.00000 Min. : 6.0 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.000 Min. :0.0000 Min. :0.000 Min. :0.0000
Class :character 1st Qu.:29.50 1st Qu.:1.000 1st Qu.:1.00 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.00 1st Qu.:1.000 1st Qu.:0.000 1st Qu.:1.500 1st Qu.:1.000 1st Qu.:0.000 1st Qu.:1.000 1st Qu.:0.000 1st Qu.:0.000 1st Qu.:21.00 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00 1st Qu.:0.00 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00 1st Qu.:1.000 1st Qu.: 6.50 1st Qu.:0.000 1st Qu.:0.000 1st Qu.:0.000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:11.0 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:0.0000
Mode :character Median :42.00 Median :1.000 Median :2.00 Median :2.000 Median :2.000 Median :2.000 Median :2.000 Median :3.000 Median :2.000 Median :2.000 Median :2.000 Median :2.000 Median :2.00 Median :2.000 Median :1.000 Median :2.000 Median :2.000 Median :1.000 Median :1.000 Median :1.000 Median :1.000 Median :33.00 Median :1.0000 Median :1.0000 Median :1.0000 Median :1.0000 Median :1.0000 Median :1.0000 Median :1.00 Median :1.00 Median :1.0000 Median :0.0000 Median :0.0000 Median :1.0000 Median :0.0000 Median :1.00 Median :2.000 Median :11.00 Median :1.000 Median :0.000 Median :0.000 Median :0.0000 Median :0.0000 Median :0.0000 Median :0.00000 Median :12.0 Median :0.0000 Median :1.0000 Median :0.0000 Median :0.000 Median :0.0000 Median :1.000 Median :1.0000
NA Mean :39.47 Mean :1.434 Mean :1.64 Mean :1.697 Mean :1.914 Mean :1.909 Mean :2.103 Mean :2.257 Mean :1.909 Mean :1.829 Mean :2.137 Mean :1.909 Mean :2.12 Mean :2.274 Mean :1.291 Mean :2.366 Mean :2.246 Mean :1.634 Mean :1.703 Mean :1.526 Mean :1.474 Mean :34.32 Mean :0.5486 Mean :0.5714 Mean :0.5429 Mean :0.5829 Mean :0.5543 Mean :0.6971 Mean :0.72 Mean :0.56 Mean :0.5886 Mean :0.3886 Mean :0.4857 Mean :0.5829 Mean :0.4914 Mean :0.72 Mean :2.006 Mean :10.02 Mean :1.345 Mean :0.807 Mean :1.094 Mean :0.1228 Mean :0.1228 Mean :0.3918 Mean :0.08823 Mean :11.9 Mean :0.2439 Mean :0.7256 Mean :0.3025 Mean :1.329 Mean :0.4379 Mean :1.138 Mean :0.9649
NA 3rd Qu.:48.00 3rd Qu.:2.000 3rd Qu.:2.00 3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.:3.00 3rd Qu.:3.000 3rd Qu.:2.000 3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:47.50 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.00 3rd Qu.:1.00 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.00 3rd Qu.:3.000 3rd Qu.:14.00 3rd Qu.:3.000 3rd Qu.:1.500 3rd Qu.:3.000 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:13.0 3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:2.000 3rd Qu.:1.0000 3rd Qu.:2.000 3rd Qu.:2.0000
NA Max. :69.00 Max. :2.000 Max. :6.00 Max. :4.000 Max. :4.000 Max. :4.000 Max. :4.000 Max. :5.000 Max. :4.000 Max. :4.000 Max. :4.000 Max. :4.000 Max. :4.00 Max. :4.000 Max. :4.000 Max. :4.000 Max. :4.000 Max. :4.000 Max. :4.000 Max. :4.000 Max. :4.000 Max. :72.00 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.00 Max. :1.00 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.00 Max. :3.000 Max. :17.00 Max. :3.000 Max. :3.000 Max. :3.000 Max. :3.0000 Max. :3.0000 Max. :3.0000 Max. :1.00000 Max. :19.0 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :7.000 Max. :2.0000 Max. :3.000 Max. :2.0000
NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA’s :4 NA’s :4 NA’s :4 NA’s :4 NA’s :4 NA’s :4 NA’s :5 NA’s :9 NA’s :11 NA’s :11 NA’s :13 NA’s :14 NA’s :22 NA’s :23 NA’s :118
# Show dataset summary deatils
#skim(dataset)

Pre-Processing

Missing Value Analysis

Based on the above descriptive data summary, there are quite a few variables with missing values. So we conducted an analysis of all missing values in various attributes to identify proper imputation technique.

## Counts of missing data per feature
dataset_missing_counts <- data.frame(apply(dataset, 2, function(x) length(which(is.na(x)))))
dataset_missing_pct <- data.frame(apply(dataset, 2,function(x) {sum(is.na(x)) / length(x) * 100}))

dataset_missing_counts <- cbind(Feature = rownames(dataset_missing_counts), dataset_missing_counts, dataset_missing_pct)
colnames(dataset_missing_counts) <- c('Feature','NA_Count','NA_Percentage')
rownames(dataset_missing_counts) <- NULL

dataset_missing_counts <- dataset_missing_counts %>% filter(`NA_Count` != 0) %>% arrange(desc(`NA_Count`))

dataset_missing_counts  %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="300px")
Feature NA_Count NA_Percentage
Psych meds. 118 67.428571
Subst Dx 23 13.142857
Non-subst Dx 22 12.571429
Abuse 14 8.000000
Suicide 13 7.428571
Hx of Violence 11 6.285714
Disorderly Conduct 11 6.285714
Education 9 5.142857
Court order 5 2.857143
Alcohol 4 2.285714
THC 4 2.285714
Cocaine 4 2.285714
Stimulants 4 2.285714
Sedative-hypnotics 4 2.285714
Opioids 4 2.285714
ggplot(dataset_missing_counts, aes(x = NA_Count, y = reorder(Feature, NA_Count))) + 
  geom_bar(stat = 'identity', fill = 'steelblue') +
  geom_label(aes(label = NA_Count)) +
  labs(title = 'Missing Counts') +
  theme(plot.title = element_text(hjust = 0.5), axis.title.y = element_blank(), axis.title.x = element_blank())

# Use nanair package to plot missing value patterns
gg_miss_upset(dataset)

Data Imputation

Based on above missing value analysis, we are going to perform data imputation using the mice package following Random Forest method. But before that, we remove the Initial as we do not need the ID columns for imputation. Additionally, we removed the ADHD and MDQ subtotal columns to avoid collinearity.

# removing Initial, ADHD Total and MD Total columns
dataset <- dataset %>%
  select(-c('Initial','ADHD Total','MD TOTAL'))

# cleaning up the column names for the imputation function
colNamesNoSpace <- colnames(dataset) %>%
  str_remove_all(' |-|\\.')

colnames(dataset) <- colNamesNoSpace
#imputation by using the random forest method ('rf')
init <- mice(dataset, maxit = 0, silent = TRUE)
predM <- init$predictorMatrix
set.seed(123)
imputed <- mice(dataset, method = 'rf', predictorMatrix = predM, m=5, silent = TRUE)
## 
##  iter imp variable
##   1   1  Alcohol  THC  Cocaine  Stimulants  Sedativehypnotics  Opioids  Courtorder  Education  HxofViolence  DisorderlyConduct  Suicide  Abuse  NonsubstDx  SubstDx  Psychmeds
##   1   2  Alcohol  THC  Cocaine  Stimulants  Sedativehypnotics  Opioids  Courtorder  Education  HxofViolence  DisorderlyConduct  Suicide  Abuse  NonsubstDx  SubstDx  Psychmeds
##   1   3  Alcohol  THC  Cocaine  Stimulants  Sedativehypnotics  Opioids  Courtorder  Education  HxofViolence  DisorderlyConduct  Suicide  Abuse  NonsubstDx  SubstDx  Psychmeds
##   1   4  Alcohol  THC  Cocaine  Stimulants  Sedativehypnotics  Opioids  Courtorder  Education  HxofViolence  DisorderlyConduct  Suicide  Abuse  NonsubstDx  SubstDx  Psychmeds
##   1   5  Alcohol  THC  Cocaine  Stimulants  Sedativehypnotics  Opioids  Courtorder  Education  HxofViolence  DisorderlyConduct  Suicide  Abuse  NonsubstDx  SubstDx  Psychmeds
##   2   1  Alcohol  THC  Cocaine  Stimulants  Sedativehypnotics  Opioids  Courtorder  Education  HxofViolence  DisorderlyConduct  Suicide  Abuse  NonsubstDx  SubstDx  Psychmeds
##   2   2  Alcohol  THC  Cocaine  Stimulants  Sedativehypnotics  Opioids  Courtorder  Education  HxofViolence  DisorderlyConduct  Suicide  Abuse  NonsubstDx  SubstDx  Psychmeds
##   2   3  Alcohol  THC  Cocaine  Stimulants  Sedativehypnotics  Opioids  Courtorder  Education  HxofViolence  DisorderlyConduct  Suicide  Abuse  NonsubstDx  SubstDx  Psychmeds
##   2   4  Alcohol  THC  Cocaine  Stimulants  Sedativehypnotics  Opioids  Courtorder  Education  HxofViolence  DisorderlyConduct  Suicide  Abuse  NonsubstDx  SubstDx  Psychmeds
##   2   5  Alcohol  THC  Cocaine  Stimulants  Sedativehypnotics  Opioids  Courtorder  Education  HxofViolence  DisorderlyConduct  Suicide  Abuse  NonsubstDx  SubstDx  Psychmeds
##   3   1  Alcohol  THC  Cocaine  Stimulants  Sedativehypnotics  Opioids  Courtorder  Education  HxofViolence  DisorderlyConduct  Suicide  Abuse  NonsubstDx  SubstDx  Psychmeds
##   3   2  Alcohol  THC  Cocaine  Stimulants  Sedativehypnotics  Opioids  Courtorder  Education  HxofViolence  DisorderlyConduct  Suicide  Abuse  NonsubstDx  SubstDx  Psychmeds
##   3   3  Alcohol  THC  Cocaine  Stimulants  Sedativehypnotics  Opioids  Courtorder  Education  HxofViolence  DisorderlyConduct  Suicide  Abuse  NonsubstDx  SubstDx  Psychmeds
##   3   4  Alcohol  THC  Cocaine  Stimulants  Sedativehypnotics  Opioids  Courtorder  Education  HxofViolence  DisorderlyConduct  Suicide  Abuse  NonsubstDx  SubstDx  Psychmeds
##   3   5  Alcohol  THC  Cocaine  Stimulants  Sedativehypnotics  Opioids  Courtorder  Education  HxofViolence  DisorderlyConduct  Suicide  Abuse  NonsubstDx  SubstDx  Psychmeds
##   4   1  Alcohol  THC  Cocaine  Stimulants  Sedativehypnotics  Opioids  Courtorder  Education  HxofViolence  DisorderlyConduct  Suicide  Abuse  NonsubstDx  SubstDx  Psychmeds
##   4   2  Alcohol  THC  Cocaine  Stimulants  Sedativehypnotics  Opioids  Courtorder  Education  HxofViolence  DisorderlyConduct  Suicide  Abuse  NonsubstDx  SubstDx  Psychmeds
##   4   3  Alcohol  THC  Cocaine  Stimulants  Sedativehypnotics  Opioids  Courtorder  Education  HxofViolence  DisorderlyConduct  Suicide  Abuse  NonsubstDx  SubstDx  Psychmeds
##   4   4  Alcohol  THC  Cocaine  Stimulants  Sedativehypnotics  Opioids  Courtorder  Education  HxofViolence  DisorderlyConduct  Suicide  Abuse  NonsubstDx  SubstDx  Psychmeds
##   4   5  Alcohol  THC  Cocaine  Stimulants  Sedativehypnotics  Opioids  Courtorder  Education  HxofViolence  DisorderlyConduct  Suicide  Abuse  NonsubstDx  SubstDx  Psychmeds
##   5   1  Alcohol  THC  Cocaine  Stimulants  Sedativehypnotics  Opioids  Courtorder  Education  HxofViolence  DisorderlyConduct  Suicide  Abuse  NonsubstDx  SubstDx  Psychmeds
##   5   2  Alcohol  THC  Cocaine  Stimulants  Sedativehypnotics  Opioids  Courtorder  Education  HxofViolence  DisorderlyConduct  Suicide  Abuse  NonsubstDx  SubstDx  Psychmeds
##   5   3  Alcohol  THC  Cocaine  Stimulants  Sedativehypnotics  Opioids  Courtorder  Education  HxofViolence  DisorderlyConduct  Suicide  Abuse  NonsubstDx  SubstDx  Psychmeds
##   5   4  Alcohol  THC  Cocaine  Stimulants  Sedativehypnotics  Opioids  Courtorder  Education  HxofViolence  DisorderlyConduct  Suicide  Abuse  NonsubstDx  SubstDx  Psychmeds
##   5   5  Alcohol  THC  Cocaine  Stimulants  Sedativehypnotics  Opioids  Courtorder  Education  HxofViolence  DisorderlyConduct  Suicide  Abuse  NonsubstDx  SubstDx  Psychmeds
# complete the imputation and show summary of the imputed data
dataset <- complete(imputed)
summary(dataset) %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="300px")
Age Sex Race ADHDQ1 ADHDQ2 ADHDQ3 ADHDQ4 ADHDQ5 ADHDQ6 ADHDQ7 ADHDQ8 ADHDQ9 ADHDQ10 ADHDQ11 ADHDQ12 ADHDQ13 ADHDQ14 ADHDQ15 ADHDQ16 ADHDQ17 ADHDQ18 MDQ1a MDQ1b MDQ1c MDQ1d MDQ1e MDQ1f MDQ1g MDQ1h MDQ1i MDQ1j MDQ1k MDQ1L MDQ1m MDQ2 MDQ3 Alcohol THC Cocaine Stimulants Sedativehypnotics Opioids Courtorder Education HxofViolence DisorderlyConduct Suicide Abuse NonsubstDx SubstDx Psychmeds
Min. :18.00 Min. :1.000 Min. :1.00 Min. :0.000 Min. :0.000 Min. :0.000 Min. :0.000 Min. :0.000 Min. :0.000 Min. :0.000 Min. :0.000 Min. :0.000 Min. :0.00 Min. :0.000 Min. :0.000 Min. :0.000 Min. :0.000 Min. :0.000 Min. :0.000 Min. :0.000 Min. :0.000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.00 Min. :0.00 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.00 Min. :0.000 Min. :0.000 Min. :0.0000 Min. :0.000 Min. :0.00 Min. :0.00 Min. :0.0000 Min. :0.00000 Min. : 6.00 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.000 Min. :0.0000 Min. :0.000 Min. :0.0000
1st Qu.:29.50 1st Qu.:1.000 1st Qu.:1.00 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.00 1st Qu.:1.000 1st Qu.:0.000 1st Qu.:1.500 1st Qu.:1.000 1st Qu.:0.000 1st Qu.:1.000 1st Qu.:0.000 1st Qu.:0.000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00 1st Qu.:0.00 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00 1st Qu.:1.000 1st Qu.:0.000 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:0.00 1st Qu.:0.00 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:11.00 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:0.0000
Median :42.00 Median :1.000 Median :2.00 Median :2.000 Median :2.000 Median :2.000 Median :2.000 Median :3.000 Median :2.000 Median :2.000 Median :2.000 Median :2.000 Median :2.00 Median :2.000 Median :1.000 Median :2.000 Median :2.000 Median :1.000 Median :1.000 Median :1.000 Median :1.000 Median :1.0000 Median :1.0000 Median :1.0000 Median :1.0000 Median :1.0000 Median :1.0000 Median :1.00 Median :1.00 Median :1.0000 Median :0.0000 Median :0.0000 Median :1.0000 Median :0.0000 Median :1.00 Median :2.000 Median :1.000 Median :0.0000 Median :0.000 Median :0.00 Median :0.00 Median :0.0000 Median :0.00000 Median :12.00 Median :0.0000 Median :1.0000 Median :0.0000 Median :0.000 Median :0.0000 Median :1.000 Median :1.0000
Mean :39.47 Mean :1.434 Mean :1.64 Mean :1.697 Mean :1.914 Mean :1.909 Mean :2.103 Mean :2.257 Mean :1.909 Mean :1.829 Mean :2.137 Mean :1.909 Mean :2.12 Mean :2.274 Mean :1.291 Mean :2.366 Mean :2.246 Mean :1.634 Mean :1.703 Mean :1.526 Mean :1.474 Mean :0.5486 Mean :0.5714 Mean :0.5429 Mean :0.5829 Mean :0.5543 Mean :0.6971 Mean :0.72 Mean :0.56 Mean :0.5886 Mean :0.3886 Mean :0.4857 Mean :0.5829 Mean :0.4914 Mean :0.72 Mean :2.006 Mean :1.366 Mean :0.7886 Mean :1.086 Mean :0.12 Mean :0.12 Mean :0.3829 Mean :0.08571 Mean :11.91 Mean :0.2343 Mean :0.7257 Mean :0.3086 Mean :1.326 Mean :0.4171 Mean :1.137 Mean :0.7771
3rd Qu.:48.00 3rd Qu.:2.000 3rd Qu.:2.00 3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.:3.00 3rd Qu.:3.000 3rd Qu.:2.000 3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.00 3rd Qu.:1.00 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.00 3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.:1.0000 3rd Qu.:3.000 3rd Qu.:0.00 3rd Qu.:0.00 3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:13.00 3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:2.000 3rd Qu.:1.0000 3rd Qu.:2.000 3rd Qu.:1.0000
Max. :69.00 Max. :2.000 Max. :6.00 Max. :4.000 Max. :4.000 Max. :4.000 Max. :4.000 Max. :5.000 Max. :4.000 Max. :4.000 Max. :4.000 Max. :4.000 Max. :4.00 Max. :4.000 Max. :4.000 Max. :4.000 Max. :4.000 Max. :4.000 Max. :4.000 Max. :4.000 Max. :4.000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.00 Max. :1.00 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.00 Max. :3.000 Max. :3.000 Max. :3.0000 Max. :3.000 Max. :3.00 Max. :3.00 Max. :3.0000 Max. :1.00000 Max. :19.00 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :7.000 Max. :2.0000 Max. :3.000 Max. :2.0000

We also checked for presence of any de-generate variables and removed from the dataset.

# capturing the degenerate variables
degenCols <- nearZeroVar(dataset)

# identifying them 
colnames(dataset[,degenCols])
## [1] "Stimulants"        "Sedativehypnotics"
# removing from the dataset
dataset <- dataset[,-degenCols]

Exploratory Data Analysis

# make dataset long to place distribution in a facetwrap
vars <- dataset %>%
  gather(key = 'predictor_variable', value = 'value', -Suicide) %>%
  mutate(Suicide = ifelse(Suicide==1,'Y','N'))

# Distribution of ADHD variables
vars %>%
  filter(str_detect(predictor_variable,'ADHD')) %>%
  ggplot() +
  geom_histogram(aes(x = value, y = ..density.., fill = Suicide), bins = 15) +
  labs(title = 'Distributions of ADHD Variables') +
  theme(plot.title = element_text(hjust = 0.5)) +
  facet_wrap(. ~predictor_variable, scales = 'free', ncol = 3)

# Distribution of MDQ variables
vars %>%
  filter(str_detect(predictor_variable,'MD')) %>%
  ggplot() +
  geom_histogram(aes(x = value, y = ..density.., fill = Suicide), bins = 15) +
  labs(title = 'Distributions of MDQ Variables') +
  theme(plot.title = element_text(hjust = 0.5)) +
  facet_wrap(. ~predictor_variable, scales = 'free', ncol = 3)

# Distribution of other variables
vars %>%
  filter(!str_detect(predictor_variable,'MD|ADH')) %>%
  ggplot() +
  geom_histogram(aes(x = value, y = ..density.., fill = Suicide), bins = 15) +
  labs(title = 'Distributions of Other Variables') +
  theme(plot.title = element_text(hjust = 0.5)) +
  facet_wrap(. ~predictor_variable, scales = 'free', ncol = 3)

Observations

  • ADHD distributions illustrate that when the values are from 0 - 3, the suicide value is more likely to be N. The inverse appears to be true, when the values are 4 - 5, the suicide value is more likely to be Y.

  • MDQ distribution demonstrates that when the value is 0, the suicide value is more like N. However, the opposite does not to seem hold here.

  • The rest of the variables distribution show that Abuse and Alcohol could be strong indicators for suicide.

Correlation Plot: Multicollinearity Check

corrMatrix <- round(cor(dataset),4)
corrMatrix %>% corrplot(., method = "color", outline = T, addgrid.col = "darkgray", order="hclust", addrect = 4, rect.col = "black", rect.lwd = 5,cl.pos = "b", tl.col = "indianred4", tl.cex = 1.0, cl.cex = 1.0, addCoef.col = "white", number.digits = 2, number.cex = 0.8, col = colorRampPalette(c("darkred","white","dodgerblue4"))(100))

From the plot above, it can be concluded that there are no variable pairs having high correlation.

Splitting Data: Train/Test

We are going to do a 75-25% split for training and test purposes.

sample = sample.split(dataset$Suicide, SplitRatio = 0.75)
train = subset(dataset, sample == TRUE)
test = subset(dataset, sample == FALSE)

#head(train)%>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="300px")

Model Building

y_train <- as.factor(train$Suicide)
y_test <- as.factor(test$Suicide)

X_train <- train %>% select(-'Suicide')
X_test <- test %>% select(-'Suicide')

Problem2: Clustering Method

K-means clustering is an example of an unsupervised machine learning algorithm. A clustering algorithm is a collection of data points that are aggregated together because of certain similarities. The “k” in k-means is the number of centroids you need in the dataset - or “imaginary or real location representing the center of the cluster”. The “means” in k-means refers to averaging of the data, or finding the centroid.

The first step is to standardize the data to have a mean of zero and standard deviation of one. This is because the kmeans algorithm uses a distance-based measurement to determine the similarity between data points. Most of the time, features in a dataset will have different units of measurements.

#select numeric values only
nums <- X_test %>% dplyr::select(where(is.numeric))
#center and scale data for K-means algorithm
dftrans <- preProcess(nums, method = c("center", "scale", "BoxCox"))
dftrans <- predict(dftrans, nums)

In order to determine the number of clusters needed for our algortihm, we will use the average silhoutte method. The average silhoutte method measures how well each object lies within a given cluster. The higher the average silhouette width, the better the clustering. The funcion fviz_nbclust allows us to do this process with ease, as shown below.

library(factoextra)
#determine # of clusters
fviz_nbclust(dftrans, kmeans, method = "silhouette", k.max = 10) +
  labs(subtitle = "Silhouette method") # add subtitle

Another popular method is the Elbow method. Ther Elbow method is used to help determine the optimal number of clusters for the optimal value of k.

fviz_nbclust(dftrans, kmeans, method = "wss") +
  labs(subtitle = "Elbow method") # add subtitle

Both of our methods to find ‘k’ suggest 2 clusters is the optimal number. We can visualize the results of our clusters as shown below.

final <- kmeans(dftrans, 2, nstart =25)
print(final)
## K-means clustering with 2 clusters of sizes 25, 19
## 
## Cluster means:
##           Age        Sex       Race     ADHDQ1     ADHDQ2     ADHDQ3     ADHDQ4
## 1 -0.03780229  0.2585521 -0.1736544  0.6385017  0.3843586  0.4393394  0.5424152
## 2  0.04973985 -0.3402001  0.2284926 -0.8401339 -0.5057350 -0.5780781 -0.7137043
##       ADHDQ5     ADHDQ6     ADHDQ7     ADHDQ8     ADHDQ9    ADHDQ10    ADHDQ11
## 1  0.3112391  0.4136629  0.4312272  0.5065839  0.5292581  0.4813072  0.4100311
## 2 -0.4095251 -0.5442933 -0.5674042 -0.6665578 -0.6963922 -0.6332990 -0.5395146
##      ADHDQ12    ADHDQ13    ADHDQ14    ADHDQ15    ADHDQ16    ADHDQ17    ADHDQ18
## 1  0.3326787  0.3997419  0.4782983  0.6970718  0.3248803  0.4230100  0.4833985
## 2 -0.4377351 -0.5259761 -0.6293399 -0.9171997 -0.4274741 -0.5565921 -0.6360507
##        MDQ1a      MDQ1b      MDQ1c      MDQ1d      MDQ1e      MDQ1f      MDQ1g
## 1  0.4012353  0.4154862  0.2971580  0.4012353  0.4953176  0.5535052  0.7032124
## 2 -0.5279412 -0.5466923 -0.3909974 -0.5279412 -0.6517336 -0.7282963 -0.9252795
##        MDQ1h      MDQ1i     MDQ1j      MDQ1k      MDQ1L      MDQ1m       MDQ2
## 1  0.3029965  0.2595573  0.352529  0.2887801  0.5053652  0.3220678  0.6401750
## 2 -0.3986795 -0.3415228 -0.463854 -0.3799738 -0.6649542 -0.4237735 -0.8423355
##         MDQ3      Alcohol        THC      Cocaine     Opioids Courtorder
## 1  0.3454007 -0.007060812  0.1408134 -0.008730403  0.09413079  0.1000364
## 2 -0.4544747  0.009290543 -0.1852809  0.011487373 -0.12385631 -0.1316268
##     Education HxofViolence DisorderlyConduct      Abuse NonsubstDx    SubstDx
## 1  0.05086440   0.03002316       -0.08875668  0.2150859  0.2350936  0.0445040
## 2 -0.06692684  -0.03950416        0.11678511 -0.2830078 -0.3093337 -0.0585579
##    Psychmeds
## 1  0.1108598
## 2 -0.1458681
## 
## Clustering vector:
##   2  17  22  24  37  38  40  50  56  60  64  66  67  68  75  80  85  86  92  97 
##   1   1   2   1   1   1   1   2   2   1   1   2   1   1   2   1   2   2   1   1 
##  98 101 103 105 108 109 114 116 121 122 128 131 132 133 143 148 151 152 156 163 
##   1   1   1   2   1   1   2   2   2   1   1   2   1   2   2   1   1   2   1   2 
## 164 165 166 172 
##   2   2   1   2 
## 
## Within cluster sum of squares by cluster:
## [1] 997.9955 655.8164
##  (between_SS / total_SS =  19.9 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"
fviz_cluster(final, data = dftrans)

library(cluster)
#generate cluster plot
d <- dist(t(dftrans), method="euclidian")   
kfit <- kmeans(d, 2)   
clusplot(as.matrix(d), kfit$cluster, color=T, shade=T, labels=2, lines=0) 

Model Summary

We can record the summary of the kmeans cluster model metrics below:

predict_km <- kmeans(dftrans, 2, nstart = 25)
print(predict_km)
## K-means clustering with 2 clusters of sizes 25, 19
## 
## Cluster means:
##           Age        Sex       Race     ADHDQ1     ADHDQ2     ADHDQ3     ADHDQ4
## 1 -0.03780229  0.2585521 -0.1736544  0.6385017  0.3843586  0.4393394  0.5424152
## 2  0.04973985 -0.3402001  0.2284926 -0.8401339 -0.5057350 -0.5780781 -0.7137043
##       ADHDQ5     ADHDQ6     ADHDQ7     ADHDQ8     ADHDQ9    ADHDQ10    ADHDQ11
## 1  0.3112391  0.4136629  0.4312272  0.5065839  0.5292581  0.4813072  0.4100311
## 2 -0.4095251 -0.5442933 -0.5674042 -0.6665578 -0.6963922 -0.6332990 -0.5395146
##      ADHDQ12    ADHDQ13    ADHDQ14    ADHDQ15    ADHDQ16    ADHDQ17    ADHDQ18
## 1  0.3326787  0.3997419  0.4782983  0.6970718  0.3248803  0.4230100  0.4833985
## 2 -0.4377351 -0.5259761 -0.6293399 -0.9171997 -0.4274741 -0.5565921 -0.6360507
##        MDQ1a      MDQ1b      MDQ1c      MDQ1d      MDQ1e      MDQ1f      MDQ1g
## 1  0.4012353  0.4154862  0.2971580  0.4012353  0.4953176  0.5535052  0.7032124
## 2 -0.5279412 -0.5466923 -0.3909974 -0.5279412 -0.6517336 -0.7282963 -0.9252795
##        MDQ1h      MDQ1i     MDQ1j      MDQ1k      MDQ1L      MDQ1m       MDQ2
## 1  0.3029965  0.2595573  0.352529  0.2887801  0.5053652  0.3220678  0.6401750
## 2 -0.3986795 -0.3415228 -0.463854 -0.3799738 -0.6649542 -0.4237735 -0.8423355
##         MDQ3      Alcohol        THC      Cocaine     Opioids Courtorder
## 1  0.3454007 -0.007060812  0.1408134 -0.008730403  0.09413079  0.1000364
## 2 -0.4544747  0.009290543 -0.1852809  0.011487373 -0.12385631 -0.1316268
##     Education HxofViolence DisorderlyConduct      Abuse NonsubstDx    SubstDx
## 1  0.05086440   0.03002316       -0.08875668  0.2150859  0.2350936  0.0445040
## 2 -0.06692684  -0.03950416        0.11678511 -0.2830078 -0.3093337 -0.0585579
##    Psychmeds
## 1  0.1108598
## 2 -0.1458681
## 
## Clustering vector:
##   2  17  22  24  37  38  40  50  56  60  64  66  67  68  75  80  85  86  92  97 
##   1   1   2   1   1   1   1   2   2   1   1   2   1   1   2   1   2   2   1   1 
##  98 101 103 105 108 109 114 116 121 122 128 131 132 133 143 148 151 152 156 163 
##   1   1   1   2   1   1   2   2   2   1   1   2   1   2   2   1   1   2   1   2 
## 164 165 166 172 
##   2   2   1   2 
## 
## Within cluster sum of squares by cluster:
## [1] 997.9955 655.8164
##  (between_SS / total_SS =  19.9 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"
dftrans %>%
  mutate(Cluster = predict_km$cluster) %>%
  group_by(Cluster) %>%
  summarise_all("mean") %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="300px")
Cluster Age Sex Race ADHDQ1 ADHDQ2 ADHDQ3 ADHDQ4 ADHDQ5 ADHDQ6 ADHDQ7 ADHDQ8 ADHDQ9 ADHDQ10 ADHDQ11 ADHDQ12 ADHDQ13 ADHDQ14 ADHDQ15 ADHDQ16 ADHDQ17 ADHDQ18 MDQ1a MDQ1b MDQ1c MDQ1d MDQ1e MDQ1f MDQ1g MDQ1h MDQ1i MDQ1j MDQ1k MDQ1L MDQ1m MDQ2 MDQ3 Alcohol THC Cocaine Opioids Courtorder Education HxofViolence DisorderlyConduct Abuse NonsubstDx SubstDx Psychmeds
1 -0.0378023 0.2585521 -0.1736544 0.6385017 0.3843586 0.4393394 0.5424152 0.3112391 0.4136629 0.4312272 0.5065839 0.5292581 0.4813072 0.4100311 0.3326787 0.3997419 0.4782983 0.6970718 0.3248803 0.4230100 0.4833985 0.4012353 0.4154862 0.2971580 0.4012353 0.4953176 0.5535052 0.7032124 0.3029965 0.2595573 0.352529 0.2887801 0.5053652 0.3220678 0.6401750 0.3454007 -0.0070608 0.1408134 -0.0087304 0.0941308 0.1000364 0.0508644 0.0300232 -0.0887567 0.2150859 0.2350936 0.0445040 0.1108598
2 0.0497399 -0.3402001 0.2284926 -0.8401339 -0.5057350 -0.5780781 -0.7137043 -0.4095251 -0.5442933 -0.5674042 -0.6665578 -0.6963922 -0.6332990 -0.5395146 -0.4377351 -0.5259761 -0.6293399 -0.9171997 -0.4274741 -0.5565921 -0.6360507 -0.5279412 -0.5466923 -0.3909974 -0.5279412 -0.6517336 -0.7282963 -0.9252795 -0.3986795 -0.3415228 -0.463854 -0.3799738 -0.6649542 -0.4237735 -0.8423355 -0.4544747 0.0092905 -0.1852809 0.0114874 -0.1238563 -0.1316268 -0.0669268 -0.0395042 0.1167851 -0.2830078 -0.3093337 -0.0585579 -0.1458681

Problem3: Principal Component Analysis (PCA)

Principal Component Analysis (PCA) is an unsupervised, non-parametric statistical technique primarily used for dimensionality reduction in machine learning. It is a useful technique for exploratory data analysis, allowing us to better visualize the variation present in a dataset with many variables. For our particular use case here, it appears that many of the questionnaire variables fall on likert scales, which when prepared for analysis are extended to dummy variables. This creates many additional features and can make analysis more difficult due to an increased number of dimensions. Therefore, utilizing PCA to reduce the number of dimensions on our entired dataset and measure the amount of variance explained is beneficial. In order to do this, we’ll use the prcomp() function:

dataset.pca <- prcomp(dataset, center = TRUE, scale = TRUE)

summary(dataset.pca)
## Importance of components:
##                           PC1     PC2     PC3     PC4     PC5     PC6    PC7
## Standard deviation     3.4879 2.15325 1.51854 1.37071 1.32056 1.30841 1.2246
## Proportion of Variance 0.2483 0.09462 0.04706 0.03834 0.03559 0.03494 0.0306
## Cumulative Proportion  0.2483 0.34289 0.38995 0.42830 0.46389 0.49883 0.5294
##                            PC8     PC9    PC10    PC11    PC12    PC13   PC14
## Standard deviation     1.19453 1.14053 1.12488 1.09600 1.03186 1.01797 1.0023
## Proportion of Variance 0.02912 0.02655 0.02582 0.02451 0.02173 0.02115 0.0205
## Cumulative Proportion  0.55855 0.58510 0.61092 0.63544 0.65717 0.67831 0.6988
##                           PC15   PC16    PC17    PC18    PC19    PC20    PC21
## Standard deviation     0.97020 0.9469 0.91595 0.87918 0.84215 0.82789 0.82016
## Proportion of Variance 0.01921 0.0183 0.01712 0.01577 0.01447 0.01399 0.01373
## Cumulative Proportion  0.71803 0.7363 0.75345 0.76922 0.78370 0.79768 0.81141
##                           PC22    PC23    PC24    PC25    PC26    PC27   PC28
## Standard deviation     0.77736 0.76284 0.74558 0.73182 0.70326 0.69314 0.6825
## Proportion of Variance 0.01233 0.01188 0.01134 0.01093 0.01009 0.00981 0.0095
## Cumulative Proportion  0.82374 0.83562 0.84696 0.85789 0.86799 0.87779 0.8873
##                           PC29    PC30    PC31   PC32   PC33    PC34    PC35
## Standard deviation     0.67280 0.65795 0.62679 0.6182 0.5979 0.57780 0.56968
## Proportion of Variance 0.00924 0.00883 0.00802 0.0078 0.0073 0.00681 0.00662
## Cumulative Proportion  0.89654 0.90537 0.91339 0.9212 0.9285 0.93530 0.94192
##                           PC36    PC37    PC38    PC39    PC40   PC41   PC42
## Standard deviation     0.55339 0.52917 0.51864 0.49580 0.47679 0.4748 0.4644
## Proportion of Variance 0.00625 0.00571 0.00549 0.00502 0.00464 0.0046 0.0044
## Cumulative Proportion  0.94817 0.95388 0.95937 0.96439 0.96903 0.9736 0.9780
##                           PC43    PC44    PC45    PC46    PC47    PC48    PC49
## Standard deviation     0.45002 0.43212 0.41491 0.39393 0.37519 0.35038 0.31036
## Proportion of Variance 0.00413 0.00381 0.00351 0.00317 0.00287 0.00251 0.00197
## Cumulative Proportion  0.98217 0.98598 0.98949 0.99266 0.99553 0.99803 1.00000

The scale = 0 argument to biplot ensures that the arrows are scaled to represent the loadings; other values for scale give slightly different biplots with different interpretations.

biplot(dataset.pca, scale = 0, cex=0.5)

From the biplot above, it’s difficult to tell much given the very large number of features. However, from our PCA analysis, we can also take a look at the eigenvalues that were generated by using a scree plot. The plot below shows the percentage of variance explained by each principal component.

fviz_eig(dataset.pca)

#compute standard deviation of each principal component
std_dev <- dataset.pca$sdev

#compute variance
pr_var <- std_dev^2


prop_varex <- pr_var/sum(pr_var)

round(prop_varex[1:10], 2)
##  [1] 0.25 0.09 0.05 0.04 0.04 0.03 0.03 0.03 0.03 0.03

The first principal component in our example therefore explains 25% of the variability, and the second principal component explains 9%. Together, the first ten principal components explain 62% of the variability. And we proceed to plot the PVE and cumulative PVE to provide us our scree plots as we did earlier.

#scree plot
plot(prop_varex, xlab = "Principal Component",
             ylab = "Proportion of Variance Explained",
             type = "b")

#cumulative scree plot
plot(cumsum(prop_varex), xlab = "Principal Component",
              ylab = "Cumulative Proportion of Variance Explained",
              type = "b")

As we can see above in our plots of the PVE and the cumultative PVE, the first few principal components account for a much larger proportion of the variance explained than the remainder of the 50 principal components for this dataset. Additionally, the proportion of variance explained by the first principal component at 25% is ~3 times the second principal component’s proportion of variance explained.

Although running PCA on the entire dataset is helpful in some ways, certain findings from a biplot are affected by the number of variables in the overall dataset. Additionally, we are seeing from our PCA analysis that the first few principal components explain a much larger proportion of the variability than later principal components. Because of this, we thought it would be interesting to dive a bit deeper into which variables seem to hold the most importance in determining the first few dimensions. To do this, we’ll use the FactoMineR and factoextra packages to determine the eigenvalues of each dimension, with particular interest in dimensions 1 and 2, since those are what we’ve plotted above and hold the highest proportion of variance explained.

eig.val <- get_eigenvalue(dataset.pca)
eig.val %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="300px")
eigenvalue variance.percent cumulative.variance.percent
Dim.1 12.1653335 24.8272112 24.82721
Dim.2 4.6364822 9.4622086 34.28942
Dim.3 2.3059660 4.7060531 38.99547
Dim.4 1.8788556 3.8343992 42.82987
Dim.5 1.7438785 3.5589358 46.38881
Dim.6 1.7119463 3.4937679 49.88258
Dim.7 1.4996351 3.0604798 52.94306
Dim.8 1.4269113 2.9120639 55.85512
Dim.9 1.3008135 2.6547214 58.50984
Dim.10 1.2653565 2.5823602 61.09220
Dim.11 1.2012221 2.4514737 63.54367
Dim.12 1.0647328 2.1729240 65.71660
Dim.13 1.0362688 2.1148343 67.83143
Dim.14 1.0046132 2.0502310 69.88166
Dim.15 0.9412784 1.9209763 71.80264
Dim.16 0.8966428 1.8298832 73.63252
Dim.17 0.8389721 1.7121880 75.34471
Dim.18 0.7729593 1.5774679 76.92218
Dim.19 0.7092249 1.4473977 78.36958
Dim.20 0.6854086 1.3987931 79.76837
Dim.21 0.6726696 1.3727952 81.14117
Dim.22 0.6042928 1.2332506 82.37442
Dim.23 0.5819237 1.1875994 83.56202
Dim.24 0.5558916 1.1344727 84.69649
Dim.25 0.5355566 1.0929727 85.78946
Dim.26 0.4945789 1.0093447 86.79881
Dim.27 0.4804497 0.9805095 87.77932
Dim.28 0.4657421 0.9504941 88.72981
Dim.29 0.4526588 0.9237935 89.65360
Dim.30 0.4328935 0.8834562 90.53706
Dim.31 0.3928620 0.8017593 91.33882
Dim.32 0.3821097 0.7798158 92.11863
Dim.33 0.3575121 0.7296164 92.84825
Dim.34 0.3338529 0.6813324 93.52958
Dim.35 0.3245365 0.6623194 94.19190
Dim.36 0.3062365 0.6249724 94.81687
Dim.37 0.2800222 0.5714738 95.38835
Dim.38 0.2689839 0.5489466 95.93730
Dim.39 0.2458200 0.5016734 96.43897
Dim.40 0.2273312 0.4639412 96.90291
Dim.41 0.2254670 0.4601368 97.36305
Dim.42 0.2156907 0.4401851 97.80323
Dim.43 0.2025136 0.4132931 98.21652
Dim.44 0.1867237 0.3810689 98.59759
Dim.45 0.1721464 0.3513193 98.94891
Dim.46 0.1551781 0.3166900 99.26560
Dim.47 0.1407682 0.2872820 99.55288
Dim.48 0.1227652 0.2505412 99.80343
Dim.49 0.0963212 0.1965739 100.00000
res.pca <- PCA(dataset, scale.unit = TRUE, ncp = 5, graph = FALSE)
var <- get_pca_var(res.pca)

Interestingly, eigenvalues less than 1 indicate that principal components account for more variance than accounted by one of the original variables in the standardized data. Because of this, many use this as a cutoff point for which PCs should be retained. Since we see this cutoff occur at Dim = 50, anything past this dimension doesn’t provide good insight into our data. Above, our scree plot cut this even further, by showing that anything past the first ten dimensions does not account for a large proportion of variance explained. Therefore, going forward, we will only focus on the first few dimensions to see if we can garner any insights.

Since our biplot above was very crowded and difficult to interpret, we decided to look a bit deeper at the quality of representation of the variables, which is determined by taking the square cosine (cos2) and accounts for a variable’s position relative to the circumference of the correlation circle (not pictured in our biplot above but can be visually seen by the length of each vector from the center/origin). After subjecting these cos2 values to a corrplot across the first five PCs, we can examine below:

corrplot(var$cos2, is.corr=FALSE)

Typically, a high cos2 value indicates a good representation of the variable on the principal component, and in our case the variable is positioned close to the circumference of the correlation circle (and farther away from the origin) – which we can visibly see with variables such as ADHDQ8, ADHDQ9, ADHDQ10, ADHDQ13 and MDQ1g on the biplot. The opposite is true for variables with a low cos2 value, which tend to fall closer to the origin.

We can also take a look at factors such as the contribution of variables on our principal components. Variables that are correlated with PC1 and PC2 are the most important in explaining the variability in the dataset. Therefore, when we conduct another correlation plot, but this time of variable contribution, we see the following:

corrplot(var$contrib, is.corr=FALSE)

We can also see this in a barplot below, and shows the variables that have the highest contribution percentage for our first two PCs:

fviz_contrib(dataset.pca, choice = "var", axes = 1, top = 15)

fviz_contrib(dataset.pca, choice = "var", axes = 2, top = 15)

Similar to our biplot visualization and our cos2 values, we can confirm that ADHDQ8, ADHDQ9, ADHDQ10, ADHDQ13 and MDQ1g etc. along with many other ADHD and MD questions contribute most to the variability explained in our dataset. This is important for us to take note of for future analysis, where we will be looking more closely at features that seem to provide more insight into clustering and classification.

Graph of variables. Positive correlated variables point to the same side of the plot. Negative correlated variables point to opposite sides of the graph.

fviz_pca_var(dataset.pca,
             col.var = "contrib", # Color by contributions to the PC
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             repel = TRUE     # Avoid text overlapping
             )
## Warning: ggrepel: 20 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

Problem4: Gradient Boosting Method

The boosting method is uses trees in a sequential pattern. The successive tree is developed using information from the previous tree to minimize its error.

The boosting method has three tuning parameters, number of trees, shrinkage parameter, and number of splits in a tree.

We will be using the stochastic gradient boosting in our model. This approach resamples the observations and columns in each iteration.

The xgbTree method will create a model by choosing the best tuning parameters.

Here, we can see our parameters that are used in the model.

gbmTune$bestTune
nrounds max_depth eta gamma colsample_bytree min_child_weight subsample
29 100 2 0.3 0 0.8 1 0.5

We identify the variables with the most importance to the prediction below, with Abuse having the most importance important.

varImp(gbmTune)['importance'] %>%  kbl() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),latex_options="scale_down")
Overall
Abuse 100.000000
ADHDQ18 71.380958
ADHDQ17 55.879768
Age 51.509844
Alcohol 51.055537
MDQ1m 49.026194
ADHDQ12 47.783539
Opioids 45.998844
ADHDQ15 43.101288
MDQ1e 38.887030
Psychmeds 35.248431
ADHDQ8 32.076875
Race 28.912055
ADHDQ16 27.731237
MDQ1d 26.337484
ADHDQ1 22.903558
ADHDQ7 22.639072
ADHDQ4 21.321127
Cocaine 21.106270
ADHDQ13 20.337441
Courtorder 18.663410
ADHDQ6 18.401690
ADHDQ2 17.755851
MDQ1b 16.918646
ADHDQ3 16.597874
MDQ1g 16.593649
MDQ1f 15.946980
Education 15.771490
ADHDQ11 14.317792
ADHDQ9 14.025456
MDQ1c 13.881699
ADHDQ5 13.273160
SubstDx 13.159069
MDQ1L 12.898276
THC 10.914987
MDQ1k 10.896834
MDQ2 8.999137
ADHDQ10 8.720960
MDQ3 8.106795
NonsubstDx 7.216400
Sex 6.533485
MDQ1a 5.566541
DisorderlyConduct 5.386087
MDQ1i 3.449383
MDQ1h 1.389954
ADHDQ14 0.000000
MDQ1j 0.000000
HxofViolence 0.000000

Visualization of the first 3 trees used in the model.

xgb.plot.tree(model = gbmTune$finalModel,trees = 1:3)

Now, to look at how the model performed on the test dataset.

gbm_predict <- predict(gbmTune, newdata = X_test)
gbm_conf_matrix <- confusionMatrix(gbm_predict, y_test)
print(gbm_conf_matrix)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 27  8
##          1  3  6
##                                           
##                Accuracy : 0.75            
##                  95% CI : (0.5966, 0.8681)
##     No Information Rate : 0.6818          
##     P-Value [Acc > NIR] : 0.2113          
##                                           
##                   Kappa : 0.3632          
##                                           
##  Mcnemar's Test P-Value : 0.2278          
##                                           
##             Sensitivity : 0.9000          
##             Specificity : 0.4286          
##          Pos Pred Value : 0.7714          
##          Neg Pred Value : 0.6667          
##              Prevalence : 0.6818          
##          Detection Rate : 0.6136          
##    Detection Prevalence : 0.7955          
##       Balanced Accuracy : 0.6643          
##                                           
##        'Positive' Class : 0               
## 

Model Summary

We record the summary of the Stochastic gradient boosting model metrics in a data frame -

gbm_model <- confusionMatrix(table(gbm_predict, y_test))$byClass
gbm_accuracy <- confusionMatrix(table(gbm_predict, y_test))$overall['Accuracy']
gbm_model <- data.frame(gbm_model)
gbm_model <- rbind("Accuracy" = gbm_accuracy, gbm_model)

gbm_model %>%
 kbl() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),latex_options="scale_down")
gbm_model
Accuracy 0.7500000
Sensitivity 0.9000000
Specificity 0.4285714
Pos Pred Value 0.7714286
Neg Pred Value 0.6666667
Precision 0.7714286
Recall 0.9000000
F1 0.8307692
Prevalence 0.6818182
Detection Rate 0.6136364
Detection Prevalence 0.7954545
Balanced Accuracy 0.6642857

Problem5: Support Vector Machine Model

Support Vector Machines are supervised learning models used to analyze data for classification problems.

For our dataset, SVM will help us decide an optimal decision boundary which can then help classify suicide attempts. We will build a linear model as well as a radial model and evaluate the performance for each to determine which SVM model is appropriate given our data.

Feature Selection

We would like to calculate the correlation between our independent variables and the target, Suicide. This is to identify the most significant predictors of suicide attempts.

#Compute correlation between each variable and Suicide
target_corr <- function(x, y) cor(y, x)
Suicide_Correlation <- sapply(train, target_corr, y=train$Suicide) 
#Output Correlation with Target
Suicidecorr <- data.frame(Suicide_Correlation)
Suicidecorr %>% 
  kbl(caption = "Correlation with Suicide") %>%
  kable_minimal()
Correlation with Suicide
Suicide_Correlation
Age -0.0413520
Sex 0.1350691
Race -0.1121179
ADHDQ1 0.1633275
ADHDQ2 0.1973216
ADHDQ3 0.1414619
ADHDQ4 0.0987259
ADHDQ5 0.1696575
ADHDQ6 0.0591130
ADHDQ7 0.1313229
ADHDQ8 0.1313311
ADHDQ9 0.0967320
ADHDQ10 0.1188377
ADHDQ11 0.0961761
ADHDQ12 0.0984859
ADHDQ13 0.1693613
ADHDQ14 0.1706819
ADHDQ15 0.1266747
ADHDQ16 0.1198645
ADHDQ17 0.0398154
ADHDQ18 0.0332194
MDQ1a 0.1038367
MDQ1b 0.2479469
MDQ1c -0.0916712
MDQ1d 0.1797094
MDQ1e -0.0275834
MDQ1f 0.1876374
MDQ1g 0.2997813
MDQ1h 0.0570587
MDQ1i 0.0501235
MDQ1j 0.0485245
MDQ1k 0.1708144
MDQ1L 0.1412683
MDQ1m 0.0713625
MDQ2 0.2596162
MDQ3 0.1764281
Alcohol 0.2480052
THC -0.0759040
Cocaine -0.1126979
Opioids 0.2160137
Courtorder 0.1578478
Education -0.0873615
HxofViolence 0.1378318
DisorderlyConduct 0.0144309
Suicide 1.0000000
Abuse 0.3109506
NonsubstDx 0.0476540
SubstDx 0.1086603
Psychmeds -0.0045748

There seems to be no strong correlation present in the data. Out of the independent variables,history of abuse alcohol usage maybe indicators of suicide attempts among the respondents. Certain ADHD and MD responses also seem to have a moderate correlation with suicide.

We will keep the variables that had a somewhat strong positive correlation such as abuse, alcohol, MDQ1G,MDQ1D and ADHDQ1. For the rest, we will use caret package to determine which features maybe significant in predicting suicide attempts.

set.seed(7)
# prepare training scheme
control <- trainControl(method="repeatedcv", number=10, repeats=3)
# train the model
model <- train(as.factor(Suicide)~., data=dataset, method="lvq", preProcess="scale", trControl=control)
# estimate variable importance
importance <- varImp(model, scale=FALSE)
# summarize importance
print(importance)
## loess r-squared variable importance
## 
##   only 20 most important variables shown (out of 48)
## 
##            Overall
## Abuse      0.08951
## MDQ1g      0.07775
## Alcohol    0.05840
## MDQ1b      0.05224
## MDQ2       0.05005
## MDQ1f      0.03921
## Courtorder 0.03732
## MDQ1d      0.03565
## Sex        0.03550
## MDQ1a      0.03363
## MDQ3       0.03281
## ADHDQ1     0.03123
## Opioids    0.02842
## MDQ1L      0.02681
## ADHDQ4     0.02321
## SubstDx    0.02049
## MDQ1k      0.02041
## ADHDQ7     0.01922
## Age        0.01870
## ADHDQ3     0.01854

We had already identified some of the variables above from the correlation matrix. We will keep the top 10 important variables for our SVM model.

Linear SVM

First, let’s build the linear model with all variables (including those deemed insignificant). We will set the cost parameter to 10.

set.seed(123)
linear_svm1=svm(formula=as.factor(Suicide)~.,data=train, kernel="linear", cost=10, scale=FALSE)
print(linear_svm1)
## 
## Call:
## svm(formula = as.factor(Suicide) ~ ., data = train, kernel = "linear", 
##     cost = 10, scale = FALSE)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  10 
## 
## Number of Support Vectors:  63
test$pred1 = predict(linear_svm1, test)
 
confusionMatrix(test$pred,as.factor(test$Suicide))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 25  8
##          1  5  6
##                                          
##                Accuracy : 0.7045         
##                  95% CI : (0.548, 0.8324)
##     No Information Rate : 0.6818         
##     P-Value [Acc > NIR] : 0.4435         
##                                          
##                   Kappa : 0.2778         
##                                          
##  Mcnemar's Test P-Value : 0.5791         
##                                          
##             Sensitivity : 0.8333         
##             Specificity : 0.4286         
##          Pos Pred Value : 0.7576         
##          Neg Pred Value : 0.5455         
##              Prevalence : 0.6818         
##          Detection Rate : 0.5682         
##    Detection Prevalence : 0.7500         
##       Balanced Accuracy : 0.6310         
##                                          
##        'Positive' Class : 0              
## 

Let’s see if including only the important features improves model fit.

set.seed(123)
linear_svm2=svm(formula=as.factor(Suicide)~ Abuse+MDQ1g+Alcohol+MDQ1d+MDQ1b+MDQ2+ADHDQ1+MDQ3+Opioids+MDQ1a,data=train, kernel="linear", cost=10, scale=FALSE)
print(linear_svm2)
## 
## Call:
## svm(formula = as.factor(Suicide) ~ Abuse + MDQ1g + Alcohol + MDQ1d + 
##     MDQ1b + MDQ2 + ADHDQ1 + MDQ3 + Opioids + MDQ1a, data = train, 
##     kernel = "linear", cost = 10, scale = FALSE)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  10 
## 
## Number of Support Vectors:  73
test$pred2 = predict(linear_svm2, test)
 
confusionMatrix(test$pred2,as.factor(test$Suicide))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 26 10
##          1  4  4
##                                           
##                Accuracy : 0.6818          
##                  95% CI : (0.5242, 0.8139)
##     No Information Rate : 0.6818          
##     P-Value [Acc > NIR] : 0.5717          
##                                           
##                   Kappa : 0.172           
##                                           
##  Mcnemar's Test P-Value : 0.1814          
##                                           
##             Sensitivity : 0.8667          
##             Specificity : 0.2857          
##          Pos Pred Value : 0.7222          
##          Neg Pred Value : 0.5000          
##              Prevalence : 0.6818          
##          Detection Rate : 0.5909          
##    Detection Prevalence : 0.8182          
##       Balanced Accuracy : 0.5762          
##                                           
##        'Positive' Class : 0               
## 

The model fit has not improved after removing the insignificant features.

Radial SVM

First we will build a radial model with all variables

set.seed(123)
radial_svm = svm(as.factor(Suicide)~., data=train, C = 13)
print(radial_svm)
## 
## Call:
## svm(formula = as.factor(Suicide) ~ ., data = train, C = 13)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
## 
## Number of Support Vectors:  111
test$pred3 = predict(radial_svm, test)
 
confusionMatrix(test$pred3,as.factor(test$Suicide))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 30 12
##          1  0  2
##                                           
##                Accuracy : 0.7273          
##                  95% CI : (0.5721, 0.8504)
##     No Information Rate : 0.6818          
##     P-Value [Acc > NIR] : 0.319382        
##                                           
##                   Kappa : 0.1852          
##                                           
##  Mcnemar's Test P-Value : 0.001496        
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.1429          
##          Pos Pred Value : 0.7143          
##          Neg Pred Value : 1.0000          
##              Prevalence : 0.6818          
##          Detection Rate : 0.6818          
##    Detection Prevalence : 0.9545          
##       Balanced Accuracy : 0.5714          
##                                           
##        'Positive' Class : 0               
## 

We will build another radial model with only the most significant variables

set.seed(123)
radial_svm2 = svm(as.factor(Suicide)~Abuse+MDQ1g+Alcohol+MDQ1d+MDQ1b+MDQ2+ADHDQ1+MDQ3+Opioids+MDQ1a, data=train, C = 13)
print(radial_svm2)
## 
## Call:
## svm(formula = as.factor(Suicide) ~ Abuse + MDQ1g + Alcohol + MDQ1d + 
##     MDQ1b + MDQ2 + ADHDQ1 + MDQ3 + Opioids + MDQ1a, data = train, 
##     C = 13)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
## 
## Number of Support Vectors:  89
test$pred4 = predict(radial_svm2, test)
 
confusionMatrix(test$pred4,as.factor(test$Suicide))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 27 12
##          1  3  2
##                                           
##                Accuracy : 0.6591          
##                  95% CI : (0.5008, 0.7951)
##     No Information Rate : 0.6818          
##     P-Value [Acc > NIR] : 0.69143         
##                                           
##                   Kappa : 0.0517          
##                                           
##  Mcnemar's Test P-Value : 0.03887         
##                                           
##             Sensitivity : 0.9000          
##             Specificity : 0.1429          
##          Pos Pred Value : 0.6923          
##          Neg Pred Value : 0.4000          
##              Prevalence : 0.6818          
##          Detection Rate : 0.6136          
##    Detection Prevalence : 0.8864          
##       Balanced Accuracy : 0.5214          
##                                           
##        'Positive' Class : 0               
## 

The model fit for the radial model has not improved after removing the insignificant variables. The model performance for the first radial model also outperforms the other SVM models.

Model Summary

We will store the results of our final SVM model in a dataframe

SVM_Model_Final <- confusionMatrix(test$pred4,as.factor(test$Suicide))$byClass
SVM_Model_Final <- data.frame(SVM_Model_Final)
AccuracySVM <- confusionMatrix(test$pred4, as.factor(test$Suicide))$overall['Accuracy']
SVM_Model_Final<- rbind("Accuracy" = AccuracySVM, SVM_Model_Final)
tabularview <- data.frame(SVM_Model_Final)
tabularview %>%  kableExtra::kbl() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),latex_options="scale_down")
SVM_Model_Final
Accuracy 0.6590909
Sensitivity 0.9000000
Specificity 0.1428571
Pos Pred Value 0.6923077
Neg Pred Value 0.4000000
Precision 0.6923077
Recall 0.9000000
F1 0.7826087
Prevalence 0.6818182
Detection Rate 0.6136364
Detection Prevalence 0.8863636
Balanced Accuracy 0.5214286

Conclusion

The linear SVM model outperformed the radial SVM model, however the Stochastic Gradient Boosting model holds the best accuracy on the test data. This looks to be the results of having a higher specificity performance in the model. This outcome may be the results of Gradient Boosting having low bias and variance with its boosting method. The k-means clustering model is an unsupervised machine learning model that allowed us to find insight into our mental health dataset. With an optimal clustering of 2, we were able to group our data into two groups.

Also, based on the PCA we found that first few principal components explained a large proportion of variances. Based on further analysis, we also found the most important variables in terms of representation and contribution in the first couple of principal components.