library(kableExtra)
library(tidyverse)
library(ggplot2)
library(dplyr)
library(psych)
library(caret)
library(mice)
library(randomForest)
library(caTools)
library(corrplot)
library(class)
library(rpart)
library(rpart.plot)
library(naniar)
library(xgboost)
library(DiagrammeR)
library(factoextra)
library(e1071)
library(FactoMineR)
library(skimr)
For this assignment, we will be working with a very interesting mental health dataset from a real-life research project. All identifying information, of course, has been removed. The attached spreadsheet has the data (the tab name “Data”). The data dictionary is given in the second tab. You can get as creative as you want. The assignment is designed to really get you to think about how you could use different methods.
The target variable is ‘Suicide Attempt’.
ADHD Data Dictionary
<- read_csv('https://raw.githubusercontent.com/dcorrea614/DATA622/main/HW4/ADHD_data.csv')
dataset head(dataset)%>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="300px")
Initial | Age | Sex | Race | ADHD Q1 | ADHD Q2 | ADHD Q3 | ADHD Q4 | ADHD Q5 | ADHD Q6 | ADHD Q7 | ADHD Q8 | ADHD Q9 | ADHD Q10 | ADHD Q11 | ADHD Q12 | ADHD Q13 | ADHD Q14 | ADHD Q15 | ADHD Q16 | ADHD Q17 | ADHD Q18 | ADHD Total | MD Q1a | MD Q1b | MD Q1c | MD Q1d | MD Q1e | MD Q1f | MD Q1g | MD Q1h | MD Q1i | MD Q1j | MD Q1k | MD Q1L | MD Q1m | MD Q2 | MD Q3 | MD TOTAL | Alcohol | THC | Cocaine | Stimulants | Sedative-hypnotics | Opioids | Court order | Education | Hx of Violence | Disorderly Conduct | Suicide | Abuse | Non-subst Dx | Subst Dx | Psych meds. |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
JA | 24 | 1 | 1 | 1 | 1 | 4 | 2 | 3 | 1 | 1 | 3 | 2 | 4 | 2 | 4 | 1 | 0 | 3 | 1 | 3 | 4 | 40 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 3 | 15 | 1 | 1 | 1 | 0 | 0 | 0 | 1 | 11 | 0 | 1 | 1 | 0 | 2 | 0 | 2 |
LA | 48 | 2 | 1 | 3 | 3 | 4 | 4 | 5 | 2 | 2 | 3 | 2 | 4 | 1 | 4 | 2 | 4 | 4 | 3 | 1 | 4 | 55 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 1 | 0 | 1 | 3 | 14 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 14 | 0 | 0 | 1 | 4 | 1 | 0 | 1 |
MD | 51 | 2 | 1 | 2 | 1 | 2 | 1 | 3 | 3 | 3 | 2 | 0 | 1 | 2 | 0 | 2 | 2 | 3 | 2 | 1 | 1 | 31 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 12 | 0 | 0 | 0 | 6 | 2 | 0 | 1 |
RD | 43 | 1 | 1 | 3 | 3 | 2 | 2 | 4 | 3 | 2 | 4 | 4 | 2 | 3 | 1 | 3 | 3 | 1 | 2 | 1 | 2 | 45 | 1 | 1 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 1 | 1 | 1 | 3 | 13 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 12 | 0 | 0 | 1 | 7 | 2 | 0 | 2 |
RB | 34 | 1 | 1 | 4 | 4 | 2 | 4 | 4 | 2 | 3 | 4 | 4 | 2 | 4 | 1 | 3 | 2 | 1 | 2 | 1 | 1 | 48 | 0 | 1 | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 2 | 7 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 9 | 1 | 1 | 1 | 0 | 2 | 0 | 0 |
SB | 39 | 2 | 1 | 2 | 3 | 1 | 4 | 3 | 2 | 3 | 4 | 4 | 2 | 4 | 2 | 4 | 4 | 3 | 4 | 3 | 3 | 55 | 0 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 3 | 14 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 11 | 0 | 1 | 1 | 2 | 0 | 0 | 0 |
summary(dataset)%>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="400px")
Initial | Age | Sex | Race | ADHD Q1 | ADHD Q2 | ADHD Q3 | ADHD Q4 | ADHD Q5 | ADHD Q6 | ADHD Q7 | ADHD Q8 | ADHD Q9 | ADHD Q10 | ADHD Q11 | ADHD Q12 | ADHD Q13 | ADHD Q14 | ADHD Q15 | ADHD Q16 | ADHD Q17 | ADHD Q18 | ADHD Total | MD Q1a | MD Q1b | MD Q1c | MD Q1d | MD Q1e | MD Q1f | MD Q1g | MD Q1h | MD Q1i | MD Q1j | MD Q1k | MD Q1L | MD Q1m | MD Q2 | MD Q3 | MD TOTAL | Alcohol | THC | Cocaine | Stimulants | Sedative-hypnotics | Opioids | Court order | Education | Hx of Violence | Disorderly Conduct | Suicide | Abuse | Non-subst Dx | Subst Dx | Psych meds. | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Length:175 | Min. :18.00 | Min. :1.000 | Min. :1.00 | Min. :0.000 | Min. :0.000 | Min. :0.000 | Min. :0.000 | Min. :0.000 | Min. :0.000 | Min. :0.000 | Min. :0.000 | Min. :0.000 | Min. :0.00 | Min. :0.000 | Min. :0.000 | Min. :0.000 | Min. :0.000 | Min. :0.000 | Min. :0.000 | Min. :0.000 | Min. :0.000 | Min. : 0.00 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.00 | Min. :0.00 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.00 | Min. :0.000 | Min. : 0.00 | Min. :0.000 | Min. :0.000 | Min. :0.000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.00000 | Min. : 6.0 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.000 | Min. :0.0000 | Min. :0.000 | Min. :0.0000 | |
Class :character | 1st Qu.:29.50 | 1st Qu.:1.000 | 1st Qu.:1.00 | 1st Qu.:1.000 | 1st Qu.:1.000 | 1st Qu.:1.000 | 1st Qu.:1.000 | 1st Qu.:1.000 | 1st Qu.:1.000 | 1st Qu.:1.000 | 1st Qu.:1.000 | 1st Qu.:1.000 | 1st Qu.:1.00 | 1st Qu.:1.000 | 1st Qu.:0.000 | 1st Qu.:1.500 | 1st Qu.:1.000 | 1st Qu.:0.000 | 1st Qu.:1.000 | 1st Qu.:0.000 | 1st Qu.:0.000 | 1st Qu.:21.00 | 1st Qu.:0.0000 | 1st Qu.:0.0000 | 1st Qu.:0.0000 | 1st Qu.:0.0000 | 1st Qu.:0.0000 | 1st Qu.:0.0000 | 1st Qu.:0.00 | 1st Qu.:0.00 | 1st Qu.:0.0000 | 1st Qu.:0.0000 | 1st Qu.:0.0000 | 1st Qu.:0.0000 | 1st Qu.:0.0000 | 1st Qu.:0.00 | 1st Qu.:1.000 | 1st Qu.: 6.50 | 1st Qu.:0.000 | 1st Qu.:0.000 | 1st Qu.:0.000 | 1st Qu.:0.0000 | 1st Qu.:0.0000 | 1st Qu.:0.0000 | 1st Qu.:0.00000 | 1st Qu.:11.0 | 1st Qu.:0.0000 | 1st Qu.:0.0000 | 1st Qu.:0.0000 | 1st Qu.:0.000 | 1st Qu.:0.0000 | 1st Qu.:0.000 | 1st Qu.:0.0000 | |
Mode :character | Median :42.00 | Median :1.000 | Median :2.00 | Median :2.000 | Median :2.000 | Median :2.000 | Median :2.000 | Median :3.000 | Median :2.000 | Median :2.000 | Median :2.000 | Median :2.000 | Median :2.00 | Median :2.000 | Median :1.000 | Median :2.000 | Median :2.000 | Median :1.000 | Median :1.000 | Median :1.000 | Median :1.000 | Median :33.00 | Median :1.0000 | Median :1.0000 | Median :1.0000 | Median :1.0000 | Median :1.0000 | Median :1.0000 | Median :1.00 | Median :1.00 | Median :1.0000 | Median :0.0000 | Median :0.0000 | Median :1.0000 | Median :0.0000 | Median :1.00 | Median :2.000 | Median :11.00 | Median :1.000 | Median :0.000 | Median :0.000 | Median :0.0000 | Median :0.0000 | Median :0.0000 | Median :0.00000 | Median :12.0 | Median :0.0000 | Median :1.0000 | Median :0.0000 | Median :0.000 | Median :0.0000 | Median :1.000 | Median :1.0000 | |
NA | Mean :39.47 | Mean :1.434 | Mean :1.64 | Mean :1.697 | Mean :1.914 | Mean :1.909 | Mean :2.103 | Mean :2.257 | Mean :1.909 | Mean :1.829 | Mean :2.137 | Mean :1.909 | Mean :2.12 | Mean :2.274 | Mean :1.291 | Mean :2.366 | Mean :2.246 | Mean :1.634 | Mean :1.703 | Mean :1.526 | Mean :1.474 | Mean :34.32 | Mean :0.5486 | Mean :0.5714 | Mean :0.5429 | Mean :0.5829 | Mean :0.5543 | Mean :0.6971 | Mean :0.72 | Mean :0.56 | Mean :0.5886 | Mean :0.3886 | Mean :0.4857 | Mean :0.5829 | Mean :0.4914 | Mean :0.72 | Mean :2.006 | Mean :10.02 | Mean :1.345 | Mean :0.807 | Mean :1.094 | Mean :0.1228 | Mean :0.1228 | Mean :0.3918 | Mean :0.08823 | Mean :11.9 | Mean :0.2439 | Mean :0.7256 | Mean :0.3025 | Mean :1.329 | Mean :0.4379 | Mean :1.138 | Mean :0.9649 | |
NA | 3rd Qu.:48.00 | 3rd Qu.:2.000 | 3rd Qu.:2.00 | 3rd Qu.:3.000 | 3rd Qu.:3.000 | 3rd Qu.:3.000 | 3rd Qu.:3.000 | 3rd Qu.:3.000 | 3rd Qu.:3.000 | 3rd Qu.:3.000 | 3rd Qu.:3.000 | 3rd Qu.:3.000 | 3rd Qu.:3.00 | 3rd Qu.:3.000 | 3rd Qu.:2.000 | 3rd Qu.:3.000 | 3rd Qu.:3.000 | 3rd Qu.:3.000 | 3rd Qu.:3.000 | 3rd Qu.:2.000 | 3rd Qu.:2.000 | 3rd Qu.:47.50 | 3rd Qu.:1.0000 | 3rd Qu.:1.0000 | 3rd Qu.:1.0000 | 3rd Qu.:1.0000 | 3rd Qu.:1.0000 | 3rd Qu.:1.0000 | 3rd Qu.:1.00 | 3rd Qu.:1.00 | 3rd Qu.:1.0000 | 3rd Qu.:1.0000 | 3rd Qu.:1.0000 | 3rd Qu.:1.0000 | 3rd Qu.:1.0000 | 3rd Qu.:1.00 | 3rd Qu.:3.000 | 3rd Qu.:14.00 | 3rd Qu.:3.000 | 3rd Qu.:1.500 | 3rd Qu.:3.000 | 3rd Qu.:0.0000 | 3rd Qu.:0.0000 | 3rd Qu.:0.0000 | 3rd Qu.:0.00000 | 3rd Qu.:13.0 | 3rd Qu.:0.0000 | 3rd Qu.:1.0000 | 3rd Qu.:1.0000 | 3rd Qu.:2.000 | 3rd Qu.:1.0000 | 3rd Qu.:2.000 | 3rd Qu.:2.0000 | |
NA | Max. :69.00 | Max. :2.000 | Max. :6.00 | Max. :4.000 | Max. :4.000 | Max. :4.000 | Max. :4.000 | Max. :5.000 | Max. :4.000 | Max. :4.000 | Max. :4.000 | Max. :4.000 | Max. :4.00 | Max. :4.000 | Max. :4.000 | Max. :4.000 | Max. :4.000 | Max. :4.000 | Max. :4.000 | Max. :4.000 | Max. :4.000 | Max. :72.00 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.00 | Max. :1.00 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.00 | Max. :3.000 | Max. :17.00 | Max. :3.000 | Max. :3.000 | Max. :3.000 | Max. :3.0000 | Max. :3.0000 | Max. :3.0000 | Max. :1.00000 | Max. :19.0 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :7.000 | Max. :2.0000 | Max. :3.000 | Max. :2.0000 | |
NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA’s :4 | NA’s :4 | NA’s :4 | NA’s :4 | NA’s :4 | NA’s :4 | NA’s :5 | NA’s :9 | NA’s :11 | NA’s :11 | NA’s :13 | NA’s :14 | NA’s :22 | NA’s :23 | NA’s :118 |
# Show dataset summary deatils
#skim(dataset)
Based on the above descriptive data summary, there are quite a few variables with missing values. So we conducted an analysis of all missing values in various attributes to identify proper imputation technique.
## Counts of missing data per feature
<- data.frame(apply(dataset, 2, function(x) length(which(is.na(x)))))
dataset_missing_counts <- data.frame(apply(dataset, 2,function(x) {sum(is.na(x)) / length(x) * 100}))
dataset_missing_pct
<- cbind(Feature = rownames(dataset_missing_counts), dataset_missing_counts, dataset_missing_pct)
dataset_missing_counts colnames(dataset_missing_counts) <- c('Feature','NA_Count','NA_Percentage')
rownames(dataset_missing_counts) <- NULL
<- dataset_missing_counts %>% filter(`NA_Count` != 0) %>% arrange(desc(`NA_Count`))
dataset_missing_counts
%>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="300px") dataset_missing_counts
Feature | NA_Count | NA_Percentage |
---|---|---|
Psych meds. | 118 | 67.428571 |
Subst Dx | 23 | 13.142857 |
Non-subst Dx | 22 | 12.571429 |
Abuse | 14 | 8.000000 |
Suicide | 13 | 7.428571 |
Hx of Violence | 11 | 6.285714 |
Disorderly Conduct | 11 | 6.285714 |
Education | 9 | 5.142857 |
Court order | 5 | 2.857143 |
Alcohol | 4 | 2.285714 |
THC | 4 | 2.285714 |
Cocaine | 4 | 2.285714 |
Stimulants | 4 | 2.285714 |
Sedative-hypnotics | 4 | 2.285714 |
Opioids | 4 | 2.285714 |
ggplot(dataset_missing_counts, aes(x = NA_Count, y = reorder(Feature, NA_Count))) +
geom_bar(stat = 'identity', fill = 'steelblue') +
geom_label(aes(label = NA_Count)) +
labs(title = 'Missing Counts') +
theme(plot.title = element_text(hjust = 0.5), axis.title.y = element_blank(), axis.title.x = element_blank())
# Use nanair package to plot missing value patterns
gg_miss_upset(dataset)
Based on above missing value analysis, we are going to perform data imputation using the mice package following Random Forest method. But before that, we remove the Initial as we do not need the ID columns for imputation. Additionally, we removed the ADHD and MDQ subtotal columns to avoid collinearity.
# removing Initial, ADHD Total and MD Total columns
<- dataset %>%
dataset select(-c('Initial','ADHD Total','MD TOTAL'))
# cleaning up the column names for the imputation function
<- colnames(dataset) %>%
colNamesNoSpace str_remove_all(' |-|\\.')
colnames(dataset) <- colNamesNoSpace
#imputation by using the random forest method ('rf')
<- mice(dataset, maxit = 0, silent = TRUE)
init <- init$predictorMatrix
predM set.seed(123)
<- mice(dataset, method = 'rf', predictorMatrix = predM, m=5, silent = TRUE) imputed
##
## iter imp variable
## 1 1 Alcohol THC Cocaine Stimulants Sedativehypnotics Opioids Courtorder Education HxofViolence DisorderlyConduct Suicide Abuse NonsubstDx SubstDx Psychmeds
## 1 2 Alcohol THC Cocaine Stimulants Sedativehypnotics Opioids Courtorder Education HxofViolence DisorderlyConduct Suicide Abuse NonsubstDx SubstDx Psychmeds
## 1 3 Alcohol THC Cocaine Stimulants Sedativehypnotics Opioids Courtorder Education HxofViolence DisorderlyConduct Suicide Abuse NonsubstDx SubstDx Psychmeds
## 1 4 Alcohol THC Cocaine Stimulants Sedativehypnotics Opioids Courtorder Education HxofViolence DisorderlyConduct Suicide Abuse NonsubstDx SubstDx Psychmeds
## 1 5 Alcohol THC Cocaine Stimulants Sedativehypnotics Opioids Courtorder Education HxofViolence DisorderlyConduct Suicide Abuse NonsubstDx SubstDx Psychmeds
## 2 1 Alcohol THC Cocaine Stimulants Sedativehypnotics Opioids Courtorder Education HxofViolence DisorderlyConduct Suicide Abuse NonsubstDx SubstDx Psychmeds
## 2 2 Alcohol THC Cocaine Stimulants Sedativehypnotics Opioids Courtorder Education HxofViolence DisorderlyConduct Suicide Abuse NonsubstDx SubstDx Psychmeds
## 2 3 Alcohol THC Cocaine Stimulants Sedativehypnotics Opioids Courtorder Education HxofViolence DisorderlyConduct Suicide Abuse NonsubstDx SubstDx Psychmeds
## 2 4 Alcohol THC Cocaine Stimulants Sedativehypnotics Opioids Courtorder Education HxofViolence DisorderlyConduct Suicide Abuse NonsubstDx SubstDx Psychmeds
## 2 5 Alcohol THC Cocaine Stimulants Sedativehypnotics Opioids Courtorder Education HxofViolence DisorderlyConduct Suicide Abuse NonsubstDx SubstDx Psychmeds
## 3 1 Alcohol THC Cocaine Stimulants Sedativehypnotics Opioids Courtorder Education HxofViolence DisorderlyConduct Suicide Abuse NonsubstDx SubstDx Psychmeds
## 3 2 Alcohol THC Cocaine Stimulants Sedativehypnotics Opioids Courtorder Education HxofViolence DisorderlyConduct Suicide Abuse NonsubstDx SubstDx Psychmeds
## 3 3 Alcohol THC Cocaine Stimulants Sedativehypnotics Opioids Courtorder Education HxofViolence DisorderlyConduct Suicide Abuse NonsubstDx SubstDx Psychmeds
## 3 4 Alcohol THC Cocaine Stimulants Sedativehypnotics Opioids Courtorder Education HxofViolence DisorderlyConduct Suicide Abuse NonsubstDx SubstDx Psychmeds
## 3 5 Alcohol THC Cocaine Stimulants Sedativehypnotics Opioids Courtorder Education HxofViolence DisorderlyConduct Suicide Abuse NonsubstDx SubstDx Psychmeds
## 4 1 Alcohol THC Cocaine Stimulants Sedativehypnotics Opioids Courtorder Education HxofViolence DisorderlyConduct Suicide Abuse NonsubstDx SubstDx Psychmeds
## 4 2 Alcohol THC Cocaine Stimulants Sedativehypnotics Opioids Courtorder Education HxofViolence DisorderlyConduct Suicide Abuse NonsubstDx SubstDx Psychmeds
## 4 3 Alcohol THC Cocaine Stimulants Sedativehypnotics Opioids Courtorder Education HxofViolence DisorderlyConduct Suicide Abuse NonsubstDx SubstDx Psychmeds
## 4 4 Alcohol THC Cocaine Stimulants Sedativehypnotics Opioids Courtorder Education HxofViolence DisorderlyConduct Suicide Abuse NonsubstDx SubstDx Psychmeds
## 4 5 Alcohol THC Cocaine Stimulants Sedativehypnotics Opioids Courtorder Education HxofViolence DisorderlyConduct Suicide Abuse NonsubstDx SubstDx Psychmeds
## 5 1 Alcohol THC Cocaine Stimulants Sedativehypnotics Opioids Courtorder Education HxofViolence DisorderlyConduct Suicide Abuse NonsubstDx SubstDx Psychmeds
## 5 2 Alcohol THC Cocaine Stimulants Sedativehypnotics Opioids Courtorder Education HxofViolence DisorderlyConduct Suicide Abuse NonsubstDx SubstDx Psychmeds
## 5 3 Alcohol THC Cocaine Stimulants Sedativehypnotics Opioids Courtorder Education HxofViolence DisorderlyConduct Suicide Abuse NonsubstDx SubstDx Psychmeds
## 5 4 Alcohol THC Cocaine Stimulants Sedativehypnotics Opioids Courtorder Education HxofViolence DisorderlyConduct Suicide Abuse NonsubstDx SubstDx Psychmeds
## 5 5 Alcohol THC Cocaine Stimulants Sedativehypnotics Opioids Courtorder Education HxofViolence DisorderlyConduct Suicide Abuse NonsubstDx SubstDx Psychmeds
# complete the imputation and show summary of the imputed data
<- complete(imputed)
dataset summary(dataset) %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="300px")
Age | Sex | Race | ADHDQ1 | ADHDQ2 | ADHDQ3 | ADHDQ4 | ADHDQ5 | ADHDQ6 | ADHDQ7 | ADHDQ8 | ADHDQ9 | ADHDQ10 | ADHDQ11 | ADHDQ12 | ADHDQ13 | ADHDQ14 | ADHDQ15 | ADHDQ16 | ADHDQ17 | ADHDQ18 | MDQ1a | MDQ1b | MDQ1c | MDQ1d | MDQ1e | MDQ1f | MDQ1g | MDQ1h | MDQ1i | MDQ1j | MDQ1k | MDQ1L | MDQ1m | MDQ2 | MDQ3 | Alcohol | THC | Cocaine | Stimulants | Sedativehypnotics | Opioids | Courtorder | Education | HxofViolence | DisorderlyConduct | Suicide | Abuse | NonsubstDx | SubstDx | Psychmeds | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Min. :18.00 | Min. :1.000 | Min. :1.00 | Min. :0.000 | Min. :0.000 | Min. :0.000 | Min. :0.000 | Min. :0.000 | Min. :0.000 | Min. :0.000 | Min. :0.000 | Min. :0.000 | Min. :0.00 | Min. :0.000 | Min. :0.000 | Min. :0.000 | Min. :0.000 | Min. :0.000 | Min. :0.000 | Min. :0.000 | Min. :0.000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.00 | Min. :0.00 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.00 | Min. :0.000 | Min. :0.000 | Min. :0.0000 | Min. :0.000 | Min. :0.00 | Min. :0.00 | Min. :0.0000 | Min. :0.00000 | Min. : 6.00 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.000 | Min. :0.0000 | Min. :0.000 | Min. :0.0000 | |
1st Qu.:29.50 | 1st Qu.:1.000 | 1st Qu.:1.00 | 1st Qu.:1.000 | 1st Qu.:1.000 | 1st Qu.:1.000 | 1st Qu.:1.000 | 1st Qu.:1.000 | 1st Qu.:1.000 | 1st Qu.:1.000 | 1st Qu.:1.000 | 1st Qu.:1.000 | 1st Qu.:1.00 | 1st Qu.:1.000 | 1st Qu.:0.000 | 1st Qu.:1.500 | 1st Qu.:1.000 | 1st Qu.:0.000 | 1st Qu.:1.000 | 1st Qu.:0.000 | 1st Qu.:0.000 | 1st Qu.:0.0000 | 1st Qu.:0.0000 | 1st Qu.:0.0000 | 1st Qu.:0.0000 | 1st Qu.:0.0000 | 1st Qu.:0.0000 | 1st Qu.:0.00 | 1st Qu.:0.00 | 1st Qu.:0.0000 | 1st Qu.:0.0000 | 1st Qu.:0.0000 | 1st Qu.:0.0000 | 1st Qu.:0.0000 | 1st Qu.:0.00 | 1st Qu.:1.000 | 1st Qu.:0.000 | 1st Qu.:0.0000 | 1st Qu.:0.000 | 1st Qu.:0.00 | 1st Qu.:0.00 | 1st Qu.:0.0000 | 1st Qu.:0.00000 | 1st Qu.:11.00 | 1st Qu.:0.0000 | 1st Qu.:0.0000 | 1st Qu.:0.0000 | 1st Qu.:0.000 | 1st Qu.:0.0000 | 1st Qu.:0.000 | 1st Qu.:0.0000 | |
Median :42.00 | Median :1.000 | Median :2.00 | Median :2.000 | Median :2.000 | Median :2.000 | Median :2.000 | Median :3.000 | Median :2.000 | Median :2.000 | Median :2.000 | Median :2.000 | Median :2.00 | Median :2.000 | Median :1.000 | Median :2.000 | Median :2.000 | Median :1.000 | Median :1.000 | Median :1.000 | Median :1.000 | Median :1.0000 | Median :1.0000 | Median :1.0000 | Median :1.0000 | Median :1.0000 | Median :1.0000 | Median :1.00 | Median :1.00 | Median :1.0000 | Median :0.0000 | Median :0.0000 | Median :1.0000 | Median :0.0000 | Median :1.00 | Median :2.000 | Median :1.000 | Median :0.0000 | Median :0.000 | Median :0.00 | Median :0.00 | Median :0.0000 | Median :0.00000 | Median :12.00 | Median :0.0000 | Median :1.0000 | Median :0.0000 | Median :0.000 | Median :0.0000 | Median :1.000 | Median :1.0000 | |
Mean :39.47 | Mean :1.434 | Mean :1.64 | Mean :1.697 | Mean :1.914 | Mean :1.909 | Mean :2.103 | Mean :2.257 | Mean :1.909 | Mean :1.829 | Mean :2.137 | Mean :1.909 | Mean :2.12 | Mean :2.274 | Mean :1.291 | Mean :2.366 | Mean :2.246 | Mean :1.634 | Mean :1.703 | Mean :1.526 | Mean :1.474 | Mean :0.5486 | Mean :0.5714 | Mean :0.5429 | Mean :0.5829 | Mean :0.5543 | Mean :0.6971 | Mean :0.72 | Mean :0.56 | Mean :0.5886 | Mean :0.3886 | Mean :0.4857 | Mean :0.5829 | Mean :0.4914 | Mean :0.72 | Mean :2.006 | Mean :1.366 | Mean :0.7886 | Mean :1.086 | Mean :0.12 | Mean :0.12 | Mean :0.3829 | Mean :0.08571 | Mean :11.91 | Mean :0.2343 | Mean :0.7257 | Mean :0.3086 | Mean :1.326 | Mean :0.4171 | Mean :1.137 | Mean :0.7771 | |
3rd Qu.:48.00 | 3rd Qu.:2.000 | 3rd Qu.:2.00 | 3rd Qu.:3.000 | 3rd Qu.:3.000 | 3rd Qu.:3.000 | 3rd Qu.:3.000 | 3rd Qu.:3.000 | 3rd Qu.:3.000 | 3rd Qu.:3.000 | 3rd Qu.:3.000 | 3rd Qu.:3.000 | 3rd Qu.:3.00 | 3rd Qu.:3.000 | 3rd Qu.:2.000 | 3rd Qu.:3.000 | 3rd Qu.:3.000 | 3rd Qu.:3.000 | 3rd Qu.:3.000 | 3rd Qu.:2.000 | 3rd Qu.:2.000 | 3rd Qu.:1.0000 | 3rd Qu.:1.0000 | 3rd Qu.:1.0000 | 3rd Qu.:1.0000 | 3rd Qu.:1.0000 | 3rd Qu.:1.0000 | 3rd Qu.:1.00 | 3rd Qu.:1.00 | 3rd Qu.:1.0000 | 3rd Qu.:1.0000 | 3rd Qu.:1.0000 | 3rd Qu.:1.0000 | 3rd Qu.:1.0000 | 3rd Qu.:1.00 | 3rd Qu.:3.000 | 3rd Qu.:3.000 | 3rd Qu.:1.0000 | 3rd Qu.:3.000 | 3rd Qu.:0.00 | 3rd Qu.:0.00 | 3rd Qu.:0.0000 | 3rd Qu.:0.00000 | 3rd Qu.:13.00 | 3rd Qu.:0.0000 | 3rd Qu.:1.0000 | 3rd Qu.:1.0000 | 3rd Qu.:2.000 | 3rd Qu.:1.0000 | 3rd Qu.:2.000 | 3rd Qu.:1.0000 | |
Max. :69.00 | Max. :2.000 | Max. :6.00 | Max. :4.000 | Max. :4.000 | Max. :4.000 | Max. :4.000 | Max. :5.000 | Max. :4.000 | Max. :4.000 | Max. :4.000 | Max. :4.000 | Max. :4.00 | Max. :4.000 | Max. :4.000 | Max. :4.000 | Max. :4.000 | Max. :4.000 | Max. :4.000 | Max. :4.000 | Max. :4.000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.00 | Max. :1.00 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.00 | Max. :3.000 | Max. :3.000 | Max. :3.0000 | Max. :3.000 | Max. :3.00 | Max. :3.00 | Max. :3.0000 | Max. :1.00000 | Max. :19.00 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :7.000 | Max. :2.0000 | Max. :3.000 | Max. :2.0000 |
We also checked for presence of any de-generate variables and removed from the dataset.
# capturing the degenerate variables
<- nearZeroVar(dataset)
degenCols
# identifying them
colnames(dataset[,degenCols])
## [1] "Stimulants" "Sedativehypnotics"
# removing from the dataset
<- dataset[,-degenCols] dataset
# make dataset long to place distribution in a facetwrap
<- dataset %>%
vars gather(key = 'predictor_variable', value = 'value', -Suicide) %>%
mutate(Suicide = ifelse(Suicide==1,'Y','N'))
# Distribution of ADHD variables
%>%
vars filter(str_detect(predictor_variable,'ADHD')) %>%
ggplot() +
geom_histogram(aes(x = value, y = ..density.., fill = Suicide), bins = 15) +
labs(title = 'Distributions of ADHD Variables') +
theme(plot.title = element_text(hjust = 0.5)) +
facet_wrap(. ~predictor_variable, scales = 'free', ncol = 3)
# Distribution of MDQ variables
%>%
vars filter(str_detect(predictor_variable,'MD')) %>%
ggplot() +
geom_histogram(aes(x = value, y = ..density.., fill = Suicide), bins = 15) +
labs(title = 'Distributions of MDQ Variables') +
theme(plot.title = element_text(hjust = 0.5)) +
facet_wrap(. ~predictor_variable, scales = 'free', ncol = 3)
# Distribution of other variables
%>%
vars filter(!str_detect(predictor_variable,'MD|ADH')) %>%
ggplot() +
geom_histogram(aes(x = value, y = ..density.., fill = Suicide), bins = 15) +
labs(title = 'Distributions of Other Variables') +
theme(plot.title = element_text(hjust = 0.5)) +
facet_wrap(. ~predictor_variable, scales = 'free', ncol = 3)
ADHD distributions illustrate that when the values are from 0 - 3, the suicide value is more likely to be N. The inverse appears to be true, when the values are 4 - 5, the suicide value is more likely to be Y.
MDQ distribution demonstrates that when the value is 0, the suicide value is more like N. However, the opposite does not to seem hold here.
The rest of the variables distribution show that Abuse and Alcohol could be strong indicators for suicide.
<- round(cor(dataset),4)
corrMatrix %>% corrplot(., method = "color", outline = T, addgrid.col = "darkgray", order="hclust", addrect = 4, rect.col = "black", rect.lwd = 5,cl.pos = "b", tl.col = "indianred4", tl.cex = 1.0, cl.cex = 1.0, addCoef.col = "white", number.digits = 2, number.cex = 0.8, col = colorRampPalette(c("darkred","white","dodgerblue4"))(100)) corrMatrix
From the plot above, it can be concluded that there are no variable pairs having high correlation.
We are going to do a 75-25% split for training and test purposes.
= sample.split(dataset$Suicide, SplitRatio = 0.75)
sample = subset(dataset, sample == TRUE)
train = subset(dataset, sample == FALSE)
test
#head(train)%>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="300px")
<- as.factor(train$Suicide)
y_train <- as.factor(test$Suicide)
y_test
<- train %>% select(-'Suicide')
X_train <- test %>% select(-'Suicide') X_test
K-means clustering is an example of an unsupervised machine learning algorithm. A clustering algorithm is a collection of data points that are aggregated together because of certain similarities. The “k” in k-means is the number of centroids you need in the dataset - or “imaginary or real location representing the center of the cluster”. The “means” in k-means refers to averaging of the data, or finding the centroid.
The first step is to standardize the data to have a mean of zero and standard deviation of one. This is because the kmeans algorithm uses a distance-based measurement to determine the similarity between data points. Most of the time, features in a dataset will have different units of measurements.
#select numeric values only
<- X_test %>% dplyr::select(where(is.numeric))
nums #center and scale data for K-means algorithm
<- preProcess(nums, method = c("center", "scale", "BoxCox"))
dftrans <- predict(dftrans, nums) dftrans
In order to determine the number of clusters needed for our algortihm, we will use the average silhoutte method. The average silhoutte method measures how well each object lies within a given cluster. The higher the average silhouette width, the better the clustering. The funcion fviz_nbclust allows us to do this process with ease, as shown below.
library(factoextra)
#determine # of clusters
fviz_nbclust(dftrans, kmeans, method = "silhouette", k.max = 10) +
labs(subtitle = "Silhouette method") # add subtitle
Another popular method is the Elbow method. Ther Elbow method is used to help determine the optimal number of clusters for the optimal value of k.
fviz_nbclust(dftrans, kmeans, method = "wss") +
labs(subtitle = "Elbow method") # add subtitle
Both of our methods to find ‘k’ suggest 2 clusters is the optimal number. We can visualize the results of our clusters as shown below.
<- kmeans(dftrans, 2, nstart =25)
final print(final)
## K-means clustering with 2 clusters of sizes 25, 19
##
## Cluster means:
## Age Sex Race ADHDQ1 ADHDQ2 ADHDQ3 ADHDQ4
## 1 -0.03780229 0.2585521 -0.1736544 0.6385017 0.3843586 0.4393394 0.5424152
## 2 0.04973985 -0.3402001 0.2284926 -0.8401339 -0.5057350 -0.5780781 -0.7137043
## ADHDQ5 ADHDQ6 ADHDQ7 ADHDQ8 ADHDQ9 ADHDQ10 ADHDQ11
## 1 0.3112391 0.4136629 0.4312272 0.5065839 0.5292581 0.4813072 0.4100311
## 2 -0.4095251 -0.5442933 -0.5674042 -0.6665578 -0.6963922 -0.6332990 -0.5395146
## ADHDQ12 ADHDQ13 ADHDQ14 ADHDQ15 ADHDQ16 ADHDQ17 ADHDQ18
## 1 0.3326787 0.3997419 0.4782983 0.6970718 0.3248803 0.4230100 0.4833985
## 2 -0.4377351 -0.5259761 -0.6293399 -0.9171997 -0.4274741 -0.5565921 -0.6360507
## MDQ1a MDQ1b MDQ1c MDQ1d MDQ1e MDQ1f MDQ1g
## 1 0.4012353 0.4154862 0.2971580 0.4012353 0.4953176 0.5535052 0.7032124
## 2 -0.5279412 -0.5466923 -0.3909974 -0.5279412 -0.6517336 -0.7282963 -0.9252795
## MDQ1h MDQ1i MDQ1j MDQ1k MDQ1L MDQ1m MDQ2
## 1 0.3029965 0.2595573 0.352529 0.2887801 0.5053652 0.3220678 0.6401750
## 2 -0.3986795 -0.3415228 -0.463854 -0.3799738 -0.6649542 -0.4237735 -0.8423355
## MDQ3 Alcohol THC Cocaine Opioids Courtorder
## 1 0.3454007 -0.007060812 0.1408134 -0.008730403 0.09413079 0.1000364
## 2 -0.4544747 0.009290543 -0.1852809 0.011487373 -0.12385631 -0.1316268
## Education HxofViolence DisorderlyConduct Abuse NonsubstDx SubstDx
## 1 0.05086440 0.03002316 -0.08875668 0.2150859 0.2350936 0.0445040
## 2 -0.06692684 -0.03950416 0.11678511 -0.2830078 -0.3093337 -0.0585579
## Psychmeds
## 1 0.1108598
## 2 -0.1458681
##
## Clustering vector:
## 2 17 22 24 37 38 40 50 56 60 64 66 67 68 75 80 85 86 92 97
## 1 1 2 1 1 1 1 2 2 1 1 2 1 1 2 1 2 2 1 1
## 98 101 103 105 108 109 114 116 121 122 128 131 132 133 143 148 151 152 156 163
## 1 1 1 2 1 1 2 2 2 1 1 2 1 2 2 1 1 2 1 2
## 164 165 166 172
## 2 2 1 2
##
## Within cluster sum of squares by cluster:
## [1] 997.9955 655.8164
## (between_SS / total_SS = 19.9 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
fviz_cluster(final, data = dftrans)
library(cluster)
#generate cluster plot
<- dist(t(dftrans), method="euclidian")
d <- kmeans(d, 2)
kfit clusplot(as.matrix(d), kfit$cluster, color=T, shade=T, labels=2, lines=0)
We can record the summary of the kmeans cluster model metrics below:
<- kmeans(dftrans, 2, nstart = 25)
predict_km print(predict_km)
## K-means clustering with 2 clusters of sizes 25, 19
##
## Cluster means:
## Age Sex Race ADHDQ1 ADHDQ2 ADHDQ3 ADHDQ4
## 1 -0.03780229 0.2585521 -0.1736544 0.6385017 0.3843586 0.4393394 0.5424152
## 2 0.04973985 -0.3402001 0.2284926 -0.8401339 -0.5057350 -0.5780781 -0.7137043
## ADHDQ5 ADHDQ6 ADHDQ7 ADHDQ8 ADHDQ9 ADHDQ10 ADHDQ11
## 1 0.3112391 0.4136629 0.4312272 0.5065839 0.5292581 0.4813072 0.4100311
## 2 -0.4095251 -0.5442933 -0.5674042 -0.6665578 -0.6963922 -0.6332990 -0.5395146
## ADHDQ12 ADHDQ13 ADHDQ14 ADHDQ15 ADHDQ16 ADHDQ17 ADHDQ18
## 1 0.3326787 0.3997419 0.4782983 0.6970718 0.3248803 0.4230100 0.4833985
## 2 -0.4377351 -0.5259761 -0.6293399 -0.9171997 -0.4274741 -0.5565921 -0.6360507
## MDQ1a MDQ1b MDQ1c MDQ1d MDQ1e MDQ1f MDQ1g
## 1 0.4012353 0.4154862 0.2971580 0.4012353 0.4953176 0.5535052 0.7032124
## 2 -0.5279412 -0.5466923 -0.3909974 -0.5279412 -0.6517336 -0.7282963 -0.9252795
## MDQ1h MDQ1i MDQ1j MDQ1k MDQ1L MDQ1m MDQ2
## 1 0.3029965 0.2595573 0.352529 0.2887801 0.5053652 0.3220678 0.6401750
## 2 -0.3986795 -0.3415228 -0.463854 -0.3799738 -0.6649542 -0.4237735 -0.8423355
## MDQ3 Alcohol THC Cocaine Opioids Courtorder
## 1 0.3454007 -0.007060812 0.1408134 -0.008730403 0.09413079 0.1000364
## 2 -0.4544747 0.009290543 -0.1852809 0.011487373 -0.12385631 -0.1316268
## Education HxofViolence DisorderlyConduct Abuse NonsubstDx SubstDx
## 1 0.05086440 0.03002316 -0.08875668 0.2150859 0.2350936 0.0445040
## 2 -0.06692684 -0.03950416 0.11678511 -0.2830078 -0.3093337 -0.0585579
## Psychmeds
## 1 0.1108598
## 2 -0.1458681
##
## Clustering vector:
## 2 17 22 24 37 38 40 50 56 60 64 66 67 68 75 80 85 86 92 97
## 1 1 2 1 1 1 1 2 2 1 1 2 1 1 2 1 2 2 1 1
## 98 101 103 105 108 109 114 116 121 122 128 131 132 133 143 148 151 152 156 163
## 1 1 1 2 1 1 2 2 2 1 1 2 1 2 2 1 1 2 1 2
## 164 165 166 172
## 2 2 1 2
##
## Within cluster sum of squares by cluster:
## [1] 997.9955 655.8164
## (between_SS / total_SS = 19.9 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
%>%
dftrans mutate(Cluster = predict_km$cluster) %>%
group_by(Cluster) %>%
summarise_all("mean") %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="300px")
Cluster | Age | Sex | Race | ADHDQ1 | ADHDQ2 | ADHDQ3 | ADHDQ4 | ADHDQ5 | ADHDQ6 | ADHDQ7 | ADHDQ8 | ADHDQ9 | ADHDQ10 | ADHDQ11 | ADHDQ12 | ADHDQ13 | ADHDQ14 | ADHDQ15 | ADHDQ16 | ADHDQ17 | ADHDQ18 | MDQ1a | MDQ1b | MDQ1c | MDQ1d | MDQ1e | MDQ1f | MDQ1g | MDQ1h | MDQ1i | MDQ1j | MDQ1k | MDQ1L | MDQ1m | MDQ2 | MDQ3 | Alcohol | THC | Cocaine | Opioids | Courtorder | Education | HxofViolence | DisorderlyConduct | Abuse | NonsubstDx | SubstDx | Psychmeds |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | -0.0378023 | 0.2585521 | -0.1736544 | 0.6385017 | 0.3843586 | 0.4393394 | 0.5424152 | 0.3112391 | 0.4136629 | 0.4312272 | 0.5065839 | 0.5292581 | 0.4813072 | 0.4100311 | 0.3326787 | 0.3997419 | 0.4782983 | 0.6970718 | 0.3248803 | 0.4230100 | 0.4833985 | 0.4012353 | 0.4154862 | 0.2971580 | 0.4012353 | 0.4953176 | 0.5535052 | 0.7032124 | 0.3029965 | 0.2595573 | 0.352529 | 0.2887801 | 0.5053652 | 0.3220678 | 0.6401750 | 0.3454007 | -0.0070608 | 0.1408134 | -0.0087304 | 0.0941308 | 0.1000364 | 0.0508644 | 0.0300232 | -0.0887567 | 0.2150859 | 0.2350936 | 0.0445040 | 0.1108598 |
2 | 0.0497399 | -0.3402001 | 0.2284926 | -0.8401339 | -0.5057350 | -0.5780781 | -0.7137043 | -0.4095251 | -0.5442933 | -0.5674042 | -0.6665578 | -0.6963922 | -0.6332990 | -0.5395146 | -0.4377351 | -0.5259761 | -0.6293399 | -0.9171997 | -0.4274741 | -0.5565921 | -0.6360507 | -0.5279412 | -0.5466923 | -0.3909974 | -0.5279412 | -0.6517336 | -0.7282963 | -0.9252795 | -0.3986795 | -0.3415228 | -0.463854 | -0.3799738 | -0.6649542 | -0.4237735 | -0.8423355 | -0.4544747 | 0.0092905 | -0.1852809 | 0.0114874 | -0.1238563 | -0.1316268 | -0.0669268 | -0.0395042 | 0.1167851 | -0.2830078 | -0.3093337 | -0.0585579 | -0.1458681 |
Principal Component Analysis (PCA) is an unsupervised, non-parametric statistical technique primarily used for dimensionality reduction in machine learning. It is a useful technique for exploratory data analysis, allowing us to better visualize the variation present in a dataset with many variables. For our particular use case here, it appears that many of the questionnaire variables fall on likert scales, which when prepared for analysis are extended to dummy variables. This creates many additional features and can make analysis more difficult due to an increased number of dimensions. Therefore, utilizing PCA to reduce the number of dimensions on our entired dataset and measure the amount of variance explained is beneficial. In order to do this, we’ll use the prcomp() function:
<- prcomp(dataset, center = TRUE, scale = TRUE)
dataset.pca
summary(dataset.pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 3.4879 2.15325 1.51854 1.37071 1.32056 1.30841 1.2246
## Proportion of Variance 0.2483 0.09462 0.04706 0.03834 0.03559 0.03494 0.0306
## Cumulative Proportion 0.2483 0.34289 0.38995 0.42830 0.46389 0.49883 0.5294
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 1.19453 1.14053 1.12488 1.09600 1.03186 1.01797 1.0023
## Proportion of Variance 0.02912 0.02655 0.02582 0.02451 0.02173 0.02115 0.0205
## Cumulative Proportion 0.55855 0.58510 0.61092 0.63544 0.65717 0.67831 0.6988
## PC15 PC16 PC17 PC18 PC19 PC20 PC21
## Standard deviation 0.97020 0.9469 0.91595 0.87918 0.84215 0.82789 0.82016
## Proportion of Variance 0.01921 0.0183 0.01712 0.01577 0.01447 0.01399 0.01373
## Cumulative Proportion 0.71803 0.7363 0.75345 0.76922 0.78370 0.79768 0.81141
## PC22 PC23 PC24 PC25 PC26 PC27 PC28
## Standard deviation 0.77736 0.76284 0.74558 0.73182 0.70326 0.69314 0.6825
## Proportion of Variance 0.01233 0.01188 0.01134 0.01093 0.01009 0.00981 0.0095
## Cumulative Proportion 0.82374 0.83562 0.84696 0.85789 0.86799 0.87779 0.8873
## PC29 PC30 PC31 PC32 PC33 PC34 PC35
## Standard deviation 0.67280 0.65795 0.62679 0.6182 0.5979 0.57780 0.56968
## Proportion of Variance 0.00924 0.00883 0.00802 0.0078 0.0073 0.00681 0.00662
## Cumulative Proportion 0.89654 0.90537 0.91339 0.9212 0.9285 0.93530 0.94192
## PC36 PC37 PC38 PC39 PC40 PC41 PC42
## Standard deviation 0.55339 0.52917 0.51864 0.49580 0.47679 0.4748 0.4644
## Proportion of Variance 0.00625 0.00571 0.00549 0.00502 0.00464 0.0046 0.0044
## Cumulative Proportion 0.94817 0.95388 0.95937 0.96439 0.96903 0.9736 0.9780
## PC43 PC44 PC45 PC46 PC47 PC48 PC49
## Standard deviation 0.45002 0.43212 0.41491 0.39393 0.37519 0.35038 0.31036
## Proportion of Variance 0.00413 0.00381 0.00351 0.00317 0.00287 0.00251 0.00197
## Cumulative Proportion 0.98217 0.98598 0.98949 0.99266 0.99553 0.99803 1.00000
The scale = 0 argument to biplot ensures that the arrows are scaled to represent the loadings; other values for scale give slightly different biplots with different interpretations.
biplot(dataset.pca, scale = 0, cex=0.5)
From the biplot above, it’s difficult to tell much given the very large number of features. However, from our PCA analysis, we can also take a look at the eigenvalues that were generated by using a scree plot. The plot below shows the percentage of variance explained by each principal component.
fviz_eig(dataset.pca)
#compute standard deviation of each principal component
<- dataset.pca$sdev
std_dev
#compute variance
<- std_dev^2
pr_var
<- pr_var/sum(pr_var)
prop_varex
round(prop_varex[1:10], 2)
## [1] 0.25 0.09 0.05 0.04 0.04 0.03 0.03 0.03 0.03 0.03
The first principal component in our example therefore explains 25% of the variability, and the second principal component explains 9%. Together, the first ten principal components explain 62% of the variability. And we proceed to plot the PVE and cumulative PVE to provide us our scree plots as we did earlier.
#scree plot
plot(prop_varex, xlab = "Principal Component",
ylab = "Proportion of Variance Explained",
type = "b")
#cumulative scree plot
plot(cumsum(prop_varex), xlab = "Principal Component",
ylab = "Cumulative Proportion of Variance Explained",
type = "b")
As we can see above in our plots of the PVE and the cumultative PVE, the first few principal components account for a much larger proportion of the variance explained than the remainder of the 50 principal components for this dataset. Additionally, the proportion of variance explained by the first principal component at 25% is ~3 times the second principal component’s proportion of variance explained.
Although running PCA on the entire dataset is helpful in some ways, certain findings from a biplot are affected by the number of variables in the overall dataset. Additionally, we are seeing from our PCA analysis that the first few principal components explain a much larger proportion of the variability than later principal components. Because of this, we thought it would be interesting to dive a bit deeper into which variables seem to hold the most importance in determining the first few dimensions. To do this, we’ll use the FactoMineR and factoextra packages to determine the eigenvalues of each dimension, with particular interest in dimensions 1 and 2, since those are what we’ve plotted above and hold the highest proportion of variance explained.
<- get_eigenvalue(dataset.pca)
eig.val %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="300px") eig.val
eigenvalue | variance.percent | cumulative.variance.percent | |
---|---|---|---|
Dim.1 | 12.1653335 | 24.8272112 | 24.82721 |
Dim.2 | 4.6364822 | 9.4622086 | 34.28942 |
Dim.3 | 2.3059660 | 4.7060531 | 38.99547 |
Dim.4 | 1.8788556 | 3.8343992 | 42.82987 |
Dim.5 | 1.7438785 | 3.5589358 | 46.38881 |
Dim.6 | 1.7119463 | 3.4937679 | 49.88258 |
Dim.7 | 1.4996351 | 3.0604798 | 52.94306 |
Dim.8 | 1.4269113 | 2.9120639 | 55.85512 |
Dim.9 | 1.3008135 | 2.6547214 | 58.50984 |
Dim.10 | 1.2653565 | 2.5823602 | 61.09220 |
Dim.11 | 1.2012221 | 2.4514737 | 63.54367 |
Dim.12 | 1.0647328 | 2.1729240 | 65.71660 |
Dim.13 | 1.0362688 | 2.1148343 | 67.83143 |
Dim.14 | 1.0046132 | 2.0502310 | 69.88166 |
Dim.15 | 0.9412784 | 1.9209763 | 71.80264 |
Dim.16 | 0.8966428 | 1.8298832 | 73.63252 |
Dim.17 | 0.8389721 | 1.7121880 | 75.34471 |
Dim.18 | 0.7729593 | 1.5774679 | 76.92218 |
Dim.19 | 0.7092249 | 1.4473977 | 78.36958 |
Dim.20 | 0.6854086 | 1.3987931 | 79.76837 |
Dim.21 | 0.6726696 | 1.3727952 | 81.14117 |
Dim.22 | 0.6042928 | 1.2332506 | 82.37442 |
Dim.23 | 0.5819237 | 1.1875994 | 83.56202 |
Dim.24 | 0.5558916 | 1.1344727 | 84.69649 |
Dim.25 | 0.5355566 | 1.0929727 | 85.78946 |
Dim.26 | 0.4945789 | 1.0093447 | 86.79881 |
Dim.27 | 0.4804497 | 0.9805095 | 87.77932 |
Dim.28 | 0.4657421 | 0.9504941 | 88.72981 |
Dim.29 | 0.4526588 | 0.9237935 | 89.65360 |
Dim.30 | 0.4328935 | 0.8834562 | 90.53706 |
Dim.31 | 0.3928620 | 0.8017593 | 91.33882 |
Dim.32 | 0.3821097 | 0.7798158 | 92.11863 |
Dim.33 | 0.3575121 | 0.7296164 | 92.84825 |
Dim.34 | 0.3338529 | 0.6813324 | 93.52958 |
Dim.35 | 0.3245365 | 0.6623194 | 94.19190 |
Dim.36 | 0.3062365 | 0.6249724 | 94.81687 |
Dim.37 | 0.2800222 | 0.5714738 | 95.38835 |
Dim.38 | 0.2689839 | 0.5489466 | 95.93730 |
Dim.39 | 0.2458200 | 0.5016734 | 96.43897 |
Dim.40 | 0.2273312 | 0.4639412 | 96.90291 |
Dim.41 | 0.2254670 | 0.4601368 | 97.36305 |
Dim.42 | 0.2156907 | 0.4401851 | 97.80323 |
Dim.43 | 0.2025136 | 0.4132931 | 98.21652 |
Dim.44 | 0.1867237 | 0.3810689 | 98.59759 |
Dim.45 | 0.1721464 | 0.3513193 | 98.94891 |
Dim.46 | 0.1551781 | 0.3166900 | 99.26560 |
Dim.47 | 0.1407682 | 0.2872820 | 99.55288 |
Dim.48 | 0.1227652 | 0.2505412 | 99.80343 |
Dim.49 | 0.0963212 | 0.1965739 | 100.00000 |
<- PCA(dataset, scale.unit = TRUE, ncp = 5, graph = FALSE)
res.pca <- get_pca_var(res.pca) var
Interestingly, eigenvalues less than 1 indicate that principal components account for more variance than accounted by one of the original variables in the standardized data. Because of this, many use this as a cutoff point for which PCs should be retained. Since we see this cutoff occur at Dim = 50, anything past this dimension doesn’t provide good insight into our data. Above, our scree plot cut this even further, by showing that anything past the first ten dimensions does not account for a large proportion of variance explained. Therefore, going forward, we will only focus on the first few dimensions to see if we can garner any insights.
Since our biplot above was very crowded and difficult to interpret, we decided to look a bit deeper at the quality of representation of the variables, which is determined by taking the square cosine (cos2) and accounts for a variable’s position relative to the circumference of the correlation circle (not pictured in our biplot above but can be visually seen by the length of each vector from the center/origin). After subjecting these cos2 values to a corrplot across the first five PCs, we can examine below:
corrplot(var$cos2, is.corr=FALSE)
Typically, a high cos2 value indicates a good representation of the variable on the principal component, and in our case the variable is positioned close to the circumference of the correlation circle (and farther away from the origin) – which we can visibly see with variables such as ADHDQ8, ADHDQ9, ADHDQ10, ADHDQ13 and MDQ1g on the biplot. The opposite is true for variables with a low cos2 value, which tend to fall closer to the origin.
We can also take a look at factors such as the contribution of variables on our principal components. Variables that are correlated with PC1 and PC2 are the most important in explaining the variability in the dataset. Therefore, when we conduct another correlation plot, but this time of variable contribution, we see the following:
corrplot(var$contrib, is.corr=FALSE)
We can also see this in a barplot below, and shows the variables that have the highest contribution percentage for our first two PCs:
fviz_contrib(dataset.pca, choice = "var", axes = 1, top = 15)
fviz_contrib(dataset.pca, choice = "var", axes = 2, top = 15)
Similar to our biplot visualization and our cos2 values, we can confirm that ADHDQ8, ADHDQ9, ADHDQ10, ADHDQ13 and MDQ1g etc. along with many other ADHD and MD questions contribute most to the variability explained in our dataset. This is important for us to take note of for future analysis, where we will be looking more closely at features that seem to provide more insight into clustering and classification.
Graph of variables. Positive correlated variables point to the same side of the plot. Negative correlated variables point to opposite sides of the graph.
fviz_pca_var(dataset.pca,
col.var = "contrib", # Color by contributions to the PC
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE # Avoid text overlapping
)
## Warning: ggrepel: 20 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
The boosting method is uses trees in a sequential pattern. The successive tree is developed using information from the previous tree to minimize its error.
The boosting method has three tuning parameters, number of trees, shrinkage parameter, and number of splits in a tree.
We will be using the stochastic gradient boosting in our model. This approach resamples the observations and columns in each iteration.
The xgbTree method will create a model by choosing the best tuning parameters.
Here, we can see our parameters that are used in the model.
$bestTune gbmTune
nrounds | max_depth | eta | gamma | colsample_bytree | min_child_weight | subsample | |
---|---|---|---|---|---|---|---|
29 | 100 | 2 | 0.3 | 0 | 0.8 | 1 | 0.5 |
We identify the variables with the most importance to the prediction below, with Abuse having the most importance important.
varImp(gbmTune)['importance'] %>% kbl() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),latex_options="scale_down")
|
Visualization of the first 3 trees used in the model.
xgb.plot.tree(model = gbmTune$finalModel,trees = 1:3)
Now, to look at how the model performed on the test dataset.
<- predict(gbmTune, newdata = X_test)
gbm_predict <- confusionMatrix(gbm_predict, y_test)
gbm_conf_matrix print(gbm_conf_matrix)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 27 8
## 1 3 6
##
## Accuracy : 0.75
## 95% CI : (0.5966, 0.8681)
## No Information Rate : 0.6818
## P-Value [Acc > NIR] : 0.2113
##
## Kappa : 0.3632
##
## Mcnemar's Test P-Value : 0.2278
##
## Sensitivity : 0.9000
## Specificity : 0.4286
## Pos Pred Value : 0.7714
## Neg Pred Value : 0.6667
## Prevalence : 0.6818
## Detection Rate : 0.6136
## Detection Prevalence : 0.7955
## Balanced Accuracy : 0.6643
##
## 'Positive' Class : 0
##
We record the summary of the Stochastic gradient boosting model metrics in a data frame -
<- confusionMatrix(table(gbm_predict, y_test))$byClass
gbm_model <- confusionMatrix(table(gbm_predict, y_test))$overall['Accuracy']
gbm_accuracy <- data.frame(gbm_model)
gbm_model <- rbind("Accuracy" = gbm_accuracy, gbm_model)
gbm_model
%>%
gbm_model kbl() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),latex_options="scale_down")
gbm_model | |
---|---|
Accuracy | 0.7500000 |
Sensitivity | 0.9000000 |
Specificity | 0.4285714 |
Pos Pred Value | 0.7714286 |
Neg Pred Value | 0.6666667 |
Precision | 0.7714286 |
Recall | 0.9000000 |
F1 | 0.8307692 |
Prevalence | 0.6818182 |
Detection Rate | 0.6136364 |
Detection Prevalence | 0.7954545 |
Balanced Accuracy | 0.6642857 |
Support Vector Machines are supervised learning models used to analyze data for classification problems.
For our dataset, SVM will help us decide an optimal decision boundary which can then help classify suicide attempts. We will build a linear model as well as a radial model and evaluate the performance for each to determine which SVM model is appropriate given our data.
We would like to calculate the correlation between our independent variables and the target, Suicide. This is to identify the most significant predictors of suicide attempts.
#Compute correlation between each variable and Suicide
<- function(x, y) cor(y, x)
target_corr <- sapply(train, target_corr, y=train$Suicide)
Suicide_Correlation #Output Correlation with Target
<- data.frame(Suicide_Correlation)
Suicidecorr %>%
Suicidecorr kbl(caption = "Correlation with Suicide") %>%
kable_minimal()
Suicide_Correlation | |
---|---|
Age | -0.0413520 |
Sex | 0.1350691 |
Race | -0.1121179 |
ADHDQ1 | 0.1633275 |
ADHDQ2 | 0.1973216 |
ADHDQ3 | 0.1414619 |
ADHDQ4 | 0.0987259 |
ADHDQ5 | 0.1696575 |
ADHDQ6 | 0.0591130 |
ADHDQ7 | 0.1313229 |
ADHDQ8 | 0.1313311 |
ADHDQ9 | 0.0967320 |
ADHDQ10 | 0.1188377 |
ADHDQ11 | 0.0961761 |
ADHDQ12 | 0.0984859 |
ADHDQ13 | 0.1693613 |
ADHDQ14 | 0.1706819 |
ADHDQ15 | 0.1266747 |
ADHDQ16 | 0.1198645 |
ADHDQ17 | 0.0398154 |
ADHDQ18 | 0.0332194 |
MDQ1a | 0.1038367 |
MDQ1b | 0.2479469 |
MDQ1c | -0.0916712 |
MDQ1d | 0.1797094 |
MDQ1e | -0.0275834 |
MDQ1f | 0.1876374 |
MDQ1g | 0.2997813 |
MDQ1h | 0.0570587 |
MDQ1i | 0.0501235 |
MDQ1j | 0.0485245 |
MDQ1k | 0.1708144 |
MDQ1L | 0.1412683 |
MDQ1m | 0.0713625 |
MDQ2 | 0.2596162 |
MDQ3 | 0.1764281 |
Alcohol | 0.2480052 |
THC | -0.0759040 |
Cocaine | -0.1126979 |
Opioids | 0.2160137 |
Courtorder | 0.1578478 |
Education | -0.0873615 |
HxofViolence | 0.1378318 |
DisorderlyConduct | 0.0144309 |
Suicide | 1.0000000 |
Abuse | 0.3109506 |
NonsubstDx | 0.0476540 |
SubstDx | 0.1086603 |
Psychmeds | -0.0045748 |
There seems to be no strong correlation present in the data. Out of the independent variables,history of abuse alcohol usage maybe indicators of suicide attempts among the respondents. Certain ADHD and MD responses also seem to have a moderate correlation with suicide.
We will keep the variables that had a somewhat strong positive correlation such as abuse, alcohol, MDQ1G,MDQ1D and ADHDQ1. For the rest, we will use caret package to determine which features maybe significant in predicting suicide attempts.
set.seed(7)
# prepare training scheme
<- trainControl(method="repeatedcv", number=10, repeats=3)
control # train the model
<- train(as.factor(Suicide)~., data=dataset, method="lvq", preProcess="scale", trControl=control)
model # estimate variable importance
<- varImp(model, scale=FALSE)
importance # summarize importance
print(importance)
## loess r-squared variable importance
##
## only 20 most important variables shown (out of 48)
##
## Overall
## Abuse 0.08951
## MDQ1g 0.07775
## Alcohol 0.05840
## MDQ1b 0.05224
## MDQ2 0.05005
## MDQ1f 0.03921
## Courtorder 0.03732
## MDQ1d 0.03565
## Sex 0.03550
## MDQ1a 0.03363
## MDQ3 0.03281
## ADHDQ1 0.03123
## Opioids 0.02842
## MDQ1L 0.02681
## ADHDQ4 0.02321
## SubstDx 0.02049
## MDQ1k 0.02041
## ADHDQ7 0.01922
## Age 0.01870
## ADHDQ3 0.01854
We had already identified some of the variables above from the correlation matrix. We will keep the top 10 important variables for our SVM model.
First, let’s build the linear model with all variables (including those deemed insignificant). We will set the cost parameter to 10.
set.seed(123)
=svm(formula=as.factor(Suicide)~.,data=train, kernel="linear", cost=10, scale=FALSE)
linear_svm1print(linear_svm1)
##
## Call:
## svm(formula = as.factor(Suicide) ~ ., data = train, kernel = "linear",
## cost = 10, scale = FALSE)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 10
##
## Number of Support Vectors: 63
$pred1 = predict(linear_svm1, test)
test
confusionMatrix(test$pred,as.factor(test$Suicide))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 25 8
## 1 5 6
##
## Accuracy : 0.7045
## 95% CI : (0.548, 0.8324)
## No Information Rate : 0.6818
## P-Value [Acc > NIR] : 0.4435
##
## Kappa : 0.2778
##
## Mcnemar's Test P-Value : 0.5791
##
## Sensitivity : 0.8333
## Specificity : 0.4286
## Pos Pred Value : 0.7576
## Neg Pred Value : 0.5455
## Prevalence : 0.6818
## Detection Rate : 0.5682
## Detection Prevalence : 0.7500
## Balanced Accuracy : 0.6310
##
## 'Positive' Class : 0
##
Let’s see if including only the important features improves model fit.
set.seed(123)
=svm(formula=as.factor(Suicide)~ Abuse+MDQ1g+Alcohol+MDQ1d+MDQ1b+MDQ2+ADHDQ1+MDQ3+Opioids+MDQ1a,data=train, kernel="linear", cost=10, scale=FALSE)
linear_svm2print(linear_svm2)
##
## Call:
## svm(formula = as.factor(Suicide) ~ Abuse + MDQ1g + Alcohol + MDQ1d +
## MDQ1b + MDQ2 + ADHDQ1 + MDQ3 + Opioids + MDQ1a, data = train,
## kernel = "linear", cost = 10, scale = FALSE)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 10
##
## Number of Support Vectors: 73
$pred2 = predict(linear_svm2, test)
test
confusionMatrix(test$pred2,as.factor(test$Suicide))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 26 10
## 1 4 4
##
## Accuracy : 0.6818
## 95% CI : (0.5242, 0.8139)
## No Information Rate : 0.6818
## P-Value [Acc > NIR] : 0.5717
##
## Kappa : 0.172
##
## Mcnemar's Test P-Value : 0.1814
##
## Sensitivity : 0.8667
## Specificity : 0.2857
## Pos Pred Value : 0.7222
## Neg Pred Value : 0.5000
## Prevalence : 0.6818
## Detection Rate : 0.5909
## Detection Prevalence : 0.8182
## Balanced Accuracy : 0.5762
##
## 'Positive' Class : 0
##
The model fit has not improved after removing the insignificant features.
First we will build a radial model with all variables
set.seed(123)
= svm(as.factor(Suicide)~., data=train, C = 13)
radial_svm print(radial_svm)
##
## Call:
## svm(formula = as.factor(Suicide) ~ ., data = train, C = 13)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 1
##
## Number of Support Vectors: 111
$pred3 = predict(radial_svm, test)
test
confusionMatrix(test$pred3,as.factor(test$Suicide))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 30 12
## 1 0 2
##
## Accuracy : 0.7273
## 95% CI : (0.5721, 0.8504)
## No Information Rate : 0.6818
## P-Value [Acc > NIR] : 0.319382
##
## Kappa : 0.1852
##
## Mcnemar's Test P-Value : 0.001496
##
## Sensitivity : 1.0000
## Specificity : 0.1429
## Pos Pred Value : 0.7143
## Neg Pred Value : 1.0000
## Prevalence : 0.6818
## Detection Rate : 0.6818
## Detection Prevalence : 0.9545
## Balanced Accuracy : 0.5714
##
## 'Positive' Class : 0
##
We will build another radial model with only the most significant variables
set.seed(123)
= svm(as.factor(Suicide)~Abuse+MDQ1g+Alcohol+MDQ1d+MDQ1b+MDQ2+ADHDQ1+MDQ3+Opioids+MDQ1a, data=train, C = 13)
radial_svm2 print(radial_svm2)
##
## Call:
## svm(formula = as.factor(Suicide) ~ Abuse + MDQ1g + Alcohol + MDQ1d +
## MDQ1b + MDQ2 + ADHDQ1 + MDQ3 + Opioids + MDQ1a, data = train,
## C = 13)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 1
##
## Number of Support Vectors: 89
$pred4 = predict(radial_svm2, test)
test
confusionMatrix(test$pred4,as.factor(test$Suicide))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 27 12
## 1 3 2
##
## Accuracy : 0.6591
## 95% CI : (0.5008, 0.7951)
## No Information Rate : 0.6818
## P-Value [Acc > NIR] : 0.69143
##
## Kappa : 0.0517
##
## Mcnemar's Test P-Value : 0.03887
##
## Sensitivity : 0.9000
## Specificity : 0.1429
## Pos Pred Value : 0.6923
## Neg Pred Value : 0.4000
## Prevalence : 0.6818
## Detection Rate : 0.6136
## Detection Prevalence : 0.8864
## Balanced Accuracy : 0.5214
##
## 'Positive' Class : 0
##
The model fit for the radial model has not improved after removing the insignificant variables. The model performance for the first radial model also outperforms the other SVM models.
We will store the results of our final SVM model in a dataframe
<- confusionMatrix(test$pred4,as.factor(test$Suicide))$byClass
SVM_Model_Final <- data.frame(SVM_Model_Final)
SVM_Model_Final <- confusionMatrix(test$pred4, as.factor(test$Suicide))$overall['Accuracy']
AccuracySVM <- rbind("Accuracy" = AccuracySVM, SVM_Model_Final)
SVM_Model_Final<- data.frame(SVM_Model_Final)
tabularview %>% kableExtra::kbl() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),latex_options="scale_down") tabularview
SVM_Model_Final | |
---|---|
Accuracy | 0.6590909 |
Sensitivity | 0.9000000 |
Specificity | 0.1428571 |
Pos Pred Value | 0.6923077 |
Neg Pred Value | 0.4000000 |
Precision | 0.6923077 |
Recall | 0.9000000 |
F1 | 0.7826087 |
Prevalence | 0.6818182 |
Detection Rate | 0.6136364 |
Detection Prevalence | 0.8863636 |
Balanced Accuracy | 0.5214286 |
The linear SVM model outperformed the radial SVM model, however the Stochastic Gradient Boosting model holds the best accuracy on the test data. This looks to be the results of having a higher specificity performance in the model. This outcome may be the results of Gradient Boosting having low bias and variance with its boosting method. The k-means clustering model is an unsupervised machine learning model that allowed us to find insight into our mental health dataset. With an optimal clustering of 2, we were able to group our data into two groups.
Also, based on the PCA we found that first few principal components explained a large proportion of variances. Based on further analysis, we also found the most important variables in terms of representation and contribution in the first couple of principal components.