Data 622 Homework 4: Mental Health Data Modeling

library(knitr)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.5     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.4     ✓ stringr 1.4.0
## ✓ readr   2.0.2     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(caret) # For createDataPartition, featureplot, classification report
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
library(skimr)    # Used for EDA
#library(corrplot) # For correlation matrix
library(mice) # Multivariate Imputation By Chained Equations
## 
## Attaching package: 'mice'
## The following object is masked from 'package:stats':
## 
##     filter
## The following objects are masked from 'package:base':
## 
##     cbind, rbind
library(parallel)
library(doParallel)
## Loading required package: foreach
## 
## Attaching package: 'foreach'
## The following objects are masked from 'package:purrr':
## 
##     accumulate, when
## Loading required package: iterators
#library(plyr)
library(VIM)
## Loading required package: colorspace
## Loading required package: grid
## VIM is ready to use.
## Suggestions and bug-reports can be submitted at: https://github.com/statistikat/VIM/issues
## 
## Attaching package: 'VIM'
## The following object is masked from 'package:datasets':
## 
##     sleep
library(ggplot2)
library(kableExtra)
## 
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
## 
##     group_rows
library(stats)
library(e1071)
library(vcd)
library(Hmisc)
## Loading required package: survival
## 
## Attaching package: 'survival'
## The following object is masked from 'package:caret':
## 
##     cluster
## Loading required package: Formula
## 
## Attaching package: 'Hmisc'
## The following object is masked from 'package:e1071':
## 
##     impute
## The following objects are masked from 'package:dplyr':
## 
##     src, summarize
## The following objects are masked from 'package:base':
## 
##     format.pval, units
library(Boruta)
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
## Global options
# ---------------------------------
options(max.print="108")
opts_knit$set(width=31)

Loading data

# getwd()
setwd("~/Data 622/repos/data622_fa2021/hw4")
adhd_data <- readxl::read_excel('ADHD_data.xlsx', sheet = "Data", .name_repair = "universal", na = "")
## New names:
## * `ADHD Q1` -> ADHD.Q1
## * `ADHD Q2` -> ADHD.Q2
## * `ADHD Q3` -> ADHD.Q3
## * `ADHD Q4` -> ADHD.Q4
## * `ADHD Q5` -> ADHD.Q5
## * ...
# dim(adhd_data)
str(adhd_data)
## tibble [175 × 54] (S3: tbl_df/tbl/data.frame)
##  $ Initial           : chr [1:175] "JA" "LA" "MD" "RD" ...
##  $ Age               : num [1:175] 24 48 51 43 34 39 41 48 44 27 ...
##  $ Sex               : num [1:175] 1 2 2 1 1 2 2 1 2 2 ...
##  $ Race              : num [1:175] 1 1 1 1 1 1 1 1 1 1 ...
##  $ ADHD.Q1           : num [1:175] 1 3 2 3 4 2 2 2 3 2 ...
##  $ ADHD.Q2           : num [1:175] 1 3 1 3 4 3 2 4 3 3 ...
##  $ ADHD.Q3           : num [1:175] 4 4 2 2 2 1 1 3 3 4 ...
##  $ ADHD.Q4           : num [1:175] 2 4 1 2 4 4 3 4 4 4 ...
##  $ ADHD.Q5           : num [1:175] 3 5 3 4 4 3 4 3 4 4 ...
##  $ ADHD.Q6           : num [1:175] 1 2 3 3 2 2 4 3 3 3 ...
##  $ ADHD.Q7           : num [1:175] 1 2 3 2 3 3 2 1 3 4 ...
##  $ ADHD.Q8           : num [1:175] 3 3 2 4 4 4 3 1 4 3 ...
##  $ ADHD.Q9           : num [1:175] 2 2 0 4 4 4 3 4 3 2 ...
##  $ ADHD.Q10          : num [1:175] 4 4 1 2 2 2 4 2 4 4 ...
##  $ ADHD.Q11          : num [1:175] 2 1 2 3 4 4 3 4 3 3 ...
##  $ ADHD.Q12          : num [1:175] 4 4 0 1 1 2 1 1 0 1 ...
##  $ ADHD.Q13          : num [1:175] 1 2 2 3 3 4 4 4 4 3 ...
##  $ ADHD.Q14          : num [1:175] 0 4 2 3 2 4 4 3 3 4 ...
##  $ ADHD.Q15          : num [1:175] 3 4 3 1 1 3 4 0 3 4 ...
##  $ ADHD.Q16          : num [1:175] 1 3 2 2 2 4 4 0 2 4 ...
##  $ ADHD.Q17          : num [1:175] 3 1 1 1 1 3 2 1 4 2 ...
##  $ ADHD.Q18          : num [1:175] 4 4 1 2 1 3 4 1 3 2 ...
##  $ ADHD.Total        : num [1:175] 40 55 31 45 48 55 54 41 56 56 ...
##  $ MD.Q1a            : num [1:175] 1 1 0 1 0 0 1 0 1 1 ...
##  $ MD.Q1b            : num [1:175] 1 1 0 1 1 1 1 0 1 1 ...
##  $ MD.Q1c            : num [1:175] 1 1 0 0 0 0 0 0 0 0 ...
##  $ MD.Q1d            : num [1:175] 1 1 0 0 1 1 0 0 1 0 ...
##  $ MD.Q1e            : num [1:175] 0 1 1 1 0 1 1 0 1 1 ...
##  $ MD.Q1f            : num [1:175] 1 1 1 1 1 1 1 1 1 0 ...
##  $ MD.Q1g            : num [1:175] 1 1 1 1 1 1 0 1 1 1 ...
##  $ MD.Q1h            : num [1:175] 1 1 0 1 0 1 0 0 0 0 ...
##  $ MD.Q1i            : num [1:175] 1 1 0 1 0 1 0 0 0 0 ...
##  $ MD.Q1j            : num [1:175] 1 0 0 0 0 1 0 0 0 1 ...
##  $ MD.Q1k            : num [1:175] 1 0 0 0 0 1 0 0 0 1 ...
##  $ MD.Q1L            : num [1:175] 0 1 0 1 0 1 1 1 1 1 ...
##  $ MD.Q1m            : num [1:175] 1 0 0 1 0 0 0 0 1 1 ...
##  $ MD.Q2             : num [1:175] 1 1 0 1 1 1 1 1 1 1 ...
##  $ MD.Q3             : num [1:175] 3 3 2 3 2 3 3 3 3 2 ...
##  $ MD.TOTAL          : num [1:175] 15 14 5 13 7 14 9 7 12 11 ...
##  $ Alcohol           : num [1:175] 1 0 0 1 1 1 3 0 1 0 ...
##  $ THC               : num [1:175] 1 0 0 1 1 0 3 0 0 3 ...
##  $ Cocaine           : num [1:175] 1 0 0 1 0 0 1 0 0 0 ...
##  $ Stimulants        : num [1:175] 0 0 0 1 0 0 1 0 0 0 ...
##  $ Sedative.hypnotics: num [1:175] 0 0 0 0 0 0 1 0 0 0 ...
##  $ Opioids           : num [1:175] 0 0 0 0 0 0 0 0 1 0 ...
##  $ Court.order       : num [1:175] 1 0 0 0 1 0 0 0 0 0 ...
##  $ Education         : num [1:175] 11 14 12 12 9 11 12 16 12 9 ...
##  $ Hx.of.Violence    : num [1:175] 0 0 0 0 1 0 1 1 1 0 ...
##  $ Disorderly.Conduct: num [1:175] 1 0 0 0 1 1 1 1 1 1 ...
##  $ Suicide           : num [1:175] 1 1 0 1 1 1 0 0 0 0 ...
##  $ Abuse             : num [1:175] 0 4 6 7 0 2 4 0 0 2 ...
##  $ Non.subst.Dx      : num [1:175] 2 1 2 2 2 0 1 2 1 1 ...
##  $ Subst.Dx          : num [1:175] 0 0 0 0 0 0 0 1 0 2 ...
##  $ Psych.meds.       : num [1:175] 2 1 1 2 0 0 1 2 1 0 ...

Data Processing Steps

I employed a couple strategies into transforming the data before we can apply models.

  • Imputing only when necessary. Imputing by applying the right method for the right variable type.

  • Make sure to make the fields into factors when they are categorical variables.

aggr_plot <- aggr(adhd_data, col=c('navyblue','red'), numbers=TRUE, sortVars=TRUE, labels=names(data), cex.axis=.7, gap=3, ylab=c("Histogram of missing data","Pattern"))

## 
##  Variables sorted by number of missings: 
##            Variable      Count
##         Psych.meds. 0.67428571
##            Subst.Dx 0.13142857
##        Non.subst.Dx 0.12571429
##               Abuse 0.08000000
##             Suicide 0.07428571
##      Hx.of.Violence 0.06285714
##  Disorderly.Conduct 0.06285714
##           Education 0.05142857
##         Court.order 0.02857143
##             Alcohol 0.02285714
##                 THC 0.02285714
##             Cocaine 0.02285714
##          Stimulants 0.02285714
##  Sedative.hypnotics 0.02285714
##             Opioids 0.02285714
##             Initial 0.00000000
##                 Age 0.00000000
##                 Sex 0.00000000
##                Race 0.00000000
##             ADHD.Q1 0.00000000
##             ADHD.Q2 0.00000000
##             ADHD.Q3 0.00000000
##             ADHD.Q4 0.00000000
##             ADHD.Q5 0.00000000
##             ADHD.Q6 0.00000000
##             ADHD.Q7 0.00000000
##             ADHD.Q8 0.00000000
##             ADHD.Q9 0.00000000
##            ADHD.Q10 0.00000000
##            ADHD.Q11 0.00000000
##            ADHD.Q12 0.00000000
##            ADHD.Q13 0.00000000
##            ADHD.Q14 0.00000000
##            ADHD.Q15 0.00000000
##            ADHD.Q16 0.00000000
##            ADHD.Q17 0.00000000
##            ADHD.Q18 0.00000000
##          ADHD.Total 0.00000000
##              MD.Q1a 0.00000000
##              MD.Q1b 0.00000000
##              MD.Q1c 0.00000000
##              MD.Q1d 0.00000000
##              MD.Q1e 0.00000000
##              MD.Q1f 0.00000000
##              MD.Q1g 0.00000000
##              MD.Q1h 0.00000000
##              MD.Q1i 0.00000000
##              MD.Q1j 0.00000000
##              MD.Q1k 0.00000000
##              MD.Q1L 0.00000000
##              MD.Q1m 0.00000000
##               MD.Q2 0.00000000
##               MD.Q3 0.00000000
##            MD.TOTAL 0.00000000

Removing Initial as it provides no values to any models. Also, I removed Psych.meds. as it has 67% missing.

adhd_data_pre_imp <- adhd_data %>% select(-c('Initial','Psych.meds.'))
skim(adhd_data_pre_imp) %>% dplyr::filter( n_missing > 0 ) 
Data summary
Name adhd_data_pre_imp
Number of rows 175
Number of columns 52
_______________________
Column type frequency:
numeric 14
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Alcohol 4 0.98 1.35 1.39 0 0 1 3.0 3 ▇▂▁▁▆
THC 4 0.98 0.81 1.27 0 0 0 1.5 3 ▇▁▁▁▃
Cocaine 4 0.98 1.09 1.39 0 0 0 3.0 3 ▇▁▁▁▅
Stimulants 4 0.98 0.12 0.53 0 0 0 0.0 3 ▇▁▁▁▁
Sedative.hypnotics 4 0.98 0.12 0.54 0 0 0 0.0 3 ▇▁▁▁▁
Opioids 4 0.98 0.39 0.99 0 0 0 0.0 3 ▇▁▁▁▁
Court.order 5 0.97 0.09 0.28 0 0 0 0.0 1 ▇▁▁▁▁
Education 9 0.95 11.90 2.17 6 11 12 13.0 19 ▁▅▇▂▁
Hx.of.Violence 11 0.94 0.24 0.43 0 0 0 0.0 1 ▇▁▁▁▂
Disorderly.Conduct 11 0.94 0.73 0.45 0 0 1 1.0 1 ▃▁▁▁▇
Suicide 13 0.93 0.30 0.46 0 0 0 1.0 1 ▇▁▁▁▃
Abuse 14 0.92 1.33 2.12 0 0 0 2.0 7 ▇▂▁▁▁
Non.subst.Dx 22 0.87 0.44 0.68 0 0 0 1.0 2 ▇▁▃▁▁
Subst.Dx 23 0.87 1.14 0.93 0 0 1 2.0 3 ▆▇▁▅▂
adhd_data_pre_imp <- adhd_data_pre_imp %>% mutate(NonSubstDx =  as.factor(Non.subst.Dx)  ,
                                                  SubstDx = as.factor(Subst.Dx) ,
                                                  Abuse = factor(Abuse, ordered = TRUE), 
                                                  Alcohol = as.factor(Alcohol)
)


adhd_data_pre_imp <- adhd_data_pre_imp %>%  mutate_if(is.numeric, factor)
skim(adhd_data_pre_imp)
Data summary
Name adhd_data_pre_imp
Number of rows 175
Number of columns 54
_______________________
Column type frequency:
factor 54
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
Age 0 1.00 FALSE 42 44: 12, 51: 9, 26: 8, 33: 7
Sex 0 1.00 FALSE 2 1: 99, 2: 76
Race 0 1.00 FALSE 4 2: 100, 1: 72, 6: 2, 3: 1
ADHD.Q1 0 1.00 FALSE 5 2: 44, 1: 43, 0: 39, 3: 30
ADHD.Q2 0 1.00 FALSE 5 2: 47, 1: 46, 3: 33, 0: 25
ADHD.Q3 0 1.00 FALSE 5 1: 46, 2: 46, 3: 32, 0: 26
ADHD.Q4 0 1.00 FALSE 5 2: 50, 4: 36, 1: 31, 3: 31
ADHD.Q5 0 1.00 FALSE 6 3: 47, 4: 41, 0: 33, 2: 32
ADHD.Q6 0 1.00 FALSE 5 2: 45, 3: 45, 0: 36, 1: 29
ADHD.Q7 0 1.00 FALSE 5 2: 54, 1: 53, 3: 25, 0: 22
ADHD.Q8 0 1.00 FALSE 5 3: 42, 1: 40, 2: 40, 4: 32
ADHD.Q9 0 1.00 FALSE 5 1: 43, 3: 41, 2: 36, 0: 31
ADHD.Q10 0 1.00 FALSE 5 2: 49, 1: 46, 3: 33, 4: 32
ADHD.Q11 0 1.00 FALSE 5 2: 48, 3: 43, 4: 35, 1: 33
ADHD.Q12 0 1.00 FALSE 5 0: 55, 1: 55, 2: 37, 3: 15
ADHD.Q13 0 1.00 FALSE 5 3: 47, 2: 46, 4: 38, 1: 29
ADHD.Q14 0 1.00 FALSE 5 3: 47, 2: 40, 4: 37, 0: 27
ADHD.Q15 0 1.00 FALSE 5 0: 50, 1: 39, 2: 35, 3: 27
ADHD.Q16 0 1.00 FALSE 5 1: 49, 0: 40, 2: 39, 4: 30
ADHD.Q17 0 1.00 FALSE 5 0: 49, 2: 46, 1: 41, 3: 22
ADHD.Q18 0 1.00 FALSE 5 1: 52, 0: 49, 2: 35, 3: 20
ADHD.Total 0 1.00 FALSE 62 17: 8, 31: 7, 32: 7, 24: 6
MD.Q1a 0 1.00 FALSE 2 1: 96, 0: 79
MD.Q1b 0 1.00 FALSE 2 1: 100, 0: 75
MD.Q1c 0 1.00 FALSE 2 1: 95, 0: 80
MD.Q1d 0 1.00 FALSE 2 1: 102, 0: 73
MD.Q1e 0 1.00 FALSE 2 1: 97, 0: 78
MD.Q1f 0 1.00 FALSE 2 1: 122, 0: 53
MD.Q1g 0 1.00 FALSE 2 1: 126, 0: 49
MD.Q1h 0 1.00 FALSE 2 1: 98, 0: 77
MD.Q1i 0 1.00 FALSE 2 1: 103, 0: 72
MD.Q1j 0 1.00 FALSE 2 0: 107, 1: 68
MD.Q1k 0 1.00 FALSE 2 0: 90, 1: 85
MD.Q1L 0 1.00 FALSE 2 1: 102, 0: 73
MD.Q1m 0 1.00 FALSE 2 0: 89, 1: 86
MD.Q2 0 1.00 FALSE 2 1: 126, 0: 49
MD.Q3 0 1.00 FALSE 4 3: 76, 2: 49, 0: 25, 1: 25
MD.TOTAL 0 1.00 FALSE 18 11: 18, 15: 14, 10: 13, 13: 13
Alcohol 4 0.98 FALSE 4 0: 80, 3: 66, 1: 18, 2: 7
THC 4 0.98 FALSE 4 0: 116, 3: 40, 1: 12, 2: 3
Cocaine 4 0.98 FALSE 4 0: 101, 3: 56, 1: 9, 2: 5
Stimulants 4 0.98 FALSE 3 0: 160, 1: 6, 3: 5
Sedative.hypnotics 4 0.98 FALSE 4 0: 161, 3: 5, 1: 4, 2: 1
Opioids 4 0.98 FALSE 3 0: 146, 3: 21, 1: 4
Court.order 5 0.97 FALSE 2 0: 155, 1: 15
Education 9 0.95 FALSE 14 12: 67, 11: 23, 13: 15, 14: 14
Hx.of.Violence 11 0.94 FALSE 2 0: 124, 1: 40
Disorderly.Conduct 11 0.94 FALSE 2 1: 119, 0: 45
Suicide 13 0.93 FALSE 2 0: 113, 1: 49
Abuse 14 0.92 TRUE 8 0: 101, 2: 20, 5: 10, 1: 8
Non.subst.Dx 22 0.87 FALSE 3 0: 102, 1: 35, 2: 16
Subst.Dx 23 0.87 FALSE 4 1: 61, 0: 42, 2: 35, 3: 14
NonSubstDx 22 0.87 FALSE 3 0: 102, 1: 35, 2: 16
SubstDx 23 0.87 FALSE 4 1: 61, 0: 42, 2: 35, 3: 14
adhd_data_pre_imp1 <- adhd_data_pre_imp %>% mutate(Education = as.numeric(Education)  )
skim(adhd_data_pre_imp1)
Data summary
Name adhd_data_pre_imp1
Number of rows 175
Number of columns 54
_______________________
Column type frequency:
factor 53
numeric 1
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
Age 0 1.00 FALSE 42 44: 12, 51: 9, 26: 8, 33: 7
Sex 0 1.00 FALSE 2 1: 99, 2: 76
Race 0 1.00 FALSE 4 2: 100, 1: 72, 6: 2, 3: 1
ADHD.Q1 0 1.00 FALSE 5 2: 44, 1: 43, 0: 39, 3: 30
ADHD.Q2 0 1.00 FALSE 5 2: 47, 1: 46, 3: 33, 0: 25
ADHD.Q3 0 1.00 FALSE 5 1: 46, 2: 46, 3: 32, 0: 26
ADHD.Q4 0 1.00 FALSE 5 2: 50, 4: 36, 1: 31, 3: 31
ADHD.Q5 0 1.00 FALSE 6 3: 47, 4: 41, 0: 33, 2: 32
ADHD.Q6 0 1.00 FALSE 5 2: 45, 3: 45, 0: 36, 1: 29
ADHD.Q7 0 1.00 FALSE 5 2: 54, 1: 53, 3: 25, 0: 22
ADHD.Q8 0 1.00 FALSE 5 3: 42, 1: 40, 2: 40, 4: 32
ADHD.Q9 0 1.00 FALSE 5 1: 43, 3: 41, 2: 36, 0: 31
ADHD.Q10 0 1.00 FALSE 5 2: 49, 1: 46, 3: 33, 4: 32
ADHD.Q11 0 1.00 FALSE 5 2: 48, 3: 43, 4: 35, 1: 33
ADHD.Q12 0 1.00 FALSE 5 0: 55, 1: 55, 2: 37, 3: 15
ADHD.Q13 0 1.00 FALSE 5 3: 47, 2: 46, 4: 38, 1: 29
ADHD.Q14 0 1.00 FALSE 5 3: 47, 2: 40, 4: 37, 0: 27
ADHD.Q15 0 1.00 FALSE 5 0: 50, 1: 39, 2: 35, 3: 27
ADHD.Q16 0 1.00 FALSE 5 1: 49, 0: 40, 2: 39, 4: 30
ADHD.Q17 0 1.00 FALSE 5 0: 49, 2: 46, 1: 41, 3: 22
ADHD.Q18 0 1.00 FALSE 5 1: 52, 0: 49, 2: 35, 3: 20
ADHD.Total 0 1.00 FALSE 62 17: 8, 31: 7, 32: 7, 24: 6
MD.Q1a 0 1.00 FALSE 2 1: 96, 0: 79
MD.Q1b 0 1.00 FALSE 2 1: 100, 0: 75
MD.Q1c 0 1.00 FALSE 2 1: 95, 0: 80
MD.Q1d 0 1.00 FALSE 2 1: 102, 0: 73
MD.Q1e 0 1.00 FALSE 2 1: 97, 0: 78
MD.Q1f 0 1.00 FALSE 2 1: 122, 0: 53
MD.Q1g 0 1.00 FALSE 2 1: 126, 0: 49
MD.Q1h 0 1.00 FALSE 2 1: 98, 0: 77
MD.Q1i 0 1.00 FALSE 2 1: 103, 0: 72
MD.Q1j 0 1.00 FALSE 2 0: 107, 1: 68
MD.Q1k 0 1.00 FALSE 2 0: 90, 1: 85
MD.Q1L 0 1.00 FALSE 2 1: 102, 0: 73
MD.Q1m 0 1.00 FALSE 2 0: 89, 1: 86
MD.Q2 0 1.00 FALSE 2 1: 126, 0: 49
MD.Q3 0 1.00 FALSE 4 3: 76, 2: 49, 0: 25, 1: 25
MD.TOTAL 0 1.00 FALSE 18 11: 18, 15: 14, 10: 13, 13: 13
Alcohol 4 0.98 FALSE 4 0: 80, 3: 66, 1: 18, 2: 7
THC 4 0.98 FALSE 4 0: 116, 3: 40, 1: 12, 2: 3
Cocaine 4 0.98 FALSE 4 0: 101, 3: 56, 1: 9, 2: 5
Stimulants 4 0.98 FALSE 3 0: 160, 1: 6, 3: 5
Sedative.hypnotics 4 0.98 FALSE 4 0: 161, 3: 5, 1: 4, 2: 1
Opioids 4 0.98 FALSE 3 0: 146, 3: 21, 1: 4
Court.order 5 0.97 FALSE 2 0: 155, 1: 15
Hx.of.Violence 11 0.94 FALSE 2 0: 124, 1: 40
Disorderly.Conduct 11 0.94 FALSE 2 1: 119, 0: 45
Suicide 13 0.93 FALSE 2 0: 113, 1: 49
Abuse 14 0.92 TRUE 8 0: 101, 2: 20, 5: 10, 1: 8
Non.subst.Dx 22 0.87 FALSE 3 0: 102, 1: 35, 2: 16
Subst.Dx 23 0.87 FALSE 4 1: 61, 0: 42, 2: 35, 3: 14
NonSubstDx 22 0.87 FALSE 3 0: 102, 1: 35, 2: 16
SubstDx 23 0.87 FALSE 4 1: 61, 0: 42, 2: 35, 3: 14

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Education 9 0.95 6.9 2.17 1 6 7 8 14 ▁▅▇▂▁

Imputation

More specifically, I separated the variables with missing values into the following categories:

  • Categorical Variables with more than 2 levels: Subst.Dx, NonSubstDx, Opioids, Sedative.hypnotics, Stimulants, Cocaine, THC, Alcohol - polyreg

  • Categorical Variables with 2 levels: Disorderly.Conduct, Hx.of.Violence, Court.order - logreg

  • Discrete variable (ordered) with more than 2 levels: Abuse - polr

  • Continuous Variables: Education - pmm -> norm

Note that all the imputation codes below are intentionally commented out just because we relied on saved adhd_data_2.rds and adhd_data_3.rds files as the process of fully paralleled processed mice methods still would take significant amount of time to run (for the RMD knitting). It might take approx. 5 mins to the upwards of 17 mins.

init <- mice(adhd_data_pre_imp1, maxit=0) 
## Warning: Number of logged events: 2
meth <- init$method
predM <- init$predictorMatrix
meth[c('Education')] <- 'pmm'
meth[c('Disorderly.Conduct', 'Hx.of.Violence', 'Court.order')] <- 'logreg'
meth[c('Subst.Dx', 'NonSubstDx' , 'Opioids', 'Sedative.hypnotics', 'Stimulants', 'Cocaine', 'THC', 'Alcohol')] <- 'polyreg'
meth[c('Abuse')] = 'polr'
meth[c('Age','Sex','Race','ADHD.Q1','ADHD.Q2','ADHD.Q3','ADHD.Q4','ADHD.Q5','ADHD.Q6','ADHD.Q7','ADHD.Q8','ADHD.Q9','ADHD.Q10','ADHD.Q11','ADHD.Q12','ADHD.Q13','ADHD.Q14','ADHD.Q15','ADHD.Q16','ADHD.Q17','ADHD.Q18','ADHD.Total','MD.Q1a','MD.Q1b','MD.Q1c','MD.Q1d','MD.Q1e','MD.Q1f','MD.Q1g','MD.Q1h','MD.Q1i','MD.Q1j','MD.Q1k','MD.Q1L','MD.Q1m','MD.Q2','MD.Q3','MD.TOTAL','Suicide')] = ''


# no_cores <- detectCores() - 1
# 
# cl<-makePSOCKcluster(no_cores)
# 
# registerDoParallel(cl)
# 
# start.time<-proc.time()
# 
# adhd_data_imp1 <- mice.par(adhd_data_pre_imp1, method=meth, predictorMatrix=predM, seed=301, m = 5)
# 
# stop.time<-proc.time()
# 
# run.time<-stop.time -start.time
# 
# print(run.time)
# 
# stopCluster(cl)
# user  system elapsed 
#   1.826   4.303 264.775 

Examining the imputations

Examining the imputations for Education. As the 50-th percentile before was 12, we picked an impute that is exactly that. Impute #4 fits the bill.

# str(adhd_data_imp1)
# skim(adhd_data_imp1) %>% dplyr::filter( n_missing > 0 )
# 
#Eduation
# adhd_data_imp1$imp$Education
# adhd_data_pre_imp1[140:142,]
# 
# 
# #Age
# adhd_data_imp1$imp$Alcohol
# adhd_data_2 <- complete(adhd_data_imp1, 4) # 2nd argument if not provided is defaulted to 1
# skim(adhd_data_2) %>% dplyr::filter( n_missing > 0 )
# # # saving a R data frame

# saveRDS(adhd_data_2, "adhd_data_2.rds")

Ended up dropping 29 records as even there are other means of imputing the missing records manually. I just don’t think it’s trustworthy when advanced algorithms did not end up imputing them. The need to have complete cases, except for the suicide variable for section of clustering and PCA where suicide variable is not the response variable, or variable of interest, is the ultimate reason why I had to drop any records with even a missing field from the dataset.

# loading rds object

# adhd_data_2 <- readRDS("adhd_data_2.rds")



# adhd_data_pre_imp1[complete.cases(adhd_data_pre_imp1[ , -49]),]

# saved 3 records
# adhd_data_3 <- adhd_data_2[complete.cases(adhd_data_2[ , -49]),]

Loading from adhd_data_3.rds

# Saving the records 
# saveRDS(adhd_data_3, "adhd_data_3.rds")
#Loading the records
adhd_data_3 <- readRDS("adhd_data_3.rds")

Count for each Education / suicide attempt

ggplot(adhd_data_3, aes(x = Education, fill = Suicide)) +
  geom_bar(alpha = 0.8) +
  scale_fill_manual(values = c("darkorange", "purple", "cyan4"),
                    guide = F) +
  theme_minimal() +
  facet_wrap(~Suicide, ncol = 1) +
  coord_flip()
## Warning: It is deprecated to specify `guide = FALSE` to remove a guide. Please
## use `guide = "none"` instead.

mosaic(~ Education + Suicide, data = adhd_data_3)

Count for each Subst.Dx / suicide attempt

ggplot(adhd_data_3, aes(x = Subst.Dx, fill = Suicide)) +
  geom_bar(alpha = 0.8) +
  scale_fill_manual(values = c("darkorange", "purple", "cyan4"),
                    guide = F) +
  theme_minimal() +
  facet_wrap(~Suicide, ncol = 1) +
  coord_flip()
## Warning: It is deprecated to specify `guide = FALSE` to remove a guide. Please
## use `guide = "none"` instead.

mosaic(~ Subst.Dx + Suicide, data = adhd_data_3)

Count for each Non.subst.Dx / suicide attempt

ggplot(adhd_data_3, aes(x = Non.subst.Dx, fill = Suicide)) +
  geom_bar(alpha = 0.8) +
  scale_fill_manual(values = c("darkorange", "purple", "cyan4"),
                    guide = F) +
  theme_minimal() +
  facet_wrap(~Suicide, ncol = 1) +
  coord_flip()
## Warning: It is deprecated to specify `guide = FALSE` to remove a guide. Please
## use `guide = "none"` instead.

mosaic(~ Non.subst.Dx + Suicide, data = adhd_data_3)

Count for each Abuse / Suicide

ggplot(adhd_data_3, aes(x = Abuse, fill = Suicide)) +
  geom_bar(alpha = 0.8) +
  scale_fill_manual(values = c("darkorange", "purple", "cyan4", "black"),
                    guide = F) +
  theme_minimal() +
  facet_wrap(~Suicide, ncol = 1) +
  coord_flip()
## Warning: It is deprecated to specify `guide = FALSE` to remove a guide. Please
## use `guide = "none"` instead.

mosaic(~ Abuse + Suicide, data = adhd_data_3)

Clustering Method

Here we are going to start our clustering techniques using hierarchical clustering. The benefits of this clustering type is that we don’t need to specify the number of clusters and results are reproducable.

Instead of using all the columns from our data, we decided to only use the column ADHD.Total an MD.Total to the represent the individual columns combined together. This will help our results to be easier to interpret.

The output from hierarchical clustering is a dendogram, which is a graph that looks like a tree. We can see from our dendogram below there are 4 distinct clusters.

hier_cluster_subset <- adhd_data_3 %>% select(c('Age','Sex','Race','ADHD.Total','MD.TOTAL','Alcohol','THC','Cocaine','Stimulants','Sedative.hypnotics','Opioids','Court.order','Education','Hx.of.Violence','Disorderly.Conduct','Suicide','Abuse','Non.subst.Dx','Subst.Dx','NonSubstDx','SubstDx'))
hier_cluster_complete <- hclust(dist(hier_cluster_subset),method="complete")
plot(hier_cluster_complete, hang = -1, cex = 0.6)
abline(h = 43.5, col = "red")

From the above dendogram, we can see there are 4 distinct clusters so we will cut our tree at this level using the function cutree(). This function returns a vector of cluster assignments, which can be see here:

vect_cut <- cutree(hier_cluster_complete,4)
hier_cluster_subset$cluster_pred <- vect_cut
vect_cut
##   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20 
##   1   2   3   1   1   1   1   2   1   1   1   3   3   1   4   1   2   2   4   2 
##  21  24  25  26  27  28  29  33  34  35  36  38  39  40  42  43  44  45  46  47 
##   2   2   2   2   2   1   2   4   1   3   4   1   1   1   2   4   1   3   4   1 
##  48  50  51  52  54  55  56  57  58  59  60  61  62  63  64  65  66  68  69  70 
##   4   3   3   4   2   2   1   4   1   4   1   1   2   1   1   4   3   2   1   4 
##  71  72  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90  91 
##   1   2   1   4   2   2   3   2   2   3   2   1   2   4   4   2   3   1   1   1 
##  92  93  94  96  97  98 100 101 109 110 111 113 114 115 116 118 119 120 121 123 
##   3   1   4   1   1   1   1   1   3   3   1   4   4   4   4   3   3   4   4   1 
## 124 125 126 127 128 129 130 133 
##   1   4   3   4   3   1   1   4 
##  [ reached getOption("max.print") -- omitted 38 entries ]

The above clustering included quite a few columns. Now we are going to make a simpler clustering dendogram using only age,ADHD.Total,MD.TOTAL. These columns represent our numeric data.

hier_cluster_subset_simple <- adhd_data_3 %>% select(c('Age','ADHD.Total','MD.TOTAL'))
hier_cluster_complete_simple <- hclust(dist(hier_cluster_subset_simple),method="complete")
plot(hier_cluster_complete_simple, hang = -1, cex = 0.6)
abline(h = 50, col = "red")

Based on the above dendogram, we can see there are 3 distinct clusters. Again, we will cut our tree and view the output cluster assignments.

num_cut = cutree(hier_cluster_complete_simple,3)
hier_cluster_subset_simple$cluster_pred <- num_cut
num_cut
##   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20 
##   1   2   1   1   2   2   2   1   2   2   1   1   1   1   1   1   1   2   1   2 
##  21  24  25  26  27  28  29  33  34  35  36  38  39  40  42  43  44  45  46  47 
##   2   2   2   2   1   2   2   1   2   1   1   1   1   2   2   1   1   1   3   1 
##  48  50  51  52  54  55  56  57  58  59  60  61  62  63  64  65  66  68  69  70 
##   1   1   1   3   2   2   1   1   1   1   1   2   2   1   1   3   1   1   2   1 
##  71  72  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90  91 
##   1   2   1   3   1   2   1   2   2   1   2   2   1   3   3   2   1   1   2   1 
##  92  93  94  96  97  98 100 101 109 110 111 113 114 115 116 118 119 120 121 123 
##   1   2   1   1   1   2   2   1   1   1   1   1   3   1   3   1   1   3   3   1 
## 124 125 126 127 128 129 130 133 
##   2   3   1   1   1   2   1   3 
##  [ reached getOption("max.print") -- omitted 38 entries ]

Here we can see that it appears that cluster 2 had a median and average age of 38. This cluster’s age fits right between cluster 1 & 3. Cluster 2 had the highest individual scores for reporting ADHD and reporting mood disorder. It would be interesting to see if there is any data to investigate if this age has any correlation with a mid-life crisis. Younger participants self reported the lower amount of ADHD and mood disorder.

results_simple <- hier_cluster_subset_simple %>% dplyr::group_by(cluster_pred) 
results_simple$Age <- as.numeric(as.character(results_simple$Age))
results_simple$ADHD.Total <- as.numeric(as.character(results_simple$ADHD.Total))
results_simple$MD.TOTAL <- as.numeric(as.character(results_simple$MD.TOTAL))
results_simple %>% dplyr::summarize(Avg_Age = mean(Age,na.rm=TRUE),
                                    Median_Age = median(Age,na.rm=TRUE),
                                    Avg_ADHD.Total = mean(ADHD.Total,na.rm=TRUE),
                                    Median_ADHD.Total = median(ADHD.Total,na.rm=TRUE),
                                    Avg_MD.TOTAL = mean(MD.TOTAL,na.rm=TRUE),
                                    Median_MD.TOTAL = median(MD.TOTAL,na.rm=TRUE))
## # A tibble: 3 × 7
##   cluster_pred Avg_Age Median_Age Avg_ADHD.Total Median_ADHD.Total Avg_MD.TOTAL
##          <int>   <dbl>      <dbl>          <dbl>             <dbl>        <dbl>
## 1            1    41.2         43          31.1                 31        10.3 
## 2            2    38.3         38          54.5                 52        12.5 
## 3            3    35.8         40           9.52                 9         6.29
## # … with 1 more variable: Median_MD.TOTAL <dbl>

Principal Component Analysis

# function that returns principal components up to the threshold variance 

rank.pca = function(data,threshold){
  rank = 1
  cum_var = 0
  while (cum_var <= threshold){
    pca = prcomp(data, rank. = rank)
    cum_var = summary(pca)$importance[rank*3]
    rank = rank + 1
  }
  return(pca)
}
data = adhd_data_3

# arbitrary threshold of .85

pca.thresh = .85

# change types back to int so we can use prcomp()

for (name in names(data)){
  data[name] = c(sapply(data[name], as.integer))
}

PCA for ADHD Questions

is.adhd_question = str_detect(names(data), '^ADHD\\.Q')
adhd_questions = data[is.adhd_question]

adhd.pca = rank.pca(adhd_questions,pca.thresh)
summary(adhd.pca)
## Importance of first k=9 (out of 18) components:
##                           PC1     PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     3.9567 1.58025 1.35806 1.21656 1.10910 1.04123 0.97067
## Proportion of Variance 0.5095 0.08127 0.06002 0.04817 0.04003 0.03528 0.03066
## Cumulative Proportion  0.5095 0.59077 0.65080 0.69897 0.73900 0.77429 0.80495
##                            PC8     PC9
## Standard deviation     0.91367 0.88988
## Proportion of Variance 0.02717 0.02577
## Cumulative Proportion  0.83212 0.85789
adhd.pca
## Standard deviations (1, .., p=18):
##  [1] 3.9566536 1.5802489 1.3580553 1.2165616 1.1091016 1.0412286 0.9706667
##  [8] 0.9136671 0.8898769 0.8500617 0.8083534 0.7991683 0.7318146 0.7095402
## [15] 0.6497302 0.5745030 0.5517886 0.5060870
## 
## Rotation (n x k) = (18 x 9):
##                PC1         PC2         PC3          PC4         PC5
## ADHD.Q1  0.2241604 -0.11850805  0.36711647 -0.305577054  0.40191812
## ADHD.Q2  0.2330749 -0.23317338  0.18537730 -0.181861210  0.37182989
## ADHD.Q3  0.2120604 -0.27150539  0.16297362 -0.175680568 -0.47947987
## ADHD.Q4  0.2583036 -0.25992257  0.22458338 -0.058665778 -0.01721544
## ADHD.Q5  0.2707830 -0.14874925 -0.48941003 -0.073054193 -0.30876278
## ADHD.Q6  0.2204925  0.05883647 -0.31113536 -0.422341320  0.08186676
## ADHD.Q7  0.2182837 -0.14805909  0.12990186 -0.087132649 -0.12346514
## ADHD.Q8  0.2678672 -0.14068584  0.17910137  0.232797556 -0.17814563
## ADHD.Q9  0.2671436 -0.06874345  0.11409744  0.290180080 -0.08141198
## ADHD.Q10 0.2516825 -0.07340875  0.07149119  0.120828234 -0.03173831
## ADHD.Q11 0.2317082 -0.12254454 -0.16989734  0.318381020  0.14585970
## ADHD.Q12 0.1960853  0.16854950  0.08940426  0.347522212 -0.10312727
##                   PC6         PC7          PC8         PC9
## ADHD.Q1  -0.113874775  0.09540069 -0.003783810  0.09079274
## ADHD.Q2   0.022970805  0.16767868 -0.024921227  0.05767276
## ADHD.Q3   0.030251125  0.31979641 -0.128954904  0.19928143
## ADHD.Q4   0.319833635 -0.22186035 -0.136238741 -0.04636797
## ADHD.Q5  -0.155266291  0.13097415 -0.213138796 -0.25882147
## ADHD.Q6  -0.515789910 -0.05485572 -0.017701053  0.30533854
## ADHD.Q7  -0.245563722 -0.47859916 -0.008230004 -0.22352587
## ADHD.Q8  -0.107495813 -0.35289985  0.087928401 -0.27553377
## ADHD.Q9  -0.041060556  0.06638797  0.488949353  0.32135227
## ADHD.Q10  0.135058805  0.29338930 -0.297591145 -0.19444432
## ADHD.Q11 -0.024138078  0.05752253  0.327773990 -0.03823873
## ADHD.Q12 -0.136111549  0.19145471 -0.031731274  0.29776948
##  [ reached getOption("max.print") -- omitted 6 rows ]

We were able to use PCA to reduce the results from 18 ADHD questions to 9 principal components while keeping 85% of the variance. For the first principal component (the component that explains the variance the most) we see that each question holds roughly the same amount of weight. The first component has a positive relationship with all 18 questions

PCA for Mood Disorder Questions

is.mood_disorder = str_detect(names(data), '^MD\\.Q')
mood_disorder = data[is.mood_disorder]

md.pca = rank.pca(mood_disorder, pca.thresh)
md.pca%>%
  summary()
## Importance of first k=9 (out of 15) components:
##                           PC1    PC2     PC3     PC4    PC5    PC6     PC7
## Standard deviation     1.2869 0.8901 0.54560 0.48373 0.4785 0.4338 0.41024
## Proportion of Variance 0.3775 0.1806 0.06785 0.05334 0.0522 0.0429 0.03836
## Cumulative Proportion  0.3775 0.5581 0.62593 0.67927 0.7315 0.7744 0.81272
##                           PC8     PC9
## Standard deviation     0.4035 0.37338
## Proportion of Variance 0.0371 0.03178
## Cumulative Proportion  0.8498 0.88160
md.pca
## Standard deviations (1, .., p=15):
##  [1] 1.2869221 0.8900716 0.5456005 0.4837312 0.4785414 0.4338412 0.4102361
##  [8] 0.4034524 0.3733821 0.3601925 0.3186736 0.2996987 0.2756148 0.2539722
## [15] 0.2405183
## 
## Rotation (n x k) = (15 x 9):
##               PC1          PC2         PC3         PC4         PC5          PC6
## MD.Q1a -0.2543278  0.147273718 -0.24046079  0.21487840 -0.18611904  0.212349417
## MD.Q1b -0.2664995  0.005885759 -0.26358013 -0.13618813  0.14553549 -0.003622287
## MD.Q1c -0.1213964  0.227754427  0.48408609 -0.06348524 -0.42280394  0.265412768
## MD.Q1d -0.1908988  0.170616678 -0.10195790 -0.39249062 -0.18001273  0.445830601
## MD.Q1e -0.1967763  0.220382086  0.11734202 -0.26568848 -0.31583754  0.016961076
## MD.Q1f -0.2048252  0.101118382 -0.23262634 -0.22701838 -0.11404800 -0.287946321
## MD.Q1g -0.2230530  0.050666663 -0.29802392 -0.15133759 -0.05760700 -0.297307077
## MD.Q1h -0.1598335  0.338358317  0.19262067 -0.05278491  0.20428980 -0.323267874
## MD.Q1i -0.1636045  0.263971910  0.37525806 -0.05413925  0.22144652 -0.393790752
## MD.Q1j -0.1565535  0.297886667  0.18385599  0.06788461  0.31891921  0.136242343
## MD.Q1k -0.1384274  0.314335407 -0.06093941  0.18863556  0.49400037  0.396372134
## MD.Q1L -0.2309599  0.171750504 -0.35738459  0.40855995 -0.04559609  0.038283582
##                PC7         PC8         PC9
## MD.Q1a -0.09902008  0.05234109 -0.54052879
## MD.Q1b -0.15206365  0.34294139  0.06523971
## MD.Q1c -0.05315031  0.42303144 -0.16822960
## MD.Q1d -0.38626121 -0.17653820  0.49880418
## MD.Q1e  0.45395747 -0.40255797 -0.18059534
## MD.Q1f  0.06729004 -0.27620910 -0.17278738
## MD.Q1g  0.30698957  0.19576366  0.19358553
## MD.Q1h -0.33874426 -0.29208952 -0.11849495
## MD.Q1i -0.27539112  0.17283502 -0.02038774
## MD.Q1j  0.54060417  0.25387498  0.18457578
## MD.Q1k  0.07471571 -0.27882103  0.02845118
## MD.Q1L -0.14496184  0.12595956 -0.11941519
##  [ reached getOption("max.print") -- omitted 3 rows ]

Similar to the PCA done to ADHD questions, we were able to reduce the 15 questions to 9 principal components. For the first component we see that MD.Q3 is weighted less than the other questions which leads us to believe that it is less important than the other questions. However, all of these questions seem to have a negative relationship with the first component. The second component is more varied, but question MD.Q3 is still the least weighted; which adds to the suspicion that it is less important than the others.

PCA for Drug Use

drug_use = data[c("Alcohol", "THC", "Cocaine", "Stimulants", "Sedative.hypnotics", "Opioids")]
drug_use = drug_use %>%
  drop_na()

drug.pca = rank.pca(drug_use, pca.thresh)
drug.pca%>%
  summary()
## Importance of first k=4 (out of 6) components:
##                           PC1    PC2    PC3     PC4
## Standard deviation     1.4779 1.3523 1.2004 0.74239
## Proportion of Variance 0.3399 0.2846 0.2242 0.08577
## Cumulative Proportion  0.3399 0.6245 0.8487 0.93449
drug.pca
## Standard deviations (1, .., p=6):
## [1] 1.4778602 1.3523390 1.2003990 0.7423910 0.5112899 0.3994609
## 
## Rotation (n x k) = (6 x 4):
##                            PC1         PC2         PC3         PC4
## Alcohol             0.59873741  0.79856286 -0.05767729 0.001194011
## THC                 0.35399506 -0.20137630  0.90900737 0.069000983
## Cocaine             0.71511220 -0.56504807 -0.40670175 0.058031197
## Stimulants          0.01080701  0.02137797  0.05531554 0.057781334
## Sedative.hypnotics -0.04377422  0.02060986 -0.03044406 0.503455684
## Opioids            -0.05275015  0.03979746 -0.03140007 0.857358449

We were able to reduce the 6 drug questions to 4 principal components. The first component (that explains 34% of the variance) gives significant weight to Cocaine, Alcohol and THC (in order of highest - lowest). The other drugs are not as significant. However, in the second component, that explains 28% of the variance, still gives Alcohol a high weighting but has a negative value for Cocaine and THC. In PC3 (22%) THC has high weighting but everything else has low or negative weights. This shows that the importance of variables can vary heavily in this set of questions

PCA on entire dataset

pca.all = rank.pca(data[names(data)!= 'Suicide'],pca.thresh)
pca.all%>%
  summary()
## Importance of first k=2 (out of 53) components:
##                            PC1     PC2
## Standard deviation     16.2562 11.0020
## Proportion of Variance  0.6025  0.2760
## Cumulative Proportion   0.6025  0.8784
pca.all
## Standard deviations (1, .., p=53):
##  [1] 1.625617e+01 1.100201e+01 4.298463e+00 2.259842e+00 2.064982e+00
##  [6] 1.694125e+00 1.519569e+00 1.404101e+00 1.312185e+00 1.195824e+00
## [11] 1.135492e+00 1.097761e+00 1.084214e+00 1.068334e+00 9.619584e-01
## [16] 9.123910e-01 8.769350e-01 8.255091e-01 8.129019e-01 7.527663e-01
## [21] 7.384661e-01 7.297601e-01 6.880044e-01 6.612435e-01 6.158500e-01
## [26] 5.817920e-01 5.759519e-01 5.648775e-01 5.239893e-01 4.922702e-01
## [31] 4.549436e-01 4.264624e-01 4.198509e-01 4.015978e-01 3.915262e-01
## [36] 3.666148e-01 3.624494e-01 3.415041e-01 3.238205e-01 3.158331e-01
## [41] 2.979210e-01 2.861042e-01 2.767981e-01 2.716943e-01 2.377636e-01
## [46] 2.320446e-01 2.190537e-01 2.021666e-01 1.889328e-01 1.713275e-01
## [51] 7.219188e-15 1.148027e-15 1.148027e-15
## 
## Rotation (n x k) = (53 x 2):
##                              PC1           PC2
## Age                 0.0620647877  0.9949747519
## Sex                -0.0072531154 -0.0013139544
## Race                0.0067183502 -0.0019756931
## ADHD.Q1            -0.0538018681  0.0052266022
## ADHD.Q2            -0.0555768024  0.0131437516
## ADHD.Q3            -0.0518974188 -0.0097617974
## ADHD.Q4            -0.0620688656 -0.0009269728
## ADHD.Q5            -0.0656749621  0.0060069875
## ADHD.Q6            -0.0537440595  0.0022319407
## ADHD.Q7            -0.0527610527  0.0089069799
## ADHD.Q8            -0.0644627667  0.0067179816
## ADHD.Q9            -0.0647007865 -0.0016581017
## ADHD.Q10           -0.0607581069  0.0039688935
## ADHD.Q11           -0.0553917002  0.0155686988
## ADHD.Q12           -0.0482533677 -0.0031498488
## ADHD.Q13           -0.0572811797  0.0133136559
## ADHD.Q14           -0.0599404181  0.0179432493
## ADHD.Q15           -0.0556512239  0.0037447369
## ADHD.Q16           -0.0569413377 -0.0038901384
## ADHD.Q17           -0.0473870752  0.0019770153
## ADHD.Q18           -0.0575764945 -0.0046722699
## ADHD.Total         -0.9544847405  0.0624423413
## MD.Q1a             -0.0117370617 -0.0031268922
## MD.Q1b             -0.0149508997 -0.0071006503
## MD.Q1c              0.0003816895  0.0037086057
## MD.Q1d             -0.0057049719  0.0045452572
## MD.Q1e             -0.0106591207  0.0037284249
## MD.Q1f             -0.0108417095 -0.0002406858
## MD.Q1g             -0.0169615453 -0.0005350969
## MD.Q1h             -0.0041460482 -0.0061396766
## MD.Q1i             -0.0029876057  0.0029566964
## MD.Q1j             -0.0050305928 -0.0042465388
## MD.Q1k             -0.0034854991 -0.0103054981
## MD.Q1L             -0.0095488677 -0.0013163299
## MD.Q1m             -0.0066222248  0.0046525494
## MD.Q2              -0.0127116545 -0.0015687173
## MD.Q3              -0.0332974875 -0.0044452927
## MD.TOTAL           -0.1483035996 -0.0194338451
## Alcohol            -0.0099280156  0.0179090034
## THC                 0.0022633833 -0.0412599771
## Cocaine             0.0215747464  0.0216090051
## Stimulants         -0.0041764724 -0.0060762925
## Sedative.hypnotics -0.0034086180 -0.0069054001
## Opioids            -0.0026294216 -0.0053229874
## Court.order        -0.0008875149 -0.0051871667
## Education           0.0056120120  0.0061943216
## Hx.of.Violence      0.0009504824 -0.0022935283
## Disorderly.Conduct  0.0050629122 -0.0030232603
## Abuse              -0.0276446538  0.0290582511
## Non.subst.Dx       -0.0142051845  0.0097436093
## Subst.Dx            0.0064502653 -0.0143900478
## NonSubstDx         -0.0142051845  0.0097436093
## SubstDx             0.0064502653 -0.0143900478

While we did not have much success reducing the groups of questions with PCA, we were able to reduce the entire dataset (53 components) to just two components while still explaining at least 85% of the variance

Xtreme Gradient Boosting (XGBoost)

Removing the lone record that Suicide is NA.

# need to remove Suicide null 
dim(adhd_data_3)
## [1] 146  54
# skim(adhd_data_3) 
adhd_data_4 <- adhd_data_3[complete.cases(adhd_data_3),]
dim(adhd_data_4)
## [1] 145  54
# Dropping the 2 repeated columns
adhd_data_4 <- adhd_data_4[, -c(51:52)]

Splitting the train and test dataset

set.seed(108)
sample_size <- floor(nrow(adhd_data_4)*0.8)
indices <- sample(1:nrow(adhd_data_4),sample_size)
data_train <- adhd_data_4[c(indices),]
data_test <- adhd_data_4[-c(indices),]

One-hot encoding of features, which is a requirement for XGBoost. At the end, you wanted to make sure the number of columns in data_train and data_test are the same.

# do the 1-hot encoding for the data_train
Suicide <- data_train[, 49]
dummy <- dummyVars(" ~ . ", data = data_train [, -49] )
newdata <- data.frame(predict (dummy, newdata = data_train[, -49]))
data_train <- cbind (newdata, Suicide)
options(max.print="309")
colnames(data_train)
##   [1] "Age.18"               "Age.19"               "Age.20"              
##   [4] "Age.21"               "Age.22"               "Age.23"              
##   [7] "Age.24"               "Age.25"               "Age.26"              
##  [10] "Age.27"               "Age.28"               "Age.29"              
##  [13] "Age.30"               "Age.31"               "Age.32"              
##  [16] "Age.33"               "Age.34"               "Age.35"              
##  [19] "Age.36"               "Age.37"               "Age.38"              
##  [22] "Age.39"               "Age.40"               "Age.41"              
##  [25] "Age.42"               "Age.43"               "Age.44"              
##  [28] "Age.45"               "Age.46"               "Age.47"              
##  [31] "Age.48"               "Age.49"               "Age.50"              
##  [34] "Age.51"               "Age.52"               "Age.53"              
##  [37] "Age.54"               "Age.55"               "Age.56"              
##  [40] "Age.57"               "Age.61"               "Age.69"              
##  [43] "Sex.1"                "Sex.2"                "Race.1"              
##  [46] "Race.2"               "Race.3"               "Race.6"              
##  [49] "ADHD.Q1.0"            "ADHD.Q1.1"            "ADHD.Q1.2"           
##  [52] "ADHD.Q1.3"            "ADHD.Q1.4"            "ADHD.Q2.0"           
##  [55] "ADHD.Q2.1"            "ADHD.Q2.2"            "ADHD.Q2.3"           
##  [58] "ADHD.Q2.4"            "ADHD.Q3.0"            "ADHD.Q3.1"           
##  [61] "ADHD.Q3.2"            "ADHD.Q3.3"            "ADHD.Q3.4"           
##  [64] "ADHD.Q4.0"            "ADHD.Q4.1"            "ADHD.Q4.2"           
##  [67] "ADHD.Q4.3"            "ADHD.Q4.4"            "ADHD.Q5.0"           
##  [70] "ADHD.Q5.1"            "ADHD.Q5.2"            "ADHD.Q5.3"           
##  [73] "ADHD.Q5.4"            "ADHD.Q5.5"            "ADHD.Q6.0"           
##  [76] "ADHD.Q6.1"            "ADHD.Q6.2"            "ADHD.Q6.3"           
##  [79] "ADHD.Q6.4"            "ADHD.Q7.0"            "ADHD.Q7.1"           
##  [82] "ADHD.Q7.2"            "ADHD.Q7.3"            "ADHD.Q7.4"           
##  [85] "ADHD.Q8.0"            "ADHD.Q8.1"            "ADHD.Q8.2"           
##  [88] "ADHD.Q8.3"            "ADHD.Q8.4"            "ADHD.Q9.0"           
##  [91] "ADHD.Q9.1"            "ADHD.Q9.2"            "ADHD.Q9.3"           
##  [94] "ADHD.Q9.4"            "ADHD.Q10.0"           "ADHD.Q10.1"          
##  [97] "ADHD.Q10.2"           "ADHD.Q10.3"           "ADHD.Q10.4"          
## [100] "ADHD.Q11.0"           "ADHD.Q11.1"           "ADHD.Q11.2"          
## [103] "ADHD.Q11.3"           "ADHD.Q11.4"           "ADHD.Q12.0"          
## [106] "ADHD.Q12.1"           "ADHD.Q12.2"           "ADHD.Q12.3"          
## [109] "ADHD.Q12.4"           "ADHD.Q13.0"           "ADHD.Q13.1"          
## [112] "ADHD.Q13.2"           "ADHD.Q13.3"           "ADHD.Q13.4"          
## [115] "ADHD.Q14.0"           "ADHD.Q14.1"           "ADHD.Q14.2"          
## [118] "ADHD.Q14.3"           "ADHD.Q14.4"           "ADHD.Q15.0"          
## [121] "ADHD.Q15.1"           "ADHD.Q15.2"           "ADHD.Q15.3"          
## [124] "ADHD.Q15.4"           "ADHD.Q16.0"           "ADHD.Q16.1"          
## [127] "ADHD.Q16.2"           "ADHD.Q16.3"           "ADHD.Q16.4"          
## [130] "ADHD.Q17.0"           "ADHD.Q17.1"           "ADHD.Q17.2"          
## [133] "ADHD.Q17.3"           "ADHD.Q17.4"           "ADHD.Q18.0"          
## [136] "ADHD.Q18.1"           "ADHD.Q18.2"           "ADHD.Q18.3"          
## [139] "ADHD.Q18.4"           "ADHD.Total.0"         "ADHD.Total.1"        
## [142] "ADHD.Total.3"         "ADHD.Total.5"         "ADHD.Total.6"        
## [145] "ADHD.Total.7"         "ADHD.Total.8"         "ADHD.Total.9"        
## [148] "ADHD.Total.10"        "ADHD.Total.11"        "ADHD.Total.12"       
## [151] "ADHD.Total.13"        "ADHD.Total.14"        "ADHD.Total.16"       
## [154] "ADHD.Total.17"        "ADHD.Total.18"        "ADHD.Total.19"       
## [157] "ADHD.Total.20"        "ADHD.Total.21"        "ADHD.Total.23"       
## [160] "ADHD.Total.24"        "ADHD.Total.25"        "ADHD.Total.26"       
## [163] "ADHD.Total.27"        "ADHD.Total.28"        "ADHD.Total.29"       
## [166] "ADHD.Total.30"        "ADHD.Total.31"        "ADHD.Total.32"       
## [169] "ADHD.Total.33"        "ADHD.Total.34"        "ADHD.Total.35"       
## [172] "ADHD.Total.36"        "ADHD.Total.37"        "ADHD.Total.38"       
## [175] "ADHD.Total.39"        "ADHD.Total.40"        "ADHD.Total.41"       
## [178] "ADHD.Total.42"        "ADHD.Total.43"        "ADHD.Total.44"       
## [181] "ADHD.Total.45"        "ADHD.Total.46"        "ADHD.Total.47"       
## [184] "ADHD.Total.48"        "ADHD.Total.49"        "ADHD.Total.50"       
## [187] "ADHD.Total.51"        "ADHD.Total.52"        "ADHD.Total.53"       
## [190] "ADHD.Total.54"        "ADHD.Total.55"        "ADHD.Total.56"       
## [193] "ADHD.Total.57"        "ADHD.Total.58"        "ADHD.Total.62"       
## [196] "ADHD.Total.63"        "ADHD.Total.65"        "ADHD.Total.67"       
## [199] "ADHD.Total.69"        "ADHD.Total.71"        "ADHD.Total.72"       
## [202] "MD.Q1a.0"             "MD.Q1a.1"             "MD.Q1b.0"            
## [205] "MD.Q1b.1"             "MD.Q1c.0"             "MD.Q1c.1"            
## [208] "MD.Q1d.0"             "MD.Q1d.1"             "MD.Q1e.0"            
## [211] "MD.Q1e.1"             "MD.Q1f.0"             "MD.Q1f.1"            
## [214] "MD.Q1g.0"             "MD.Q1g.1"             "MD.Q1h.0"            
## [217] "MD.Q1h.1"             "MD.Q1i.0"             "MD.Q1i.1"            
## [220] "MD.Q1j.0"             "MD.Q1j.1"             "MD.Q1k.0"            
## [223] "MD.Q1k.1"             "MD.Q1L.0"             "MD.Q1L.1"            
## [226] "MD.Q1m.0"             "MD.Q1m.1"             "MD.Q2.0"             
## [229] "MD.Q2.1"              "MD.Q3.0"              "MD.Q3.1"             
## [232] "MD.Q3.2"              "MD.Q3.3"              "MD.TOTAL.0"          
## [235] "MD.TOTAL.1"           "MD.TOTAL.2"           "MD.TOTAL.3"          
## [238] "MD.TOTAL.4"           "MD.TOTAL.5"           "MD.TOTAL.6"          
## [241] "MD.TOTAL.7"           "MD.TOTAL.8"           "MD.TOTAL.9"          
## [244] "MD.TOTAL.10"          "MD.TOTAL.11"          "MD.TOTAL.12"         
## [247] "MD.TOTAL.13"          "MD.TOTAL.14"          "MD.TOTAL.15"         
## [250] "MD.TOTAL.16"          "MD.TOTAL.17"          "Alcohol.0"           
## [253] "Alcohol.1"            "Alcohol.2"            "Alcohol.3"           
## [256] "THC.0"                "THC.1"                "THC.2"               
## [259] "THC.3"                "Cocaine.0"            "Cocaine.1"           
## [262] "Cocaine.2"            "Cocaine.3"            "Stimulants.0"        
## [265] "Stimulants.1"         "Stimulants.3"         "Sedative.hypnotics.0"
## [268] "Sedative.hypnotics.1" "Sedative.hypnotics.2" "Sedative.hypnotics.3"
## [271] "Opioids.0"            "Opioids.1"            "Opioids.3"           
## [274] "Court.order.0"        "Court.order.1"        "Education"           
## [277] "Hx.of.Violence.0"     "Hx.of.Violence.1"     "Disorderly.Conduct.0"
## [280] "Disorderly.Conduct.1" "Abuse.L"              "Abuse.Q"             
## [283] "Abuse.C"              "Abuse.4"              "Abuse.5"             
## [286] "Abuse.6"              "Abuse.7"              "NonSubstDx.0"        
## [289] "NonSubstDx.1"         "NonSubstDx.2"         "SubstDx.0"           
## [292] "SubstDx.1"            "SubstDx.2"            "SubstDx.3"           
## [295] "Suicide"
# repeat the same exercise for the data_test
Suicide1 <- data_test[, 49]
dummy1 <- dummyVars(" ~ . ", data = data_test [, -49] )
newdata1 <- data.frame(predict (dummy1, newdata = data_test[, -49]))
data_test <- cbind (newdata1, Suicide1)
colnames(data_test)
##   [1] "Age.18"               "Age.19"               "Age.20"              
##   [4] "Age.21"               "Age.22"               "Age.23"              
##   [7] "Age.24"               "Age.25"               "Age.26"              
##  [10] "Age.27"               "Age.28"               "Age.29"              
##  [13] "Age.30"               "Age.31"               "Age.32"              
##  [16] "Age.33"               "Age.34"               "Age.35"              
##  [19] "Age.36"               "Age.37"               "Age.38"              
##  [22] "Age.39"               "Age.40"               "Age.41"              
##  [25] "Age.42"               "Age.43"               "Age.44"              
##  [28] "Age.45"               "Age.46"               "Age.47"              
##  [31] "Age.48"               "Age.49"               "Age.50"              
##  [34] "Age.51"               "Age.52"               "Age.53"              
##  [37] "Age.54"               "Age.55"               "Age.56"              
##  [40] "Age.57"               "Age.61"               "Age.69"              
##  [43] "Sex.1"                "Sex.2"                "Race.1"              
##  [46] "Race.2"               "Race.3"               "Race.6"              
##  [49] "ADHD.Q1.0"            "ADHD.Q1.1"            "ADHD.Q1.2"           
##  [52] "ADHD.Q1.3"            "ADHD.Q1.4"            "ADHD.Q2.0"           
##  [55] "ADHD.Q2.1"            "ADHD.Q2.2"            "ADHD.Q2.3"           
##  [58] "ADHD.Q2.4"            "ADHD.Q3.0"            "ADHD.Q3.1"           
##  [61] "ADHD.Q3.2"            "ADHD.Q3.3"            "ADHD.Q3.4"           
##  [64] "ADHD.Q4.0"            "ADHD.Q4.1"            "ADHD.Q4.2"           
##  [67] "ADHD.Q4.3"            "ADHD.Q4.4"            "ADHD.Q5.0"           
##  [70] "ADHD.Q5.1"            "ADHD.Q5.2"            "ADHD.Q5.3"           
##  [73] "ADHD.Q5.4"            "ADHD.Q5.5"            "ADHD.Q6.0"           
##  [76] "ADHD.Q6.1"            "ADHD.Q6.2"            "ADHD.Q6.3"           
##  [79] "ADHD.Q6.4"            "ADHD.Q7.0"            "ADHD.Q7.1"           
##  [82] "ADHD.Q7.2"            "ADHD.Q7.3"            "ADHD.Q7.4"           
##  [85] "ADHD.Q8.0"            "ADHD.Q8.1"            "ADHD.Q8.2"           
##  [88] "ADHD.Q8.3"            "ADHD.Q8.4"            "ADHD.Q9.0"           
##  [91] "ADHD.Q9.1"            "ADHD.Q9.2"            "ADHD.Q9.3"           
##  [94] "ADHD.Q9.4"            "ADHD.Q10.0"           "ADHD.Q10.1"          
##  [97] "ADHD.Q10.2"           "ADHD.Q10.3"           "ADHD.Q10.4"          
## [100] "ADHD.Q11.0"           "ADHD.Q11.1"           "ADHD.Q11.2"          
## [103] "ADHD.Q11.3"           "ADHD.Q11.4"           "ADHD.Q12.0"          
## [106] "ADHD.Q12.1"           "ADHD.Q12.2"           "ADHD.Q12.3"          
## [109] "ADHD.Q12.4"           "ADHD.Q13.0"           "ADHD.Q13.1"          
## [112] "ADHD.Q13.2"           "ADHD.Q13.3"           "ADHD.Q13.4"          
## [115] "ADHD.Q14.0"           "ADHD.Q14.1"           "ADHD.Q14.2"          
## [118] "ADHD.Q14.3"           "ADHD.Q14.4"           "ADHD.Q15.0"          
## [121] "ADHD.Q15.1"           "ADHD.Q15.2"           "ADHD.Q15.3"          
## [124] "ADHD.Q15.4"           "ADHD.Q16.0"           "ADHD.Q16.1"          
## [127] "ADHD.Q16.2"           "ADHD.Q16.3"           "ADHD.Q16.4"          
## [130] "ADHD.Q17.0"           "ADHD.Q17.1"           "ADHD.Q17.2"          
## [133] "ADHD.Q17.3"           "ADHD.Q17.4"           "ADHD.Q18.0"          
## [136] "ADHD.Q18.1"           "ADHD.Q18.2"           "ADHD.Q18.3"          
## [139] "ADHD.Q18.4"           "ADHD.Total.0"         "ADHD.Total.1"        
## [142] "ADHD.Total.3"         "ADHD.Total.5"         "ADHD.Total.6"        
## [145] "ADHD.Total.7"         "ADHD.Total.8"         "ADHD.Total.9"        
## [148] "ADHD.Total.10"        "ADHD.Total.11"        "ADHD.Total.12"       
## [151] "ADHD.Total.13"        "ADHD.Total.14"        "ADHD.Total.16"       
## [154] "ADHD.Total.17"        "ADHD.Total.18"        "ADHD.Total.19"       
## [157] "ADHD.Total.20"        "ADHD.Total.21"        "ADHD.Total.23"       
## [160] "ADHD.Total.24"        "ADHD.Total.25"        "ADHD.Total.26"       
## [163] "ADHD.Total.27"        "ADHD.Total.28"        "ADHD.Total.29"       
## [166] "ADHD.Total.30"        "ADHD.Total.31"        "ADHD.Total.32"       
## [169] "ADHD.Total.33"        "ADHD.Total.34"        "ADHD.Total.35"       
## [172] "ADHD.Total.36"        "ADHD.Total.37"        "ADHD.Total.38"       
## [175] "ADHD.Total.39"        "ADHD.Total.40"        "ADHD.Total.41"       
## [178] "ADHD.Total.42"        "ADHD.Total.43"        "ADHD.Total.44"       
## [181] "ADHD.Total.45"        "ADHD.Total.46"        "ADHD.Total.47"       
## [184] "ADHD.Total.48"        "ADHD.Total.49"        "ADHD.Total.50"       
## [187] "ADHD.Total.51"        "ADHD.Total.52"        "ADHD.Total.53"       
## [190] "ADHD.Total.54"        "ADHD.Total.55"        "ADHD.Total.56"       
## [193] "ADHD.Total.57"        "ADHD.Total.58"        "ADHD.Total.62"       
## [196] "ADHD.Total.63"        "ADHD.Total.65"        "ADHD.Total.67"       
## [199] "ADHD.Total.69"        "ADHD.Total.71"        "ADHD.Total.72"       
## [202] "MD.Q1a.0"             "MD.Q1a.1"             "MD.Q1b.0"            
## [205] "MD.Q1b.1"             "MD.Q1c.0"             "MD.Q1c.1"            
## [208] "MD.Q1d.0"             "MD.Q1d.1"             "MD.Q1e.0"            
## [211] "MD.Q1e.1"             "MD.Q1f.0"             "MD.Q1f.1"            
## [214] "MD.Q1g.0"             "MD.Q1g.1"             "MD.Q1h.0"            
## [217] "MD.Q1h.1"             "MD.Q1i.0"             "MD.Q1i.1"            
## [220] "MD.Q1j.0"             "MD.Q1j.1"             "MD.Q1k.0"            
## [223] "MD.Q1k.1"             "MD.Q1L.0"             "MD.Q1L.1"            
## [226] "MD.Q1m.0"             "MD.Q1m.1"             "MD.Q2.0"             
## [229] "MD.Q2.1"              "MD.Q3.0"              "MD.Q3.1"             
## [232] "MD.Q3.2"              "MD.Q3.3"              "MD.TOTAL.0"          
## [235] "MD.TOTAL.1"           "MD.TOTAL.2"           "MD.TOTAL.3"          
## [238] "MD.TOTAL.4"           "MD.TOTAL.5"           "MD.TOTAL.6"          
## [241] "MD.TOTAL.7"           "MD.TOTAL.8"           "MD.TOTAL.9"          
## [244] "MD.TOTAL.10"          "MD.TOTAL.11"          "MD.TOTAL.12"         
## [247] "MD.TOTAL.13"          "MD.TOTAL.14"          "MD.TOTAL.15"         
## [250] "MD.TOTAL.16"          "MD.TOTAL.17"          "Alcohol.0"           
## [253] "Alcohol.1"            "Alcohol.2"            "Alcohol.3"           
## [256] "THC.0"                "THC.1"                "THC.2"               
## [259] "THC.3"                "Cocaine.0"            "Cocaine.1"           
## [262] "Cocaine.2"            "Cocaine.3"            "Stimulants.0"        
## [265] "Stimulants.1"         "Stimulants.3"         "Sedative.hypnotics.0"
## [268] "Sedative.hypnotics.1" "Sedative.hypnotics.2" "Sedative.hypnotics.3"
## [271] "Opioids.0"            "Opioids.1"            "Opioids.3"           
## [274] "Court.order.0"        "Court.order.1"        "Education"           
## [277] "Hx.of.Violence.0"     "Hx.of.Violence.1"     "Disorderly.Conduct.0"
## [280] "Disorderly.Conduct.1" "Abuse.L"              "Abuse.Q"             
## [283] "Abuse.C"              "Abuse.4"              "Abuse.5"             
## [286] "Abuse.6"              "Abuse.7"              "NonSubstDx.0"        
## [289] "NonSubstDx.1"         "NonSubstDx.2"         "SubstDx.0"           
## [292] "SubstDx.1"            "SubstDx.2"            "SubstDx.3"           
## [295] "Suicide1"
str(data_train)
## 'data.frame':    116 obs. of  295 variables:
##  $ Age.18              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Age.19              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Age.20              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Age.21              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Age.22              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Age.23              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Age.24              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Age.25              : num  0 0 0 0 0 0 0 0 0 1 ...
##  $ Age.26              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Age.27              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Age.28              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Age.29              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Age.30              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Age.31              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Age.32              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Age.33              : num  0 0 0 0 1 0 0 0 1 0 ...
##  $ Age.34              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Age.35              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Age.36              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Age.37              : num  1 0 0 0 0 0 0 0 0 0 ...
##  $ Age.38              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Age.39              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Age.40              : num  0 0 1 0 0 0 0 0 0 0 ...
##  $ Age.41              : num  0 0 0 0 0 1 0 0 0 0 ...
##  $ Age.42              : num  0 1 0 0 0 0 0 1 0 0 ...
##  $ Age.43              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Age.44              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Age.45              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Age.46              : num  0 0 0 1 0 0 0 0 0 0 ...
##  $ Age.47              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Age.48              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Age.49              : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ Age.50              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Age.51              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Age.52              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Age.53              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Age.54              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Age.55              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Age.56              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Age.57              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Age.61              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Age.69              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Sex.1               : num  1 1 0 0 1 1 1 0 0 1 ...
##  $ Sex.2               : num  0 0 1 1 0 0 0 1 1 0 ...
##  $ Race.1              : num  0 0 1 1 1 1 0 0 0 1 ...
##  $ Race.2              : num  1 1 0 0 0 0 1 1 1 0 ...
##  $ Race.3              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Race.6              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ ADHD.Q1.0           : num  1 0 0 0 0 0 1 0 0 0 ...
##  $ ADHD.Q1.1           : num  0 1 0 0 0 0 0 0 0 0 ...
##  $ ADHD.Q1.2           : num  0 0 0 1 0 1 0 0 0 1 ...
##  $ ADHD.Q1.3           : num  0 0 0 0 0 0 0 1 1 0 ...
##  $ ADHD.Q1.4           : num  0 0 1 0 1 0 0 0 0 0 ...
##  $ ADHD.Q2.0           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ ADHD.Q2.1           : num  1 0 0 1 0 0 0 1 0 0 ...
##  $ ADHD.Q2.2           : num  0 1 0 0 0 1 1 0 1 1 ...
##  $ ADHD.Q2.3           : num  0 0 1 0 0 0 0 0 0 0 ...
##  $ ADHD.Q2.4           : num  0 0 0 0 1 0 0 0 0 0 ...
##  $ ADHD.Q3.0           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ ADHD.Q3.1           : num  1 0 1 1 0 1 0 0 0 0 ...
##  $ ADHD.Q3.2           : num  0 0 0 0 0 0 1 1 0 0 ...
##  $ ADHD.Q3.3           : num  0 0 0 0 1 0 0 0 1 1 ...
##  $ ADHD.Q3.4           : num  0 1 0 0 0 0 0 0 0 0 ...
##  $ ADHD.Q4.0           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ ADHD.Q4.1           : num  0 0 0 1 0 0 0 1 0 0 ...
##  $ ADHD.Q4.2           : num  1 0 0 0 0 1 1 0 1 0 ...
##  $ ADHD.Q4.3           : num  0 1 0 0 1 0 0 0 0 1 ...
##  $ ADHD.Q4.4           : num  0 0 1 0 0 0 0 0 0 0 ...
##  $ ADHD.Q5.0           : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ ADHD.Q5.1           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ ADHD.Q5.2           : num  0 0 1 0 0 0 0 1 0 0 ...
##  $ ADHD.Q5.3           : num  0 1 0 1 1 1 0 0 1 1 ...
##  $ ADHD.Q5.4           : num  1 0 0 0 0 0 0 0 0 0 ...
##  $ ADHD.Q5.5           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ ADHD.Q6.0           : num  0 0 1 0 0 0 1 0 0 0 ...
##  $ ADHD.Q6.1           : num  0 0 0 0 1 0 0 0 0 0 ...
##  $ ADHD.Q6.2           : num  0 0 0 0 0 0 0 0 1 1 ...
##  $ ADHD.Q6.3           : num  1 1 0 1 0 1 0 1 0 0 ...
##  $ ADHD.Q6.4           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ ADHD.Q7.0           : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ ADHD.Q7.1           : num  1 0 0 1 0 1 0 0 0 0 ...
##  $ ADHD.Q7.2           : num  0 1 1 0 0 0 1 0 0 0 ...
##  $ ADHD.Q7.3           : num  0 0 0 0 1 0 0 0 0 1 ...
##  $ ADHD.Q7.4           : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ ADHD.Q8.0           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ ADHD.Q8.1           : num  0 1 0 0 0 1 0 1 0 0 ...
##  $ ADHD.Q8.2           : num  0 0 0 1 0 0 1 0 0 0 ...
##  $ ADHD.Q8.3           : num  1 0 1 0 1 0 0 0 1 1 ...
##  $ ADHD.Q8.4           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ ADHD.Q9.0           : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ ADHD.Q9.1           : num  0 0 0 0 0 1 0 0 1 0 ...
##  $ ADHD.Q9.2           : num  0 0 0 1 0 0 1 0 0 0 ...
##  $ ADHD.Q9.3           : num  1 1 1 0 1 0 0 0 0 1 ...
##  $ ADHD.Q9.4           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ ADHD.Q10.0          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ ADHD.Q10.1          : num  1 0 0 1 0 0 0 1 0 0 ...
##  $ ADHD.Q10.2          : num  0 1 0 0 0 1 1 0 1 0 ...
##  $ ADHD.Q10.3          : num  0 0 1 0 1 0 0 0 0 1 ...
##  $ ADHD.Q10.4          : num  0 0 0 0 0 0 0 0 0 0 ...
##   [list output truncated]

As you can tell below, the target variable in the data_train is unbalanced and is skewed toward 0 (68% 0s).

table(data_train$Suicide)
## 
##  0  1 
## 79 37
train_upsample <- upSample(x=data_train[, -295], y = data_train$Suicide)
# There is a new variable created called Class so that's why I can drop the Suicide column
table(train_upsample$Class)
## 
##  0  1 
## 79 79

Preparing the grid for XGBoost

grid_tune <- expand.grid(
  nrounds = c(81), # of trees normal range is 1500 - 3000 depends on # of records
  max_depth = c(2, 4, 6), 
  eta = c(0.01, 0.1, 0.3), # Learning rate
  gamma = 0,   # pruning normal range [0, 1]
  colsample_bytree = 1, # subsample ratio for columns for tree
  min_child_weight = 1, # the larger the more conservative the model is, can be used as a stop
  subsample = 1 # used to prevent overfitting by sampling x% training dataset
)

The following defines the trainControl and actually train the XGBoost

trainControl <- trainControl (method = "cv", 
                              number = 3, 
                              verboseIter = T, 
                              allowParallel = T
                             )
xgb_tune <- train(x = train_upsample[, -295], 
                  y = train_upsample$Class,
                  trControl = trainControl, 
                  tuneGrid = grid_tune, 
                  method = "xgbTree", 
                  verbose = T
                  )
## + Fold1: eta=0.01, max_depth=2, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=81 
## - Fold1: eta=0.01, max_depth=2, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=81 
## + Fold1: eta=0.01, max_depth=4, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=81 
## - Fold1: eta=0.01, max_depth=4, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=81 
## + Fold1: eta=0.01, max_depth=6, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=81 
## - Fold1: eta=0.01, max_depth=6, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=81 
## + Fold1: eta=0.10, max_depth=2, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=81 
## - Fold1: eta=0.10, max_depth=2, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=81 
## + Fold1: eta=0.10, max_depth=4, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=81 
## - Fold1: eta=0.10, max_depth=4, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=81 
## + Fold1: eta=0.10, max_depth=6, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=81 
## - Fold1: eta=0.10, max_depth=6, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=81 
## + Fold1: eta=0.30, max_depth=2, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=81 
## - Fold1: eta=0.30, max_depth=2, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=81 
## + Fold1: eta=0.30, max_depth=4, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=81 
## - Fold1: eta=0.30, max_depth=4, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=81 
## + Fold1: eta=0.30, max_depth=6, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=81 
## - Fold1: eta=0.30, max_depth=6, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=81 
## + Fold2: eta=0.01, max_depth=2, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=81 
## - Fold2: eta=0.01, max_depth=2, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=81 
## + Fold2: eta=0.01, max_depth=4, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=81 
## - Fold2: eta=0.01, max_depth=4, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=81 
## + Fold2: eta=0.01, max_depth=6, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=81 
## - Fold2: eta=0.01, max_depth=6, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=81 
## + Fold2: eta=0.10, max_depth=2, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=81 
## - Fold2: eta=0.10, max_depth=2, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=81 
## + Fold2: eta=0.10, max_depth=4, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=81 
## - Fold2: eta=0.10, max_depth=4, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=81 
## + Fold2: eta=0.10, max_depth=6, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=81 
## - Fold2: eta=0.10, max_depth=6, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=81 
## + Fold2: eta=0.30, max_depth=2, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=81 
## - Fold2: eta=0.30, max_depth=2, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=81 
## + Fold2: eta=0.30, max_depth=4, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=81 
## - Fold2: eta=0.30, max_depth=4, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=81 
## + Fold2: eta=0.30, max_depth=6, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=81 
## - Fold2: eta=0.30, max_depth=6, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=81 
## + Fold3: eta=0.01, max_depth=2, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=81 
## - Fold3: eta=0.01, max_depth=2, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=81 
## + Fold3: eta=0.01, max_depth=4, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=81 
## - Fold3: eta=0.01, max_depth=4, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=81 
## + Fold3: eta=0.01, max_depth=6, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=81 
## - Fold3: eta=0.01, max_depth=6, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=81 
## + Fold3: eta=0.10, max_depth=2, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=81 
## - Fold3: eta=0.10, max_depth=2, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=81 
## + Fold3: eta=0.10, max_depth=4, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=81 
## - Fold3: eta=0.10, max_depth=4, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=81 
## + Fold3: eta=0.10, max_depth=6, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=81 
## - Fold3: eta=0.10, max_depth=6, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=81 
## + Fold3: eta=0.30, max_depth=2, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=81 
## - Fold3: eta=0.30, max_depth=2, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=81 
## + Fold3: eta=0.30, max_depth=4, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=81 
## - Fold3: eta=0.30, max_depth=4, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=81 
## + Fold3: eta=0.30, max_depth=6, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=81 
## - Fold3: eta=0.30, max_depth=6, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=81 
## Aggregating results
## Selecting tuning parameters
## Fitting nrounds = 81, max_depth = 4, eta = 0.3, gamma = 0, colsample_bytree = 1, min_child_weight = 1, subsample = 1 on full training set
# Best tune
xgb_tune$bestTune
##   nrounds max_depth eta gamma colsample_bytree min_child_weight subsample
## 8      81         4 0.3     0                1                1         1
# Creating the best model 
train_Control <- trainControl (method = "none", 
                              verboseIter = T, 
                              allowParallel = T
                             )
final_grid <- expand.grid(nrounds = xgb_tune$bestTune$nrounds, 
                          eta = xgb_tune$bestTune$eta,
                          max_depth = xgb_tune$bestTune$max_depth,
                          gamma = xgb_tune$bestTune$gamma,
                          colsample_bytree = xgb_tune$bestTune$colsample_bytree,
                          min_child_weight = xgb_tune$bestTune$min_child_weight,
                          subsample = xgb_tune$bestTune$subsample
)
xgb_model <- train(x= train_upsample[, -295], 
                   y = train_upsample$Class,
                   trControl = train_Control, 
                   tuneGrid = final_grid, 
                   method = "xgbTree", 
                   verbose = F
)                  
## Fitting nrounds = 81, eta = 0.3, max_depth = 4, gamma = 0, colsample_bytree = 1, min_child_weight = 1, subsample = 1 on full training set
# Creating prediction here
xgb.pred <- predict(xgb_model, data_test)

After creating xgb.pred, we run the Confusion Matrix to get the performance metrics.

confusionMatrix(xgb.pred, data_test$Suicide)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 19  4
##          1  2  4
##                                           
##                Accuracy : 0.7931          
##                  95% CI : (0.6028, 0.9201)
##     No Information Rate : 0.7241          
##     P-Value [Acc > NIR] : 0.2735          
##                                           
##                   Kappa : 0.4387          
##                                           
##  Mcnemar's Test P-Value : 0.6831          
##                                           
##             Sensitivity : 0.9048          
##             Specificity : 0.5000          
##          Pos Pred Value : 0.8261          
##          Neg Pred Value : 0.6667          
##              Prevalence : 0.7241          
##          Detection Rate : 0.6552          
##    Detection Prevalence : 0.7931          
##       Balanced Accuracy : 0.7024          
##                                           
##        'Positive' Class : 0               
## 

Accuracy is 79.31% while Balanced Accuracy is only 70.24%

Support Vector Machine (SVM)

SVM will help us decide on optimal decision boundary which can then help classify our labeled data. We’re going to model for the response variable suicide attempts with the adhd_data_3 dataset with both a radial and a linear kernel to try to find the decision boundary as we don’t know whether this is a non-linear problem or linear problem.

adhd_data_4_matx = as.matrix(adhd_data_4)
df.corr.p = as.data.frame(rcorr(adhd_data_4_matx)$P)
# removing Suicide, and repeated columns
correlation_table <- cbind(rownames(df.corr.p), df.corr.p[, 49])[-c(49) , ]
correlation_table %>%
  kbl(caption = "Correlation with Suicide")  %>%
  kable_material_dark()
Correlation with Suicide
Age 0.30027942497339
Sex 0.0471220935204864
Race 0.218309067324639
ADHD.Q1 0.0117740269942401
ADHD.Q2 0.11564807090846
ADHD.Q3 0.259154266743925
ADHD.Q4 0.112410726789768
ADHD.Q5 0.216255750653531
ADHD.Q6 0.483512910929568
ADHD.Q7 0.048634884292714
ADHD.Q8 0.151797641933531
ADHD.Q9 0.322485825735442
ADHD.Q10 0.291560841921094
ADHD.Q11 0.515184561690949
ADHD.Q12 0.395887015823113
ADHD.Q13 0.584780116748058
ADHD.Q14 0.249482531618938
ADHD.Q15 0.112873665348362
ADHD.Q16 0.201977696336559
ADHD.Q17 0.872794003748846
ADHD.Q18 0.546374843413306
ADHD.Total 0.102932484662113
MD.Q1a 0.0625907388295026
MD.Q1b 0.00317302701726341
MD.Q1c 0.960575599200187
MD.Q1d 0.0010368039318247
MD.Q1e 0.537896077431999
MD.Q1f 0.00712626183154863
MD.Q1g 0.000120152964638853
MD.Q1h 0.144730606824987
MD.Q1i 0.2592977551472
MD.Q1j 0.267581368524187
MD.Q1k 0.016734195148864
MD.Q1L 0.0156611407664968
MD.Q1m 0.403393642126824
MD.Q2 0.000857782522594164
MD.Q3 0.025029663425552
MD.TOTAL 0.000755726404960466
Alcohol 0.0138128897960139
THC 0.868985202967459
Cocaine 0.158974694425702
Stimulants 0.441880666140152
Sedative.hypnotics 0.0752635208018009
Opioids 0.016937963928505
Court.order 0.0805564613765424
Education 0.230822596373245
Hx.of.Violence 0.149606337250033
Disorderly.Conduct 0.704780193506212
Abuse 0.000396677403584178
NonSubstDx 0.693218272432247
SubstDx 0.0468228684891696

Here are the factors that has at least a 30% correlation with Suicide:

  • Age

  • ADHD.Q6

  • ADHD.Q9

  • ADHD.Q11

  • ADHD.Q12

  • ADHD.Q13

  • ADHD.Q17

  • ADHD.Q18

  • MD.Q1c

  • MD.Q1e

  • MD.Q1m

  • THC

  • Stimulants

  • Disorderly.Conduct

  • NonSubstDx

We further pear down the list of factors to that of correlation coefficient .500 or above.

svm_selected <- as.data.frame(cbind(rownames(df.corr.p[df.corr.p[, 49] >= .5,]), df.corr.p[df.corr.p[, 49]>= .5, 49 ]))
svm_selected <-  svm_selected[complete.cases(svm_selected), ] 
rownames(svm_selected) <- seq(length=nrow(svm_selected))
svm_selected
##                   V1                V2
## 1           ADHD.Q11 0.515184561690949
## 2           ADHD.Q13 0.584780116748058
## 3           ADHD.Q17 0.872794003748846
## 4           ADHD.Q18 0.546374843413306
## 5             MD.Q1c 0.960575599200187
## 6             MD.Q1e 0.537896077431999
## 7                THC 0.868985202967459
## 8 Disorderly.Conduct 0.704780193506212
## 9         NonSubstDx 0.693218272432247
adhd <- adhd_data_4 %>% select(c(svm_selected$V1, 'Suicide' )) 
set.seed(45)
indexes = createDataPartition(adhd$Suicide, p = .85, list = F) # results not in a List
train = adhd[indexes, ]
test = adhd[-indexes, ]
 
wts <- 100 / table(test$Suicide)
wts
## 
##         0         1 
##  6.666667 16.666667
radial_svm = svm(Suicide~., data=train, class.weights = wts, C = 13)
print(radial_svm)
## 
## Call:
## svm(formula = Suicide ~ ., data = train, class.weights = wts, C = 13)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
## 
## Number of Support Vectors:  112
test$pred = predict(radial_svm, test)
 
confusionMatrix(test$pred, test$Suicide)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 11  5
##          1  4  1
##                                           
##                Accuracy : 0.5714          
##                  95% CI : (0.3402, 0.7818)
##     No Information Rate : 0.7143          
##     P-Value [Acc > NIR] : 0.9501          
##                                           
##                   Kappa : -0.1053         
##                                           
##  Mcnemar's Test P-Value : 1.0000          
##                                           
##             Sensitivity : 0.7333          
##             Specificity : 0.1667          
##          Pos Pred Value : 0.6875          
##          Neg Pred Value : 0.2000          
##              Prevalence : 0.7143          
##          Detection Rate : 0.5238          
##    Detection Prevalence : 0.7619          
##       Balanced Accuracy : 0.4500          
##                                           
##        'Positive' Class : 0               
## 
linear_svm = svm(factor(Suicide)~., data=train, kernel = "linear", type = 'C-classification', cost = 10, scale = F, class.weights = wts)
print(linear_svm)
## 
## Call:
## svm(formula = factor(Suicide) ~ ., data = train, kernel = "linear", 
##     type = "C-classification", cost = 10, class.weights = wts, scale = F)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  10 
## 
## Number of Support Vectors:  98
test$pred = predict(linear_svm, test)
 
confusionMatrix(test$pred, test$Suicide)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 12  4
##          1  3  2
##                                           
##                Accuracy : 0.6667          
##                  95% CI : (0.4303, 0.8541)
##     No Information Rate : 0.7143          
##     P-Value [Acc > NIR] : 0.7705          
##                                           
##                   Kappa : 0.1404          
##                                           
##  Mcnemar's Test P-Value : 1.0000          
##                                           
##             Sensitivity : 0.8000          
##             Specificity : 0.3333          
##          Pos Pred Value : 0.7500          
##          Neg Pred Value : 0.4000          
##              Prevalence : 0.7143          
##          Detection Rate : 0.5714          
##    Detection Prevalence : 0.7619          
##       Balanced Accuracy : 0.5667          
##                                           
##        'Positive' Class : 0               
## 

Linear SVM performs better with a higher Balanced Accuracy. Thus, we will focus on improving the linear SVM.

We used correlation coefficients to select the model variables before. Let’s take a different approach and examine some charts of the predictor variables and target variable using the ggpairs function.

#remove individual ADHD and MD question columns, keeping only the total columns
adhd_data_31<- adhd_data_3[, -c(4:21, 23:37)]
#identify all factor columns
x <- sapply(adhd_data_31, is.factor)
#convert all factor columns to numeric
adhd_data_31[ , x] <- as.data.frame(apply(adhd_data_31[ , x], 2, as.numeric))
adhd_data_31$Suicide <- as.factor(adhd_data_3$Suicide)


ggpairs(adhd_data_31, columns = c(4,5,16,21), aes(colour = Suicide, alpha = 0.4))

From the plots above, particularly the box plots, it is evident that MD.TOTAL and SubstDx separate our target variable pretty well. However, note that there is some overlap between the target variable classes. We will use these two variables in our final model.

set.seed(443)
adhd_data_3_model <- subset(adhd_data_3, select=c(SubstDx, MD.TOTAL, Suicide))
indexes = createDataPartition(adhd_data_3_model$Suicide, p = .85, list = F) # results not in a List
train = adhd_data_3_model[indexes, ]
test = adhd_data_3_model[-indexes, ]
  
linear_svm_final = svm(factor(Suicide)~., data=train, kernel = "linear", type = 'C-classification', cost = 10, scale = F, class.weights = wts)
print(linear_svm_final)
## 
## Call:
## svm(formula = factor(Suicide) ~ ., data = train, kernel = "linear", 
##     type = "C-classification", cost = 10, class.weights = wts, scale = F)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  10 
## 
## Number of Support Vectors:  99
test$pred = predict(linear_svm_final, test)
 
confusionMatrix(test$pred, test$Suicide)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 11  1
##          1  4  5
##                                           
##                Accuracy : 0.7619          
##                  95% CI : (0.5283, 0.9178)
##     No Information Rate : 0.7143          
##     P-Value [Acc > NIR] : 0.4181          
##                                           
##                   Kappa : 0.4928          
##                                           
##  Mcnemar's Test P-Value : 0.3711          
##                                           
##             Sensitivity : 0.7333          
##             Specificity : 0.8333          
##          Pos Pred Value : 0.9167          
##          Neg Pred Value : 0.5556          
##              Prevalence : 0.7143          
##          Detection Rate : 0.5238          
##    Detection Prevalence : 0.5714          
##       Balanced Accuracy : 0.7833          
##                                           
##        'Positive' Class : 0               
## 

From the confusion matrix results, we can see that 6 cases were misclassified. Accuracy is 76.19% and balanced accuracy is 78.33%.


Modeling Comparison

Model Accuracy Balanced Accuracy
Xtreme Gradient Boosting (XGBoost) 79.31% 70.24%
Support Vector Machine (SVM) 76.19% 78.33%

While we can see basing on the accuracy, XGBoost was a better choice. We knew that for a dataset that has a response variable that is imbalanced, balanced accuracy is what we should be comparing performance off of. With that in mind, we see that SVM has a far greater balanced accuracy. Thus it’s the choice of model we recommend.

Conclusion

Having our recommendation given in the previous section, we can conclude that we have the needed variables or predictors to do the necessary modeling for Suicide. The team believe that if there are more variables, and more importantly, observations with lesser missing values, we can potentially enhance the balanced accuracy further.