library(GGally)
library(caret)
library(mice)
library(plyr)
library(tidyverse)
library(DataExplorer)
library(MASS)
library(naniar)
library(corrplot)
library(VIM)
library(cluster)
library(recipes)
library(factoextra)
knitr::opts_chunk$set(echo = TRUE)

Introduction

ML techniques can potentially offer new routes for learning patterns of human behavior; identifying mental health symptoms and risk factors; developing predictions about disease progression; and personalizing and optimizing therapies. In this project we will work with mental health dataset and use some of the unsupervised learning methods to cluster data to provide new insights, and to discover patterns and help structure the data.

Problem Statement

For this assignment, we will be working with a very interesting mental health dataset from a real-life research project. All identifying information, of course, has been removed. The attached spreadsheet has the data (the tab name “Data”). The data dictionary is given in the second tab. You can get as creative as you want. The assignment is designed to really get you to think about how you could use different methods.

Exploratory Data Analysis

Exploration of the data is always the first step in data analysis. The main priorities of exploration is exploring data types, outliers, overall distribution of the data points, and missing data. We’re also going to see if any transformation is possible.

In this process we’ll analye and visualize the data to get a better understanding of the data and glean insight from it. The process will involves the following steps:

  • Import the data
  • Clean the data
  • Process the data
  • Visualize the data
#Load the data
raw_adhd <- readxl::read_xlsx('ADHD_data.xlsx')
glimpse(raw_adhd)
## Rows: 175
## Columns: 54
## $ Initial              <chr> "JA", "LA", "MD", "RD", "RB", "SB", "PK", "RJ", "~
## $ Age                  <dbl> 24, 48, 51, 43, 34, 39, 41, 48, 44, 27, 44, 56, 5~
## $ Sex                  <dbl> 1, 2, 2, 1, 1, 2, 2, 1, 2, 2, 1, 1, 2, 1, 2, 2, 2~
## $ Race                 <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
## $ `ADHD Q1`            <dbl> 1, 3, 2, 3, 4, 2, 2, 2, 3, 2, 2, 2, 1, 2, 1, 4, 3~
## $ `ADHD Q2`            <dbl> 1, 3, 1, 3, 4, 3, 2, 4, 3, 3, 3, 2, 3, 3, 0, 3, 2~
## $ `ADHD Q3`            <dbl> 4, 4, 2, 2, 2, 1, 1, 3, 3, 4, 2, 2, 0, 2, 1, 1, 2~
## $ `ADHD Q4`            <dbl> 2, 4, 1, 2, 4, 4, 3, 4, 4, 4, 3, 3, 2, 4, 2, 4, 3~
## $ `ADHD Q5`            <dbl> 3, 5, 3, 4, 4, 3, 4, 3, 4, 4, 3, 3, 3, 2, 2, 2, 4~
## $ `ADHD Q6`            <dbl> 1, 2, 3, 3, 2, 2, 4, 3, 3, 3, 3, 2, 3, 0, 3, 0, 1~
## $ `ADHD Q7`            <dbl> 1, 2, 3, 2, 3, 3, 2, 1, 3, 4, 3, 1, 2, 4, 0, 2, 2~
## $ `ADHD Q8`            <dbl> 3, 3, 2, 4, 4, 4, 3, 1, 4, 3, 3, 2, 2, 4, 0, 3, 3~
## $ `ADHD Q9`            <dbl> 2, 2, 0, 4, 4, 4, 3, 4, 3, 2, 2, 3, 1, 2, 2, 3, 1~
## $ `ADHD Q10`           <dbl> 4, 4, 1, 2, 2, 2, 4, 2, 4, 4, 3, 2, 1, 4, 2, 3, 2~
## $ `ADHD Q11`           <dbl> 2, 1, 2, 3, 4, 4, 3, 4, 3, 3, 2, 2, 3, 3, 1, 4, 3~
## $ `ADHD Q12`           <dbl> 4, 4, 0, 1, 1, 2, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1~
## $ `ADHD Q13`           <dbl> 1, 2, 2, 3, 3, 4, 4, 4, 4, 3, 2, 3, 3, 1, 2, 3, 3~
## $ `ADHD Q14`           <dbl> 0, 4, 2, 3, 2, 4, 4, 3, 3, 4, 4, 3, 3, 3, 3, 4, 3~
## $ `ADHD Q15`           <dbl> 3, 4, 3, 1, 1, 3, 4, 0, 3, 4, 1, 3, 2, 0, 1, 0, 3~
## $ `ADHD Q16`           <dbl> 1, 3, 2, 2, 2, 4, 4, 0, 2, 4, 2, 1, 1, 0, 1, 1, 3~
## $ `ADHD Q17`           <dbl> 3, 1, 1, 1, 1, 3, 2, 1, 4, 2, 1, 1, 0, 0, 3, 0, 2~
## $ `ADHD Q18`           <dbl> 4, 4, 1, 2, 1, 3, 4, 1, 3, 2, 2, 2, 1, 2, 1, 3, 2~
## $ `ADHD Total`         <dbl> 40, 55, 31, 45, 48, 55, 54, 41, 56, 56, 42, 38, 3~
## $ `MD Q1a`             <dbl> 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0~
## $ `MD Q1b`             <dbl> 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0~
## $ `MD Q1c`             <dbl> 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1~
## $ `MD Q1d`             <dbl> 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1~
## $ `MD Q1e`             <dbl> 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1~
## $ `MD Q1f`             <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1~
## $ `MD Q1g`             <dbl> 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1~
## $ `MD Q1h`             <dbl> 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0~
## $ `MD Q1i`             <dbl> 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1~
## $ `MD Q1j`             <dbl> 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1~
## $ `MD Q1k`             <dbl> 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0~
## $ `MD Q1L`             <dbl> 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0~
## $ `MD Q1m`             <dbl> 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0~
## $ `MD Q2`              <dbl> 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1~
## $ `MD Q3`              <dbl> 3, 3, 2, 3, 2, 3, 3, 3, 3, 2, 3, 0, 2, 2, 2, 2, 2~
## $ `MD TOTAL`           <dbl> 15, 14, 5, 13, 7, 14, 9, 7, 12, 11, 16, 0, 11, 10~
## $ Alcohol              <dbl> 1, 0, 0, 1, 1, 1, 3, 0, 1, 0, 3, 0, 1, 0, 3, 0, 0~
## $ THC                  <dbl> 1, 0, 0, 1, 1, 0, 3, 0, 0, 3, 1, 0, 1, 0, 1, 0, 0~
## $ Cocaine              <dbl> 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0~
## $ Stimulants           <dbl> 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0~
## $ `Sedative-hypnotics` <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0~
## $ Opioids              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 3, 0, 0, 0, 0, 3, 0~
## $ `Court order`        <dbl> 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0~
## $ Education            <dbl> 11, 14, 12, 12, 9, 11, 12, 16, 12, 9, 12, 18, 12,~
## $ `Hx of Violence`     <dbl> 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0~
## $ `Disorderly Conduct` <dbl> 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0~
## $ Suicide              <dbl> 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0~
## $ Abuse                <dbl> 0, 4, 6, 7, 0, 2, 4, 0, 0, 2, 4, 2, 0, 0, 0, 5, 1~
## $ `Non-subst Dx`       <dbl> 2, 1, 2, 2, 2, 0, 1, 2, 1, 1, 1, 2, 0, 1, 0, 2, 2~
## $ `Subst Dx`           <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 2, 2, 0, 0, 0, 1, 1, 0~
## $ `Psych meds.`        <dbl> 2, 1, 1, 2, 0, 0, 1, 2, 1, 0, 0, 1, 0, 0, 0, 2, 1~

The data set is mostly compromised of “dbl” type, even though some of the columns such as Sex and Race are prime candidates to be converted to factors. There are unique values for ‘Age’ and ‘Initial’. Since the column ‘Initial’ is a sort of identifier, we can safely remove it, the column is not necessary for analysis.

It’s important to examine the data dictionary and see which columns might be combined or separated into different data sets.

The original dataframe contains 175 observations (i.e. survey participants) as rows and 54 columns as variables. The columns contain both qualitative and quantitative variables. Moreover, some columns represent categorical data but is encoded as numerical values.

Data Dictionary:
Sex: Male - 1, Female - 2
Race: Race: White-1, African American-2, Hispanic-3, Asian-4, Native American-5, Other or missing data -6
ADHD self-report scale: Never-0, rarely-1, sometimes-2, often-3, very often-4
Mood disorder questions: No-0, yes-1; question 3: no problem-0, minor-1, moderate-2, serious-3
Individual substances misuse: no use-0, use-1, abuse-2, dependence-3
Court Order: No-0, Yes-1
Education: 1-12 grade, 13+ college
History of Violence: No-0, Yes-1
Disorderly Conduct: No-0, Yes-1
Suicide attempt: No-0, Yes-1
Abuse Hx: No-0, Physical (P)-1, Sexual (S)-2, Emotional (E)-3, P&S-4, P&E-5, S&E-6, P&S&E-7
Non-substance-related Dx: 0 – none; 1 – one; 2 – More than one
Substance-related Dx: 0 – none; 1 – one Substance-related; 2 – two; 3 – three or more
Psychiatric Meds: 0 – none; 1 – one psychotropic med; 2 – more than one psychotropic med

Identifying the missing data from the data set:

plot_missing(
  raw_adhd,
  group = list(Good = 0.05, OK = 0.4, Bad = 0.8, Remove = 1),
  missing_only = TRUE,
  geom_label_args = list(),
  title = NULL,
  ggtheme = theme_gray(),
  theme_config = list(legend.position = c("bottom"))
)

There are columns missing data, with the exception of Psychiatric Meds, most of the missing data appears imputable. The ‘Psychiatric Meds’ column is missing 67.43% of data points. Imputing the data for this column is not optimal since most of the data would be imputed.

gg_miss_upset(raw_adhd)

# rename data column names
raw_adhd <- raw_adhd %>% dplyr::rename_all(funs(make.names(.)))
imputed_Data <- mice(raw_adhd, m=2, maxit = 2, method = 'cart', seed = 500)
## 
##  iter imp variable
##   1   1  Alcohol  THC  Cocaine  Stimulants  Sedative.hypnotics  Opioids  Court.order  Education  Hx.of.Violence  Disorderly.Conduct  Suicide  Abuse  Non.subst.Dx  Subst.Dx  Psych.meds.
##   1   2  Alcohol  THC  Cocaine  Stimulants  Sedative.hypnotics  Opioids  Court.order  Education  Hx.of.Violence  Disorderly.Conduct  Suicide  Abuse  Non.subst.Dx  Subst.Dx  Psych.meds.
##   2   1  Alcohol  THC  Cocaine  Stimulants  Sedative.hypnotics  Opioids  Court.order  Education  Hx.of.Violence  Disorderly.Conduct  Suicide  Abuse  Non.subst.Dx  Subst.Dx  Psych.meds.
##   2   2  Alcohol  THC  Cocaine  Stimulants  Sedative.hypnotics  Opioids  Court.order  Education  Hx.of.Violence  Disorderly.Conduct  Suicide  Abuse  Non.subst.Dx  Subst.Dx  Psych.meds.
raw_adhd <- complete(imputed_Data,2)
library(skimr)
skim(raw_adhd)
Data summary
Name raw_adhd
Number of rows 175
Number of columns 54
_______________________
Column type frequency:
character 1
numeric 53
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Initial 0 1 2 3 0 108 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Age 0 1 39.47 11.17 18 29.5 42 48.0 69 ▆▅▇▅▁
Sex 0 1 1.43 0.50 1 1.0 1 2.0 2 ▇▁▁▁▆
Race 0 1 1.64 0.69 1 1.0 2 2.0 6 ▇▁▁▁▁
ADHD.Q1 0 1 1.70 1.29 0 1.0 2 3.0 4 ▇▇▇▆▃
ADHD.Q2 0 1 1.91 1.25 0 1.0 2 3.0 4 ▅▇▇▆▅
ADHD.Q3 0 1 1.91 1.27 0 1.0 2 3.0 4 ▅▇▇▆▅
ADHD.Q4 0 1 2.10 1.34 0 1.0 2 3.0 4 ▅▅▇▅▆
ADHD.Q5 0 1 2.26 1.44 0 1.0 3 3.0 5 ▇▅▇▆▁
ADHD.Q6 0 1 1.91 1.31 0 1.0 2 3.0 4 ▆▅▇▇▃
ADHD.Q7 0 1 1.83 1.19 0 1.0 2 3.0 4 ▃▇▇▃▃
ADHD.Q8 0 1 2.14 1.29 0 1.0 2 3.0 4 ▃▇▇▇▆
ADHD.Q9 0 1 1.91 1.32 0 1.0 2 3.0 4 ▆▇▇▇▅
ADHD.Q10 0 1 2.12 1.23 0 1.0 2 3.0 4 ▂▇▇▆▅
ADHD.Q11 0 1 2.27 1.24 0 1.0 2 3.0 4 ▂▆▇▇▆
ADHD.Q12 0 1 1.29 1.21 0 0.0 1 2.0 4 ▇▇▆▂▂
ADHD.Q13 0 1 2.37 1.23 0 1.5 2 3.0 4 ▂▅▇▇▆
ADHD.Q14 0 1 2.25 1.35 0 1.0 2 3.0 4 ▅▅▇▇▆
ADHD.Q15 0 1 1.63 1.39 0 0.0 1 3.0 4 ▇▆▆▅▃
ADHD.Q16 0 1 1.70 1.38 0 1.0 1 3.0 4 ▆▇▆▃▅
ADHD.Q17 0 1 1.53 1.29 0 0.0 1 2.0 4 ▇▇▇▃▃
ADHD.Q18 0 1 1.47 1.30 0 0.0 1 2.0 4 ▇▇▆▃▃
ADHD.Total 0 1 34.32 16.70 0 21.0 33 47.5 72 ▃▆▇▆▂
MD.Q1a 0 1 0.55 0.50 0 0.0 1 1.0 1 ▆▁▁▁▇
MD.Q1b 0 1 0.57 0.50 0 0.0 1 1.0 1 ▆▁▁▁▇
MD.Q1c 0 1 0.54 0.50 0 0.0 1 1.0 1 ▇▁▁▁▇
MD.Q1d 0 1 0.58 0.49 0 0.0 1 1.0 1 ▆▁▁▁▇
MD.Q1e 0 1 0.55 0.50 0 0.0 1 1.0 1 ▆▁▁▁▇
MD.Q1f 0 1 0.70 0.46 0 0.0 1 1.0 1 ▃▁▁▁▇
MD.Q1g 0 1 0.72 0.45 0 0.0 1 1.0 1 ▃▁▁▁▇
MD.Q1h 0 1 0.56 0.50 0 0.0 1 1.0 1 ▆▁▁▁▇
MD.Q1i 0 1 0.59 0.49 0 0.0 1 1.0 1 ▆▁▁▁▇
MD.Q1j 0 1 0.39 0.49 0 0.0 0 1.0 1 ▇▁▁▁▅
MD.Q1k 0 1 0.49 0.50 0 0.0 0 1.0 1 ▇▁▁▁▇
MD.Q1L 0 1 0.58 0.49 0 0.0 1 1.0 1 ▆▁▁▁▇
MD.Q1m 0 1 0.49 0.50 0 0.0 0 1.0 1 ▇▁▁▁▇
MD.Q2 0 1 0.72 0.45 0 0.0 1 1.0 1 ▃▁▁▁▇
MD.Q3 0 1 2.01 1.07 0 1.0 2 3.0 3 ▂▂▁▅▇
MD.TOTAL 0 1 10.02 4.81 0 6.5 11 14.0 17 ▃▃▆▇▇
Alcohol 0 1 1.33 1.39 0 0.0 1 3.0 3 ▇▂▁▁▆
THC 0 1 0.79 1.26 0 0.0 0 1.0 3 ▇▁▁▁▂
Cocaine 0 1 1.11 1.39 0 0.0 0 3.0 3 ▇▁▁▁▅
Stimulants 0 1 0.12 0.53 0 0.0 0 0.0 3 ▇▁▁▁▁
Sedative.hypnotics 0 1 0.12 0.54 0 0.0 0 0.0 3 ▇▁▁▁▁
Opioids 0 1 0.40 1.00 0 0.0 0 0.0 3 ▇▁▁▁▁
Court.order 0 1 0.09 0.28 0 0.0 0 0.0 1 ▇▁▁▁▁
Education 0 1 11.87 2.13 6 11.0 12 12.0 19 ▁▅▇▂▁
Hx.of.Violence 0 1 0.25 0.44 0 0.0 0 0.5 1 ▇▁▁▁▃
Disorderly.Conduct 0 1 0.74 0.44 0 0.0 1 1.0 1 ▃▁▁▁▇
Suicide 0 1 0.30 0.46 0 0.0 0 1.0 1 ▇▁▁▁▃
Abuse 0 1 1.26 2.07 0 0.0 0 2.0 7 ▇▂▁▁▁
Non.subst.Dx 0 1 0.44 0.68 0 0.0 0 1.0 2 ▇▁▂▁▁
Subst.Dx 0 1 1.11 0.93 0 0.0 1 2.0 3 ▆▇▁▅▂
Psych.meds. 0 1 0.46 0.71 0 0.0 0 1.0 2 ▇▁▂▁▂

Data Cleaning & Exploration

First, we are going to remove the Psychiatric Medicine and Initial Columns from data set.

adhd_df <- raw_adhd[-c(1, 54)]
jus_nums <- raw_adhd[-c(1, 54)]

The various distribution of the responses to the ADHD questionnaire are displayed below. Also Distribution of Binary and Ordinal Variables.

plot_histogram(adhd_df, ggtheme = theme_linedraw())

After observing the plots to assess how the distributions involved in the dataset. Based on the histrograms plotted above, we can note that there are many observations although numeric, behave as categorical features and this will need to be assessed when performing the kmeans clustering analysis. There does not seem to be any clear distinguishable outliers however there does seem to be some features that experience low variance such Stimulants where majority of the recorded observations are 0.

Data Transformation / Further Data Exploration

The columns Sex and Age are going to be converted to factors. When building future models, it’s easier to read the different clusters or groups with certain categorical data turned into factors with appropriate labels.

adhd_df$Sex <-factor(adhd_df$Sex, levels = c(1,2), labels=c('Male','Female'))

adhd_df$Race <-factor(adhd_df$Race, levels=c(1,2,3,4,5,6), labels = c('White', 'African American', 'Hispanic', 'Asian', 'Native American', 'Other'))

adhd <- adhd_df
adhd_df[sapply(adhd_df, is.double)] <- lapply(adhd_df[sapply(adhd_df, is.double)], as.factor)
glimpse(adhd_df)
## Rows: 175
## Columns: 52
## $ Age                <fct> 24, 48, 51, 43, 34, 39, 41, 48, 44, 27, 44, 56, 53,~
## $ Sex                <fct> Male, Female, Female, Male, Male, Female, Female, M~
## $ Race               <fct> White, White, White, White, White, White, White, Wh~
## $ ADHD.Q1            <fct> 1, 3, 2, 3, 4, 2, 2, 2, 3, 2, 2, 2, 1, 2, 1, 4, 3, ~
## $ ADHD.Q2            <fct> 1, 3, 1, 3, 4, 3, 2, 4, 3, 3, 3, 2, 3, 3, 0, 3, 2, ~
## $ ADHD.Q3            <fct> 4, 4, 2, 2, 2, 1, 1, 3, 3, 4, 2, 2, 0, 2, 1, 1, 2, ~
## $ ADHD.Q4            <fct> 2, 4, 1, 2, 4, 4, 3, 4, 4, 4, 3, 3, 2, 4, 2, 4, 3, ~
## $ ADHD.Q5            <fct> 3, 5, 3, 4, 4, 3, 4, 3, 4, 4, 3, 3, 3, 2, 2, 2, 4, ~
## $ ADHD.Q6            <fct> 1, 2, 3, 3, 2, 2, 4, 3, 3, 3, 3, 2, 3, 0, 3, 0, 1, ~
## $ ADHD.Q7            <fct> 1, 2, 3, 2, 3, 3, 2, 1, 3, 4, 3, 1, 2, 4, 0, 2, 2, ~
## $ ADHD.Q8            <fct> 3, 3, 2, 4, 4, 4, 3, 1, 4, 3, 3, 2, 2, 4, 0, 3, 3, ~
## $ ADHD.Q9            <fct> 2, 2, 0, 4, 4, 4, 3, 4, 3, 2, 2, 3, 1, 2, 2, 3, 1, ~
## $ ADHD.Q10           <fct> 4, 4, 1, 2, 2, 2, 4, 2, 4, 4, 3, 2, 1, 4, 2, 3, 2, ~
## $ ADHD.Q11           <fct> 2, 1, 2, 3, 4, 4, 3, 4, 3, 3, 2, 2, 3, 3, 1, 4, 3, ~
## $ ADHD.Q12           <fct> 4, 4, 0, 1, 1, 2, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, ~
## $ ADHD.Q13           <fct> 1, 2, 2, 3, 3, 4, 4, 4, 4, 3, 2, 3, 3, 1, 2, 3, 3, ~
## $ ADHD.Q14           <fct> 0, 4, 2, 3, 2, 4, 4, 3, 3, 4, 4, 3, 3, 3, 3, 4, 3, ~
## $ ADHD.Q15           <fct> 3, 4, 3, 1, 1, 3, 4, 0, 3, 4, 1, 3, 2, 0, 1, 0, 3, ~
## $ ADHD.Q16           <fct> 1, 3, 2, 2, 2, 4, 4, 0, 2, 4, 2, 1, 1, 0, 1, 1, 3, ~
## $ ADHD.Q17           <fct> 3, 1, 1, 1, 1, 3, 2, 1, 4, 2, 1, 1, 0, 0, 3, 0, 2, ~
## $ ADHD.Q18           <fct> 4, 4, 1, 2, 1, 3, 4, 1, 3, 2, 2, 2, 1, 2, 1, 3, 2, ~
## $ ADHD.Total         <fct> 40, 55, 31, 45, 48, 55, 54, 41, 56, 56, 42, 38, 31,~
## $ MD.Q1a             <fct> 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, ~
## $ MD.Q1b             <fct> 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, ~
## $ MD.Q1c             <fct> 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, ~
## $ MD.Q1d             <fct> 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, ~
## $ MD.Q1e             <fct> 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, ~
## $ MD.Q1f             <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, ~
## $ MD.Q1g             <fct> 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, ~
## $ MD.Q1h             <fct> 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, ~
## $ MD.Q1i             <fct> 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, ~
## $ MD.Q1j             <fct> 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, ~
## $ MD.Q1k             <fct> 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, ~
## $ MD.Q1L             <fct> 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, ~
## $ MD.Q1m             <fct> 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0, ~
## $ MD.Q2              <fct> 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, ~
## $ MD.Q3              <fct> 3, 3, 2, 3, 2, 3, 3, 3, 3, 2, 3, 0, 2, 2, 2, 2, 2, ~
## $ MD.TOTAL           <fct> 15, 14, 5, 13, 7, 14, 9, 7, 12, 11, 16, 0, 11, 10, ~
## $ Alcohol            <fct> 1, 0, 0, 1, 1, 1, 3, 0, 1, 0, 3, 0, 1, 0, 3, 0, 0, ~
## $ THC                <fct> 1, 0, 0, 1, 1, 0, 3, 0, 0, 3, 1, 0, 1, 0, 1, 0, 0, ~
## $ Cocaine            <fct> 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, ~
## $ Stimulants         <fct> 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, ~
## $ Sedative.hypnotics <fct> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, ~
## $ Opioids            <fct> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 3, 0, 0, 0, 0, 3, 0, ~
## $ Court.order        <fct> 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, ~
## $ Education          <fct> 11, 14, 12, 12, 9, 11, 12, 16, 12, 9, 12, 18, 12, 1~
## $ Hx.of.Violence     <fct> 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, ~
## $ Disorderly.Conduct <fct> 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, ~
## $ Suicide            <fct> 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, ~
## $ Abuse              <fct> 0, 4, 6, 7, 0, 2, 4, 0, 0, 2, 4, 2, 0, 0, 0, 5, 1, ~
## $ Non.subst.Dx       <fct> 2, 1, 2, 2, 2, 0, 1, 2, 1, 1, 1, 2, 0, 1, 0, 2, 2, ~
## $ Subst.Dx           <fct> 0, 0, 0, 0, 0, 0, 0, 1, 0, 2, 2, 0, 0, 0, 1, 1, 0, ~

Clustering

Clustering is the most common form of unsupervised learning. It’s a machine learning algorithm used to draw inferences from unlabeled data. The algorithm groups a set of data points into subsets. The goal is to create clusters that are have minimal variance internally, but maximum variance from external clusters.

There are methods to clustering data. One is agglomerative, each observation in a respective cluster then merging together until a stop criteria is reached. The second is divisive, with all observations in a single cluster and subsequent splitting occures until a stop criteria is met.

k-means clustering

K-means clustering is one of the simplest and popular unsupervised machine learning algorithms. Typically, unsupervised algorithms make inferences from datasets using only input vectors without referring to known, or labelled, outcomes.

To perform a cluster analysis in R, generally, the data should be prepared as follows:

  • Removal totalized features

  • Numeric to Factor Conversions

  • Newly transformed categorical variables were binarized into 0/1. k-means will not be able to distinguish the eucliiand distances properly between classes that span more than 2 categories.

  • Normalization: features need to be normalized such that the distances they are centered and scaled the mean is 0 and the Stdev is 1, this scales all the data to allow kmeans to appropriately place centroids and observations at appropriate distances.

  • Colinearity test: Colinearity was tested and it was determined that there was not sufficient colinearity between any variables such that they needed to be removed for this reason alone.

  • Removed low-variance features: From Data Exploration section Stimulants seems like a low-variance variable with majority of categories recorded at 0.

adhd_df %>% recipe(~.) %>% 
  step_rm(contains("total")) %>% 
  #step_mutate_at(-Age, fn = ~ as.factor(.)) %>% 
  step_dummy(all_nominal(), one_hot = T) %>% 
  step_normalize(all_predictors()) %>%
  step_nzv(all_predictors()) %>% 
  step_corr(all_predictors()) %>% 
  prep() #%>% 
## Data Recipe
## 
## Inputs:
## 
##       role #variables
##  predictor         52
## 
## Training data contained 175 data points and no missing data.
## 
## Operations:
## 
## Variables removed ADHD.Total, MD.TOTAL [trained]
## Dummy variables from Age, Sex, Race, ADHD.Q1, ADHD.Q2, ADHD.Q3, ... [trained]
## Centering and scaling for Age_X18, Age_X19, Age_X20, Age_X21, ... [trained]
## Sparse, unbalanced variable filter removed Age_X18, Age_X19, Age_X20, ... [trained]
## Correlation filter removed Race_African.American, ... [trained]

The classification of observations into groups requires some methods for computing the distance or the (dis)similarity between each pair of observations. The choice of distance measures is a critical step in clustering. It defines how the similarity of two elements (x, y) is calculated and it will influence the shape of the clusters. The choice also has has a strong influence on the clustering results.

the default distance measure is the Euclidean distance. However, depending on the type of the data and the research questions, other dissimilarity measures might be preferred and you should be aware of the options.

Within R it is simple to compute and visualize the distance matrix using the functions get_dist and fviz_dist from the factoextra R package.

distance <- get_dist(adhd_df)
fviz_dist(distance, gradient = list(low = "#00AFBB", mid = "white", high = "#FC4E07"))

adhd_df <- adhd_df[, -c(2,3)]
df <- adhd_df

k2 <- kmeans(df, centers = 2, nstart = 25)
k3 <- kmeans(df, centers = 3, nstart = 25)
k4 <- kmeans(df, centers = 4, nstart = 25)
k5 <- kmeans(df, centers = 5, nstart = 25)
str(k2)
## List of 9
##  $ cluster     : int [1:175] 2 2 1 2 2 2 2 2 2 2 ...
##  $ centers     : num [1:2, 1:50] 40.92 38.045 0.931 2.455 1.172 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:2] "1" "2"
##   .. ..$ : chr [1:50] "Age" "ADHD.Q1" "ADHD.Q2" "ADHD.Q3" ...
##  $ totss       : num 83398
##  $ withinss    : num [1:2] 22761 23728
##  $ tot.withinss: num 46489
##  $ betweenss   : num 36909
##  $ size        : int [1:2] 87 88
##  $ iter        : int 1
##  $ ifault      : int 0
##  - attr(*, "class")= chr "kmeans"

We can also view our results by using fviz_cluster. This provides a nice illustration of the clusters. If there are more than two dimensions (variables) fviz_cluster will perform principal component analysis (PCA) and plot the data points according to the first two principal components that explain the majority of the variance.

Because the number of clusters (k) must be set before we start the algorithm, it is often advantageous to use several different values of k and examine the differences in the results. We can execute the same process for 3, 4, and 5 clusters, and the results are shown in the figure:

# I had to convert df back to numeric to plot the clusters

df[sapply(df, is.factor)] <- lapply(df[sapply(df, is.factor)], as.numeric)

p1 <- fviz_cluster(k2, geom = "point", data = df) + ggtitle("k = 2")
p2 <- fviz_cluster(k3, geom = "point",  data = df) + ggtitle("k = 3")
p3 <- fviz_cluster(k4, geom = "point",  data = df) + ggtitle("k = 4")
p4 <- fviz_cluster(k5, geom = "point",  data = df) + ggtitle("k = 5")

library(gridExtra)
grid.arrange(p1, p2, p3, p4, nrow = 2)

Determining Optimal Clusters

As you may recall the analyst specifies the number of clusters to use; preferably the analyst would like to use the optimal number of clusters. To aid the analyst, the following explains the three most popular methods for determining the optimal clusters, which includes:

  • Elbow method
  • Silhouette method
  • Gap statistic

Recall that, the basic idea behind cluster partitioning methods, such as k-means clustering, is to define clusters such that the total intra-cluster variation (known as total within-cluster variation or total within-cluster sum of square) is minimized

The total within-cluster sum of square (wss) measures the compactness of the clustering and we want it to be as small as possible. Thus, we can use the following algorithm to define the optimal clusters:

  • Compute clustering algorithm (e.g., k-means clustering) for different values of k. For instance, by varying k from 1 to 10 clusters
  • For each k, calculate the total within-cluster sum of square (wss)
  • Plot the curve of wss according to the number of clusters k.
  • The location of a bend (knee) in the plot is generally considered as an indicator of the appropriate number of clusters.

Fortunately, this process to compute the “Elbow method” has been wrapped up in a single function (fviz_nbclust):

set.seed(123)

fviz_nbclust(df, kmeans, method = "wss")

The results suggest that 4 is the optimal number of clusters as it appears to be the bend in the knee (or elbow).

Extracting Results

With most of these approaches suggesting 2 as the number of optimal clusters, we can perform the final analysis and extract the results using 2 clusters.

set.seed(123)
final <- kmeans(df, 2, nstart = 25)
fviz_cluster(final, data = df)

Hierarchical Clustering

Hierarchical Clustering works by repeatingly combining the two nearest clusters into a larger cluster. It is a bottom-up approach which doesn’t require the specification of the number of clusters beforehand.

The final structure of the cluster is represented by a dendrogram diagram.

In order to perform hierarchical clustering we will perform the following:

  1. Normalize our continuous numerical values.Each observations feature values are presented as coordinates in n-dimensional space (n being the number of predictors/features). The distances between coordinates will be calculated in step 2 using a standard of measure. However, if the coordinates are normalized it may lead to false results.

The continous variables in the data set are Age, MD.Total, and MD.Total.

set.seed(300)
#scale the numerical continuous data

adhd3 <- adhd_df
adhd3$Age <- as.numeric(adhd3$Age)
adhd3$MD.TOTAL <- as.numeric(adhd3$MD.TOTAL)
adhd3$ADHD.Total <- as.numeric(adhd3$ADHD.Total)
adhd3 %>% select_if(is.numeric) %>% lapply(scale) ->adhd3
adhd3 <- as.data.frame(adhd3)
adhd4 <- adhd_df
adhd4$Age <- adhd3$Age
adhd4$ADHD.Total <- adhd3$ADHD.Total
adhd4$MD.TOTAL <- adhd3$MD.TOTAL
  1. We will use the euclidean measure of distance. The default of the dist function is euclidean which is the square distance between two vectors.
#compute the distance
dist_mat <- dist(adhd4, method = 'euclidean')
  1. The linkage method selected is ‘ward.D2’. The ‘average’ method was originally tested, but the clusters were not as complete. The ward method minimizes the total variance within the clusters.
#select linkage method

hclust_comp <- hclust(dist_mat, method = 'ward.D2')
plot(hclust_comp)

  1. Find the optimal number of clusters. We’re going to utilize the “NbClust” library and visualize the optimal number of clusters. In order to use the NbClust function, the numerical-only data set is necessary.
library(NbClust)
adhd_clust <- NbClust(
                      data = scale(jus_nums),
                      distance = 'euclidean',
                      min.nc = 2, 
                      max.nc = 10,
                      method="ward.D2", 
                      index="all")

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
## 

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 10 proposed 2 as the best number of clusters 
## * 3 proposed 3 as the best number of clusters 
## * 3 proposed 4 as the best number of clusters 
## * 2 proposed 5 as the best number of clusters 
## * 2 proposed 6 as the best number of clusters 
## * 1 proposed 7 as the best number of clusters 
## * 2 proposed 10 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  2 
##  
##  
## *******************************************************************

I tried the scale and unscaled versions of the data set, the optimal cluster is 2. So that’s going to be our k in our dendrogram.

#View the different clusters
plot(hclust_comp)
rect.hclust(hclust_comp , k = 2, border = 2:6)

#abline(h = 39, col = 'red')
hierarchyGroups <- cutree(hclust_comp, 2)
fviz_cluster(list(data = jus_nums, cluster = hierarchyGroups))

Looking at the cluster plots, we get two distinct clusters.

Principal Component Analysis (PCA)

PCA commonly used for dimensionality reduction by using each data point onto only the first few principal components (most cases first and second dimensions) to obtain lower-dimensional data while keeping as much of the data’s variation as possible.

Features that are strongly correlated with each other plotare more suitable for PCA than those loosely related. Below corrplot using the Spearman correlation for the categorical variables demonstrate that the features within ADHD set are more strongly correlated than the MD set. In addition, there are too many missing values for the individual substance misuse, and therefore PCA is not performed on this set. For question 2, we conduct PCA on both ADHD and MD, but ADHD is demonstrated to be more suitable for the task.

PCA - ADHD

# pca - ADHD
PCA_adhd <- prcomp(df %>% dplyr::select(contains("adhd.")))
sd <- PCA_adhd$sdev
loadings <- PCA_adhd$rotation
rownames(loadings) <- colnames(df %>% dplyr::select(contains("adhd.")))
scores <- PCA_adhd$x

summary(PCA_adhd)
## Importance of components:
##                            PC1     PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     15.9724 1.53903 1.33059 1.16044 1.09903 1.00812 0.95616
## Proportion of Variance  0.9453 0.00878 0.00656 0.00499 0.00448 0.00377 0.00339
## Cumulative Proportion   0.9453 0.95407 0.96063 0.96562 0.97009 0.97386 0.97724
##                            PC8     PC9    PC10    PC11    PC12    PC13    PC14
## Standard deviation     0.91382 0.86184 0.84723 0.80836 0.79002 0.76135 0.69356
## Proportion of Variance 0.00309 0.00275 0.00266 0.00242 0.00231 0.00215 0.00178
## Cumulative Proportion  0.98034 0.98309 0.98575 0.98817 0.99048 0.99263 0.99441
##                           PC15    PC16   PC17    PC18    PC19
## Standard deviation     0.68103 0.60607 0.5683 0.52550 0.27813
## Proportion of Variance 0.00172 0.00136 0.0012 0.00102 0.00029
## Cumulative Proportion  0.99613 0.99749 0.9987 0.99971 1.00000

We will visualize the results using scree and cumulative variance plot. The scree plot (“elbow” method)indicate that the first 4 components are the most important ones, although they merely captured roughly 32% of the variance in the data. The first 29 components have the standard deviation above 1 and together they captured 77% of the variance. The number of dimensions is significantly reduced from 91 to 29 in this case.

par(mfrow = c(1, 2))

# scree plot
screeplot(PCA_adhd, type = "l", npcs = 30, main = "Screeplot of the first 30 PCs")
abline(h = 1, col = "red", lty = 5)
legend("topright", legend = c("Eigenvalue = 1"), col = c("red"), lty = 5, cex = 0.6)

# cumulative variance plot
cumpro <- cumsum(PCA_adhd$sdev^2 / sum(PCA_adhd$sdev^2))
plot(cumpro[0:30], xlab = "PC #", ylab = "Amount of explained variance", main = "Cumulative variance plot")

Loading

Loadings are interpreted as the coefficients of the linear combination of the initial variables from which the principal components are constructed. From a numerical point of view, the loadings are equal to the coordinates of the variables divided by the square root of the eigenvalue associated with the component.

adhd_total is the most important contributor to the first principal component. PC1 is made up by majority of "_X4" variables, meaning the response of “very often” from the ADHD questions; whereas PC2 is made up by majority of "_X0" variables, meaning the response of “never” from the ADHD questions. As expected, PCA successfully extract components from features that are not strongly associated to each other.

library(kableExtra)
cut_off <- sqrt(1/ncol(df %>% dplyr::select(contains("adhd.")))) 

loadingsDf <- loadings %>% 
    as.data.frame() %>% 
    tibble::rownames_to_column() %>%
    dplyr::select(variables = rowname, everything())

pc1_important <- loadingsDf %>% 
    dplyr::filter(abs(PC1) > cut_off) %>%
    dplyr::select(variables, PC1) %>%
    arrange(desc(abs(PC1)))

pc1_important %>% 
  kable() %>%
  scroll_box()
variables PC1
ADHD.Total 0.9690882

The biplot displays the individuals and variables in the same plot featuring the first two principal components. It seems to suggest that the individuals can be visually clustered into 4 groups based on their responses to the ADHD questions.

biplot(PCA_adhd, scale = 0, cex = 0.5)

Gradient Boosting

We are going to use gbm to fit gradient boosting model.

library(gbm)
library(MASS)

# Separate data set

df1 <- df
df1 <- na.omit(df1)

set.seed(1412)
trainIndex <- createDataPartition(df1$Suicide, p = .80) %>% unlist()
training <- df1[ trainIndex,]
testing  <- df1[-trainIndex,]
model.boost=gbm(Suicide ~ . ,data = training, distribution = "gaussian",n.trees = 1350,
                  shrinkage = 0.01, interaction.depth = 4)  
model.boost
## gbm(formula = Suicide ~ ., distribution = "gaussian", data = training, 
##     n.trees = 1350, interaction.depth = 4, shrinkage = 0.01)
## A gradient boosted model with gaussian loss function.
## 1350 iterations were performed.
## There were 49 predictors of which 47 had non-zero influence.
#Summary gives a table of Variable Importance and a plot of Variable Importance 
summary(model.boost) 

##                                   var   rel.inf
## Abuse                           Abuse 9.2775812
## Age                               Age 7.6979568
## Alcohol                       Alcohol 5.2531623
## ADHD.Q17                     ADHD.Q17 4.9948204
## ADHD.Q1                       ADHD.Q1 4.8860788
## MD.TOTAL                     MD.TOTAL 4.8026574
## Subst.Dx                     Subst.Dx 3.8608613
## ADHD.Total                 ADHD.Total 3.4710617
## Cocaine                       Cocaine 3.3130873
## Education                   Education 3.0853890
## Opioids                       Opioids 2.8369628
## ADHD.Q10                     ADHD.Q10 2.3637266
## MD.Q1d                         MD.Q1d 2.1464098
## ADHD.Q15                     ADHD.Q15 2.0952871
## ADHD.Q5                       ADHD.Q5 1.9335370
## ADHD.Q6                       ADHD.Q6 1.8805429
## ADHD.Q7                       ADHD.Q7 1.8261864
## ADHD.Q2                       ADHD.Q2 1.8096463
## ADHD.Q4                       ADHD.Q4 1.7470188
## MD.Q2                           MD.Q2 1.7248427
## ADHD.Q18                     ADHD.Q18 1.7062774
## ADHD.Q14                     ADHD.Q14 1.6971749
## MD.Q1g                         MD.Q1g 1.6724975
## THC                               THC 1.6422761
## ADHD.Q16                     ADHD.Q16 1.6004904
## ADHD.Q9                       ADHD.Q9 1.4659478
## ADHD.Q12                     ADHD.Q12 1.4586827
## ADHD.Q3                       ADHD.Q3 1.4300755
## MD.Q1b                         MD.Q1b 1.4024011
## ADHD.Q11                     ADHD.Q11 1.2378792
## MD.Q3                           MD.Q3 1.1962696
## Hx.of.Violence         Hx.of.Violence 1.1763532
## ADHD.Q8                       ADHD.Q8 1.1581247
## MD.Q1k                         MD.Q1k 1.1275679
## MD.Q1e                         MD.Q1e 1.1143728
## Disorderly.Conduct Disorderly.Conduct 1.0388760
## ADHD.Q13                     ADHD.Q13 1.0097709
## MD.Q1c                         MD.Q1c 0.9758807
## MD.Q1L                         MD.Q1L 0.9594112
## MD.Q1m                         MD.Q1m 0.8567218
## MD.Q1h                         MD.Q1h 0.8392388
## MD.Q1i                         MD.Q1i 0.5727816
## Non.subst.Dx             Non.subst.Dx 0.5566255
## MD.Q1j                         MD.Q1j 0.4080050
## MD.Q1a                         MD.Q1a 0.3937639
## MD.Q1f                         MD.Q1f 0.1523843
## Court.order               Court.order 0.1433331
## Stimulants                 Stimulants 0.0000000
## Sedative.hypnotics Sedative.hypnotics 0.0000000

Let consider the top 11 variables of importance…

df3 <- subset(df1, select = c('Suicide', 'Abuse', 'Age', 'Alcohol', 'MD.TOTAL', 'ADHD.Q1', 'ADHD.Q17','Subst.Dx','ADHD.Total', 'Cocaine', 'Education'))
set.seed(7790)
trainIndex <- createDataPartition(df3$Suicide, p = .80) %>% unlist()
training <- df3[ trainIndex,]
testing  <- df3[-trainIndex,]
model.boost=gbm(Suicide ~ . ,data = training, distribution = "gaussian",n.trees = 1350,
                  shrinkage = 0.01, interaction.depth = 4)  
model.boost
## gbm(formula = Suicide ~ ., distribution = "gaussian", data = training, 
##     n.trees = 1350, interaction.depth = 4, shrinkage = 0.01)
## A gradient boosted model with gaussian loss function.
## 1350 iterations were performed.
## There were 10 predictors of which 10 had non-zero influence.
#Summary gives a table of Variable Importance and a plot of Variable Importance 
summary(model.boost) 

##                   var   rel.inf
## ADHD.Total ADHD.Total 16.443802
## Age               Age 15.306040
## MD.TOTAL     MD.TOTAL 14.907611
## Alcohol       Alcohol 10.240016
## ADHD.Q17     ADHD.Q17  9.187577
## Abuse           Abuse  8.643782
## ADHD.Q1       ADHD.Q1  7.983155
## Cocaine       Cocaine  6.072144
## Subst.Dx     Subst.Dx  5.633229
## Education   Education  5.582642

Plotting the Partial Dependence Plot: The partial Dependence Plots will tell us the relationship and dependence of the variables X with the Response variable Y

#Plot of Response variable with ADHD.Total variable
plot(model.boost,i="Abuse") 

#Inverse relation with Age variable

plot(model.boost,i="Age") 

The above plot simply shows the relation between the variables in the x-axis and the mapping function f(x) on the y-axis. First plot shows that Abuse is somewhat positively correlated with the response Suicide, whereas the second one shows that Age is not really directly related to Suicide.

cor(training$Abuse, training$Suicide)
## [1] 0.3267058
cor(training$Age,training$Suicide)
## [1] -0.09346698

Prediction on Test Set

We will compute the Test Error as a function of number of Trees.

n.trees = seq(from=100 ,to=5000, by=100) #no of trees-a vector of 100 values 

#Generating a Prediction matrix for each Tree
predmatrix<-predict(model.boost,training,n.trees = n.trees)
dim(predmatrix) #dimentions of the Prediction Matrix
## [1] 140  50
#Calculating The Mean squared Test Error
test.error<-with(training,apply( (predmatrix-Suicide)^2,2,mean))
head(test.error) #contains the Mean squared test error for each of the 100 trees averaged
##        100        200        300        400        500        600 
## 0.14529026 0.11622908 0.09815924 0.08642057 0.07763141 0.07074450
#Plotting the test error vs number of trees

plot(n.trees , test.error , pch=19,col="blue",xlab="Number of Trees",ylab="Test Error", main = "Perfomance of Boosting on Test Set")

Note that from the above plot we can notice that if boosting is done properly by selecting appropriate tuning parameters such as shrinkage parameter lambda,the number of splits we want and the number of trees n, then it can generalize really well and convert a weak learner to strong learner. It is really well and tend to outperform a single learner which is prone to either overfitting or underfitting or generate thousands or hundreds of them,then combine them to produce a better and stronger model.

Support Vector Machine

SVM (Support Vector Machines) are a supervised learning method that can be used for classification and regression analysis. Typically, there are used more for classification. The basic idea of SVM is to create an optimal hyperplane to linearly seperate variables, using support vectors. Support vectors are the data points the are closest to the hyperplane. These are the points that are most difficult to classify. The goal of SVM is to create the best hyperplane, or one that maximizes the distance/magrin between the support vectors. SVM does work with non-linear data using kernal functions. Kernal Functions maps the data points to a higher dimension.

For the purposes of this question, we want to use SVM to classify suicides. In order to use SVM, we’re going to need to reduce the total dimensions in the data set. SVM works best with less dimensions, in order to perform dimension reduction, we’re going to apply 2 approaches. One is to eliminate variables with higher than normal levels of correlation and secondly, implement PCA to create new variables that account for most of the data.

SVM with Dimension Reduction using Correlation

We’re going to create a new data set and convert Suicide into a factor.

adhd_svm_data <- adhd
adhd_svm_data$Suicide<- factor(adhd_svm_data$Suicide)

Selecting only the numerical values, a corr plot is developed in an effort to identify variables with high correlation values. Once they’re identified, the total number of variables can be reduced. Just by isolating the numerical columns, there are 49 columns.

adhd_svm_data %>% select_if(is.numeric) %>% cor() -> svmCor
corrplot(svmCor, method = 'circle')

svmNumerical <- adhd_svm_data %>% dplyr::select( -Race, -Suicide, -Sex)
svmNumerical <- svmNumerical[,-c(nearZeroVar(svmNumerical))]
corr_data <- findCorrelation(svmCor, cutoff = 0.55)
svmNumerical2 <- svmNumerical[, -corr_data]

After cutting off correlation at 0.55, the number of variables left are 29 continuous variables. In total, we are moving forward with this SVM model with 33 variables.

svmNumerical2$Race <- adhd_svm_data$Race
svmNumerical2$Suicide <- adhd_svm_data$Suicide
svmNumerical2$Sex <- adhd_svm_data$Sex
set.seed(100)

#Create Data Partition 

initsplit <- createDataPartition(svmNumerical2$Suicide, p=0.8, list=FALSE)


#Create Training Data to tune the model
training <- svmNumerical2[initsplit,]


#Create testing data to evaluate the model
test <- svmNumerical2[-initsplit,]
library(e1071)
svmfit_s = svm(Suicide ~ ., data = training, kernel = "sigmoid", cost = 10, scale = TRUE)
svmfit_r = svm(Suicide ~ ., data = training, kernel = "radial", cost = 10, scale = TRUE)
svmfit_l =svm(Suicide ~ ., data = training, kernel = "linear", cost = 10, scale = TRUE)
svmPre1<-predict(svmfit_s,test)
svm1<-confusionMatrix(svmPre1,test$Suicide)
svm1
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 18  7
##          1  6  3
##                                           
##                Accuracy : 0.6176          
##                  95% CI : (0.4356, 0.7783)
##     No Information Rate : 0.7059          
##     P-Value [Acc > NIR] : 0.9037          
##                                           
##                   Kappa : 0.0515          
##                                           
##  Mcnemar's Test P-Value : 1.0000          
##                                           
##             Sensitivity : 0.7500          
##             Specificity : 0.3000          
##          Pos Pred Value : 0.7200          
##          Neg Pred Value : 0.3333          
##              Prevalence : 0.7059          
##          Detection Rate : 0.5294          
##    Detection Prevalence : 0.7353          
##       Balanced Accuracy : 0.5250          
##                                           
##        'Positive' Class : 0               
## 
svmPre2<-predict(svmfit_r,test)
svm2<-confusionMatrix(svmPre2,test$Suicide)
svm2
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 19  5
##          1  5  5
##                                          
##                Accuracy : 0.7059         
##                  95% CI : (0.5252, 0.849)
##     No Information Rate : 0.7059         
##     P-Value [Acc > NIR] : 0.5843         
##                                          
##                   Kappa : 0.2917         
##                                          
##  Mcnemar's Test P-Value : 1.0000         
##                                          
##             Sensitivity : 0.7917         
##             Specificity : 0.5000         
##          Pos Pred Value : 0.7917         
##          Neg Pred Value : 0.5000         
##              Prevalence : 0.7059         
##          Detection Rate : 0.5588         
##    Detection Prevalence : 0.7059         
##       Balanced Accuracy : 0.6458         
##                                          
##        'Positive' Class : 0              
## 
svmPre3<-predict(svmfit_l,test)
svm3<-confusionMatrix(svmPre3,test$Suicide)
svm3
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 15  8
##          1  9  2
##                                           
##                Accuracy : 0.5             
##                  95% CI : (0.3243, 0.6757)
##     No Information Rate : 0.7059          
##     P-Value [Acc > NIR] : 0.9966          
##                                           
##                   Kappa : -0.17           
##                                           
##  Mcnemar's Test P-Value : 1.0000          
##                                           
##             Sensitivity : 0.6250          
##             Specificity : 0.2000          
##          Pos Pred Value : 0.6522          
##          Neg Pred Value : 0.1818          
##              Prevalence : 0.7059          
##          Detection Rate : 0.4412          
##    Detection Prevalence : 0.6765          
##       Balanced Accuracy : 0.4125          
##                                           
##        'Positive' Class : 0               
## 

SVM - Dimension Reduction with PCA

Using the ‘jus_nums’ data set (all the variables are still numerical but with imputed values).

jus_nums2 <- jus_nums %>% dplyr::select(-MD.TOTAL, -ADHD.Total)

jus_nums2$Suicide <- as.factor(jus_nums2$Suicide)
PCA_svm <- prcomp(jus_nums2 %>%dplyr :: select(-Suicide), center = TRUE, scale. = TRUE)
summary(PCA_svm)
## Importance of components:
##                          PC1     PC2     PC3     PC4    PC5     PC6     PC7
## Standard deviation     3.479 2.16135 1.50769 1.38209 1.3592 1.30149 1.22940
## Proportion of Variance 0.247 0.09534 0.04639 0.03898 0.0377 0.03457 0.03085
## Cumulative Proportion  0.247 0.34235 0.38874 0.42772 0.4654 0.49999 0.53084
##                            PC8     PC9    PC10   PC11    PC12    PC13    PC14
## Standard deviation     1.20081 1.11978 1.09678 1.0640 1.03988 1.00251 0.97682
## Proportion of Variance 0.02943 0.02559 0.02455 0.0231 0.02207 0.02051 0.01947
## Cumulative Proportion  0.56027 0.58586 0.61040 0.6335 0.65558 0.67609 0.69556
##                           PC15    PC16   PC17    PC18    PC19    PC20    PC21
## Standard deviation     0.96617 0.95067 0.9154 0.89627 0.87005 0.84068 0.82144
## Proportion of Variance 0.01905 0.01844 0.0171 0.01639 0.01545 0.01442 0.01377
## Cumulative Proportion  0.71461 0.73306 0.7502 0.76655 0.78200 0.79642 0.81019
##                           PC22   PC23    PC24    PC25   PC26    PC27    PC28
## Standard deviation     0.79734 0.7762 0.75049 0.73257 0.7207 0.69183 0.67952
## Proportion of Variance 0.01297 0.0123 0.01149 0.01095 0.0106 0.00977 0.00942
## Cumulative Proportion  0.82317 0.8355 0.84696 0.85791 0.8685 0.87828 0.88770
##                           PC29    PC30    PC31    PC32    PC33    PC34    PC35
## Standard deviation     0.67792 0.64656 0.63508 0.61229 0.59429 0.58344 0.57641
## Proportion of Variance 0.00938 0.00853 0.00823 0.00765 0.00721 0.00695 0.00678
## Cumulative Proportion  0.89708 0.90561 0.91384 0.92149 0.92870 0.93565 0.94243
##                           PC36    PC37    PC38    PC39    PC40   PC41    PC42
## Standard deviation     0.55147 0.54357 0.52278 0.51033 0.48642 0.4591 0.45420
## Proportion of Variance 0.00621 0.00603 0.00558 0.00531 0.00483 0.0043 0.00421
## Cumulative Proportion  0.94864 0.95467 0.96024 0.96556 0.97039 0.9747 0.97890
##                          PC43    PC44    PC45    PC46    PC47    PC48    PC49
## Standard deviation     0.4260 0.42290 0.40593 0.39464 0.37146 0.33741 0.31823
## Proportion of Variance 0.0037 0.00365 0.00336 0.00318 0.00282 0.00232 0.00207
## Cumulative Proportion  0.9826 0.98625 0.98962 0.99279 0.99561 0.99793 1.00000
fviz_pca_biplot(PCA_svm, label="var",
             habillage = jus_nums2$Suicide,
             addEllipses = TRUE, palette = "jco")

When can see the the 2 most important components of the account for 34.781 % of the data. We can utilize 34-35 components, those cover a little over 95% of the data, however, the fear is to overfit the model. It is necessary to reduce the number of PCA components so we can continue building our SVM model. The ‘elbow method’ can be used to reduce the overall number of compnents.

fviz_eig(PCA_svm, addlabels = TRUE)

Looking at the scree plot, the optimal number of PCA components is 3, the rate of decrease slows down after 3. Using the three principal components, the next step is to create a new data set with Suicide as the fifth column. After the new data set is created, we can split the data into training and testing sets. The new data set will have 4 total columns, a large reduction in dimensions from the original data set.

#Select the 4 components and assign to a new variable 
pca_3 <- PCA_svm$x[,1:3]

#create new data set

pca_new <- cbind(as.data.frame(pca_3), Suicide = jus_nums2$Suicide)
set.seed(100)

#Create Data Partition 

initsplit <- createDataPartition(pca_new$Suicide, p=0.8, list=FALSE)


#Create Training Data to tune the model
training2 <- pca_new[initsplit,]


#Create testing data to evaluate the model
test2 <- pca_new[-initsplit,]

We’re going to build several models using different kernals.

svmfit2 = svm(Suicide ~ ., data = training2, kernel = "radial", cost = 10, scale = FALSE)
svmfit3 = svm(Suicide ~ ., data = training2, kernel = "linear", cost = 10, scale = FALSE)
svmfit4 = svm(Suicide ~ ., data = training2, kernel = "sigmoid", cost = 10, scale = FALSE)

Radial Kernal Results

svmPre2<-predict(svmfit2,test2)
svm2<-confusionMatrix(svmPre2,test$Suicide)
svm2
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 20  6
##          1  4  4
##                                          
##                Accuracy : 0.7059         
##                  95% CI : (0.5252, 0.849)
##     No Information Rate : 0.7059         
##     P-Value [Acc > NIR] : 0.5843         
##                                          
##                   Kappa : 0.2478         
##                                          
##  Mcnemar's Test P-Value : 0.7518         
##                                          
##             Sensitivity : 0.8333         
##             Specificity : 0.4000         
##          Pos Pred Value : 0.7692         
##          Neg Pred Value : 0.5000         
##              Prevalence : 0.7059         
##          Detection Rate : 0.5882         
##    Detection Prevalence : 0.7647         
##       Balanced Accuracy : 0.6167         
##                                          
##        'Positive' Class : 0              
## 

Linear Kernal Results

svmPre3<-predict(svmfit3,test2)
svm3<-confusionMatrix(svmPre3,test$Suicide)
svm3
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 24 10
##          1  0  0
##                                          
##                Accuracy : 0.7059         
##                  95% CI : (0.5252, 0.849)
##     No Information Rate : 0.7059         
##     P-Value [Acc > NIR] : 0.584308       
##                                          
##                   Kappa : 0              
##                                          
##  Mcnemar's Test P-Value : 0.004427       
##                                          
##             Sensitivity : 1.0000         
##             Specificity : 0.0000         
##          Pos Pred Value : 0.7059         
##          Neg Pred Value :    NaN         
##              Prevalence : 0.7059         
##          Detection Rate : 0.7059         
##    Detection Prevalence : 1.0000         
##       Balanced Accuracy : 0.5000         
##                                          
##        'Positive' Class : 0              
## 

Sigmoid Kernal Results

svmPre4<-predict(svmfit4,test2)
svm4<-confusionMatrix(svmPre4,test$Suicide)
svm4
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 16  7
##          1  8  3
##                                           
##                Accuracy : 0.5588          
##                  95% CI : (0.3789, 0.7281)
##     No Information Rate : 0.7059          
##     P-Value [Acc > NIR] : 0.9777          
##                                           
##                   Kappa : -0.0324         
##                                           
##  Mcnemar's Test P-Value : 1.0000          
##                                           
##             Sensitivity : 0.6667          
##             Specificity : 0.3000          
##          Pos Pred Value : 0.6957          
##          Neg Pred Value : 0.2727          
##              Prevalence : 0.7059          
##          Detection Rate : 0.4706          
##    Detection Prevalence : 0.6765          
##       Balanced Accuracy : 0.4833          
##                                           
##        'Positive' Class : 0               
## 

Conclusion: Selecting a final model

Interestingly, three of the six models results in the same level of accuracy with the testing data. The model using the sigmoid kernal with reduced variables due to correlation was the best out of that approach. On the other hand, the linear and radial kernal returned similar results using PCA to trim the overall variables. Now, it is important to look into the sensitivity and specifity of the top models. The specificity of 2 of the top models was 50% and 40% respectively, while the third model had a specificity of 0%. This is really important because of the topic and our end goals. The goal is to detect suicide in patients. 1 of the models can predict suicide at the rate of a coin toss and the second less than that, which is terrible. I would rather go with the third model that has no specificity. The model with a 100% sensitivity can at least RULE out non-suicidal people. In a real world enviroment, it’s better to be wrong about who you think is suicidal then accidental classify someone as non-suicidal who actually is suicidal. Also, the PCA models contain a much reduced dimesionality, making them easier models. In a professional setting, the downside of such models would be explaining it to a non-technical audience.

Sources: