library(GGally)
library(caret)
library(mice)
library(plyr)
library(tidyverse)
library(DataExplorer)
library(MASS)
library(naniar)
library(corrplot)
library(VIM)
library(cluster)
library(recipes)
library(factoextra)
knitr::opts_chunk$set(echo = TRUE)
ML techniques can potentially offer new routes for learning patterns of human behavior; identifying mental health symptoms and risk factors; developing predictions about disease progression; and personalizing and optimizing therapies. In this project we will work with mental health dataset and use some of the unsupervised learning methods to cluster data to provide new insights, and to discover patterns and help structure the data.
For this assignment, we will be working with a very interesting mental health dataset from a real-life research project. All identifying information, of course, has been removed. The attached spreadsheet has the data (the tab name “Data”). The data dictionary is given in the second tab. You can get as creative as you want. The assignment is designed to really get you to think about how you could use different methods.
Exploration of the data is always the first step in data analysis. The main priorities of exploration is exploring data types, outliers, overall distribution of the data points, and missing data. We’re also going to see if any transformation is possible.
In this process we’ll analye and visualize the data to get a better understanding of the data and glean insight from it. The process will involves the following steps:
#Load the data
raw_adhd <- readxl::read_xlsx('ADHD_data.xlsx')
glimpse(raw_adhd)
## Rows: 175
## Columns: 54
## $ Initial <chr> "JA", "LA", "MD", "RD", "RB", "SB", "PK", "RJ", "~
## $ Age <dbl> 24, 48, 51, 43, 34, 39, 41, 48, 44, 27, 44, 56, 5~
## $ Sex <dbl> 1, 2, 2, 1, 1, 2, 2, 1, 2, 2, 1, 1, 2, 1, 2, 2, 2~
## $ Race <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
## $ `ADHD Q1` <dbl> 1, 3, 2, 3, 4, 2, 2, 2, 3, 2, 2, 2, 1, 2, 1, 4, 3~
## $ `ADHD Q2` <dbl> 1, 3, 1, 3, 4, 3, 2, 4, 3, 3, 3, 2, 3, 3, 0, 3, 2~
## $ `ADHD Q3` <dbl> 4, 4, 2, 2, 2, 1, 1, 3, 3, 4, 2, 2, 0, 2, 1, 1, 2~
## $ `ADHD Q4` <dbl> 2, 4, 1, 2, 4, 4, 3, 4, 4, 4, 3, 3, 2, 4, 2, 4, 3~
## $ `ADHD Q5` <dbl> 3, 5, 3, 4, 4, 3, 4, 3, 4, 4, 3, 3, 3, 2, 2, 2, 4~
## $ `ADHD Q6` <dbl> 1, 2, 3, 3, 2, 2, 4, 3, 3, 3, 3, 2, 3, 0, 3, 0, 1~
## $ `ADHD Q7` <dbl> 1, 2, 3, 2, 3, 3, 2, 1, 3, 4, 3, 1, 2, 4, 0, 2, 2~
## $ `ADHD Q8` <dbl> 3, 3, 2, 4, 4, 4, 3, 1, 4, 3, 3, 2, 2, 4, 0, 3, 3~
## $ `ADHD Q9` <dbl> 2, 2, 0, 4, 4, 4, 3, 4, 3, 2, 2, 3, 1, 2, 2, 3, 1~
## $ `ADHD Q10` <dbl> 4, 4, 1, 2, 2, 2, 4, 2, 4, 4, 3, 2, 1, 4, 2, 3, 2~
## $ `ADHD Q11` <dbl> 2, 1, 2, 3, 4, 4, 3, 4, 3, 3, 2, 2, 3, 3, 1, 4, 3~
## $ `ADHD Q12` <dbl> 4, 4, 0, 1, 1, 2, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1~
## $ `ADHD Q13` <dbl> 1, 2, 2, 3, 3, 4, 4, 4, 4, 3, 2, 3, 3, 1, 2, 3, 3~
## $ `ADHD Q14` <dbl> 0, 4, 2, 3, 2, 4, 4, 3, 3, 4, 4, 3, 3, 3, 3, 4, 3~
## $ `ADHD Q15` <dbl> 3, 4, 3, 1, 1, 3, 4, 0, 3, 4, 1, 3, 2, 0, 1, 0, 3~
## $ `ADHD Q16` <dbl> 1, 3, 2, 2, 2, 4, 4, 0, 2, 4, 2, 1, 1, 0, 1, 1, 3~
## $ `ADHD Q17` <dbl> 3, 1, 1, 1, 1, 3, 2, 1, 4, 2, 1, 1, 0, 0, 3, 0, 2~
## $ `ADHD Q18` <dbl> 4, 4, 1, 2, 1, 3, 4, 1, 3, 2, 2, 2, 1, 2, 1, 3, 2~
## $ `ADHD Total` <dbl> 40, 55, 31, 45, 48, 55, 54, 41, 56, 56, 42, 38, 3~
## $ `MD Q1a` <dbl> 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0~
## $ `MD Q1b` <dbl> 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0~
## $ `MD Q1c` <dbl> 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1~
## $ `MD Q1d` <dbl> 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1~
## $ `MD Q1e` <dbl> 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1~
## $ `MD Q1f` <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1~
## $ `MD Q1g` <dbl> 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1~
## $ `MD Q1h` <dbl> 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0~
## $ `MD Q1i` <dbl> 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1~
## $ `MD Q1j` <dbl> 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1~
## $ `MD Q1k` <dbl> 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0~
## $ `MD Q1L` <dbl> 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0~
## $ `MD Q1m` <dbl> 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0~
## $ `MD Q2` <dbl> 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1~
## $ `MD Q3` <dbl> 3, 3, 2, 3, 2, 3, 3, 3, 3, 2, 3, 0, 2, 2, 2, 2, 2~
## $ `MD TOTAL` <dbl> 15, 14, 5, 13, 7, 14, 9, 7, 12, 11, 16, 0, 11, 10~
## $ Alcohol <dbl> 1, 0, 0, 1, 1, 1, 3, 0, 1, 0, 3, 0, 1, 0, 3, 0, 0~
## $ THC <dbl> 1, 0, 0, 1, 1, 0, 3, 0, 0, 3, 1, 0, 1, 0, 1, 0, 0~
## $ Cocaine <dbl> 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0~
## $ Stimulants <dbl> 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0~
## $ `Sedative-hypnotics` <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0~
## $ Opioids <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 3, 0, 0, 0, 0, 3, 0~
## $ `Court order` <dbl> 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0~
## $ Education <dbl> 11, 14, 12, 12, 9, 11, 12, 16, 12, 9, 12, 18, 12,~
## $ `Hx of Violence` <dbl> 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0~
## $ `Disorderly Conduct` <dbl> 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0~
## $ Suicide <dbl> 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0~
## $ Abuse <dbl> 0, 4, 6, 7, 0, 2, 4, 0, 0, 2, 4, 2, 0, 0, 0, 5, 1~
## $ `Non-subst Dx` <dbl> 2, 1, 2, 2, 2, 0, 1, 2, 1, 1, 1, 2, 0, 1, 0, 2, 2~
## $ `Subst Dx` <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 2, 2, 0, 0, 0, 1, 1, 0~
## $ `Psych meds.` <dbl> 2, 1, 1, 2, 0, 0, 1, 2, 1, 0, 0, 1, 0, 0, 0, 2, 1~
The data set is mostly compromised of “dbl” type, even though some of the columns such as Sex and Race are prime candidates to be converted to factors. There are unique values for ‘Age’ and ‘Initial’. Since the column ‘Initial’ is a sort of identifier, we can safely remove it, the column is not necessary for analysis.
It’s important to examine the data dictionary and see which columns might be combined or separated into different data sets.
The original dataframe contains 175 observations (i.e. survey participants) as rows and 54 columns as variables. The columns contain both qualitative and quantitative variables. Moreover, some columns represent categorical data but is encoded as numerical values.
| Data Dictionary: |
| Sex: Male - 1, Female - 2 |
| Race: Race: White-1, African American-2, Hispanic-3, Asian-4, Native American-5, Other or missing data -6 |
| ADHD self-report scale: Never-0, rarely-1, sometimes-2, often-3, very often-4 |
| Mood disorder questions: No-0, yes-1; question 3: no problem-0, minor-1, moderate-2, serious-3 |
| Individual substances misuse: no use-0, use-1, abuse-2, dependence-3 |
| Court Order: No-0, Yes-1 |
| Education: 1-12 grade, 13+ college |
| History of Violence: No-0, Yes-1 |
| Disorderly Conduct: No-0, Yes-1 |
| Suicide attempt: No-0, Yes-1 |
| Abuse Hx: No-0, Physical (P)-1, Sexual (S)-2, Emotional (E)-3, P&S-4, P&E-5, S&E-6, P&S&E-7 |
| Non-substance-related Dx: 0 – none; 1 – one; 2 – More than one |
| Substance-related Dx: 0 – none; 1 – one Substance-related; 2 – two; 3 – three or more |
| Psychiatric Meds: 0 – none; 1 – one psychotropic med; 2 – more than one psychotropic med |
Identifying the missing data from the data set:
plot_missing(
raw_adhd,
group = list(Good = 0.05, OK = 0.4, Bad = 0.8, Remove = 1),
missing_only = TRUE,
geom_label_args = list(),
title = NULL,
ggtheme = theme_gray(),
theme_config = list(legend.position = c("bottom"))
)
There are columns missing data, with the exception of Psychiatric Meds, most of the missing data appears imputable. The ‘Psychiatric Meds’ column is missing 67.43% of data points. Imputing the data for this column is not optimal since most of the data would be imputed.
gg_miss_upset(raw_adhd)
# rename data column names
raw_adhd <- raw_adhd %>% dplyr::rename_all(funs(make.names(.)))
imputed_Data <- mice(raw_adhd, m=2, maxit = 2, method = 'cart', seed = 500)
##
## iter imp variable
## 1 1 Alcohol THC Cocaine Stimulants Sedative.hypnotics Opioids Court.order Education Hx.of.Violence Disorderly.Conduct Suicide Abuse Non.subst.Dx Subst.Dx Psych.meds.
## 1 2 Alcohol THC Cocaine Stimulants Sedative.hypnotics Opioids Court.order Education Hx.of.Violence Disorderly.Conduct Suicide Abuse Non.subst.Dx Subst.Dx Psych.meds.
## 2 1 Alcohol THC Cocaine Stimulants Sedative.hypnotics Opioids Court.order Education Hx.of.Violence Disorderly.Conduct Suicide Abuse Non.subst.Dx Subst.Dx Psych.meds.
## 2 2 Alcohol THC Cocaine Stimulants Sedative.hypnotics Opioids Court.order Education Hx.of.Violence Disorderly.Conduct Suicide Abuse Non.subst.Dx Subst.Dx Psych.meds.
raw_adhd <- complete(imputed_Data,2)
library(skimr)
skim(raw_adhd)
| Name | raw_adhd |
| Number of rows | 175 |
| Number of columns | 54 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| numeric | 53 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| Initial | 0 | 1 | 2 | 3 | 0 | 108 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Age | 0 | 1 | 39.47 | 11.17 | 18 | 29.5 | 42 | 48.0 | 69 | ▆▅▇▅▁ |
| Sex | 0 | 1 | 1.43 | 0.50 | 1 | 1.0 | 1 | 2.0 | 2 | ▇▁▁▁▆ |
| Race | 0 | 1 | 1.64 | 0.69 | 1 | 1.0 | 2 | 2.0 | 6 | ▇▁▁▁▁ |
| ADHD.Q1 | 0 | 1 | 1.70 | 1.29 | 0 | 1.0 | 2 | 3.0 | 4 | ▇▇▇▆▃ |
| ADHD.Q2 | 0 | 1 | 1.91 | 1.25 | 0 | 1.0 | 2 | 3.0 | 4 | ▅▇▇▆▅ |
| ADHD.Q3 | 0 | 1 | 1.91 | 1.27 | 0 | 1.0 | 2 | 3.0 | 4 | ▅▇▇▆▅ |
| ADHD.Q4 | 0 | 1 | 2.10 | 1.34 | 0 | 1.0 | 2 | 3.0 | 4 | ▅▅▇▅▆ |
| ADHD.Q5 | 0 | 1 | 2.26 | 1.44 | 0 | 1.0 | 3 | 3.0 | 5 | ▇▅▇▆▁ |
| ADHD.Q6 | 0 | 1 | 1.91 | 1.31 | 0 | 1.0 | 2 | 3.0 | 4 | ▆▅▇▇▃ |
| ADHD.Q7 | 0 | 1 | 1.83 | 1.19 | 0 | 1.0 | 2 | 3.0 | 4 | ▃▇▇▃▃ |
| ADHD.Q8 | 0 | 1 | 2.14 | 1.29 | 0 | 1.0 | 2 | 3.0 | 4 | ▃▇▇▇▆ |
| ADHD.Q9 | 0 | 1 | 1.91 | 1.32 | 0 | 1.0 | 2 | 3.0 | 4 | ▆▇▇▇▅ |
| ADHD.Q10 | 0 | 1 | 2.12 | 1.23 | 0 | 1.0 | 2 | 3.0 | 4 | ▂▇▇▆▅ |
| ADHD.Q11 | 0 | 1 | 2.27 | 1.24 | 0 | 1.0 | 2 | 3.0 | 4 | ▂▆▇▇▆ |
| ADHD.Q12 | 0 | 1 | 1.29 | 1.21 | 0 | 0.0 | 1 | 2.0 | 4 | ▇▇▆▂▂ |
| ADHD.Q13 | 0 | 1 | 2.37 | 1.23 | 0 | 1.5 | 2 | 3.0 | 4 | ▂▅▇▇▆ |
| ADHD.Q14 | 0 | 1 | 2.25 | 1.35 | 0 | 1.0 | 2 | 3.0 | 4 | ▅▅▇▇▆ |
| ADHD.Q15 | 0 | 1 | 1.63 | 1.39 | 0 | 0.0 | 1 | 3.0 | 4 | ▇▆▆▅▃ |
| ADHD.Q16 | 0 | 1 | 1.70 | 1.38 | 0 | 1.0 | 1 | 3.0 | 4 | ▆▇▆▃▅ |
| ADHD.Q17 | 0 | 1 | 1.53 | 1.29 | 0 | 0.0 | 1 | 2.0 | 4 | ▇▇▇▃▃ |
| ADHD.Q18 | 0 | 1 | 1.47 | 1.30 | 0 | 0.0 | 1 | 2.0 | 4 | ▇▇▆▃▃ |
| ADHD.Total | 0 | 1 | 34.32 | 16.70 | 0 | 21.0 | 33 | 47.5 | 72 | ▃▆▇▆▂ |
| MD.Q1a | 0 | 1 | 0.55 | 0.50 | 0 | 0.0 | 1 | 1.0 | 1 | ▆▁▁▁▇ |
| MD.Q1b | 0 | 1 | 0.57 | 0.50 | 0 | 0.0 | 1 | 1.0 | 1 | ▆▁▁▁▇ |
| MD.Q1c | 0 | 1 | 0.54 | 0.50 | 0 | 0.0 | 1 | 1.0 | 1 | ▇▁▁▁▇ |
| MD.Q1d | 0 | 1 | 0.58 | 0.49 | 0 | 0.0 | 1 | 1.0 | 1 | ▆▁▁▁▇ |
| MD.Q1e | 0 | 1 | 0.55 | 0.50 | 0 | 0.0 | 1 | 1.0 | 1 | ▆▁▁▁▇ |
| MD.Q1f | 0 | 1 | 0.70 | 0.46 | 0 | 0.0 | 1 | 1.0 | 1 | ▃▁▁▁▇ |
| MD.Q1g | 0 | 1 | 0.72 | 0.45 | 0 | 0.0 | 1 | 1.0 | 1 | ▃▁▁▁▇ |
| MD.Q1h | 0 | 1 | 0.56 | 0.50 | 0 | 0.0 | 1 | 1.0 | 1 | ▆▁▁▁▇ |
| MD.Q1i | 0 | 1 | 0.59 | 0.49 | 0 | 0.0 | 1 | 1.0 | 1 | ▆▁▁▁▇ |
| MD.Q1j | 0 | 1 | 0.39 | 0.49 | 0 | 0.0 | 0 | 1.0 | 1 | ▇▁▁▁▅ |
| MD.Q1k | 0 | 1 | 0.49 | 0.50 | 0 | 0.0 | 0 | 1.0 | 1 | ▇▁▁▁▇ |
| MD.Q1L | 0 | 1 | 0.58 | 0.49 | 0 | 0.0 | 1 | 1.0 | 1 | ▆▁▁▁▇ |
| MD.Q1m | 0 | 1 | 0.49 | 0.50 | 0 | 0.0 | 0 | 1.0 | 1 | ▇▁▁▁▇ |
| MD.Q2 | 0 | 1 | 0.72 | 0.45 | 0 | 0.0 | 1 | 1.0 | 1 | ▃▁▁▁▇ |
| MD.Q3 | 0 | 1 | 2.01 | 1.07 | 0 | 1.0 | 2 | 3.0 | 3 | ▂▂▁▅▇ |
| MD.TOTAL | 0 | 1 | 10.02 | 4.81 | 0 | 6.5 | 11 | 14.0 | 17 | ▃▃▆▇▇ |
| Alcohol | 0 | 1 | 1.33 | 1.39 | 0 | 0.0 | 1 | 3.0 | 3 | ▇▂▁▁▆ |
| THC | 0 | 1 | 0.79 | 1.26 | 0 | 0.0 | 0 | 1.0 | 3 | ▇▁▁▁▂ |
| Cocaine | 0 | 1 | 1.11 | 1.39 | 0 | 0.0 | 0 | 3.0 | 3 | ▇▁▁▁▅ |
| Stimulants | 0 | 1 | 0.12 | 0.53 | 0 | 0.0 | 0 | 0.0 | 3 | ▇▁▁▁▁ |
| Sedative.hypnotics | 0 | 1 | 0.12 | 0.54 | 0 | 0.0 | 0 | 0.0 | 3 | ▇▁▁▁▁ |
| Opioids | 0 | 1 | 0.40 | 1.00 | 0 | 0.0 | 0 | 0.0 | 3 | ▇▁▁▁▁ |
| Court.order | 0 | 1 | 0.09 | 0.28 | 0 | 0.0 | 0 | 0.0 | 1 | ▇▁▁▁▁ |
| Education | 0 | 1 | 11.87 | 2.13 | 6 | 11.0 | 12 | 12.0 | 19 | ▁▅▇▂▁ |
| Hx.of.Violence | 0 | 1 | 0.25 | 0.44 | 0 | 0.0 | 0 | 0.5 | 1 | ▇▁▁▁▃ |
| Disorderly.Conduct | 0 | 1 | 0.74 | 0.44 | 0 | 0.0 | 1 | 1.0 | 1 | ▃▁▁▁▇ |
| Suicide | 0 | 1 | 0.30 | 0.46 | 0 | 0.0 | 0 | 1.0 | 1 | ▇▁▁▁▃ |
| Abuse | 0 | 1 | 1.26 | 2.07 | 0 | 0.0 | 0 | 2.0 | 7 | ▇▂▁▁▁ |
| Non.subst.Dx | 0 | 1 | 0.44 | 0.68 | 0 | 0.0 | 0 | 1.0 | 2 | ▇▁▂▁▁ |
| Subst.Dx | 0 | 1 | 1.11 | 0.93 | 0 | 0.0 | 1 | 2.0 | 3 | ▆▇▁▅▂ |
| Psych.meds. | 0 | 1 | 0.46 | 0.71 | 0 | 0.0 | 0 | 1.0 | 2 | ▇▁▂▁▂ |
First, we are going to remove the Psychiatric Medicine and Initial Columns from data set.
adhd_df <- raw_adhd[-c(1, 54)]
jus_nums <- raw_adhd[-c(1, 54)]
The various distribution of the responses to the ADHD questionnaire are displayed below. Also Distribution of Binary and Ordinal Variables.
plot_histogram(adhd_df, ggtheme = theme_linedraw())
After observing the plots to assess how the distributions involved in the dataset. Based on the histrograms plotted above, we can note that there are many observations although numeric, behave as categorical features and this will need to be assessed when performing the kmeans clustering analysis. There does not seem to be any clear distinguishable outliers however there does seem to be some features that experience low variance such Stimulants where majority of the recorded observations are 0.
The columns Sex and Age are going to be converted to factors. When building future models, it’s easier to read the different clusters or groups with certain categorical data turned into factors with appropriate labels.
adhd_df$Sex <-factor(adhd_df$Sex, levels = c(1,2), labels=c('Male','Female'))
adhd_df$Race <-factor(adhd_df$Race, levels=c(1,2,3,4,5,6), labels = c('White', 'African American', 'Hispanic', 'Asian', 'Native American', 'Other'))
adhd <- adhd_df
adhd_df[sapply(adhd_df, is.double)] <- lapply(adhd_df[sapply(adhd_df, is.double)], as.factor)
glimpse(adhd_df)
## Rows: 175
## Columns: 52
## $ Age <fct> 24, 48, 51, 43, 34, 39, 41, 48, 44, 27, 44, 56, 53,~
## $ Sex <fct> Male, Female, Female, Male, Male, Female, Female, M~
## $ Race <fct> White, White, White, White, White, White, White, Wh~
## $ ADHD.Q1 <fct> 1, 3, 2, 3, 4, 2, 2, 2, 3, 2, 2, 2, 1, 2, 1, 4, 3, ~
## $ ADHD.Q2 <fct> 1, 3, 1, 3, 4, 3, 2, 4, 3, 3, 3, 2, 3, 3, 0, 3, 2, ~
## $ ADHD.Q3 <fct> 4, 4, 2, 2, 2, 1, 1, 3, 3, 4, 2, 2, 0, 2, 1, 1, 2, ~
## $ ADHD.Q4 <fct> 2, 4, 1, 2, 4, 4, 3, 4, 4, 4, 3, 3, 2, 4, 2, 4, 3, ~
## $ ADHD.Q5 <fct> 3, 5, 3, 4, 4, 3, 4, 3, 4, 4, 3, 3, 3, 2, 2, 2, 4, ~
## $ ADHD.Q6 <fct> 1, 2, 3, 3, 2, 2, 4, 3, 3, 3, 3, 2, 3, 0, 3, 0, 1, ~
## $ ADHD.Q7 <fct> 1, 2, 3, 2, 3, 3, 2, 1, 3, 4, 3, 1, 2, 4, 0, 2, 2, ~
## $ ADHD.Q8 <fct> 3, 3, 2, 4, 4, 4, 3, 1, 4, 3, 3, 2, 2, 4, 0, 3, 3, ~
## $ ADHD.Q9 <fct> 2, 2, 0, 4, 4, 4, 3, 4, 3, 2, 2, 3, 1, 2, 2, 3, 1, ~
## $ ADHD.Q10 <fct> 4, 4, 1, 2, 2, 2, 4, 2, 4, 4, 3, 2, 1, 4, 2, 3, 2, ~
## $ ADHD.Q11 <fct> 2, 1, 2, 3, 4, 4, 3, 4, 3, 3, 2, 2, 3, 3, 1, 4, 3, ~
## $ ADHD.Q12 <fct> 4, 4, 0, 1, 1, 2, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, ~
## $ ADHD.Q13 <fct> 1, 2, 2, 3, 3, 4, 4, 4, 4, 3, 2, 3, 3, 1, 2, 3, 3, ~
## $ ADHD.Q14 <fct> 0, 4, 2, 3, 2, 4, 4, 3, 3, 4, 4, 3, 3, 3, 3, 4, 3, ~
## $ ADHD.Q15 <fct> 3, 4, 3, 1, 1, 3, 4, 0, 3, 4, 1, 3, 2, 0, 1, 0, 3, ~
## $ ADHD.Q16 <fct> 1, 3, 2, 2, 2, 4, 4, 0, 2, 4, 2, 1, 1, 0, 1, 1, 3, ~
## $ ADHD.Q17 <fct> 3, 1, 1, 1, 1, 3, 2, 1, 4, 2, 1, 1, 0, 0, 3, 0, 2, ~
## $ ADHD.Q18 <fct> 4, 4, 1, 2, 1, 3, 4, 1, 3, 2, 2, 2, 1, 2, 1, 3, 2, ~
## $ ADHD.Total <fct> 40, 55, 31, 45, 48, 55, 54, 41, 56, 56, 42, 38, 31,~
## $ MD.Q1a <fct> 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, ~
## $ MD.Q1b <fct> 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, ~
## $ MD.Q1c <fct> 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, ~
## $ MD.Q1d <fct> 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, ~
## $ MD.Q1e <fct> 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, ~
## $ MD.Q1f <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, ~
## $ MD.Q1g <fct> 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, ~
## $ MD.Q1h <fct> 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, ~
## $ MD.Q1i <fct> 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, ~
## $ MD.Q1j <fct> 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, ~
## $ MD.Q1k <fct> 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, ~
## $ MD.Q1L <fct> 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, ~
## $ MD.Q1m <fct> 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0, ~
## $ MD.Q2 <fct> 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, ~
## $ MD.Q3 <fct> 3, 3, 2, 3, 2, 3, 3, 3, 3, 2, 3, 0, 2, 2, 2, 2, 2, ~
## $ MD.TOTAL <fct> 15, 14, 5, 13, 7, 14, 9, 7, 12, 11, 16, 0, 11, 10, ~
## $ Alcohol <fct> 1, 0, 0, 1, 1, 1, 3, 0, 1, 0, 3, 0, 1, 0, 3, 0, 0, ~
## $ THC <fct> 1, 0, 0, 1, 1, 0, 3, 0, 0, 3, 1, 0, 1, 0, 1, 0, 0, ~
## $ Cocaine <fct> 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, ~
## $ Stimulants <fct> 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, ~
## $ Sedative.hypnotics <fct> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, ~
## $ Opioids <fct> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 3, 0, 0, 0, 0, 3, 0, ~
## $ Court.order <fct> 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, ~
## $ Education <fct> 11, 14, 12, 12, 9, 11, 12, 16, 12, 9, 12, 18, 12, 1~
## $ Hx.of.Violence <fct> 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, ~
## $ Disorderly.Conduct <fct> 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, ~
## $ Suicide <fct> 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, ~
## $ Abuse <fct> 0, 4, 6, 7, 0, 2, 4, 0, 0, 2, 4, 2, 0, 0, 0, 5, 1, ~
## $ Non.subst.Dx <fct> 2, 1, 2, 2, 2, 0, 1, 2, 1, 1, 1, 2, 0, 1, 0, 2, 2, ~
## $ Subst.Dx <fct> 0, 0, 0, 0, 0, 0, 0, 1, 0, 2, 2, 0, 0, 0, 1, 1, 0, ~
Clustering is the most common form of unsupervised learning. It’s a machine learning algorithm used to draw inferences from unlabeled data. The algorithm groups a set of data points into subsets. The goal is to create clusters that are have minimal variance internally, but maximum variance from external clusters.
There are methods to clustering data. One is agglomerative, each observation in a respective cluster then merging together until a stop criteria is reached. The second is divisive, with all observations in a single cluster and subsequent splitting occures until a stop criteria is met.
K-means clustering is one of the simplest and popular unsupervised machine learning algorithms. Typically, unsupervised algorithms make inferences from datasets using only input vectors without referring to known, or labelled, outcomes.
To perform a cluster analysis in R, generally, the data should be prepared as follows:
Removal totalized features
Numeric to Factor Conversions
Newly transformed categorical variables were binarized into 0/1. k-means will not be able to distinguish the eucliiand distances properly between classes that span more than 2 categories.
Normalization: features need to be normalized such that the distances they are centered and scaled the mean is 0 and the Stdev is 1, this scales all the data to allow kmeans to appropriately place centroids and observations at appropriate distances.
Colinearity test: Colinearity was tested and it was determined that there was not sufficient colinearity between any variables such that they needed to be removed for this reason alone.
Removed low-variance features: From Data Exploration section Stimulants seems like a low-variance variable with majority of categories recorded at 0.
adhd_df %>% recipe(~.) %>%
step_rm(contains("total")) %>%
#step_mutate_at(-Age, fn = ~ as.factor(.)) %>%
step_dummy(all_nominal(), one_hot = T) %>%
step_normalize(all_predictors()) %>%
step_nzv(all_predictors()) %>%
step_corr(all_predictors()) %>%
prep() #%>%
## Data Recipe
##
## Inputs:
##
## role #variables
## predictor 52
##
## Training data contained 175 data points and no missing data.
##
## Operations:
##
## Variables removed ADHD.Total, MD.TOTAL [trained]
## Dummy variables from Age, Sex, Race, ADHD.Q1, ADHD.Q2, ADHD.Q3, ... [trained]
## Centering and scaling for Age_X18, Age_X19, Age_X20, Age_X21, ... [trained]
## Sparse, unbalanced variable filter removed Age_X18, Age_X19, Age_X20, ... [trained]
## Correlation filter removed Race_African.American, ... [trained]
The classification of observations into groups requires some methods for computing the distance or the (dis)similarity between each pair of observations. The choice of distance measures is a critical step in clustering. It defines how the similarity of two elements (x, y) is calculated and it will influence the shape of the clusters. The choice also has has a strong influence on the clustering results.
the default distance measure is the Euclidean distance. However, depending on the type of the data and the research questions, other dissimilarity measures might be preferred and you should be aware of the options.
Within R it is simple to compute and visualize the distance matrix using the functions get_dist and fviz_dist from the factoextra R package.
distance <- get_dist(adhd_df)
fviz_dist(distance, gradient = list(low = "#00AFBB", mid = "white", high = "#FC4E07"))
adhd_df <- adhd_df[, -c(2,3)]
df <- adhd_df
k2 <- kmeans(df, centers = 2, nstart = 25)
k3 <- kmeans(df, centers = 3, nstart = 25)
k4 <- kmeans(df, centers = 4, nstart = 25)
k5 <- kmeans(df, centers = 5, nstart = 25)
str(k2)
## List of 9
## $ cluster : int [1:175] 2 2 1 2 2 2 2 2 2 2 ...
## $ centers : num [1:2, 1:50] 40.92 38.045 0.931 2.455 1.172 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:2] "1" "2"
## .. ..$ : chr [1:50] "Age" "ADHD.Q1" "ADHD.Q2" "ADHD.Q3" ...
## $ totss : num 83398
## $ withinss : num [1:2] 22761 23728
## $ tot.withinss: num 46489
## $ betweenss : num 36909
## $ size : int [1:2] 87 88
## $ iter : int 1
## $ ifault : int 0
## - attr(*, "class")= chr "kmeans"
We can also view our results by using fviz_cluster. This provides a nice illustration of the clusters. If there are more than two dimensions (variables) fviz_cluster will perform principal component analysis (PCA) and plot the data points according to the first two principal components that explain the majority of the variance.
Because the number of clusters (k) must be set before we start the algorithm, it is often advantageous to use several different values of k and examine the differences in the results. We can execute the same process for 3, 4, and 5 clusters, and the results are shown in the figure:
# I had to convert df back to numeric to plot the clusters
df[sapply(df, is.factor)] <- lapply(df[sapply(df, is.factor)], as.numeric)
p1 <- fviz_cluster(k2, geom = "point", data = df) + ggtitle("k = 2")
p2 <- fviz_cluster(k3, geom = "point", data = df) + ggtitle("k = 3")
p3 <- fviz_cluster(k4, geom = "point", data = df) + ggtitle("k = 4")
p4 <- fviz_cluster(k5, geom = "point", data = df) + ggtitle("k = 5")
library(gridExtra)
grid.arrange(p1, p2, p3, p4, nrow = 2)
Determining Optimal Clusters
As you may recall the analyst specifies the number of clusters to use; preferably the analyst would like to use the optimal number of clusters. To aid the analyst, the following explains the three most popular methods for determining the optimal clusters, which includes:
Recall that, the basic idea behind cluster partitioning methods, such as k-means clustering, is to define clusters such that the total intra-cluster variation (known as total within-cluster variation or total within-cluster sum of square) is minimized
The total within-cluster sum of square (wss) measures the compactness of the clustering and we want it to be as small as possible. Thus, we can use the following algorithm to define the optimal clusters:
Fortunately, this process to compute the “Elbow method” has been wrapped up in a single function (fviz_nbclust):
set.seed(123)
fviz_nbclust(df, kmeans, method = "wss")
The results suggest that 4 is the optimal number of clusters as it appears to be the bend in the knee (or elbow).
Extracting Results
With most of these approaches suggesting 2 as the number of optimal clusters, we can perform the final analysis and extract the results using 2 clusters.
set.seed(123)
final <- kmeans(df, 2, nstart = 25)
fviz_cluster(final, data = df)
Hierarchical Clustering works by repeatingly combining the two nearest clusters into a larger cluster. It is a bottom-up approach which doesn’t require the specification of the number of clusters beforehand.
The final structure of the cluster is represented by a dendrogram diagram.
In order to perform hierarchical clustering we will perform the following:
The continous variables in the data set are Age, MD.Total, and MD.Total.
set.seed(300)
#scale the numerical continuous data
adhd3 <- adhd_df
adhd3$Age <- as.numeric(adhd3$Age)
adhd3$MD.TOTAL <- as.numeric(adhd3$MD.TOTAL)
adhd3$ADHD.Total <- as.numeric(adhd3$ADHD.Total)
adhd3 %>% select_if(is.numeric) %>% lapply(scale) ->adhd3
adhd3 <- as.data.frame(adhd3)
adhd4 <- adhd_df
adhd4$Age <- adhd3$Age
adhd4$ADHD.Total <- adhd3$ADHD.Total
adhd4$MD.TOTAL <- adhd3$MD.TOTAL
#compute the distance
dist_mat <- dist(adhd4, method = 'euclidean')
#select linkage method
hclust_comp <- hclust(dist_mat, method = 'ward.D2')
plot(hclust_comp)
library(NbClust)
adhd_clust <- NbClust(
data = scale(jus_nums),
distance = 'euclidean',
min.nc = 2,
max.nc = 10,
method="ward.D2",
index="all")
## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##
## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## *******************************************************************
## * Among all indices:
## * 10 proposed 2 as the best number of clusters
## * 3 proposed 3 as the best number of clusters
## * 3 proposed 4 as the best number of clusters
## * 2 proposed 5 as the best number of clusters
## * 2 proposed 6 as the best number of clusters
## * 1 proposed 7 as the best number of clusters
## * 2 proposed 10 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 2
##
##
## *******************************************************************
I tried the scale and unscaled versions of the data set, the optimal cluster is 2. So that’s going to be our k in our dendrogram.
#View the different clusters
plot(hclust_comp)
rect.hclust(hclust_comp , k = 2, border = 2:6)
#abline(h = 39, col = 'red')
hierarchyGroups <- cutree(hclust_comp, 2)
fviz_cluster(list(data = jus_nums, cluster = hierarchyGroups))
Looking at the cluster plots, we get two distinct clusters.
PCA commonly used for dimensionality reduction by using each data point onto only the first few principal components (most cases first and second dimensions) to obtain lower-dimensional data while keeping as much of the data’s variation as possible.
Features that are strongly correlated with each other plotare more suitable for PCA than those loosely related. Below corrplot using the Spearman correlation for the categorical variables demonstrate that the features within ADHD set are more strongly correlated than the MD set. In addition, there are too many missing values for the individual substance misuse, and therefore PCA is not performed on this set. For question 2, we conduct PCA on both ADHD and MD, but ADHD is demonstrated to be more suitable for the task.
# pca - ADHD
PCA_adhd <- prcomp(df %>% dplyr::select(contains("adhd.")))
sd <- PCA_adhd$sdev
loadings <- PCA_adhd$rotation
rownames(loadings) <- colnames(df %>% dplyr::select(contains("adhd.")))
scores <- PCA_adhd$x
summary(PCA_adhd)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 15.9724 1.53903 1.33059 1.16044 1.09903 1.00812 0.95616
## Proportion of Variance 0.9453 0.00878 0.00656 0.00499 0.00448 0.00377 0.00339
## Cumulative Proportion 0.9453 0.95407 0.96063 0.96562 0.97009 0.97386 0.97724
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 0.91382 0.86184 0.84723 0.80836 0.79002 0.76135 0.69356
## Proportion of Variance 0.00309 0.00275 0.00266 0.00242 0.00231 0.00215 0.00178
## Cumulative Proportion 0.98034 0.98309 0.98575 0.98817 0.99048 0.99263 0.99441
## PC15 PC16 PC17 PC18 PC19
## Standard deviation 0.68103 0.60607 0.5683 0.52550 0.27813
## Proportion of Variance 0.00172 0.00136 0.0012 0.00102 0.00029
## Cumulative Proportion 0.99613 0.99749 0.9987 0.99971 1.00000
We will visualize the results using scree and cumulative variance plot. The scree plot (“elbow” method)indicate that the first 4 components are the most important ones, although they merely captured roughly 32% of the variance in the data. The first 29 components have the standard deviation above 1 and together they captured 77% of the variance. The number of dimensions is significantly reduced from 91 to 29 in this case.
par(mfrow = c(1, 2))
# scree plot
screeplot(PCA_adhd, type = "l", npcs = 30, main = "Screeplot of the first 30 PCs")
abline(h = 1, col = "red", lty = 5)
legend("topright", legend = c("Eigenvalue = 1"), col = c("red"), lty = 5, cex = 0.6)
# cumulative variance plot
cumpro <- cumsum(PCA_adhd$sdev^2 / sum(PCA_adhd$sdev^2))
plot(cumpro[0:30], xlab = "PC #", ylab = "Amount of explained variance", main = "Cumulative variance plot")
Loading
Loadings are interpreted as the coefficients of the linear combination of the initial variables from which the principal components are constructed. From a numerical point of view, the loadings are equal to the coordinates of the variables divided by the square root of the eigenvalue associated with the component.
adhd_total is the most important contributor to the first principal component. PC1 is made up by majority of "_X4" variables, meaning the response of “very often” from the ADHD questions; whereas PC2 is made up by majority of "_X0" variables, meaning the response of “never” from the ADHD questions. As expected, PCA successfully extract components from features that are not strongly associated to each other.
library(kableExtra)
cut_off <- sqrt(1/ncol(df %>% dplyr::select(contains("adhd."))))
loadingsDf <- loadings %>%
as.data.frame() %>%
tibble::rownames_to_column() %>%
dplyr::select(variables = rowname, everything())
pc1_important <- loadingsDf %>%
dplyr::filter(abs(PC1) > cut_off) %>%
dplyr::select(variables, PC1) %>%
arrange(desc(abs(PC1)))
pc1_important %>%
kable() %>%
scroll_box()
| variables | PC1 |
|---|---|
| ADHD.Total | 0.9690882 |
The biplot displays the individuals and variables in the same plot featuring the first two principal components. It seems to suggest that the individuals can be visually clustered into 4 groups based on their responses to the ADHD questions.
biplot(PCA_adhd, scale = 0, cex = 0.5)
We are going to use gbm to fit gradient boosting model.
library(gbm)
library(MASS)
# Separate data set
df1 <- df
df1 <- na.omit(df1)
set.seed(1412)
trainIndex <- createDataPartition(df1$Suicide, p = .80) %>% unlist()
training <- df1[ trainIndex,]
testing <- df1[-trainIndex,]
model.boost=gbm(Suicide ~ . ,data = training, distribution = "gaussian",n.trees = 1350,
shrinkage = 0.01, interaction.depth = 4)
model.boost
## gbm(formula = Suicide ~ ., distribution = "gaussian", data = training,
## n.trees = 1350, interaction.depth = 4, shrinkage = 0.01)
## A gradient boosted model with gaussian loss function.
## 1350 iterations were performed.
## There were 49 predictors of which 47 had non-zero influence.
#Summary gives a table of Variable Importance and a plot of Variable Importance
summary(model.boost)
## var rel.inf
## Abuse Abuse 9.2775812
## Age Age 7.6979568
## Alcohol Alcohol 5.2531623
## ADHD.Q17 ADHD.Q17 4.9948204
## ADHD.Q1 ADHD.Q1 4.8860788
## MD.TOTAL MD.TOTAL 4.8026574
## Subst.Dx Subst.Dx 3.8608613
## ADHD.Total ADHD.Total 3.4710617
## Cocaine Cocaine 3.3130873
## Education Education 3.0853890
## Opioids Opioids 2.8369628
## ADHD.Q10 ADHD.Q10 2.3637266
## MD.Q1d MD.Q1d 2.1464098
## ADHD.Q15 ADHD.Q15 2.0952871
## ADHD.Q5 ADHD.Q5 1.9335370
## ADHD.Q6 ADHD.Q6 1.8805429
## ADHD.Q7 ADHD.Q7 1.8261864
## ADHD.Q2 ADHD.Q2 1.8096463
## ADHD.Q4 ADHD.Q4 1.7470188
## MD.Q2 MD.Q2 1.7248427
## ADHD.Q18 ADHD.Q18 1.7062774
## ADHD.Q14 ADHD.Q14 1.6971749
## MD.Q1g MD.Q1g 1.6724975
## THC THC 1.6422761
## ADHD.Q16 ADHD.Q16 1.6004904
## ADHD.Q9 ADHD.Q9 1.4659478
## ADHD.Q12 ADHD.Q12 1.4586827
## ADHD.Q3 ADHD.Q3 1.4300755
## MD.Q1b MD.Q1b 1.4024011
## ADHD.Q11 ADHD.Q11 1.2378792
## MD.Q3 MD.Q3 1.1962696
## Hx.of.Violence Hx.of.Violence 1.1763532
## ADHD.Q8 ADHD.Q8 1.1581247
## MD.Q1k MD.Q1k 1.1275679
## MD.Q1e MD.Q1e 1.1143728
## Disorderly.Conduct Disorderly.Conduct 1.0388760
## ADHD.Q13 ADHD.Q13 1.0097709
## MD.Q1c MD.Q1c 0.9758807
## MD.Q1L MD.Q1L 0.9594112
## MD.Q1m MD.Q1m 0.8567218
## MD.Q1h MD.Q1h 0.8392388
## MD.Q1i MD.Q1i 0.5727816
## Non.subst.Dx Non.subst.Dx 0.5566255
## MD.Q1j MD.Q1j 0.4080050
## MD.Q1a MD.Q1a 0.3937639
## MD.Q1f MD.Q1f 0.1523843
## Court.order Court.order 0.1433331
## Stimulants Stimulants 0.0000000
## Sedative.hypnotics Sedative.hypnotics 0.0000000
Let consider the top 11 variables of importance…
df3 <- subset(df1, select = c('Suicide', 'Abuse', 'Age', 'Alcohol', 'MD.TOTAL', 'ADHD.Q1', 'ADHD.Q17','Subst.Dx','ADHD.Total', 'Cocaine', 'Education'))
set.seed(7790)
trainIndex <- createDataPartition(df3$Suicide, p = .80) %>% unlist()
training <- df3[ trainIndex,]
testing <- df3[-trainIndex,]
model.boost=gbm(Suicide ~ . ,data = training, distribution = "gaussian",n.trees = 1350,
shrinkage = 0.01, interaction.depth = 4)
model.boost
## gbm(formula = Suicide ~ ., distribution = "gaussian", data = training,
## n.trees = 1350, interaction.depth = 4, shrinkage = 0.01)
## A gradient boosted model with gaussian loss function.
## 1350 iterations were performed.
## There were 10 predictors of which 10 had non-zero influence.
#Summary gives a table of Variable Importance and a plot of Variable Importance
summary(model.boost)
## var rel.inf
## ADHD.Total ADHD.Total 16.443802
## Age Age 15.306040
## MD.TOTAL MD.TOTAL 14.907611
## Alcohol Alcohol 10.240016
## ADHD.Q17 ADHD.Q17 9.187577
## Abuse Abuse 8.643782
## ADHD.Q1 ADHD.Q1 7.983155
## Cocaine Cocaine 6.072144
## Subst.Dx Subst.Dx 5.633229
## Education Education 5.582642
Plotting the Partial Dependence Plot: The partial Dependence Plots will tell us the relationship and dependence of the variables X with the Response variable Y
#Plot of Response variable with ADHD.Total variable
plot(model.boost,i="Abuse")
#Inverse relation with Age variable
plot(model.boost,i="Age")
The above plot simply shows the relation between the variables in the x-axis and the mapping function f(x) on the y-axis. First plot shows that Abuse is somewhat positively correlated with the response Suicide, whereas the second one shows that Age is not really directly related to Suicide.
cor(training$Abuse, training$Suicide)
## [1] 0.3267058
cor(training$Age,training$Suicide)
## [1] -0.09346698
Prediction on Test Set
We will compute the Test Error as a function of number of Trees.
n.trees = seq(from=100 ,to=5000, by=100) #no of trees-a vector of 100 values
#Generating a Prediction matrix for each Tree
predmatrix<-predict(model.boost,training,n.trees = n.trees)
dim(predmatrix) #dimentions of the Prediction Matrix
## [1] 140 50
#Calculating The Mean squared Test Error
test.error<-with(training,apply( (predmatrix-Suicide)^2,2,mean))
head(test.error) #contains the Mean squared test error for each of the 100 trees averaged
## 100 200 300 400 500 600
## 0.14529026 0.11622908 0.09815924 0.08642057 0.07763141 0.07074450
#Plotting the test error vs number of trees
plot(n.trees , test.error , pch=19,col="blue",xlab="Number of Trees",ylab="Test Error", main = "Perfomance of Boosting on Test Set")
Note that from the above plot we can notice that if boosting is done properly by selecting appropriate tuning parameters such as shrinkage parameter lambda,the number of splits we want and the number of trees n, then it can generalize really well and convert a weak learner to strong learner. It is really well and tend to outperform a single learner which is prone to either overfitting or underfitting or generate thousands or hundreds of them,then combine them to produce a better and stronger model.
SVM (Support Vector Machines) are a supervised learning method that can be used for classification and regression analysis. Typically, there are used more for classification. The basic idea of SVM is to create an optimal hyperplane to linearly seperate variables, using support vectors. Support vectors are the data points the are closest to the hyperplane. These are the points that are most difficult to classify. The goal of SVM is to create the best hyperplane, or one that maximizes the distance/magrin between the support vectors. SVM does work with non-linear data using kernal functions. Kernal Functions maps the data points to a higher dimension.
For the purposes of this question, we want to use SVM to classify suicides. In order to use SVM, we’re going to need to reduce the total dimensions in the data set. SVM works best with less dimensions, in order to perform dimension reduction, we’re going to apply 2 approaches. One is to eliminate variables with higher than normal levels of correlation and secondly, implement PCA to create new variables that account for most of the data.
We’re going to create a new data set and convert Suicide into a factor.
adhd_svm_data <- adhd
adhd_svm_data$Suicide<- factor(adhd_svm_data$Suicide)
Selecting only the numerical values, a corr plot is developed in an effort to identify variables with high correlation values. Once they’re identified, the total number of variables can be reduced. Just by isolating the numerical columns, there are 49 columns.
adhd_svm_data %>% select_if(is.numeric) %>% cor() -> svmCor
corrplot(svmCor, method = 'circle')
svmNumerical <- adhd_svm_data %>% dplyr::select( -Race, -Suicide, -Sex)
svmNumerical <- svmNumerical[,-c(nearZeroVar(svmNumerical))]
corr_data <- findCorrelation(svmCor, cutoff = 0.55)
svmNumerical2 <- svmNumerical[, -corr_data]
After cutting off correlation at 0.55, the number of variables left are 29 continuous variables. In total, we are moving forward with this SVM model with 33 variables.
svmNumerical2$Race <- adhd_svm_data$Race
svmNumerical2$Suicide <- adhd_svm_data$Suicide
svmNumerical2$Sex <- adhd_svm_data$Sex
set.seed(100)
#Create Data Partition
initsplit <- createDataPartition(svmNumerical2$Suicide, p=0.8, list=FALSE)
#Create Training Data to tune the model
training <- svmNumerical2[initsplit,]
#Create testing data to evaluate the model
test <- svmNumerical2[-initsplit,]
library(e1071)
svmfit_s = svm(Suicide ~ ., data = training, kernel = "sigmoid", cost = 10, scale = TRUE)
svmfit_r = svm(Suicide ~ ., data = training, kernel = "radial", cost = 10, scale = TRUE)
svmfit_l =svm(Suicide ~ ., data = training, kernel = "linear", cost = 10, scale = TRUE)
svmPre1<-predict(svmfit_s,test)
svm1<-confusionMatrix(svmPre1,test$Suicide)
svm1
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 18 7
## 1 6 3
##
## Accuracy : 0.6176
## 95% CI : (0.4356, 0.7783)
## No Information Rate : 0.7059
## P-Value [Acc > NIR] : 0.9037
##
## Kappa : 0.0515
##
## Mcnemar's Test P-Value : 1.0000
##
## Sensitivity : 0.7500
## Specificity : 0.3000
## Pos Pred Value : 0.7200
## Neg Pred Value : 0.3333
## Prevalence : 0.7059
## Detection Rate : 0.5294
## Detection Prevalence : 0.7353
## Balanced Accuracy : 0.5250
##
## 'Positive' Class : 0
##
svmPre2<-predict(svmfit_r,test)
svm2<-confusionMatrix(svmPre2,test$Suicide)
svm2
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 19 5
## 1 5 5
##
## Accuracy : 0.7059
## 95% CI : (0.5252, 0.849)
## No Information Rate : 0.7059
## P-Value [Acc > NIR] : 0.5843
##
## Kappa : 0.2917
##
## Mcnemar's Test P-Value : 1.0000
##
## Sensitivity : 0.7917
## Specificity : 0.5000
## Pos Pred Value : 0.7917
## Neg Pred Value : 0.5000
## Prevalence : 0.7059
## Detection Rate : 0.5588
## Detection Prevalence : 0.7059
## Balanced Accuracy : 0.6458
##
## 'Positive' Class : 0
##
svmPre3<-predict(svmfit_l,test)
svm3<-confusionMatrix(svmPre3,test$Suicide)
svm3
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 15 8
## 1 9 2
##
## Accuracy : 0.5
## 95% CI : (0.3243, 0.6757)
## No Information Rate : 0.7059
## P-Value [Acc > NIR] : 0.9966
##
## Kappa : -0.17
##
## Mcnemar's Test P-Value : 1.0000
##
## Sensitivity : 0.6250
## Specificity : 0.2000
## Pos Pred Value : 0.6522
## Neg Pred Value : 0.1818
## Prevalence : 0.7059
## Detection Rate : 0.4412
## Detection Prevalence : 0.6765
## Balanced Accuracy : 0.4125
##
## 'Positive' Class : 0
##
Using the ‘jus_nums’ data set (all the variables are still numerical but with imputed values).
jus_nums2 <- jus_nums %>% dplyr::select(-MD.TOTAL, -ADHD.Total)
jus_nums2$Suicide <- as.factor(jus_nums2$Suicide)
PCA_svm <- prcomp(jus_nums2 %>%dplyr :: select(-Suicide), center = TRUE, scale. = TRUE)
summary(PCA_svm)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 3.479 2.16135 1.50769 1.38209 1.3592 1.30149 1.22940
## Proportion of Variance 0.247 0.09534 0.04639 0.03898 0.0377 0.03457 0.03085
## Cumulative Proportion 0.247 0.34235 0.38874 0.42772 0.4654 0.49999 0.53084
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 1.20081 1.11978 1.09678 1.0640 1.03988 1.00251 0.97682
## Proportion of Variance 0.02943 0.02559 0.02455 0.0231 0.02207 0.02051 0.01947
## Cumulative Proportion 0.56027 0.58586 0.61040 0.6335 0.65558 0.67609 0.69556
## PC15 PC16 PC17 PC18 PC19 PC20 PC21
## Standard deviation 0.96617 0.95067 0.9154 0.89627 0.87005 0.84068 0.82144
## Proportion of Variance 0.01905 0.01844 0.0171 0.01639 0.01545 0.01442 0.01377
## Cumulative Proportion 0.71461 0.73306 0.7502 0.76655 0.78200 0.79642 0.81019
## PC22 PC23 PC24 PC25 PC26 PC27 PC28
## Standard deviation 0.79734 0.7762 0.75049 0.73257 0.7207 0.69183 0.67952
## Proportion of Variance 0.01297 0.0123 0.01149 0.01095 0.0106 0.00977 0.00942
## Cumulative Proportion 0.82317 0.8355 0.84696 0.85791 0.8685 0.87828 0.88770
## PC29 PC30 PC31 PC32 PC33 PC34 PC35
## Standard deviation 0.67792 0.64656 0.63508 0.61229 0.59429 0.58344 0.57641
## Proportion of Variance 0.00938 0.00853 0.00823 0.00765 0.00721 0.00695 0.00678
## Cumulative Proportion 0.89708 0.90561 0.91384 0.92149 0.92870 0.93565 0.94243
## PC36 PC37 PC38 PC39 PC40 PC41 PC42
## Standard deviation 0.55147 0.54357 0.52278 0.51033 0.48642 0.4591 0.45420
## Proportion of Variance 0.00621 0.00603 0.00558 0.00531 0.00483 0.0043 0.00421
## Cumulative Proportion 0.94864 0.95467 0.96024 0.96556 0.97039 0.9747 0.97890
## PC43 PC44 PC45 PC46 PC47 PC48 PC49
## Standard deviation 0.4260 0.42290 0.40593 0.39464 0.37146 0.33741 0.31823
## Proportion of Variance 0.0037 0.00365 0.00336 0.00318 0.00282 0.00232 0.00207
## Cumulative Proportion 0.9826 0.98625 0.98962 0.99279 0.99561 0.99793 1.00000
fviz_pca_biplot(PCA_svm, label="var",
habillage = jus_nums2$Suicide,
addEllipses = TRUE, palette = "jco")
When can see the the 2 most important components of the account for 34.781 % of the data. We can utilize 34-35 components, those cover a little over 95% of the data, however, the fear is to overfit the model. It is necessary to reduce the number of PCA components so we can continue building our SVM model. The ‘elbow method’ can be used to reduce the overall number of compnents.
fviz_eig(PCA_svm, addlabels = TRUE)
Looking at the scree plot, the optimal number of PCA components is 3, the rate of decrease slows down after 3. Using the three principal components, the next step is to create a new data set with Suicide as the fifth column. After the new data set is created, we can split the data into training and testing sets. The new data set will have 4 total columns, a large reduction in dimensions from the original data set.
#Select the 4 components and assign to a new variable
pca_3 <- PCA_svm$x[,1:3]
#create new data set
pca_new <- cbind(as.data.frame(pca_3), Suicide = jus_nums2$Suicide)
set.seed(100)
#Create Data Partition
initsplit <- createDataPartition(pca_new$Suicide, p=0.8, list=FALSE)
#Create Training Data to tune the model
training2 <- pca_new[initsplit,]
#Create testing data to evaluate the model
test2 <- pca_new[-initsplit,]
We’re going to build several models using different kernals.
svmfit2 = svm(Suicide ~ ., data = training2, kernel = "radial", cost = 10, scale = FALSE)
svmfit3 = svm(Suicide ~ ., data = training2, kernel = "linear", cost = 10, scale = FALSE)
svmfit4 = svm(Suicide ~ ., data = training2, kernel = "sigmoid", cost = 10, scale = FALSE)
Radial Kernal Results
svmPre2<-predict(svmfit2,test2)
svm2<-confusionMatrix(svmPre2,test$Suicide)
svm2
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 20 6
## 1 4 4
##
## Accuracy : 0.7059
## 95% CI : (0.5252, 0.849)
## No Information Rate : 0.7059
## P-Value [Acc > NIR] : 0.5843
##
## Kappa : 0.2478
##
## Mcnemar's Test P-Value : 0.7518
##
## Sensitivity : 0.8333
## Specificity : 0.4000
## Pos Pred Value : 0.7692
## Neg Pred Value : 0.5000
## Prevalence : 0.7059
## Detection Rate : 0.5882
## Detection Prevalence : 0.7647
## Balanced Accuracy : 0.6167
##
## 'Positive' Class : 0
##
Linear Kernal Results
svmPre3<-predict(svmfit3,test2)
svm3<-confusionMatrix(svmPre3,test$Suicide)
svm3
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 24 10
## 1 0 0
##
## Accuracy : 0.7059
## 95% CI : (0.5252, 0.849)
## No Information Rate : 0.7059
## P-Value [Acc > NIR] : 0.584308
##
## Kappa : 0
##
## Mcnemar's Test P-Value : 0.004427
##
## Sensitivity : 1.0000
## Specificity : 0.0000
## Pos Pred Value : 0.7059
## Neg Pred Value : NaN
## Prevalence : 0.7059
## Detection Rate : 0.7059
## Detection Prevalence : 1.0000
## Balanced Accuracy : 0.5000
##
## 'Positive' Class : 0
##
Sigmoid Kernal Results
svmPre4<-predict(svmfit4,test2)
svm4<-confusionMatrix(svmPre4,test$Suicide)
svm4
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 16 7
## 1 8 3
##
## Accuracy : 0.5588
## 95% CI : (0.3789, 0.7281)
## No Information Rate : 0.7059
## P-Value [Acc > NIR] : 0.9777
##
## Kappa : -0.0324
##
## Mcnemar's Test P-Value : 1.0000
##
## Sensitivity : 0.6667
## Specificity : 0.3000
## Pos Pred Value : 0.6957
## Neg Pred Value : 0.2727
## Prevalence : 0.7059
## Detection Rate : 0.4706
## Detection Prevalence : 0.6765
## Balanced Accuracy : 0.4833
##
## 'Positive' Class : 0
##
Interestingly, three of the six models results in the same level of accuracy with the testing data. The model using the sigmoid kernal with reduced variables due to correlation was the best out of that approach. On the other hand, the linear and radial kernal returned similar results using PCA to trim the overall variables. Now, it is important to look into the sensitivity and specifity of the top models. The specificity of 2 of the top models was 50% and 40% respectively, while the third model had a specificity of 0%. This is really important because of the topic and our end goals. The goal is to detect suicide in patients. 1 of the models can predict suicide at the rate of a coin toss and the second less than that, which is terrible. I would rather go with the third model that has no specificity. The model with a 100% sensitivity can at least RULE out non-suicidal people. In a real world enviroment, it’s better to be wrong about who you think is suicidal then accidental classify someone as non-suicidal who actually is suicidal. Also, the PCA models contain a much reduced dimesionality, making them easier models. In a professional setting, the downside of such models would be explaining it to a non-technical audience.
Sources: