CUNY SPS DATA 622 - Machine Learning and Big Data

Overview

This analysis uses clustering, principal component analysis (PCA), and the support vector machine (SVM) algorithm to derive insights from a mental health dataset. It begins with exploratory data analysis (EDA) and pre-processing followed by clustering and PCA. It finishes with SVM modeling of drug and mental-health-related features to predict whether a patient attempted suicide.

All references and a technical appendix of all R code are available at the end of this report.

PROJECT SECTIONS

EDA

EDA seeks to understand the data and its nuances. It draws on a combination of summary statistics, and univariate and bivariate visualizations to summarize the dataset, its response, and the initial group of possible features. This information will inform pre-processing of the dataset prior to modeling.

Understanding the Data

Sourced from an actual research project and deidentified, the dataset contains individual samples describing patients’ experiences with ADHD, mood disorders, substance use and misuse, and related mental health behaviors. It consists of 175 samples and 54 columns, including an identifier Initial to be dropped and a response Suicide to be modeled via SVM. The remaining features for modeling range from demographic information (e.g., Age) and various questionnaire responses (e.g., ADHD.Q1) to used/abused substances (e.g., THC) and medications (e.g., Psych.meds). All columns in the dataset are listed and described below.

Mental Health Dataset Column Definition

Columns	Variable	Description
C	Sex	Male-1, Female-2
D	Race	White-1, African American-2, Hispanic-3, Asian-4, Native American-5, Other or missing data -6
E - W	ADHD self-report scale	Never-0, rarely-1, sometimes-2, often-3, very often-4
X - AM	Mood disorder questions	No-0, yes-1; question 3: no problem-0, minor-1, moderate-2, serious-3
AN - AS	Individual substances misuse	no use-0, use-1, abuse-2, dependence-3
AT	Court Order	No-0, Yes-1
AU		Education
AV	History of Violence	No-0, Yes-1
AW	Disorderly Conduct	No-0, Yes-1
AX	Suicide attempt	No-0, Yes-1
AY	Abuse Hx	No-0, Physical (P)-1, Sexual (S)-2, Emotional (E)-3, P&S-4, P&E-5, S&E-6, P&S&E-7
AZ	Non-substance-related Dx	0 - none; 1 - one; 2 - More than one
BA	Substance-related Dx	0 - none; 1 - one Substance-related; 2 - two; 3 - three or more
BB	Psychiatric Meds	0 - none; 1 - one psychotropic med; 2 - more than one psychotropic med

Expand for Basic Statistic Summary

## Data Frame Summary  
## df  
## Dimensions: 175 x 54  
## Duplicates: 0  
## 
## +----+---------------------+--------------------------+--------------------+----------+----------+
## | No | Variable            | Stats / Values           | Freqs (% of Valid) | Valid    | Missing  |
## +====+=====================+==========================+====================+==========+==========+
## | 1  | Initial             | 1. DB                    |   5 ( 2.9%)        | 175      | 0        |
## |    | [character]         | 2. CM                    |   4 ( 2.3%)        | (100%)   | (0%)     |
## |    |                     | 3. DJ                    |   4 ( 2.3%)        |          |          |
## |    |                     | 4. JM                    |   4 ( 2.3%)        |          |          |
## |    |                     | 5. RD                    |   4 ( 2.3%)        |          |          |
## |    |                     | 6. SH                    |   4 ( 2.3%)        |          |          |
## |    |                     | 7. AH                    |   3 ( 1.7%)        |          |          |
## |    |                     | 8. DH                    |   3 ( 1.7%)        |          |          |
## |    |                     | 9. JL                    |   3 ( 1.7%)        |          |          |
## |    |                     | 10. LC                   |   3 ( 1.7%)        |          |          |
## |    |                     | [ 98 others ]            | 138 (78.9%)        |          |          |
## +----+---------------------+--------------------------+--------------------+----------+----------+
## | 2  | Age                 | Mean (sd) : 39.5 (11.2)  | 42 distinct values | 175      | 0        |
## |    | [numeric]           | min < med < max:         |                    | (100%)   | (0%)     |
## |    |                     | 18 < 42 < 69             |                    |          |          |
## |    |                     | IQR (CV) : 18.5 (0.3)    |                    |          |          |
## +----+---------------------+--------------------------+--------------------+----------+----------+
## | 3  | Sex                 | Min  : 1                 | 1 : 99 (56.6%)     | 175      | 0        |
## |    | [numeric]           | Mean : 1.4               | 2 : 76 (43.4%)     | (100%)   | (0%)     |
## |    |                     | Max  : 2                 |                    |          |          |
## +----+---------------------+--------------------------+--------------------+----------+----------+
## | 4  | Race                | Mean (sd) : 1.6 (0.7)    | 1 :  72 (41.1%)    | 175      | 0        |
## |    | [numeric]           | min < med < max:         | 2 : 100 (57.1%)    | (100%)   | (0%)     |
## |    |                     | 1 < 2 < 6                | 3 :   1 ( 0.6%)    |          |          |
## |    |                     | IQR (CV) : 1 (0.4)       | 6 :   2 ( 1.1%)    |          |          |
## +----+---------------------+--------------------------+--------------------+----------+----------+
## | 5  | ADHD.Q1             | Mean (sd) : 1.7 (1.3)    | 0 : 39 (22.3%)     | 175      | 0        |
## |    | [numeric]           | min < med < max:         | 1 : 43 (24.6%)     | (100%)   | (0%)     |
## |    |                     | 0 < 2 < 4                | 2 : 44 (25.1%)     |          |          |
## |    |                     | IQR (CV) : 2 (0.8)       | 3 : 30 (17.1%)     |          |          |
## |    |                     |                          | 4 : 19 (10.9%)     |          |          |
## +----+---------------------+--------------------------+--------------------+----------+----------+
## | 6  | ADHD.Q2             | Mean (sd) : 1.9 (1.3)    | 0 : 25 (14.3%)     | 175      | 0        |
## |    | [numeric]           | min < med < max:         | 1 : 46 (26.3%)     | (100%)   | (0%)     |
## |    |                     | 0 < 2 < 4                | 2 : 47 (26.9%)     |          |          |
## |    |                     | IQR (CV) : 2 (0.7)       | 3 : 33 (18.9%)     |          |          |
## |    |                     |                          | 4 : 24 (13.7%)     |          |          |
## +----+---------------------+--------------------------+--------------------+----------+----------+
## | 7  | ADHD.Q3             | Mean (sd) : 1.9 (1.3)    | 0 : 26 (14.9%)     | 175      | 0        |
## |    | [numeric]           | min < med < max:         | 1 : 46 (26.3%)     | (100%)   | (0%)     |
## |    |                     | 0 < 2 < 4                | 2 : 46 (26.3%)     |          |          |
## |    |                     | IQR (CV) : 2 (0.7)       | 3 : 32 (18.3%)     |          |          |
## |    |                     |                          | 4 : 25 (14.3%)     |          |          |
## +----+---------------------+--------------------------+--------------------+----------+----------+
## | 8  | ADHD.Q4             | Mean (sd) : 2.1 (1.3)    | 0 : 27 (15.4%)     | 175      | 0        |
## |    | [numeric]           | min < med < max:         | 1 : 31 (17.7%)     | (100%)   | (0%)     |
## |    |                     | 0 < 2 < 4                | 2 : 50 (28.6%)     |          |          |
## |    |                     | IQR (CV) : 2 (0.6)       | 3 : 31 (17.7%)     |          |          |
## |    |                     |                          | 4 : 36 (20.6%)     |          |          |
## +----+---------------------+--------------------------+--------------------+----------+----------+
## | 9  | ADHD.Q5             | Mean (sd) : 2.3 (1.4)    | 0 : 33 (18.9%)     | 175      | 0        |
## |    | [numeric]           | min < med < max:         | 1 : 21 (12.0%)     | (100%)   | (0%)     |
## |    |                     | 0 < 3 < 5                | 2 : 32 (18.3%)     |          |          |
## |    |                     | IQR (CV) : 2 (0.6)       | 3 : 47 (26.9%)     |          |          |
## |    |                     |                          | 4 : 41 (23.4%)     |          |          |
## |    |                     |                          | 5 :  1 ( 0.6%)     |          |          |
## +----+---------------------+--------------------------+--------------------+----------+----------+
## | 10 | ADHD.Q6             | Mean (sd) : 1.9 (1.3)    | 0 : 36 (20.6%)     | 175      | 0        |
## |    | [numeric]           | min < med < max:         | 1 : 29 (16.6%)     | (100%)   | (0%)     |
## |    |                     | 0 < 2 < 4                | 2 : 45 (25.7%)     |          |          |
## |    |                     | IQR (CV) : 2 (0.7)       | 3 : 45 (25.7%)     |          |          |
## |    |                     |                          | 4 : 20 (11.4%)     |          |          |
## +----+---------------------+--------------------------+--------------------+----------+----------+
## | 11 | ADHD.Q7             | Mean (sd) : 1.8 (1.2)    | 0 : 22 (12.6%)     | 175      | 0        |
## |    | [numeric]           | min < med < max:         | 1 : 53 (30.3%)     | (100%)   | (0%)     |
## |    |                     | 0 < 2 < 4                | 2 : 54 (30.9%)     |          |          |
## |    |                     | IQR (CV) : 2 (0.6)       | 3 : 25 (14.3%)     |          |          |
## |    |                     |                          | 4 : 21 (12.0%)     |          |          |
## +----+---------------------+--------------------------+--------------------+----------+----------+
## | 12 | ADHD.Q8             | Mean (sd) : 2.1 (1.3)    | 0 : 21 (12.0%)     | 175      | 0        |
## |    | [numeric]           | min < med < max:         | 1 : 40 (22.9%)     | (100%)   | (0%)     |
## |    |                     | 0 < 2 < 4                | 2 : 40 (22.9%)     |          |          |
## |    |                     | IQR (CV) : 2 (0.6)       | 3 : 42 (24.0%)     |          |          |
## |    |                     |                          | 4 : 32 (18.3%)     |          |          |
## +----+---------------------+--------------------------+--------------------+----------+----------+
## | 13 | ADHD.Q9             | Mean (sd) : 1.9 (1.3)    | 0 : 31 (17.7%)     | 175      | 0        |
## |    | [numeric]           | min < med < max:         | 1 : 43 (24.6%)     | (100%)   | (0%)     |
## |    |                     | 0 < 2 < 4                | 2 : 36 (20.6%)     |          |          |
## |    |                     | IQR (CV) : 2 (0.7)       | 3 : 41 (23.4%)     |          |          |
## |    |                     |                          | 4 : 24 (13.7%)     |          |          |
## +----+---------------------+--------------------------+--------------------+----------+----------+
## | 14 | ADHD.Q10            | Mean (sd) : 2.1 (1.2)    | 0 : 15 ( 8.6%)     | 175      | 0        |
## |    | [numeric]           | min < med < max:         | 1 : 46 (26.3%)     | (100%)   | (0%)     |
## |    |                     | 0 < 2 < 4                | 2 : 49 (28.0%)     |          |          |
## |    |                     | IQR (CV) : 2 (0.6)       | 3 : 33 (18.9%)     |          |          |
## |    |                     |                          | 4 : 32 (18.3%)     |          |          |
## +----+---------------------+--------------------------+--------------------+----------+----------+
## | 15 | ADHD.Q11            | Mean (sd) : 2.3 (1.2)    | 0 : 16 ( 9.1%)     | 175      | 0        |
## |    | [numeric]           | min < med < max:         | 1 : 33 (18.9%)     | (100%)   | (0%)     |
## |    |                     | 0 < 2 < 4                | 2 : 48 (27.4%)     |          |          |
## |    |                     | IQR (CV) : 2 (0.5)       | 3 : 43 (24.6%)     |          |          |
## |    |                     |                          | 4 : 35 (20.0%)     |          |          |
## +----+---------------------+--------------------------+--------------------+----------+----------+
## | 16 | ADHD.Q12            | Mean (sd) : 1.3 (1.2)    | 0 : 55 (31.4%)     | 175      | 0        |
## |    | [numeric]           | min < med < max:         | 1 : 55 (31.4%)     | (100%)   | (0%)     |
## |    |                     | 0 < 1 < 4                | 2 : 37 (21.1%)     |          |          |
## |    |                     | IQR (CV) : 2 (0.9)       | 3 : 15 ( 8.6%)     |          |          |
## |    |                     |                          | 4 : 13 ( 7.4%)     |          |          |
## +----+---------------------+--------------------------+--------------------+----------+----------+
## | 17 | ADHD.Q13            | Mean (sd) : 2.4 (1.2)    | 0 : 15 ( 8.6%)     | 175      | 0        |
## |    | [numeric]           | min < med < max:         | 1 : 29 (16.6%)     | (100%)   | (0%)     |
## |    |                     | 0 < 2 < 4                | 2 : 46 (26.3%)     |          |          |
## |    |                     | IQR (CV) : 1.5 (0.5)     | 3 : 47 (26.9%)     |          |          |
## |    |                     |                          | 4 : 38 (21.7%)     |          |          |
## +----+---------------------+--------------------------+--------------------+----------+----------+
## | 18 | ADHD.Q14            | Mean (sd) : 2.2 (1.3)    | 0 : 27 (15.4%)     | 175      | 0        |
## |    | [numeric]           | min < med < max:         | 1 : 24 (13.7%)     | (100%)   | (0%)     |
## |    |                     | 0 < 2 < 4                | 2 : 40 (22.9%)     |          |          |
## |    |                     | IQR (CV) : 2 (0.6)       | 3 : 47 (26.9%)     |          |          |
## |    |                     |                          | 4 : 37 (21.1%)     |          |          |
## +----+---------------------+--------------------------+--------------------+----------+----------+
## | 19 | ADHD.Q15            | Mean (sd) : 1.6 (1.4)    | 0 : 50 (28.6%)     | 175      | 0        |
## |    | [numeric]           | min < med < max:         | 1 : 39 (22.3%)     | (100%)   | (0%)     |
## |    |                     | 0 < 1 < 4                | 2 : 35 (20.0%)     |          |          |
## |    |                     | IQR (CV) : 3 (0.9)       | 3 : 27 (15.4%)     |          |          |
## |    |                     |                          | 4 : 24 (13.7%)     |          |          |
## +----+---------------------+--------------------------+--------------------+----------+----------+
## | 20 | ADHD.Q16            | Mean (sd) : 1.7 (1.4)    | 0 : 40 (22.9%)     | 175      | 0        |
## |    | [numeric]           | min < med < max:         | 1 : 49 (28.0%)     | (100%)   | (0%)     |
## |    |                     | 0 < 1 < 4                | 2 : 39 (22.3%)     |          |          |
## |    |                     | IQR (CV) : 2 (0.8)       | 3 : 17 ( 9.7%)     |          |          |
## |    |                     |                          | 4 : 30 (17.1%)     |          |          |
## +----+---------------------+--------------------------+--------------------+----------+----------+
## | 21 | ADHD.Q17            | Mean (sd) : 1.5 (1.3)    | 0 : 49 (28.0%)     | 175      | 0        |
## |    | [numeric]           | min < med < max:         | 1 : 41 (23.4%)     | (100%)   | (0%)     |
## |    |                     | 0 < 1 < 4                | 2 : 46 (26.3%)     |          |          |
## |    |                     | IQR (CV) : 2 (0.8)       | 3 : 22 (12.6%)     |          |          |
## |    |                     |                          | 4 : 17 ( 9.7%)     |          |          |
## +----+---------------------+--------------------------+--------------------+----------+----------+
## | 22 | ADHD.Q18            | Mean (sd) : 1.5 (1.3)    | 0 : 49 (28.0%)     | 175      | 0        |
## |    | [numeric]           | min < med < max:         | 1 : 52 (29.7%)     | (100%)   | (0%)     |
## |    |                     | 0 < 1 < 4                | 2 : 35 (20.0%)     |          |          |
## |    |                     | IQR (CV) : 2 (0.9)       | 3 : 20 (11.4%)     |          |          |
## |    |                     |                          | 4 : 19 (10.9%)     |          |          |
## +----+---------------------+--------------------------+--------------------+----------+----------+
## | 23 | ADHD.Total          | Mean (sd) : 34.3 (16.7)  | 62 distinct values | 175      | 0        |
## |    | [numeric]           | min < med < max:         |                    | (100%)   | (0%)     |
## |    |                     | 0 < 33 < 72              |                    |          |          |
## |    |                     | IQR (CV) : 26.5 (0.5)    |                    |          |          |
## +----+---------------------+--------------------------+--------------------+----------+----------+
## | 24 | MD.Q1a              | Min  : 0                 | 0 : 79 (45.1%)     | 175      | 0        |
## |    | [numeric]           | Mean : 0.5               | 1 : 96 (54.9%)     | (100%)   | (0%)     |
## |    |                     | Max  : 1                 |                    |          |          |
## +----+---------------------+--------------------------+--------------------+----------+----------+
## | 25 | MD.Q1b              | Min  : 0                 | 0 :  75 (42.9%)    | 175      | 0        |
## |    | [numeric]           | Mean : 0.6               | 1 : 100 (57.1%)    | (100%)   | (0%)     |
## |    |                     | Max  : 1                 |                    |          |          |
## +----+---------------------+--------------------------+--------------------+----------+----------+
## | 26 | MD.Q1c              | Min  : 0                 | 0 : 80 (45.7%)     | 175      | 0        |
## |    | [numeric]           | Mean : 0.5               | 1 : 95 (54.3%)     | (100%)   | (0%)     |
## |    |                     | Max  : 1                 |                    |          |          |
## +----+---------------------+--------------------------+--------------------+----------+----------+
## | 27 | MD.Q1d              | Min  : 0                 | 0 :  73 (41.7%)    | 175      | 0        |
## |    | [numeric]           | Mean : 0.6               | 1 : 102 (58.3%)    | (100%)   | (0%)     |
## |    |                     | Max  : 1                 |                    |          |          |
## +----+---------------------+--------------------------+--------------------+----------+----------+
## | 28 | MD.Q1e              | Min  : 0                 | 0 : 78 (44.6%)     | 175      | 0        |
## |    | [numeric]           | Mean : 0.6               | 1 : 97 (55.4%)     | (100%)   | (0%)     |
## |    |                     | Max  : 1                 |                    |          |          |
## +----+---------------------+--------------------------+--------------------+----------+----------+
## | 29 | MD.Q1f              | Min  : 0                 | 0 :  53 (30.3%)    | 175      | 0        |
## |    | [numeric]           | Mean : 0.7               | 1 : 122 (69.7%)    | (100%)   | (0%)     |
## |    |                     | Max  : 1                 |                    |          |          |
## +----+---------------------+--------------------------+--------------------+----------+----------+
## | 30 | MD.Q1g              | Min  : 0                 | 0 :  49 (28.0%)    | 175      | 0        |
## |    | [numeric]           | Mean : 0.7               | 1 : 126 (72.0%)    | (100%)   | (0%)     |
## |    |                     | Max  : 1                 |                    |          |          |
## +----+---------------------+--------------------------+--------------------+----------+----------+
## | 31 | MD.Q1h              | Min  : 0                 | 0 : 77 (44.0%)     | 175      | 0        |
## |    | [numeric]           | Mean : 0.6               | 1 : 98 (56.0%)     | (100%)   | (0%)     |
## |    |                     | Max  : 1                 |                    |          |          |
## +----+---------------------+--------------------------+--------------------+----------+----------+
## | 32 | MD.Q1i              | Min  : 0                 | 0 :  72 (41.1%)    | 175      | 0        |
## |    | [numeric]           | Mean : 0.6               | 1 : 103 (58.9%)    | (100%)   | (0%)     |
## |    |                     | Max  : 1                 |                    |          |          |
## +----+---------------------+--------------------------+--------------------+----------+----------+
## | 33 | MD.Q1j              | Min  : 0                 | 0 : 107 (61.1%)    | 175      | 0        |
## |    | [numeric]           | Mean : 0.4               | 1 :  68 (38.9%)    | (100%)   | (0%)     |
## |    |                     | Max  : 1                 |                    |          |          |
## +----+---------------------+--------------------------+--------------------+----------+----------+
## | 34 | MD.Q1k              | Min  : 0                 | 0 : 90 (51.4%)     | 175      | 0        |
## |    | [numeric]           | Mean : 0.5               | 1 : 85 (48.6%)     | (100%)   | (0%)     |
## |    |                     | Max  : 1                 |                    |          |          |
## +----+---------------------+--------------------------+--------------------+----------+----------+
## | 35 | MD.Q1L              | Min  : 0                 | 0 :  73 (41.7%)    | 175      | 0        |
## |    | [numeric]           | Mean : 0.6               | 1 : 102 (58.3%)    | (100%)   | (0%)     |
## |    |                     | Max  : 1                 |                    |          |          |
## +----+---------------------+--------------------------+--------------------+----------+----------+
## | 36 | MD.Q1m              | Min  : 0                 | 0 : 89 (50.9%)     | 175      | 0        |
## |    | [numeric]           | Mean : 0.5               | 1 : 86 (49.1%)     | (100%)   | (0%)     |
## |    |                     | Max  : 1                 |                    |          |          |
## +----+---------------------+--------------------------+--------------------+----------+----------+
## | 37 | MD.Q2               | Min  : 0                 | 0 :  49 (28.0%)    | 175      | 0        |
## |    | [numeric]           | Mean : 0.7               | 1 : 126 (72.0%)    | (100%)   | (0%)     |
## |    |                     | Max  : 1                 |                    |          |          |
## +----+---------------------+--------------------------+--------------------+----------+----------+
## | 38 | MD.Q3               | Mean (sd) : 2 (1.1)      | 0 : 25 (14.3%)     | 175      | 0        |
## |    | [numeric]           | min < med < max:         | 1 : 25 (14.3%)     | (100%)   | (0%)     |
## |    |                     | 0 < 2 < 3                | 2 : 49 (28.0%)     |          |          |
## |    |                     | IQR (CV) : 2 (0.5)       | 3 : 76 (43.4%)     |          |          |
## +----+---------------------+--------------------------+--------------------+----------+----------+
## | 39 | MD.TOTAL            | Mean (sd) : 10 (4.8)     | 18 distinct values | 175      | 0        |
## |    | [numeric]           | min < med < max:         |                    | (100%)   | (0%)     |
## |    |                     | 0 < 11 < 17              |                    |          |          |
## |    |                     | IQR (CV) : 7.5 (0.5)     |                    |          |          |
## +----+---------------------+--------------------------+--------------------+----------+----------+
## | 40 | Alcohol             | Mean (sd) : 1.3 (1.4)    | 0 : 80 (46.8%)     | 171      | 4        |
## |    | [numeric]           | min < med < max:         | 1 : 18 (10.5%)     | (97.71%) | (2.29%)  |
## |    |                     | 0 < 1 < 3                | 2 :  7 ( 4.1%)     |          |          |
## |    |                     | IQR (CV) : 3 (1)         | 3 : 66 (38.6%)     |          |          |
## +----+---------------------+--------------------------+--------------------+----------+----------+
## | 41 | THC                 | Mean (sd) : 0.8 (1.3)    | 0 : 116 (67.8%)    | 171      | 4        |
## |    | [numeric]           | min < med < max:         | 1 :  12 ( 7.0%)    | (97.71%) | (2.29%)  |
## |    |                     | 0 < 0 < 3                | 2 :   3 ( 1.8%)    |          |          |
## |    |                     | IQR (CV) : 1.5 (1.6)     | 3 :  40 (23.4%)    |          |          |
## +----+---------------------+--------------------------+--------------------+----------+----------+
## | 42 | Cocaine             | Mean (sd) : 1.1 (1.4)    | 0 : 101 (59.1%)    | 171      | 4        |
## |    | [numeric]           | min < med < max:         | 1 :   9 ( 5.3%)    | (97.71%) | (2.29%)  |
## |    |                     | 0 < 0 < 3                | 2 :   5 ( 2.9%)    |          |          |
## |    |                     | IQR (CV) : 3 (1.3)       | 3 :  56 (32.8%)    |          |          |
## +----+---------------------+--------------------------+--------------------+----------+----------+
## | 43 | Stimulants          | Mean (sd) : 0.1 (0.5)    | 0 : 160 (93.6%)    | 171      | 4        |
## |    | [numeric]           | min < med < max:         | 1 :   6 ( 3.5%)    | (97.71%) | (2.29%)  |
## |    |                     | 0 < 0 < 3                | 3 :   5 ( 2.9%)    |          |          |
## |    |                     | IQR (CV) : 0 (4.3)       |                    |          |          |
## +----+---------------------+--------------------------+--------------------+----------+----------+
## | 44 | Sedative.hypnotics  | Mean (sd) : 0.1 (0.5)    | 0 : 161 (94.2%)    | 171      | 4        |
## |    | [numeric]           | min < med < max:         | 1 :   4 ( 2.3%)    | (97.71%) | (2.29%)  |
## |    |                     | 0 < 0 < 3                | 2 :   1 ( 0.6%)    |          |          |
## |    |                     | IQR (CV) : 0 (4.4)       | 3 :   5 ( 2.9%)    |          |          |
## +----+---------------------+--------------------------+--------------------+----------+----------+
## | 45 | Opioids             | Mean (sd) : 0.4 (1)      | 0 : 146 (85.4%)    | 171      | 4        |
## |    | [numeric]           | min < med < max:         | 1 :   4 ( 2.3%)    | (97.71%) | (2.29%)  |
## |    |                     | 0 < 0 < 3                | 3 :  21 (12.3%)    |          |          |
## |    |                     | IQR (CV) : 0 (2.5)       |                    |          |          |
## +----+---------------------+--------------------------+--------------------+----------+----------+
## | 46 | Court.order         | Min  : 0                 | 0 : 155 (91.2%)    | 170      | 5        |
## |    | [numeric]           | Mean : 0.1               | 1 :  15 ( 8.8%)    | (97.14%) | (2.86%)  |
## |    |                     | Max  : 1                 |                    |          |          |
## +----+---------------------+--------------------------+--------------------+----------+----------+
## | 47 | Education           | Mean (sd) : 11.9 (2.2)   | 14 distinct values | 166      | 9        |
## |    | [numeric]           | min < med < max:         |                    | (94.86%) | (5.14%)  |
## |    |                     | 6 < 12 < 19              |                    |          |          |
## |    |                     | IQR (CV) : 2 (0.2)       |                    |          |          |
## +----+---------------------+--------------------------+--------------------+----------+----------+
## | 48 | Hx.of.Violence      | Min  : 0                 | 0 : 124 (75.6%)    | 164      | 11       |
## |    | [numeric]           | Mean : 0.2               | 1 :  40 (24.4%)    | (93.71%) | (6.29%)  |
## |    |                     | Max  : 1                 |                    |          |          |
## +----+---------------------+--------------------------+--------------------+----------+----------+
## | 49 | Disorderly.Conduct  | Min  : 0                 | 0 :  45 (27.4%)    | 164      | 11       |
## |    | [numeric]           | Mean : 0.7               | 1 : 119 (72.6%)    | (93.71%) | (6.29%)  |
## |    |                     | Max  : 1                 |                    |          |          |
## +----+---------------------+--------------------------+--------------------+----------+----------+
## | 50 | Suicide             | Min  : 0                 | 0 : 113 (69.8%)    | 162      | 13       |
## |    | [numeric]           | Mean : 0.3               | 1 :  49 (30.2%)    | (92.57%) | (7.43%)  |
## |    |                     | Max  : 1                 |                    |          |          |
## +----+---------------------+--------------------------+--------------------+----------+----------+
## | 51 | Abuse               | Mean (sd) : 1.3 (2.1)    | 0 : 101 (62.7%)    | 161      | 14       |
## |    | [numeric]           | min < med < max:         | 1 :   8 ( 5.0%)    | (92%)    | (8%)     |
## |    |                     | 0 < 0 < 7                | 2 :  20 (12.4%)    |          |          |
## |    |                     | IQR (CV) : 2 (1.6)       | 3 :   4 ( 2.5%)    |          |          |
## |    |                     |                          | 4 :   6 ( 3.7%)    |          |          |
## |    |                     |                          | 5 :  10 ( 6.2%)    |          |          |
## |    |                     |                          | 6 :   4 ( 2.5%)    |          |          |
## |    |                     |                          | 7 :   8 ( 5.0%)    |          |          |
## +----+---------------------+--------------------------+--------------------+----------+----------+
## | 52 | Non.subst.Dx        | Mean (sd) : 0.4 (0.7)    | 0 : 102 (66.7%)    | 153      | 22       |
## |    | [numeric]           | min < med < max:         | 1 :  35 (22.9%)    | (87.43%) | (12.57%) |
## |    |                     | 0 < 0 < 2                | 2 :  16 (10.5%)    |          |          |
## |    |                     | IQR (CV) : 1 (1.5)       |                    |          |          |
## +----+---------------------+--------------------------+--------------------+----------+----------+
## | 53 | Subst.Dx            | Mean (sd) : 1.1 (0.9)    | 0 : 42 (27.6%)     | 152      | 23       |
## |    | [numeric]           | min < med < max:         | 1 : 61 (40.1%)     | (86.86%) | (13.14%) |
## |    |                     | 0 < 1 < 3                | 2 : 35 (23.0%)     |          |          |
## |    |                     | IQR (CV) : 2 (0.8)       | 3 : 14 ( 9.2%)     |          |          |
## +----+---------------------+--------------------------+--------------------+----------+----------+
## | 54 | Psych.meds.         | Mean (sd) : 1 (0.8)      | 0 : 19 (33.3%)     | 57       | 118      |
## |    | [numeric]           | min < med < max:         | 1 : 21 (36.8%)     | (32.57%) | (67.43%) |
## |    |                     | 0 < 1 < 2                | 2 : 17 (29.8%)     |          |          |
## |    |                     | IQR (CV) : 2 (0.8)       |                    |          |          |
## +----+---------------------+--------------------------+--------------------+----------+----------+

Summary statistics reveal useful information to inform pre-processing and modeling. First, the demographic and questionnaire features are entirely complete, though missingness is present elsewhere. Notably, SVM response Suicide is approximately 93 percent complete, and Psych.meds. is only approximately 33 percent complete. Second, the feature distributions could require transformation prior to modeling. Some of the questionnaire features show clear skewness (e.g., ADHD.Q12) or imbalance (e.g., MD.Q1g), while many of the substance use/abuse features appear heavily imbalanced towards “no use”. Here again, response Suicide is imbalanced towards “no attempt”. And third, there are many features, meaning the data set is ripe for the unsupervised methods to follow.

Beyond the to-be-dropped Initial, all dataset features are loaded in numeric format, though many of them are categorical if not ordinal in nature. These features thus require conversion to factors.

EDA continues with sets of cross-tabulations between SVM modeling response Suicide and the ADHD and MD questionnaire groups, respectively. Suicide is coded as “1” if an individual has attempted suicide and “0” if not. The ADHD features are coded on a general Likert from never (“0”) through very often (“4”). The MD features are coded Yes/No (“1”/“0”).

For most of the ADHD features, individuals having attempted suicide (Suicide = “1”) skew towards the higher end of the scale (e.g., towards likely or very likely) relative to their non-attempting counterparts. Minimal information is available on the questionnaire itself, but presumably, these features describe likelihood of exhibiting or partaking in certain behaviors indicative of ADHD. Later analysis may reveal specific question features to be particularly important.

Like the ADHD features, the MD question features display clear differences between classes of Suicide. Individuals having attempted suicide (Suicide = “1”) tend towards question answers of yes (“1”) in often substantially greater proportions than non-attempters. The only exception is MD.Q1c, where both classes of Suicide show yes proportions of approximately 0.55.

Categorical Features

Next comes a focus on Suicide and its relationships with the categorical demographic and education features as well as the categorical Abuse.

Sex: For this dataset, the proportion of females having attempted suicide (38.9%) is nearly double the same proportion among males (23.3%).

Race: For this dataset, approximately half of individuals indicating race/ethnicity as “Other/Missing” have attempted suicide (~50.0%). That proportion compares with approximately 41.2% among “White” individuals and approximately 22.0% among “Black” individuals. Zero “Hispanic” individuals in this dataset have attempted suicide.

Education: For this dataset, zero individuals with more than fifteen years of formal education have attempted suicide. Level “15”, with a proportion having attempted suicide of approximately 100%, is an outlier relative to lower levels of education, which all show proportions of approximately 50% or below.

Abuse: For this dataset, individuals having been sexually and/or emotionally abused are particularly likely to have attempted suicide. Approximately 75% of individuals in each of the “Emotional (E)” and “Physical & Sexual & Emotional” categories have attempted suicide. These proportions compare to roughly one in five individuals in each of the “No” and “Physical & Emotional” categories.

Feature Correlation

The next group of features for analysis are the numerics–Age, ADHD.Total, and MD.TOTAL–with each one is related to Suicide using distribution plots and correlations. For clarity, the plots below do not account for missing values of Suicide, which were included in an initial run of plots but whose relationships did not appear to differ in nature from those of the other levels.

Having attempted suicide (Suicide = “1”) may be related to these numeric features in this dataset. There is a clear difference between the class distributions of MD.TOTAL, with the distribution for Suicide = “1” shifted higher than its counterpart for Suicide = “0”. The former’s median is roughly equal to the 75th percentile for the latter. Regarding collinearity, overall correlations are essentially non-existent between Age and each of ADHD.Total and MD.TOTAL, though the class-specific correlation values for Suicide = “1” are roughly twice the magnitude of those for Suicide = “0”. The overall correlation between ADHD.Total and MD.TOTAL is approximately 0.482, suggesting the two features move in somewhat similar directions. Here again, there is a difference by class of Suicide: the value among individuals not having attempted suicide is approximately 0.562 compared with approximately 0.41 among individuals who have attempted it.

There could be differences in proportion between levels of Suicide relative to Court.order, Hx.of.Violence, and Disorderly.Conduct. Both Court.order and Hx.of.Violence skew towards “No”–that is, no court order and no history of violence–but individuals having attempted suicide (Suicide = “1”) appear to show “Yes” in greater proportions than their non-attempting counterparts. By contrast, most individuals in the dataset have some history of disorderly conduct (Disorderly.Conduct = “1”), and the proportions appear relatively similar between classes of Suicide.

Distributions for Non.subst.Dx, Subst.Dx, and Psych.meds. are relatively similar regardless of class of Suicide. Most individuals in the dataset show zero use of non-substance-related drugs, with fewer individuals at greater levels of use. Use of a single substance-related drug (Subst.Dx = “1”) is most prevalent in the dataset, followed by zero use and use of two. As noted previously, Psych.meds. is primarily missing, and in general, individuals missing values for Non.subst.Dx and Subst.Dx are missing a value for Psych.meds..

Regarding the substance-misuse features, individuals in this dataset typically either show no use (* = “0”) or dependence (* = “3”). Use (* = “1”) and abuse (* = “2”) are most common for Alcohol, THC, and Cocaine. Differences in proportions between classes of Suicide are most notable within the dependence group for Alcohol.

Reliability of Questions

Likert-type scales are generally used as an attempt to quantify a description that is not directly measurable of an individual’s environment and behavior. Several items in a questionnaire try to assess this condition, and so the answers should possess some level of internal consistency, e.g. if a survey on alcoholism is given to a random individual and that individual answers never to alcohol usage, then it is expected that the answer to a question on number of alcoholic drink consumed would be never/none/etc., any other response would lead to an unreliable score. Thus, the overall scale and consistency reliability estimates provide better insights about the data, whereas single question reliabilities are generally very low.

Cronbach’s alpha is a test reliability technique that requires only a single test administration to provide a unique estimate of the reliability for a given questionnaire (Gravetter, et al, 2013). Cronbach’s alpha is the average value of the reliability coefficients one would obtained for all possible combinations of items when split into two half-tests:

\[\alpha = \frac{N*\bar c}{\bar v + (N-1)\bar c}\],

where \(N\) is the number of items, \(\bar c\) is the average inter-item covariance among items, \(\bar v\) is the average variance. This give a value from 0 to 1, and if Cronbach’s alpha \(\le 0.7\), the questions are not internally consistent and do not capture the same concept the are supposed too.

This measure is the most frequently used measures of reliability, however it assumes that scale items are repeated measurements, and for this analysis this assumption is kept. Moreover, Guttman’s Lambda 6 (G6) is another measure that evaluates the reliability of individual items. This means that it provides information about how well individual questions reflect the concept being measured.

The reliability analysis for the ADHD questions highlights that the Cronbach’s alpha is 0.94 with a 95% confidence boundaries (0.93, 0.96). Discarding any item would not result in an increase in the reliability, suggesting that all the items should be kept. By looking at the individual items, G6 is also \(\ge 0.7\) suggesting that the question does provide insights on the concept being assessed to an acceptable level. There individual correlations are also positive and high.

##       Items alpha    G6   cor
## 1   ADHD.Q1 0.941 0.953 0.683
## 2   ADHD.Q2 0.941 0.952 0.694
## 3   ADHD.Q3 0.942 0.953 0.654
## 4   ADHD.Q4 0.941 0.953 0.719
## 5   ADHD.Q5 0.941 0.952 0.715
## 6   ADHD.Q6 0.942 0.954 0.628
## 7   ADHD.Q7 0.941 0.952 0.722
## 8   ADHD.Q8 0.939 0.951 0.792
## 9   ADHD.Q9 0.939 0.952 0.782
## 10 ADHD.Q10 0.939 0.952 0.778
## 11 ADHD.Q11 0.941 0.953  0.72
## 12 ADHD.Q12 0.941 0.954 0.666
## 13 ADHD.Q13  0.94 0.951 0.754
## 14 ADHD.Q14 0.941 0.953 0.701
## 15 ADHD.Q15 0.943 0.955 0.616
## 16 ADHD.Q16 0.942 0.952 0.665
## 17 ADHD.Q17 0.942 0.954 0.636
## 18 ADHD.Q18 0.941 0.953 0.698

As for the reliability analysis for the MD questions, the Cronbach’s alpha is 0.86 with a 95% confidence boundaries (0.83, 0.89). In this case, although it is relatively high, the results suggest that removing MD.Q1c and MD.Q1k, or MD.Q3 only, an alpha of 0.88 can be achieved. By looking at the individual items, G6 is also \(\ge 0.7\) suggesting that the question does provide insights on the concept being assessed to an acceptable level. Lastly, the individual correlations are moderate, noticeable for those that can improve the reliability.

##     Items alpha    G6   cor
## 1  MD.Q1a 0.848 0.888 0.655
## 2  MD.Q1b  0.85  0.89 0.602
## 3  MD.Q1c  0.86 0.897  0.42
## 4  MD.Q1d 0.855 0.895  0.52
## 5  MD.Q1e 0.851  0.89 0.612
## 6  MD.Q1f  0.85  0.89 0.621
## 7  MD.Q1g  0.85 0.889 0.642
## 8  MD.Q1h 0.853 0.887 0.609
## 9  MD.Q1i 0.853 0.888 0.575
## 10 MD.Q1j 0.853 0.891 0.573
## 11 MD.Q1k 0.858 0.893 0.484
## 12 MD.Q1L 0.848 0.885 0.681
## 13 MD.Q1m 0.853 0.893  0.53
## 14  MD.Q2 0.847 0.886 0.703
## 15  MD.Q3 0.877 0.894 0.471

Factor Analysis

Typically in behavioral science studies where a test component is a questionnaire, individual items may represent a common, underlying factor. To understand this, factor analysis is a method that allows us to find commonalities in data. This method is particularly useful when dealing with many variables. Unlike PCA, which is a linear combination of variables, factor analysis is a measurement model of hidden latent variables that affect several variables at once.

The reliability test has already highlighted that the ADHD questions are strongly correlated, therefore, an oblique rotation, namely promax rotation is used search for a clearer association between individual factors and the various variables. The test of the hypothesis suggests that 3 factors are sufficient, and the chi square statistic is 197.3 on 102 degrees of freedom, p-value \(\le 0.05\). Moreover, the factor loading are sufficiently, and from the plot of the results, the ADHD questions can be grouped together into 3 sets which reflect the same underlying factor. These items can be summed into a new factor item to help reduce the dimensions of the data set.

Looking at the MD questions, the data labeling suggest that these are 3 questions, with question 1 having multiple follow-up questions. The reliability test highlighted that the variables are moderately correlated, so the promax rotation is used. As expected, the test of the hypothesis suggests that 3 factors are sufficient, and the chi square statistic is 88.82 on 63 degrees of freedom, p-value \(\le 0.05\).For most, the factor loading are sufficiently, and from the plot of the results, the MD questions can be grouped together into 3 sets which reflect the same underlying factor. It should be noted that MD.Q1L and MD.Q1m were not correlated strongly with the other questions that aim to extract a specific information about the respondents. In such cases, it may be necessary to remove these questions from the survey.

Observation-specific factor scores are calculated for each of the six found factors–three for the ADHD features and three for the MD features–using Thomson’s regression method. These scores are then combined with the larger data set, replacing the specific question features.

Pre-processing of Data

Missing Data Imputation

Data pre-processing starts with addressing missingness. Most of the features in the dataset, including Suicide, are missing values. The most notable of this subset, by a wide margin, is Psych.meds., which is missing values for approximately 67.4% of all observations; it will be dropped. Next up are Subst.Dx at approximately 13.1% missingness and Non.Subst.Dx at approximately 12.6% missingness, with several more features falling between roughly 5.0 and 10.0%. Regarding patterns of missingness across the subset of features displaying missingness, the vast majority of observations are missing either no values or a value for Psych.meds.. There are also groups of observations missing values across the entire subset, across all but one feature, or for related features.

## 
##  Variables sorted by number of missings: 
##            Variable      Count
##         Psych.meds. 0.67428571
##            Subst.Dx 0.13142857
##        Non.subst.Dx 0.12571429
##               Abuse 0.08000000
##             Suicide 0.07428571
##      Hx.of.Violence 0.06285714
##  Disorderly.Conduct 0.06285714
##           Education 0.05142857
##         Court.order 0.02857143
##             Alcohol 0.02285714
##                 THC 0.02285714
##             Cocaine 0.02285714
##          Stimulants 0.02285714
##  Sedative.hypnotics 0.02285714
##             Opioids 0.02285714

Imputing meaning for missing values–meaning where it may not exist–can be problematic, particularly with limited domain expertise. The patterns noted above suggest that, in this dataset, the data are missing at random (MAR)–not missing completely at random (MCAR)–given missingness in a particular feature may relate to the values in another feature. There is insufficient information about the dataset to support an assumption of MCAR.

This analysis assumes MAR and employs the multivariate imputation by chained equations (MICE) method to perform multiple imputation for each missing value. MICE can account for the different types of data present in the dataset. Here, MICE imputes using logistic regression for the binary factors, proportional odds models for the ordered factors, and multinomial regression for the non-binary and non-ordered factor (Abuse). Then, for each missing value, the most common imputation estimate from the five MICE imputation runs is imputed. And finally, regardless of imputation, the substantially missing Psych.meds. is dropped.

Feature Transformation

Feature transformation begins by assessing the numeric features for possible power transformation. Below are skewness statistics for each feature, with negative values reflecting left skewness and positive values reflecting right skewness. Larger values are associated with greater levels of skewness. None of these features show skew large enough to warrant power transformation. These will, however, undergo centering and scaling to facilitate clustering, PCA, and SVM modeling.

Feature Skewness
	Skewness
Age	-0.1040228
ADHD.Total	0.0533291
MD.TOTAL	-0.4793695
ADHD_f1	0.2807466
ADHD_f2	0.1011600
ADHD_f3	-0.0179774
MD_f1	-0.2185621
MD_f2	-0.2442031
MD_f3	-0.6081189

Most of the dataset’s features are stored as factors and must be addressed before moving forward. The non-ordered categorical features (e.g., Sex and Race) are converted to sets of dummy variables, one dummy for each categorical level. By contrast, the ordered factors (e.g., Alcohol and Education) are converted to sets of polynomial scores. These scores capture the possible effects–linear, quadratic, cubic, etc.–that can be fit using the ordinal information in the original factor. Following factor conversion, the resulting set of features undergoes centering and scaling.

Clustering of Patients

Clustering is the partitioning of data into groups. This can be done in a number of ways, the two most popular being K-means and hierarchical clustering. In terms of a data frame, a clustering algorithm finds out which rows are similar to each other. Rows that are grouped together have high similarity to each other and low similarity to rows outside the grouping.

Clustering - Dataset

To perform a cluster analysis in R, the data should be prepared such that rows are observations and cloumns are variables. Any missing value in the data must be omitted. The data must be scale to make variables comparable. Variables must be transformed so they have mean zero and standard deviation of one.

K-means Clustering

One of the more popular algorithms for clustering is K-means. It divides the observations into predefined, discrete groups based on some distance metric. K-means clustering consists of defining clusters so that the total intra-cluster variation is minimized. For clustering distance measures, the classification of variables into groups requires some methods to compute the distance between each pair of variables. One method used is the distance matrix.

Specify the number of clusters (K) to be used.
Select randomly k objects from the data set as the initial cluster centers or means.
Assign each observation to their closest centroid, based on the Euclidean distance between the object and the centroid.
For each of the k clusters update the cluster centroid by calculating the new mean values of all the data points in the cluster. The centroid of a Kth cluster is a vector of length p containing the means of all variables for the observations in the kth cluster; p is the number of variables.
Iteratively minimize the total within sum of square. Iterate steps 3 and 4 until the cluster assignments stops or the maximum number of iterations is reached.

Firstly, the distance matrix of the variables is created

First, group the data into two clusters (centers = 2). The kmeans function also has an nstart option that attempts multiple initial configurations and reports on the best one. For example, adding nstart = 25 will generate 25 initial configurations.

## List of 9
##  $ cluster     : int [1:55] 1 1 2 1 1 1 1 2 1 1 ...
##  $ centers     : num [1:2, 1:53] -0.0812 0.1218 0.0242 -0.0363 0.0538 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:2] "1" "2"
##   .. ..$ : chr [1:53] "Age" "Sex" "Race" "ADHD.Q1" ...
##  $ totss       : num 2862
##  $ withinss    : num [1:2] 1414 1002
##  $ tot.withinss: num 2416
##  $ betweenss   : num 446
##  $ size        : int [1:2] 33 22
##  $ iter        : int 1
##  $ ifault      : int 0
##  - attr(*, "class")= chr "kmeans"

Breakdown of k-means function and what it means:

cluster: A vector of integers (from 1:k) indicating the cluster to which each point is allocated.
centers: A matrix of cluster centers.
totss: The total sum of squares.
withinss: Vector of within-cluster sum of squares, one component per cluster.
tot.withinss: Total within-cluster sum of squares, i.e. sum(withinss).
betweenss: The between-cluster sum of squares, i.e. \(totss-tot.withinss\).
size: The number of points in each cluster.

Using scatter plot to compared Suicide Attempts vs Age. There does not seem to be any relationship with age and suicide attempts. However, there seem to be more suicide attempts by patients in cluster 2.

Run the kmeans function with clusters 3, 4, and 5 to see how it looks like. However, doing this does not give the optimal cluster to use, it is just strictly for observation.

There are many methods to find the optimal cluster to use for k-means clustering. Sihouette method would be used for this example. The average silhouette method measures the quality of a clustering and it determines how well each variable lies within its cluster. A high average silhouette width indicates a good clustering. The average silhouette method computes the average silhouette of observations for different values of k. The optimal number of clusters k is the one that maximizes the average silhouette over a range of possible values for k.

The optimal cluster chosen appears to be 2.

Update the final cluster method with the optimal cluster chosen by the average Silhouette Method.

Plot the final clusters

Observe the means of each variable of each of the two clusters.

Cluster	Age	Sex	Race	ADHD.Q1	ADHD.Q2	ADHD.Q3	ADHD.Q4	ADHD.Q5	ADHD.Q6	ADHD.Q7	ADHD.Q8	ADHD.Q9	ADHD.Q10	ADHD.Q11	ADHD.Q12	ADHD.Q13	ADHD.Q14	ADHD.Q15	ADHD.Q16	ADHD.Q17	ADHD.Q18	ADHD.Total	MD.Q1a	MD.Q1b	MD.Q1c	MD.Q1d	MD.Q1e	MD.Q1f	MD.Q1g	MD.Q1h	MD.Q1i	MD.Q1j	MD.Q1k	MD.Q1L	MD.Q1m	MD.Q2	MD.Q3	MD.TOTAL	Alcohol	THC	Cocaine	Stimulants	Sedative.hypnotics	Opioids	Court.order	Education	Hx.of.Violence	Disorderly.Conduct	Suicide	Abuse	Non.subst.Dx	Subst.Dx	Psych.meds.
1	39.52	1.58	1.61	2.36	2.67	2.55	2.97	3.39	2.82	2.76	3.30	2.88	3.24	3.12	2.15	3.33	3.36	2.45	2.67	2.33	2.52	50.88	0.70	0.94	0.48	0.79	0.79	0.88	0.97	0.70	0.64	0.52	0.58	0.70	0.58	1.00	2.61	12.85	1.55	0.70	0.67	0.15	0.06	0.21	0.12	11.79	0.30	0.64	0.45	1.88	1.09	0.48	0.88
2	41.73	1.55	1.50	1.73	1.77	1.59	2.36	2.23	1.82	1.50	1.95	1.59	1.82	2.45	0.68	2.27	2.18	1.27	1.27	0.73	1.23	30.45	0.18	0.27	0.36	0.18	0.36	0.64	0.64	0.32	0.45	0.27	0.18	0.27	0.27	0.55	1.73	6.68	0.91	0.55	0.59	0.00	0.00	0.41	0.05	13.77	0.23	0.45	0.09	1.14	1.09	0.50	1.09

Clustering Conclusion

By exploring the different clustering among the data set, it can established that:

Cluster 1 have a mean age of 39.5 compared to cluster 2 with a mean age of 41.7.
For the ADHD questions, Cluster 1 patients answered the questions between sometimes and often. Cluster 2 patients answered the questions between rarely and sometimes. ADHD total is higher for Cluster 1 than Cluster 2.
For the mood questions, a lot of Cluster 1 patients answer yes to mood disorder questions than cluster2 patients. For mood question3, cluster 1 patients answered it between moderate and serious vs cluster 2 patients answered it between minor and moderate. MD.total is higher in cluster 1 than cluster 2.
Patients are more prone to use Alcohol, THC, Cocaine, Stimulants, Sedative.hypnotics in cluster 1 than patients in cluster 2. However Cluster 2 patients double than the patients in cluster 1 in doing opioids.
Cluster 1 patients have 2 times more likely to have a court order compared to Cluster 2 patients.
Cluster 1 patients are mostly patients from 1-12 grade, while Cluster 2 are college.
Cluster 1 quintuple the amount of suicide attempts compared to Cluster 2.

PCA

Apart from the factor analysis, which can assist in the dimensional reduction of the dataset, another technique is Principal Component Analysis. PCA is a method that summarizes information in datasets which contain correlated numerical variables. Each feature is considered a dimension, and the goal of PCA is to reduce dimensionality while preserving the information contained in the original dataset. The resulting features (called principal components) are linear combinations of the original features and contain the maximum variation (information) in the dataset.

PCA is valuable because it can: (1) identify correlated variables and outliers, (2) discover hidden patterns and trends and (3) remove redundancy and noise.

Traditional PCA works with continuous, numerical features and creates a Pearson correlation matrix behind-the-scenes. Pearson correlations assume that all variables are normally distributed – in other words, the variables are continuous, quantitative, symmetric, and bell shaped. In this case, although it can technically be applied to the raw data and get an output, most of the features are dichotomous (ie, binary) or ordinal, as a result, the output will not hold much meaning.

The alternative is to conduct PCA in relation to the polychoric correlation. These correlations assume that the variables are ordered measures of an underlying continuum, so there is no need to be continuous or symmetrical. Like a Pearson correlation, the values of a polychoric correlation range from -1 to 1 and the the value/sign measures the strength and direction of the relationship.

PCA - Dataset

Broadly speaking, the base dataset contains 3 main categories of features (demographics, questionnaire responses, and drug use) and a few additional miscellaneous features. Off the bat, the Psych.meds variable is eliminated because it has more than 50% of its values missing. Demographic information (Age, Sex, and Race) is also left out of the PCA in order to preserve the information contained in these variables.

The approach for this analysis is to identify the optimal reductions by first performing PCA on the 2 questionnaire sets and drug variables independently and then on a combination of these categories and the remaining miscellaneous categories together. The independent PCA of questionnaire answers & drug variables assumes that the features within these categories are correlated, but that the categories are independent of one another. The results of these analyses are compared to the final results with all variables included together.

PCA only works with non-null values, so any records that include missing values are dropped. Additionally, since the data dictionary indicates that ADHD questions should be on scale of 1-4, records in the dataset labeled ‘5’ are eliminated, as this is likely a data-entry error.

The base dataset for the PCA includes 142 rows and 45 total features.

ADHD Questions

Firstly, looking at the 18 unlabeled self-reported ADHD questions, polychoric correlation matrix is passed into the principal function to calculate the principal components. The components are arranged in descending order of contribution to variance, with RC1 explaining the largest variance and RC3 explaining the least amount of variance. Each component is a linear combination of the variables, and the resulting eigenvectors are the factors that the features need to be multiplied by for the final calculation of the component score for an observation. The resulting eigenvalues are the variances for each of the components. It can be visualized as percentages in a scree plot, as shown below.

The results show that 65% of the variance is explained by principal components 1 and 2 alone. Since an SVM model is fitted to classify the Suicide variable, an investigation of any patterns between this variable and the first two principal components is carried out. By using the loadings matrix to calculate the final pca scores for each observation in the dataset, the below plot with respect to the Suicide variable can be used to identify any meaningful relationships. It is clear that there is still quite a bit of overlap between the two classes.

Next, using a 90% coutoff, the total number of principal components that the data can be reduced to is determined. In other words, the number of components that make up 90% of the variability in the data is accepted. This threshold is identified by looking at a cumulative plot of the variances.

The final analysis shows that the number of ADHD features can be reduced to 9 principal components which retain 90% of the variability in the data.

Mood Disorder Questions

Taking a similar approach with the Mood Disorder questions, the polychoric correlation matrix is calculated and passed to the principal function. The scree plot highlights that 66% of the variance is explained by the first two principal components.

Once again, take a look at these components with respect to the Suicide variable. There is still quite a bit of overlap, nonetheless there is some distinction between the two classes (lower PC1 is more indicative of non-Suicide individuals).

With the the 90% variability cutoff, the final analysis shows that the number of Mood Disorder features can be reduced to 7 principal components.

Drug Features

Next, the drug features subset of data will include the following variables: Alcohol, THC, Cocaine, Stimulants, Sedative.hypnotics, Opioids, and Subst.Dx.

The principal components derived from the drug features are a bit more balanced than the other two sets of features. The first component and second components only account for 34% and 24% (respectively) of the variability in the dataset. Nonetheless, it is visualized in order to compare the Suicide variable to see if any patterns can be identified.

The 0 class is embedded almost completely in the 1 class, so the first two components do not give as much insight as the other two categories of features.

To meet the 90% threshold, 5 principal components will be needed. Since there are only 7 unique drug features, this reduction might not be as worth it to pursue.

All Numeric Features

Lastly, PCA is used on all of the features combined. This approach assumes dependence of features across categories (questionnaires, drugs, and miscellaneous).

The final scree plot shows that only about 40% of the variability is explained by the first two components. Despite this, the two components versus the Suicide variable highlights that the lower values of PC1 are associated with less suicides.

One final cumulative scree plot suggest that to retain 90% of the variance in the dataset, 17 principal components should be kept.

PCA Conclusions

By exploring the differences between completing PCA independently on the 3 categories of features (ADHD questionnaire, MD questionnaire, and drug features) and across all categories, it can established that:

The ADHD question PCA analysis resulted in the first principal component accounting for almost 60% of the variance in the data. To retain 90% of the variance, 9 of the principal components should be kept. This results in a reduction of ADHD features by 50%.
The Mood disorder question PCA analysis resulted in the first principal component accounting for 53% of the variance in the data. When visualizing this against the Suicide feature, there is evidence of some separation of the 2 classes based on the first two components alone. To retain 90% of the variance, 7 of the principal components are needed. This results in a reduction of Mood Disorder Features by a little over 50%.
The drug feature PCA analysis resulted in the first principal component accounting for about 34% of the variance in the data. To retain 90% of the variance, 5 of the principal components should be kept. However, since there are only 7 drug features in the dataset, this reduction might not be necessary.
The final PCA analysis with all features resulted in the first principal component accounting for 30% of the variance in the data. To retain 90% of the variance, 17 principal components are needed.

Suicide Predictions using SVM

SVM is a supervised machine learning algorithm that can be used for classification or regression problems. Kernel is the technique used to transform ones data and then based on these transformations an optimal boundary is then found between the possible outputs. The data transformations are pretty complex, then separates your data based on the labels or outputs that have been defined.

SVM can be run using the e1071 package or the caret package. In the e1071 package the kernals are Linear, Radial, Polynomial or Sigimoid. In the caret package, the kernels are listed as svmLinear (for linear values), svmRadial (for non linear), and svmPoly (for nonlinear). When it is run within the svm function the kernals are listed differently in the in the e1071 it is listed under kernel, while in the caret package it is listed under method.

For the purpose of this project the caret package will be used to obtain the SVM values. svmLinear, svmRadial, and svmPoly will be used to see which one has the best outcome. The e1071 package will be used for the plot svm, the kernel used will be the best kernel outcome based on the prediction.

SVM - Dataset

The data that will be used is the train and test set that was already created. Since the data was converted to include dummy variables for optimization.Some features of the train and set data will be removed in order to decrease the size of the data. The features that will be removed are ‘Race’, ‘Sex’, ‘Age’, because it may be highly correlated. ‘Education’ will also be removed since it is also demographic. As well as ‘Psych.meds’ will not be used since it only 50% of data in the original data.

The features are separated based on the features that is based on the PCA analysis to see if it will have the same outcome. Boruta would be used as well in order to determine the best features for the data it will be used in order to create the plot svm. Since the dataset includes more than three columns two features need to be selected in order to plot the graph. The top two features will be selected for the plot svm. This will only be used for the drug feature and for all features. Since feature importance will be used for the drug features and all features another dataset will be created for the feature selection since it won’t make sense to do feature selection using Boruta with dummy variables. The dataset created will be used for the svm prediction and the svm plot sections. But for both datasets the same features will be removed.

Lastly, the dataset is split 70/30 into a training set (n = 124) and a test set (n = 51). The latter will be held out for validation.

The top SVM prediction of the drug features, ADHD questions features, the MD questions features, and the all features will be used to see what models gives one the best prediction to see if someone were to commit suicide.

Drug Features

A separate dataset is created in order to single out the drug features.

This dataset includes “Alcohol.”,“THC.”, “Subst.Dx.”, “Cocaine”, “Stimulants.”,“Sedative.hypnotics.”, “Opioids”, and Suicide which is the deciding factor.

Next, SVM on the drug features is use to model whether a patient attempted suicide. The caret package will be used for all of the models and it will include a preprocess function which will be centered and scaled.

The svmPoly ends up achieving the best accuracy rate based on the drug features.

Model Validation

By tuning of drug features with the svmPoly model, the accuracy on the test data resulted in 0.73.

Drug Feature Importance plot svm

Using the top two features of the drug features dataset the svm plot will be created with that. The kernal of the best drugs features prediction will be used for the kernel of the svm plot which was svmPoly in this case it will be “polynomial” since the package to create the svm plot is the e1071.

Feature Importance of Drugs Data
	meanImp	decision
Alcohol	8.719750	Confirmed
Cocaine	5.143454	Confirmed
Opioids	4.864598	Confirmed

Based on the feature importance a person would most likely use these substances to commit suicide. Now to make the svm plot.

Looking at the svm plot and seeing how the decision area is it seems that it is less likely for someone to use these top feature substances in order to commit suicide.

ADHD Questions

The original train data with the transformations is used for this one since the ADHD questions was condensed to three factors.

It seems that svmRadial and svmPoly model results in the same accuracy rate, thus svmLinear is used to tune and make predictions on the test set.

Model Validation

The tuning of ADHD features with the svmLinear model resulted in the accuracy on the test data of 0.73.

MD Questions

Like the ADHD the original train data with the transformations will be used for this one since the MD questions was condensed to three.

Again, this seems that each model results in the same accuracy rate.

All Features

All of the features will be used for these models.

svmPoly gave the best prediction results out of all of them.

Model Validation

The tuning of all features with the svmPoly model resulted in the accuracy on the test data of 0.73.

Most Important feature

Using the top two features of the all features dataset the svm plot will be created with that. The kernal of the best all features prediction will be used for the kernel of the svm plot which was svmPoly in this case it will be “polynomial” since the package to create the svm plot is the e1071

Feature Importance of all of the Features
	meanImp	decision
Abuse	10.492681	Confirmed
MD_f1	7.651798	Confirmed
Alcohol	6.962397	Confirmed
Opioids	4.640637	Confirmed

Based on the feature importance a person would most likely use these whats considered to be the top features in order to commit suicide. Now to make the svm plot.

Looking at the decision area and the fact that the points are along zero mark it seems that someone under these top features would not commit suicide.

SVM Conclusion

The features of models were split up as the same based on the results of the PCA analysis, so the data was separated by the drug features, ADHD condensed questions, MD condensed questions and all of the features which included everything.

The drug feature models: For the drug feature models the svmPoly turned out to give the best prediction. With that outcome it seems like the drug feature is Linear.
The ADHD Questions: The model that gave the best prediction was the svmLinear, and with that outcome it seems like it is linear.
The MD Questions: Based on what was run all of the models gave the same outcome, so it is not know if the features are more linear or non linear.
All Features: The one with all of the features had a completely different outcome in comparison to all of the models. The svmLinear gave the worst outcome, and the svmRadial and the svmPoly gave better predictions. Out of all of them svmPoly was the best. With this outcome it seems like the All Features models are more non linear.
The SVM plot didn’t exactly give the outcome as one would think, the decision areas shows that using the svm plot would not lead to a good prediction whether or not someone would commit suicide.

As the PCA model the drug features gave a better outcome compared to the ADHD questions and MD questions, but the All Features model was better in comparison to the PCA analysis.

If one wanted to predict whether or not someone would commit Suicide the best model for that prediction would be the svmPoly of the all features model.

Final Discussion & Conclusions

From this analysis, the subgroups revealed from the clustering suggest that there are two major groups. Cluster 1 tends to have a higher ADHD and Mood Disorder total scores than patients from Cluster 2. While patients from Cluster 1 report frequency use of most of the listed drugs, Cluster 2 patients particularly reported more use of opioids. Lastly, Cluster 1 quintuple the amount of suicide attempts compared to Cluster 2. Without sounding too suggestive of the groups, the primary distinctions that separate these two groups lead to Cluster 1 being named the “highly at-risk for mental health disorder with 1-12 Grade education-level” patients and Cluster 2 being the “moderately at-risk for mental health disorder with College education-level” patients.

PCA focused on the three general categories of features—the ADHD questionnaire, MD questionnaire, and drug features—separately then together. The optimal number of principal components to retain 90% of the variance differed across categories, with nine for the ADHD features, seven for MD features, and five of the drug features. The categories’ first principal components accounted for approximately 60% of the variance in the ADHD features, approximately 53% of the variance in the MD features, and approximately 34% of the variance in the drug features. Combining the three categories, a final PCA of all features showed seventeen principal components to be necessary to meet the 90% threshold, with the first component accounting for approximately 30% of the total variance.

SVM modeling to predict Suicide followed the same course as PCA, considering the general feature categories separately first. It compared the performance of linear, radial, and polynomial kernels for each category. The model using a linear kernel performed the best for the ADHD features, and the model with a polynomial kernel performed the best for the drug features. These results suggest linear data for these categories of features. Notably, all kernels performed similarly with the MD features, leaving the linearity of the associated data unclear. Considering all features combined, the model using the polynomial kernel performed the best and showed a test set accuracy of approximately 0.73 when predicting Suicide. This value exceeded the accuracy measures for the category-specific models.

Works Cited

Gravetter, Frederick J, and Larry B. Wallnau. Statistics for the Behavioral Sciences. , 2013. Print.
Kuhn, M. (2019). The caret package. Accessed April 21, 2021, from https://topepo.github.io/caret/.
Kuhn, M., and Johnson, K. (2013). The basics of encoding categorical data for predictive models. Applied Predictive Modeling. Accessed April 21, 2021.
Clustering in R. Data Science Blog by Domino, 31 Mar. 2021.
Alboukadel, et al. 5 Amazing Types of Clustering Methods You Should Know. Datanovia, 25 Dec. 2019.
SVM Model: Support Vector Machine Essentials. Statistical tools for high-throughput data analysis by kassambara, 11/03/2018 .
Basic data analysis with palmerpenguins. RBloggers by jhk0530, July 10, 2020.

Code Appendix

The code chunks below represent the R code called in order during the analysis. These are reproduced in the appendix for review and comment.

df <- read_csv("https://raw.githubusercontent.com/greeneyefirefly/DATA622-Group3/main/Project_4/ADHD_data.csv")
df <- df %>% rename_all(make.names)

df[,c(3,4,24:37,46,48:51)] <- lapply(df[,c(3,4,24:37,46,48:51)], 
                                     function(x) factor(x, ordered=FALSE))
df[,c(5:22,38,40:45,47,52:54)] <- lapply(df[,c(5:22,38,40:45,47,52:54)], 
                                         function(x) factor(x, ordered=TRUE))
df <- df %>% select(-Initial)

lapply(subset(df, select=ADHD.Q1:ADHD.Q18), 
       function(col) ctable(x = df$Suicide, y = col, prop = 'r'))

lapply(subset(df, select=MD.Q1a:MD.Q3), function(col) ctable(x = df$Suicide, y = col, prop = 'r'))

tab <- with(df, table(Sex, Suicide))
tab <- as.data.frame(prop.table(tab, margin = 1)) %>%
  filter(Suicide == '1')

tab %>%
  mutate(Sex = fct_recode(Sex, Male='1', Female='2')) %>%
  ggplot() +  
  geom_col(aes(x=Sex, y=Freq, fill=Sex)) +
  geom_label(aes(x=Sex, y=Freq, label = paste0(round((Freq*100),1),"%"))) +
  scale_y_continuous(labels = function(x) paste0(x*100, "%")) +
  theme(legend.position='none') +
  labs(title = 'Proportion attempting suicide, by sex',
       y = 'Proportion')

tab <- with(df, table(Race, Suicide))
tab <- as.data.frame(prop.table(tab, margin = 1)) %>%
  filter(Suicide == '1')

tab %>%
  mutate(Race = fct_recode(Race,
                           White='1',
                           Black='2',
                           Hispanic='3',
                           Asian='4',
                           'Native American'='5',
                           'Other/Missing' = '6')) %>%
  ggplot() +  
  geom_col(aes(x=Race, y=Freq, fill=Race)) +
  geom_label(aes(x=Race, y=Freq, label = paste0(round((Freq*100),1),"%"))) +
  scale_y_continuous(labels = function(x) paste0(x*100, "%")) +
  theme(legend.position='none') +
  labs(title = 'Proportion attempting suicide, by race/ethnicity',
       y = 'Proportion')

tab <- with(df, table(Education, Suicide))
tab <- as.data.frame(prop.table(tab, margin = 1)) %>%
  filter(Suicide == '1')

tab %>%
  ggplot() +  
  geom_col(aes(x=Education, y=Freq)) +
  scale_y_continuous(labels = function(x) paste0(x*100, "%")) +
  theme(legend.position='none') +
  labs(title = 'Proportion attempting suicide, by education level',
       y = 'Proportion')

tab <- with(df, table(Abuse, Suicide))
tab <- as.data.frame(prop.table(tab, margin = 1)) %>%
  filter(Suicide == '1')

tab %>%
  mutate(Abuse = fct_recode(Abuse,
                         No='0',
                         'Physical (P)'='1',
                         'Sexual (S)'='2',
                         'Emotional (E)'='3',
                         'P & S'='4',
                         'P & E'='5',
                         'S & E'='6',
                         'P & S & E'='7')) %>%
  ggplot() +  
  geom_col(aes(x=Abuse, y=Freq, fill=Abuse)) +
  geom_label(aes(x=Abuse, y=Freq, label = paste0(round((Freq*100),1),"%"))) +
  scale_y_continuous(labels = function(x) paste0(x*100, "%")) +
  theme(legend.position='none') +
  labs(title = 'Proportion attempting suicide, by abuse category',
       y = 'Proportion') +
  scale_x_discrete(guide = guide_axis(angle = 45))

df %>%
  filter(!is.na(Suicide)) %>%
  mutate(Suicide = fct_recode(Suicide,
                              'No attempt' = '0',
                              Attempt = '1')) %>%
  ggpairs(
    columns = c('Suicide', 'Age', 'ADHD.Total', 'MD.TOTAL'),
    title = "Correlogram of response 'Suicide' and numeric features",
    ggplot2::aes(color = Suicide),
    progress = FALSE,
    lower = list(continuous = wrap(
      "smooth", alpha = 0.3, size = 0.1
    ))
  )

# calculate cronbach's alpha - ADHD
temp = as.data.frame(sapply(df[,c(4:21)], factor))
Cronbach.ADHD <- psych::alpha(sapply(temp,as.numeric), check.keys=F)
as.data.frame(cbind(Items = names(temp),
                    alpha = round(Cronbach.ADHD[["alpha.drop"]][["raw_alpha"]],3),
                    G6 = round(Cronbach.ADHD[["alpha.drop"]][["G6(smc)"]],3),
                    cor = round(Cronbach.ADHD[["item.stats"]][["r.cor"]],3)))

# calculate cronbach's alpha - MD
temp = as.data.frame(sapply(df[,c(23:37)], factor))
Cronbach.MD <- psych::alpha(sapply(temp,as.numeric), check.keys=F)
as.data.frame(cbind(Items = names(temp),
                    alpha = round(Cronbach.MD[["alpha.drop"]][["raw_alpha"]],3),
                    G6 = round(Cronbach.MD[["alpha.drop"]][["G6(smc)"]],3),
                    cor = round(Cronbach.MD[["item.stats"]][["r.cor"]],3)))

# Omitting NAs
temp = na.omit(as.data.frame(sapply(df[,c(4:21)], factor)))
# Run factor analysis
factoranalysis1 <- factanal(sapply(temp,as.numeric), 3, rotation="promax", 
                            scores = "regression")
#print(factoranalysis1, digits=2, cutoff=.2, sort=TRUE)

load <- factoranalysis1$loadings[,1:2]
plot(load, type="n", xlim = c(-1.5, 1.5)) 
text(load, labels=names(temp), cex=.7)

loads <- factoranalysis1$loadings
fa.diagram(loads)

# Omitting NAs
temp = na.omit(as.data.frame(sapply(df[,c(23:37)], factor)))
# Run factor analysis
factoranalysis2 <- factanal(sapply(temp,as.numeric), 3, rotation="promax", 
                            scores = "regression")
#print(factoranalysis2, digits=2, cutoff=.2, sort=TRUE)

load <- factoranalysis2$loadings[,1:2]
plot(load, type="n", xlim = c(-1.5, 1.5)) 
text(load, labels=names(temp), cex=.7)

loads <- factoranalysis2$loadings
fa.diagram(loads)

ADHD_scores <- as.data.frame(factoranalysis1$scores) %>% 
  rename(ADHD_f1 = Factor1, ADHD_f2 = Factor2, ADHD_f3 = Factor3)
MD_scores <- as.data.frame(factoranalysis2$scores) %>% 
  rename(MD_f1 = Factor1, MD_f2 = Factor2, MD_f3 = Factor3)
df_factors <- df %>% select(-c(starts_with("ADHD.Q"), starts_with("MD.Q")))
df_factors <- cbind(df_factors, ADHD_scores, MD_scores)

col_missing <- colnames(df_factors)[colSums(is.na(df_factors)) > 0]
aggr(df_factors[,col_missing], col=c('navyblue','red'), numbers=TRUE, 
     sortVars=TRUE, labels=names(df_factors[,col_missing]), 
     cex.axis=.7, oma=c(10,5,3,3), ylab=c("Histogram","Patterns"))

kable(sapply(df_impute[c('Age','ADHD.Total','MD.TOTAL','ADHD_f1','ADHD_f2',
                         'ADHD_f3','MD_f1','MD_f2','MD_f3')], skewness), 
      col.names = c('Skewness'),
      caption = 'Feature Skewness')  %>%
  kable_styling(bootstrap_options = "striped", full_width = TRUE)

set.seed(622)
df_dummy <- dummyVars(Suicide~ ., data = df_impute)
df_dummy <- data.frame(predict(df_dummy, newdata = df_impute))
df_transform <- df_dummy %>% preProcess(method = c('center','scale')) %>% 
  predict(df_dummy) %>% cbind(df_impute$Suicide) %>% 
  rename(Suicide = 'df_impute$Suicide')

ADHD_df <- read_csv(
  "https://raw.githubusercontent.com/greeneyefirefly/DATA622-Group3/main/Project_4/ADHD_data.csv")
ADHD_df <- ADHD_df %>% rename_all(make.names)
ADHD_df <- na.omit(ADHD_df)
ADHD_df <- ADHD_df[, -which(names(ADHD_df) %in% c("Initial"))]
dfc <- scale(ADHD_df)

distance <- get_dist(dfc)
fviz_dist(distance, gradient = list(low = "#00AFBB", mid = "white", high = "#FC4E07"))

k2 <- kmeans(dfc, centers = 2, nstart = 25)
str(k2)

fviz_cluster(k2, data = dfc)

ADHD_df %>%
  as_tibble() %>%
  mutate(cluster = k2$cluster) %>%
  ggplot(aes(Age, Suicide, color = factor(cluster), label = Suicide)) +
  geom_text()

k3 <- kmeans(dfc, centers = 3, nstart = 25)
k4 <- kmeans(dfc, centers = 4, nstart = 25)
k5 <- kmeans(dfc, centers = 5, nstart = 25)
# plots to compare
p1 <- fviz_cluster(k2, geom = "point", data = dfc) + ggtitle("k = 2")
p2 <- fviz_cluster(k3, geom = "point",  data = dfc) + ggtitle("k = 3")
p3 <- fviz_cluster(k4, geom = "point",  data = dfc) + ggtitle("k = 4")
p4 <- fviz_cluster(k5, geom = "point",  data = dfc) + ggtitle("k = 5")
gridExtra::grid.arrange(p1, p2, p3, p4, nrow = 2)

# function to compute average silhouette for k clusters
avg_sil <- function(k) {
  km.res <- kmeans(dfc, centers = k, nstart = 25)
  ss <- silhouette(km.res$cluster, dist(dfc))
  mean(ss[, 3])
}
# Compute and plot wss for k = 2 to k = 15
k.values <- 2:15
# extract avg silhouette for 2-15 clusters
avg_sil_values <- map_dbl(k.values, avg_sil)
plot(k.values, avg_sil_values,
       type = "b", pch = 19, frame = FALSE, 
       xlab = "Number of clusters K",
       ylab = "Average Silhouettes")

fviz_nbclust(dfc, kmeans, method = "silhouette")

# Compute k-means clustering with k = 2
set.seed(123)
final <- kmeans(dfc, 2, nstart = 25)
#print(final)

fviz_cluster(final, data = dfc)

as.data.frame(ADHD_df) %>%
  mutate(Cluster = final$cluster) %>%
  group_by(Cluster) %>%
  summarise_all("mean") %>% 
  kable(digits=2L) %>% kable_styling() %>%
  scroll_box(width = "100%", height = "100%")

pca_base <- df %>%
  select(-Age, -Sex, -Race, -ADHD.Total, -MD.TOTAL, -Education, -Psych.meds.) %>%
  filter(ADHD.Q5 != 5) %>%
  drop_na()

df_adhd <- pca_base %>% 
  select(starts_with('ADHD'), -Suicide) %>%
  mutate_if(is.factor, as.character) %>%
  mutate_if(is.character, as.numeric)

poly_adhd <- polychoric(df_adhd)
adhd_rho <- poly_adhd$rho

adhd_pca <- principal(r = adhd_rho, nfactors = 2, covar = TRUE, rotate="none")

adhd_var <- adhd_pca$values/sum(adhd_pca$values)
rounded_adhd_var <- round(adhd_var,2)

p1 = qplot(c(1:length(adhd_pca$values)), adhd_var) + 
  geom_line() + 
  geom_text(aes(label=rounded_adhd_var),hjust=0, vjust=-1) +
  labs(x = "Principal Component", 
       y = "Variance Explained",
       title = "Scree Plot: PCA on ADHD Questions") +
  ylim(0, 1)

adhd_pca$scores <- factor.scores(df_adhd, adhd_pca) 
df_adhd_final <- cbind(pca_base %>% select(Suicide), adhd_pca$scores$scores)

p2 = ggplot(df_adhd_final, aes(PC1, PC2, col = Suicide, fill = Suicide)) +
  stat_ellipse(geom = 'polygon', col = 'black', alpha = 0.5) +
  geom_point(shape = 21, col = 'black')

gridExtra::grid.arrange(p1,p2, nrow = 2)

adhd_cumulative_var <- cumsum(adhd_pca$values/sum(adhd_pca$values))
rounded_adhd_cumulative_var <- round(adhd_cumulative_var,2)

qplot(c(1:length(adhd_pca$values)), adhd_cumulative_var) + 
  geom_line() + 
  geom_text(aes(label=rounded_adhd_cumulative_var),hjust=0, vjust=1) +
  labs(x = "Principal Component", 
       y = "Cumulative Variance Explained",
       title = "Scree Plot: PCA on ADHD Questions") +
  ylim(0, 1)

df_mood <- pca_base %>% 
  select(starts_with('MD'), -Suicide) %>%
  mutate_if(is.factor, as.character) %>%
  mutate_if(is.character, as.numeric)

poly_mood <- polychoric(df_mood)
mood_rho <- poly_mood$rho

mood_pca <- principal(r = mood_rho, nfactors = 2, covar = TRUE,  rotate="none")

mood_var <- mood_pca$values/sum(mood_pca$values)
rounded_mood_var <- round(mood_var,2)

p1 = qplot(c(1:length(mood_pca$values)), mood_var) + 
  geom_line() + 
  geom_text(aes(label=rounded_mood_var),hjust=0, vjust=-1) +
  labs(x = "Principal Component", 
       y = "Variance Explained",
       title = "Scree Plot: PCA on Mood Disorder Questions") +
  ylim(0, 1)

mood_pca$scores <- factor.scores(df_mood,mood_pca) 
df_mood_final <- cbind(pca_base %>% select(Suicide), mood_pca$scores$scores)

p2 = ggplot(df_mood_final, aes(PC1, PC2, col = Suicide, fill = Suicide)) +
  stat_ellipse(geom = 'polygon', col = 'black', alpha = 0.5) +
  geom_point(shape = 21, col = 'black')

gridExtra::grid.arrange(p1,p2, nrow = 2)

mood_cumulative_var <- cumsum(mood_pca$values/sum(mood_pca$values))
rounded_mood_cumulative_var <- round(mood_cumulative_var,2)

qplot(c(1:length(mood_pca$values)), mood_cumulative_var) + 
  geom_line() + 
  geom_text(aes(label=rounded_mood_cumulative_var),hjust=0, vjust=1) +
  labs(x = "Principal Component", 
       y = "Cumulative Variance Explained",
       title = "Scree Plot: PCA on Mood Disorder Questions") +
  ylim(0, 1)

df_drug <- pca_base %>% 
  select(Alcohol, THC, Cocaine, Stimulants, Sedative.hypnotics, Opioids, Subst.Dx) %>%
  mutate_if(is.factor, as.character) %>%
  mutate_if(is.character, as.numeric)

poly_drug <- polychoric(df_drug)
drug_rho <- poly_drug$rho

drug_pca <- principal(r = drug_rho, nfactors = 2, covar = TRUE, rotate= 'none')

drug_var <- drug_pca$values/sum(drug_pca$values)
rounded_drug_var <- round(drug_var,2)

p1= qplot(c(1:length(drug_pca$values)), drug_var) + 
  geom_line() + 
  geom_text(aes(label=rounded_drug_var),hjust=0, vjust=-1) +
  labs(x = "Principal Component", 
       y = "Variance Explained",
       title = "Scree Plot: PCA on Drug Features") +
  ylim(0, 1)

drug_pca$scores <- factor.scores(df_drug, drug_pca) 
df_drug_final <- cbind(pca_base %>% select(Suicide), drug_pca$scores$scores)

p2 = ggplot(df_drug_final, aes(PC1, PC2, col = Suicide, fill = Suicide)) +
  stat_ellipse(geom = 'polygon', col = 'black', alpha = 0.5) +
  geom_point(shape = 21, col = 'black')

gridExtra::grid.arrange(p1,p2, nrow = 2)

drug_cumulative_var <- cumsum(drug_pca$values/sum(drug_pca$values))
rounded_drug_cumulative_var <- round(drug_cumulative_var,2)

qplot(c(1:length(drug_pca$values)), drug_cumulative_var) + 
  geom_line() + 
  geom_text(aes(label=rounded_drug_cumulative_var),hjust=0, vjust=1) +
  labs(x = "Principal Component", 
       y = "Cumulative Variance Explained",
       title = "Scree Plot: PCA on Drug Features") +
  ylim(0, 1)

df_all <- pca_base %>% 
  select(-Suicide, -Sedative.hypnotics) %>%
  mutate_if(is.factor, as.character) %>%
  mutate_if(is.character, as.numeric)

poly_all <- polychoric(df_all)
all_rho <- poly_all$rho

all_pca <- principal(r = all_rho, nfactors = 2, covar = TRUE, rotate = 'none')

all_var <- all_pca$values/sum(all_pca$values)
rounded_all_var <- round(all_var,1)

p1 = qplot(c(1:length(all_pca$values)), all_var) + 
  geom_line() + 
  geom_text(aes(label=rounded_all_var),hjust=0, vjust=-1) +
  labs(x = "Principal Component", 
       y = "Variance Explained",
       title = "Scree Plot: PCA on Selected Features") +
  ylim(0, 1)

all_pca$scores <- factor.scores(df_all, all_pca) 
df_all_final <- cbind(pca_base %>% select(Suicide), all_pca$scores$scores)

p2 = ggplot(df_all_final, aes(PC1, PC2, col = Suicide, fill = Suicide)) +
  stat_ellipse(geom = 'polygon', col = 'black', alpha = 0.5) +
  geom_point(shape = 21, col = 'black')

gridExtra::grid.arrange(p1,p2, nrow = 2)

all_cumulative_var <- cumsum(all_pca$values/sum(all_pca$values))
rounded_all_cumulative_var <- round(all_cumulative_var,1)

qplot(c(1:length(all_pca$values)), all_cumulative_var) + 
  geom_line() + 
  geom_text(aes(label=rounded_all_cumulative_var),hjust=0, vjust=1) +
  labs(x = "Principal Component", 
       y = "Cumulative Variance Explained",
       title = "Scree Plot: PCA on Selected Features") +
  ylim(0, 1)

set.seed(622)
index <- as.vector(createDataPartition(df_impute$Suicide, p = .70, list = FALSE))
svm_train <- df_impute[index,] # 124 observations
svm_test <- df_impute[-index,] # 51 observations

index <- as.vector(createDataPartition(df_transform$Suicide, p = .70, list = FALSE))
train <- df_transform[index,] # 124 observations
test <- df_transform[-index,] # 51 observations

svm_train <- svm_train %>%
  select(-Age, -Sex, -Race, -ADHD.Total, -MD.TOTAL, -Education, -Psych.meds.) %>%
  mutate_if(is.factor, as.character) %>%
  mutate_if(is.character, as.numeric)

svm_test <- svm_test %>%
  select(-Age, -Sex, -Race, -ADHD.Total, -MD.TOTAL, -Education, -Psych.meds.) %>%
  mutate_if(is.factor, as.character) %>%
  mutate_if(is.character, as.numeric)%>%
  drop_na()

svm_train$Suicide<-as.factor(svm_train$Suicide)
svm_test$Suicide<-as.factor(svm_test$Suicide)

svm_train2<- train%>%
  select(-Race.1,-Race.2,-Race.3, -Race.6, -Age, -Sex.1,-Sex.2,
         -ADHD.Total, -MD.TOTAL, -starts_with('Education'), 
         -Psych.meds..L, -Psych.meds..Q)

svm_test2<- test%>%
   select(-Race.1,-Race.2,-Race.3, -Race.6, -Age, -Sex.1,-Sex.2,
          -ADHD.Total, -MD.TOTAL, -starts_with('Education'), 
          -Psych.meds..L, -Psych.meds..Q)

drugs_df<-svm_train%>%
    select(Alcohol, THC, Cocaine, Stimulants, Sedative.hypnotics, Opioids, Subst.Dx, Suicide)

set.seed(525)
output = Boruta(drugs_df$Suicide ~ ., data = drugs_df, doTrace = 0)  
roughFixMod = TentativeRoughFix(output)
importance = attStats(TentativeRoughFix(output))
importance = importance[importance$decision != 'Rejected', c('meanImp', 'decision')]
kable(head(importance[order(-importance$meanImp), ]), 
      caption = "Feature Importance of Drugs Data") %>%
  kable_styling(bootstrap_options = "striped", full_width = TRUE)

set.seed(123)
drugs_model <- train(
  Suicide ~., data = drugs_df, method = "svmLinear",
  tuneLength = 5,
  trControl = trainControl("cv", number = 10),
  preProcess = c("center","scale")
  )

drug_pred <-predict(drugs_model, svm_test)

# confusionMatrix(table(drug_pred, svm_test$Suicide))

pCol <- c("#37004D", '#ff8301', '#bf5ccb')
svm_test1<-svm_test
svm_test1$drug_pred<- drug_pred

p1 = ggplot(svm_test1, aes(x = Suicide, y = drug_pred, color = Suicide)) +
  geom_jitter(size = 3) +
  scale_color_manual(values = pCol) +
  labs(title = 'svmLinear Accuracy : 0.71')

set.seed(123)
drugs_model_rad <- train(
  Suicide ~., data = drugs_df, method = "svmRadial",
  tuneLength = 5,
  trControl = trainControl("cv", number = 10),
  preProcess = c("center","scale")
  )

drug_pred_rad <-predict(drugs_model_rad, svm_test)

# confusionMatrix(table(drug_pred_rad, svm_test$Suicide))

svm_test1$drug_pred_rad<- drug_pred_rad

p2 = ggplot(svm_test1, aes(x = Suicide, y = drug_pred_rad, color = Suicide)) +
  geom_jitter(size = 3) +
  scale_color_manual(values = pCol) +
  labs(title = 'svmRadial Accuracy : 0.69')

set.seed(123)
drugs_model_poly <- train(
  Suicide ~., data = drugs_df, method = "svmPoly",
  tuneLength = 5,
  metric="Accuracy",
  trControl = trainControl("cv", number = 10),
  preProcess = c("center","scale")
  )

drug_pred_poly <-predict(drugs_model_poly, svm_test)

# confusionMatrix(table(drug_pred_poly, svm_test$Suicide))

svm_test1$drug_pred_poly<- drug_pred_poly

p3 = ggplot(svm_test1, aes(x = Suicide, y = drug_pred_poly, color = Suicide)) +
  geom_jitter(size = 3) +
  scale_color_manual(values = pCol)+
  labs(title = 'svmPoly Accuracy : 0.73')

set.seed(123)
drugs_tuned <- train(
  Suicide ~. , data = drugs_df, method = "svmPoly",
  trControl = trainControl("cv", number = 10),
  tuneLength= 5,
  preProcess = c("center","scale")
  )

# Make predictions on the test data
predicted.classes <- drugs_tuned %>% predict(svm_test)
# Compute model accuracy rate
acc = mean(predicted.classes == svm_test$Suicide)

ADHD_df<-svm_train2 %>%
  select(starts_with("ADHD_"), Suicide)

set.seed(123)
ADHD_model <- train(
  Suicide ~., data = ADHD_df, method = "svmLinear",
   tuneLength = 14,
  trControl = trainControl("cv", number = 10),
  preProcess = c("center","scale")
  )

ADHD_pred <-predict(ADHD_model, svm_test2)

# confusionMatrix(table(ADHD_pred, svm_test2$Suicide))

pCol <- c( '#12B79E', '#92EED4')
svm_test3<-svm_test2
svm_test3$ADHD_pred<- ADHD_pred

p1 = ggplot(svm_test3, aes(x = Suicide, y = ADHD_pred, color = Suicide)) +
  geom_jitter(size = 3) +
  scale_color_manual(values = pCol)+
  labs(title = 'svmLinear Accuracy : 0.71')

set.seed(123)
ADHD_model_rad <- train(
  Suicide ~., data = ADHD_df, method = "svmRadial",
   tuneLength = 14,
  trControl = trainControl("cv", number = 10),
  preProcess = c("center","scale")
  )
ADHD_pred_rad <-predict(ADHD_model_rad, svm_test2)

# confusionMatrix(table(ADHD_pred_rad, svm_test2$Suicide))

svm_test3$ADHD_pred_rad<- ADHD_pred_rad

p2 = ggplot(svm_test3, aes(x = Suicide, y = ADHD_pred_rad, color = Suicide)) +
  geom_jitter(size = 3) +
  scale_color_manual(values = pCol)+
  labs(title = 'svmRadial Accuracy : 0.69')

set.seed(123)
ADHD_model_poly <- train(
  Suicide ~., data = ADHD_df, method = "svmPoly",
   tuneLength = 5,
  trControl = trainControl("cv", number = 10),
  preProcess = c("center","scale")
  )

ADHD_pred_poly <-predict(ADHD_model_poly, svm_test2)

# confusionMatrix(table(ADHD_pred_poly, svm_test2$Suicide))

svm_test3$ADHD_pred_poly<- ADHD_pred_poly

p3 = ggplot(svm_test3, aes(x = Suicide, y = ADHD_pred_poly, color = Suicide)) +
  geom_jitter(size = 3) +
  scale_color_manual(values = pCol)+
  labs(title = 'svmPoly Accuracy : 0.69')

set.seed(123)
ADHD_tuned <- train(
  Suicide ~. , data = ADHD_df, method = "svmLinear",
  trControl = trainControl("cv", number = 10),
  tuneGrid = expand.grid(C = seq(0, 2, length = 20)),
  preProcess = c("center","scale")
  )

# Make predictions on the test data
predicted.classes <- drugs_tuned %>% predict(svm_test)
# Compute model accuracy rate
acc = mean(predicted.classes == svm_test$Suicide)

MD_df<-svm_train2%>%
  select(starts_with("MD_"), Suicide)

set.seed(123)
MD_model <- train(
  Suicide ~., data = MD_df, method = "svmLinear",
  trControl = trainControl("cv", number = 10),
  preProcess = c("center","scale")
  )

MD_pred <-predict(MD_model, svm_test2)

# confusionMatrix(table(MD_pred, svm_test2$Suicide))

pCol <- c( '#ff8301', '#bf5ccb')
svm_test3<-svm_test2
svm_test3$MD_pred<- MD_pred

p1 = ggplot(svm_test3, aes(x = Suicide, y = MD_pred, color = Suicide)) +
  geom_jitter(size = 3) +
  scale_color_manual(values = pCol)+
  labs(title = 'svmLinear Accuracy : 0.69')

set.seed(123)
MD_model_rad <- train(
  Suicide ~., data = MD_df, method = "svmRadial",
  trControl = trainControl("cv", number = 10),
  preProcess = c("center","scale")
  )

MD_pred_rad <-predict(MD_model_rad, svm_test2)

# confusionMatrix(table(MD_pred_rad, svm_test2$Suicide))

svm_test3$MD_pred_rad<- MD_pred_rad

p2 = ggplot(svm_test3, aes(x = Suicide, y = MD_pred_rad, color = Suicide)) +
  geom_jitter(size = 3) +
  scale_color_manual(values = pCol)+
  labs(title = 'svmRadial Accuracy : 0.69')

set.seed(123)
MD_model_poly <- train(
  Suicide ~., data = MD_df, method = "svmPoly",
  trControl = trainControl("cv", number = 10),
  preProcess = c("center","scale")
  )

MD_pred_poly <-predict(MD_model_poly, svm_test2)

# confusionMatrix(table(MD_pred_poly, svm_test2$Suicide))

svm_test3$MD_pred_poly<- MD_pred_poly

p3 = ggplot(svm_test1, aes(x = Suicide, y = MD_pred_poly, color = Suicide)) +
  geom_jitter(size = 3) +
  scale_color_manual(values = pCol)+
  labs(title = 'svmPoly Accuracy : 0.69')

gridExtra::grid.arrange(p1,p2,p3, nrow=3)

set.seed(525)
output = Boruta(svm_train$Suicide ~ ., data = svm_train, doTrace = 0)  
roughFixMod = TentativeRoughFix(output)
importance = attStats(TentativeRoughFix(output))
importance = importance[importance$decision != 'Rejected', c('meanImp', 'decision')]
kable(head(importance[order(-importance$meanImp), ]), 
      caption = "Feature Importance of all of the Features") %>%
  kable_styling(bootstrap_options = "striped", full_width = TRUE)

set.seed(123)
all_model <- train(
  Suicide ~., data = svm_train, method = "svmLinear",
  trControl = trainControl("cv", number = 10),
  preProcess = c("center","scale")
  )

all_pred <-predict(all_model, svm_test)

# confusionMatrix(table(all_pred, svm_test$Suicide))

pCol <- c( "#999999", "#56B4E9")

svm_test1$drug_pred<- all_pred

p1 = ggplot(svm_test1, aes(x = Suicide, y = all_pred, color = Suicide)) +
  geom_jitter(size = 3) +
  scale_color_manual(values = pCol)+
  labs(title = 'svmLinear Accuracy : 0.71')

set.seed(123)
all_model_rad <- train(
  Suicide ~., data = svm_train, method = "svmRadial",
  trControl = trainControl("cv", number = 10),
  preProcess = c("center","scale")
  )


all_pred_rad <-predict(all_model_rad, svm_test)

# confusionMatrix(table(all_pred_rad, svm_test$Suicide))

svm_test1$all_pred_rad<- all_pred_rad

p2 = ggplot(svm_test1, aes(x = Suicide, y = all_pred_rad, color = Suicide)) +
  geom_jitter(size = 3) +
  scale_color_manual(values = pCol)+
  labs(title = 'svmRadial Accuracy : 0.69')

set.seed(123)
all_model_poly <- train(
  Suicide ~., data = svm_train, method = "svmPoly",
  trControl = trainControl("cv", number = 10),
  preProcess = c("center","scale")
  )

all_pred_poly <-predict(all_model_poly, svm_test)

# confusionMatrix(table(all_pred_poly, svm_test$Suicide))

svm_test1$all_pred_poly<- all_pred_poly

p3 = ggplot(svm_test1, aes(x = Suicide, y = all_pred_poly, color = Suicide)) +
  geom_jitter(size = 3) +
  scale_color_manual(values = pCol)+
  labs(title = 'svmPoly Accuracy : 0.75')

gridExtra::grid.arrange(p1,p2,p3, nrow=3)

set.seed(123)
aamodel <- train(
  Suicide ~., data = svm_train, method = "svmPoly",
  trControl = trainControl("cv", number = 10),
  tuneLength = 5,
  preProcess = c("center","scale")
  )

all_pred_best <-predict(aamodel, svm_test)

# confusionMatrix(table(all_pred_best, svm_test$Suicide))

CUNY SPS DATA 622 - Machine Learning and Big Data

Spring 2021 - Group 3 - Homework 4

Maryluz Cruz, Samantha Deokinanan, Amber Ferger, Tony Mei, and Charlie Rosemond

6th May, 2021

R Packages

Overview

EDA

Understanding the Data

Categorical Features

Feature Correlation

Reliability of Questions

Factor Analysis

Pre-processing of Data

Missing Data Imputation

Feature Transformation

Clustering of Patients

Clustering - Dataset

K-means Clustering

Clustering Conclusion

PCA

PCA - Dataset

ADHD Questions

Mood Disorder Questions

Drug Features

All Numeric Features

PCA Conclusions

Suicide Predictions using SVM

SVM - Dataset

Drug Features

Model Validation

Drug Feature Importance plot svm

ADHD Questions

Model Validation

MD Questions

All Features

Model Validation

Most Important feature

SVM Conclusion

Final Discussion & Conclusions

Works Cited

Code Appendix