1 Prompt

For this assignment, we will be working with a very interesting mental health data set from a real-life research project. All identifying information, of course, has been removed. The attached spreadsheet has the data (the tab name “Data”). The data dictionary is given in the second tab. You can get as creative as you want. The assignment is designed to really get you to think about how you could use different methods.

  1. Please use a clustering method to find clusters of patients here. Whether you choose to use k-means clustering or hierarchical clustering is up to you as long as you reason through your work. Can you come up with creative names for the profiles you found? (60)

  2. Let’s explore using Principal Component Analysis on this data set. You will note that there are different types of questions in the data set: column: E-W: ADHD self-report; column X – AM: mood disorders questionnaire, column AN-AS: Individual Substance Misuse; etc. Please reason through your work as you decide on which sets of variables you want to use to conduct Principal Component Analysis. (60)

  3. Assume you are modeling whether a patient attempted suicide (columnAX). Please use support vector machine to model this. You might want to consider reducing the number of variables or somehow use extracted information from the variables. This can be a really fun modeling task! (80)

2 Data Extraction and Variable Selection

In researching the specifics of the attention deficit hyperactivity disorder (ADHD) questionnaire (https://add.org/wp-content/uploads/2015/03/adhd-questionnaire-ASRS111.pdf) and the mood disorder (MD) questionnaire (https://www.ohsu.edu/sites/default/files/2019-06/cms-quality-bipolar_disorder_mdq_screener.pdf), derived variables were created and assessed in an attempt to better define the diagnosis of the patient providing the survey answers. Through trial and error the mood disorder questionnaire provided value in determining a suicide attempt. According to the survey explanation, an answer of Yes (1) to seven or more events in question 1, a Yes answer to question 2 and finally a two or three response to question 3 warrants a recommendation for further medical assessment for bipolar disorder. Based on this relationship in the question, a count of the Yes answers to question 1 is calculated. That calculation along with the results of question two and question three are used as inputs to the SVM model. Based on the PCA, abuse and sex were also identified as significant to the determination of suicide attempt. The ADHD total also showed significance to the SVM model, but after several trials, the ADHD total variable was removed as the variable did not improve the model test results.

Create a variable for the count of Yes answers to the events from MD question 1. Also, Sex, Abuse, and MD.Q2 are defined as factor variables instead of character variables. Finally, the dependent variable, Suicide, is also defined as a factor data type. The SVM model requires character or factor variables along with numeric variables.

Data summary
Name data
Number of rows 171
Number of columns 53
_______________________
Column type frequency:
character 4
factor 4
numeric 45
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
fRace 0 1 2 2 0 4 0
fCO 0 1 2 3 0 2 0
fHViol 0 1 2 3 0 2 0
fDCond 0 1 2 3 0 2 0

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
MD.Q2 0 1 FALSE 2 1: 125, 0: 46
fSex 0 1 FALSE 2 M: 95, F: 76
fSuic 0 1 FALSE 2 No: 122, Yes: 49
fAbuse 0 1 TRUE 8 0: 111, 2: 20, 5: 10, 1: 8

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Age 0 1 39.19 11.03 18 29.0 41 48.0 61 ▅▅▅▇▃
ADHD.Q1 0 1 1.70 1.30 0 1.0 2 3.0 4 ▇▇▇▆▃
ADHD.Q2 0 1 1.91 1.25 0 1.0 2 3.0 4 ▅▇▇▆▃
ADHD.Q3 0 1 1.92 1.27 0 1.0 2 3.0 4 ▅▇▇▆▅
ADHD.Q4 0 1 2.12 1.34 0 1.0 2 3.0 4 ▅▅▇▅▆
ADHD.Q5 0 1 2.26 1.45 0 1.0 3 3.0 5 ▇▅▇▆▁
ADHD.Q6 0 1 1.91 1.31 0 1.0 2 3.0 4 ▆▅▇▇▃
ADHD.Q7 0 1 1.84 1.19 0 1.0 2 3.0 4 ▃▇▇▃▃
ADHD.Q8 0 1 2.16 1.29 0 1.0 2 3.0 4 ▃▇▇▇▆
ADHD.Q9 0 1 1.92 1.32 0 1.0 2 3.0 4 ▆▇▆▇▅
ADHD.Q10 0 1 2.14 1.23 0 1.0 2 3.0 4 ▂▇▇▅▅
ADHD.Q11 0 1 2.29 1.24 0 1.0 2 3.0 4 ▂▆▇▇▆
ADHD.Q12 0 1 1.29 1.20 0 0.0 1 2.0 4 ▇▇▆▂▂
ADHD.Q13 0 1 2.37 1.24 0 1.5 2 3.0 4 ▂▅▇▇▆
ADHD.Q14 0 1 2.26 1.35 0 1.0 2 3.0 4 ▅▅▆▇▆
ADHD.Q15 0 1 1.64 1.40 0 0.0 1 3.0 4 ▇▆▆▅▃
ADHD.Q16 0 1 1.73 1.38 0 1.0 1 3.0 4 ▆▇▆▃▅
ADHD.Q17 0 1 1.53 1.29 0 0.0 1 2.0 4 ▇▇▇▃▃
ADHD.Q18 0 1 1.49 1.31 0 0.0 1 2.0 4 ▇▇▆▃▃
ADHD.Total 0 1 34.51 16.72 0 21.0 34 48.0 72 ▃▆▇▆▂
MD.Q1a 0 1 0.55 0.50 0 0.0 1 1.0 1 ▆▁▁▁▇
MD.Q1b 0 1 0.57 0.50 0 0.0 1 1.0 1 ▆▁▁▁▇
MD.Q1c 0 1 0.54 0.50 0 0.0 1 1.0 1 ▇▁▁▁▇
MD.Q1d 0 1 0.58 0.50 0 0.0 1 1.0 1 ▆▁▁▁▇
MD.Q1e 0 1 0.56 0.50 0 0.0 1 1.0 1 ▆▁▁▁▇
MD.Q1f 0 1 0.70 0.46 0 0.0 1 1.0 1 ▃▁▁▁▇
MD.Q1g 0 1 0.73 0.45 0 0.0 1 1.0 1 ▃▁▁▁▇
MD.Q1h 0 1 0.57 0.50 0 0.0 1 1.0 1 ▆▁▁▁▇
MD.Q1i 0 1 0.60 0.49 0 0.0 1 1.0 1 ▆▁▁▁▇
MD.Q1j 0 1 0.39 0.49 0 0.0 0 1.0 1 ▇▁▁▁▅
MD.Q1k 0 1 0.48 0.50 0 0.0 0 1.0 1 ▇▁▁▁▇
MD.Q1L 0 1 0.58 0.49 0 0.0 1 1.0 1 ▆▁▁▁▇
MD.Q1m 0 1 0.49 0.50 0 0.0 0 1.0 1 ▇▁▁▁▇
MD.Q3 0 1 2.02 1.06 0 1.0 2 3.0 3 ▂▃▁▅▇
MD.TOTAL 0 1 10.08 4.75 0 7.0 11 14.0 17 ▃▃▆▇▇
Alcohol 0 1 1.35 1.39 0 0.0 1 3.0 3 ▇▂▁▁▆
THC 0 1 0.81 1.27 0 0.0 0 1.5 3 ▇▁▁▁▃
Cocaine 0 1 1.09 1.39 0 0.0 0 3.0 3 ▇▁▁▁▅
Stimulants 0 1 0.12 0.53 0 0.0 0 0.0 3 ▇▁▁▁▁
Sedative.hypnotics 0 1 0.12 0.54 0 0.0 0 0.0 3 ▇▁▁▁▁
Opioids 0 1 0.39 0.99 0 0.0 0 0.0 3 ▇▁▁▁▁
Education 0 1 11.91 2.14 6 11.0 12 12.5 19 ▁▅▇▂▁
fNonDx 0 1 0.39 0.65 0 0.0 0 1.0 2 ▇▁▂▁▁
fSubsDx 0 1 1.35 1.05 0 1.0 1 2.0 3 ▆▇▁▅▅
MD.Q1.Count 0 1 7.35 3.91 0 4.0 8 11.0 13 ▅▅▅▇▇

Based on the final imputed data set, a total of 171 instances exist with Suicide attempts accounting for 49 of the total.

## 
##  No Yes 
## 122  49

The data set is initially split into a training and test with 75% of the data used in the training set which equals 129 of the 171 overall instances. The test set consists of 42 instances.

## <Analysis/Assess/Total>
## <129/42/171>

3 Model Definition

Using tidymodels, the recipe function is used to define the independent variables, Sex, Abuse, MD.Q1.Count, MD.Q2, and Md.Q3 in relationship to the dependent variable, Suicide. In order to properly prepare the numeric and factor data, the recipe includes a step to normalize the numeric data and create dummy variables for the factor data. These steps will ensure proper balance in the measurements of the SVM model.

In order to evaluate the different versions of the SVM models defined, the metrics of ROC AUC, accuracy, and kappa are used.

Given the test data set contains 42 instances, and with the current seed value, 31 of the 42 are No for Suicide. By simply picking No, for every selection, an accuracy of 73.8% would be achieved. This accuracy gives a baseline in which to find a more accurate model.

With tidymodels, the SVM model is available using a polynomial kernel or a radial basis function (RBF) kernel. The RBF kernel approach was selected as the function measures the distance between the feature vectors. The kernel value decreases with distance and measures on a scale from zero to one. With the small size of the data set and limited number of variables, the RBF kernel outperformed the polynomial kernel option.

The tidymodels approach allowed for parameter tuning, which in the case of the RBF kernel, the two parameters to the model are cost and rbf_sigma.

With the opportunity to tune the parameters, a cross-fold validations setting of five folds and three repeats, resulting in 15 separate model combinations proved the most effective in terms of the final result while also keeping the model as simple as possible.

4 Model Evaluation

The parameter tuning ran a five-fold cross-validation repeated three times. The marginal plot shows each predictor plotted against performance. Both parameters are presented in different log scales to ensure the values along the x-axis are spread evenly for display purposes. As the plot shows, the performance of the two parameters based on the ROC AUC is consistently above 0.650 with several points above 0.700.

The output of the raw metrics shows very poor performance for kappa, with only a couple tuned models resulting in a value above zero. The best results for accuracy show a consistent 0.705, which realistically represents the selection of “No” for every instance.

Finally, the output for the best ROC AUC indicates good performance with the top five results greater than 0.67 and indicate a potential varying selection method as compared to the accuracy metric.

Based on previous modeling attempts, the kappa metric was used to determine the best tuned model before performing the final predictions. As the kappa results consistently resulted in low values, the decision was made to base the model selection on the ROC AUC metric. In repeated model attempts, the ROC AUC metric proved better at selecting a model with higher accuracy on the test data set. The SVM model based on the RBF kernel with the best ROC AUC contains a cost parameter of 0.501 and an rbf_sigma value of 0.0426.

The summary of the final SVM model object reiterates the findings and decisions previously described. An SVM model based on the RBF kernel with two recipe steps: normalization of numeric data and creation of dummy variables for factor data. The SVM model will perform classification with a cost parameter value of 0.5007 and an RBF sigma parameter value of 0.0426. The model contains 83 support vectors.

## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: svm_rbf()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 2 Recipe Steps
## 
## ● step_normalize()
## ● step_dummy()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## Support Vector Machine object of class "ksvm" 
## 
## SV type: C-svc  (classification) 
##  parameter : cost C = 0.500710232351587 
## 
## Gaussian Radial Basis kernel function. 
##  Hyperparameter : sigma =  0.0426425309933349 
## 
## Number of Support Vectors : 83 
## 
## Objective Function Value : -34.83 
## Training error : 0.255814 
## Probability model included.

From the final model fit using the tidymodels workflow process, the fit of the selected model on the trained data results in an accuracy of 85.7% and an ROC AUC of 87.5%.

5 Model Predictions

The ROC AUC of the train and test plot indicate the curves for each data split. Based on the plot, the test data appears to outperform the training data. This result shows overfitting did not occur on the training data as often is the case in model creation.

For clarity, the confusion matrix of the test data predictions indicates a total of 36 correct selections of the 42 instances. As noted previously, a prediction of “No” for every instance results in an accuracy of 73.8%. This model clearly outperforms the baseline model with an accuracy greater than 85%. The ability of the model to correctly predict 6 “Yes” instances shows the model truly created a maximum margin between the variables in the vector planes.

##           Truth
## Prediction No Yes
##        No  30   5
##        Yes  1   6
## [1] 0.8571429

The density plot shows the frequency of the predicted “Yes” values across the test data set. The plot does indicate for those true values of “Yes”, the density was quite consistent across the predicted “Yes” values. Given this realization, the ability to identify 6 of the 11 true “Yes” results indicates the model can be relied upon to some degree in identifying individuals who have attempted suicide. With the large spike in density for true “No” instances, the model does not suffer in predicting “No” values. Given the propensity of the tuned models to simply always predict “No”, the final tuned model achieved success in utilizing the SVM method to properly separate instances based on the classification approach.

The final five plots display the performance of the predicted “Yes” value against the individual independent variables separated by the true values for Suicide Attempt.

http://www.rebeccabarter.com/blog/2020-03-25_machine_learning/ https://stackoverflow.com/questions/62772397/integration-of-variable-importance-plots-within-the-tidy-modelling-framework https://github.com/tidymodels/parsnip/issues/311 https://www.tidymodels.org/learn/work/tune-svm/ https://stackoverflow.com/questions/8287344/cannot-plot-graph-for-an-svm-model-in-r https://stackoverflow.com/questions/23613952/support-vector-machine-visualization-in-r https://stackoverflow.com/questions/1142294/how-do-i-plot-a-classification-graph-of-a-svm-in-r