1 Prompt

For this assignment, we will be working with a very interesting mental health data set from a real-life research project. All identifying information, of course, has been removed. The attached spreadsheet has the data (the tab name “Data”). The data dictionary is given in the second tab. You can get as creative as you want. The assignment is designed to really get you to think about how you could use different methods.

Please use a clustering method to find clusters of patients here. Whether you choose to use k-means clustering or hierarchical clustering is up to you as long as you reason through your work. Can you come up with creative names for the profiles you found? (60)
Let’s explore using Principal Component Analysis on this data set. You will note that there are different types of questions in the data set: column: E-W: ADHD self-report; column X – AM: mood disorders questionnaire, column AN-AS: Individual Substance Misuse; etc. Please reason through your work as you decide on which sets of variables you want to use to conduct Principal Component Analysis. (60)
Assume you are modeling whether a patient attempted suicide (columnAX). Please use support vector machine to model this. You might want to consider reducing the number of variables or somehow use extracted information from the variables. This can be a really fun modeling task! (80)

# Import required R libraries
library(kernlab)
library(caret)
theme_set(theme_classic())

library(tidyverse)
library(tidymodels)
library(skimr)
library(vip)
library(rpart.plot)

2 Data Extraction and Variable Selection

In researching the specifics of the attention deficit hyperactivity disorder (ADHD) questionnaire (https://add.org/wp-content/uploads/2015/03/adhd-questionnaire-ASRS111.pdf) and the mood disorder (MD) questionnaire (https://www.ohsu.edu/sites/default/files/2019-06/cms-quality-bipolar_disorder_mdq_screener.pdf), derived variables were created and assessed in an attempt to better define the diagnosis of the patient providing the survey answers. Through trial and error the mood disorder questionnaire provided value in determining a suicide attempt. According to the survey explanation, an answer of Yes (1) to seven or more events in question 1, a Yes answer to question 2 and finally a two or three response to question 3 warrants a recommendation for further medical assessment for bipolar disorder. Based on this relationship in the question, a count of the Yes answers to question 1 is calculated. That calculation along with the results of question two and question three are used as inputs to the SVM model. Based on the PCA, abuse and sex were also identified as significant to the determination of suicide attempt. The ADHD total also showed significance to the SVM model, but after several trials, the ADHD total variable was removed as the variable did not improve the model test results.

Create a variable for the count of Yes answers to the events from MD question 1. Also, Sex, Abuse, and MD.Q2 are defined as factor variables instead of character variables. Finally, the dependent variable, Suicide, is also defined as a factor data type. The SVM model requires character or factor variables along with numeric variables.

data <- read_csv("New_adhd.csv")

data$MD.Q1.Count <- 0
data$MD.Q1.Count <- ifelse(data$MD.Q1a == 1, data$MD.Q1.Count+1, data$MD.Q1.Count)
data$MD.Q1.Count <- ifelse(data$MD.Q1b == 1, data$MD.Q1.Count+1, data$MD.Q1.Count)
data$MD.Q1.Count <- ifelse(data$MD.Q1c == 1, data$MD.Q1.Count+1, data$MD.Q1.Count)
data$MD.Q1.Count <- ifelse(data$MD.Q1d == 1, data$MD.Q1.Count+1, data$MD.Q1.Count)
data$MD.Q1.Count <- ifelse(data$MD.Q1e == 1, data$MD.Q1.Count+1, data$MD.Q1.Count)
data$MD.Q1.Count <- ifelse(data$MD.Q1f == 1, data$MD.Q1.Count+1, data$MD.Q1.Count)
data$MD.Q1.Count <- ifelse(data$MD.Q1g == 1, data$MD.Q1.Count+1, data$MD.Q1.Count)
data$MD.Q1.Count <- ifelse(data$MD.Q1h == 1, data$MD.Q1.Count+1, data$MD.Q1.Count)
data$MD.Q1.Count <- ifelse(data$MD.Q1i == 1, data$MD.Q1.Count+1, data$MD.Q1.Count)
data$MD.Q1.Count <- ifelse(data$MD.Q1j == 1, data$MD.Q1.Count+1, data$MD.Q1.Count)
data$MD.Q1.Count <- ifelse(data$MD.Q1k == 1, data$MD.Q1.Count+1, data$MD.Q1.Count)
data$MD.Q1.Count <- ifelse(data$MD.Q1L == 1, data$MD.Q1.Count+1, data$MD.Q1.Count)
data$MD.Q1.Count <- ifelse(data$MD.Q1m == 1, data$MD.Q1.Count+1, data$MD.Q1.Count)

data$fSex <- as.factor(data$fSex)
data$fSuic <- as.factor(data$fSuic)
data$fAbuse <- factor(data$fAbuse, levels=c("0","1","2","3","4","5","6","7"), ordered=TRUE)
data$MD.Q2 <- as.factor(data$MD.Q2)

skim(data)

Data summary
Name	data
Number of rows	171
Number of columns	53
_______________________
Column type frequency:
character	4
factor	4
numeric	45
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
fRace	1	2	2	4
fCO	1	2	3	2
fHViol	1	2	3	2
fDCond	1	2	3	2

Variable type: factor

skim_variable	complete_rate	ordered	n_unique	top_counts
MD.Q2	1	FALSE	2	1: 125, 0: 46
fSex	1	FALSE	2	M: 95, F: 76
fSuic	1	FALSE	2	No: 122, Yes: 49
fAbuse	1	TRUE	8	0: 111, 2: 20, 5: 10, 1: 8

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Age	1	39.19	11.03	18	29.0	41	48.0	61	▅▅▅▇▃
ADHD.Q1	1	1.70	1.30	0	1.0	2	3.0	4	▇▇▇▆▃
ADHD.Q2	1	1.91	1.25	0	1.0	2	3.0	4	▅▇▇▆▃
ADHD.Q3	1	1.92	1.27	0	1.0	2	3.0	4	▅▇▇▆▅
ADHD.Q4	1	2.12	1.34	0	1.0	2	3.0	4	▅▅▇▅▆
ADHD.Q5	1	2.26	1.45	0	1.0	3	3.0	5	▇▅▇▆▁
ADHD.Q6	1	1.91	1.31	0	1.0	2	3.0	4	▆▅▇▇▃
ADHD.Q7	1	1.84	1.19	0	1.0	2	3.0	4	▃▇▇▃▃
ADHD.Q8	1	2.16	1.29	0	1.0	2	3.0	4	▃▇▇▇▆
ADHD.Q9	1	1.92	1.32	0	1.0	2	3.0	4	▆▇▆▇▅
ADHD.Q10	1	2.14	1.23	0	1.0	2	3.0	4	▂▇▇▅▅
ADHD.Q11	1	2.29	1.24	0	1.0	2	3.0	4	▂▆▇▇▆
ADHD.Q12	1	1.29	1.20	0	0.0	1	2.0	4	▇▇▆▂▂
ADHD.Q13	1	2.37	1.24	0	1.5	2	3.0	4	▂▅▇▇▆
ADHD.Q14	1	2.26	1.35	0	1.0	2	3.0	4	▅▅▆▇▆
ADHD.Q15	1	1.64	1.40	0	0.0	1	3.0	4	▇▆▆▅▃
ADHD.Q16	1	1.73	1.38	0	1.0	1	3.0	4	▆▇▆▃▅
ADHD.Q17	1	1.53	1.29	0	0.0	1	2.0	4	▇▇▇▃▃
ADHD.Q18	1	1.49	1.31	0	0.0	1	2.0	4	▇▇▆▃▃
ADHD.Total	1	34.51	16.72	0	21.0	34	48.0	72	▃▆▇▆▂
MD.Q1a	1	0.55	0.50	0	0.0	1	1.0	1	▆▁▁▁▇
MD.Q1b	1	0.57	0.50	0	0.0	1	1.0	1	▆▁▁▁▇
MD.Q1c	1	0.54	0.50	0	0.0	1	1.0	1	▇▁▁▁▇
MD.Q1d	1	0.58	0.50	0	0.0	1	1.0	1	▆▁▁▁▇
MD.Q1e	1	0.56	0.50	0	0.0	1	1.0	1	▆▁▁▁▇
MD.Q1f	1	0.70	0.46	0	0.0	1	1.0	1	▃▁▁▁▇
MD.Q1g	1	0.73	0.45	0	0.0	1	1.0	1	▃▁▁▁▇
MD.Q1h	1	0.57	0.50	0	0.0	1	1.0	1	▆▁▁▁▇
MD.Q1i	1	0.60	0.49	0	0.0	1	1.0	1	▆▁▁▁▇
MD.Q1j	1	0.39	0.49	0	0.0	0	1.0	1	▇▁▁▁▅
MD.Q1k	1	0.48	0.50	0	0.0	0	1.0	1	▇▁▁▁▇
MD.Q1L	1	0.58	0.49	0	0.0	1	1.0	1	▆▁▁▁▇
MD.Q1m	1	0.49	0.50	0	0.0	0	1.0	1	▇▁▁▁▇
MD.Q3	1	2.02	1.06	0	1.0	2	3.0	3	▂▃▁▅▇
MD.TOTAL	1	10.08	4.75	0	7.0	11	14.0	17	▃▃▆▇▇
Alcohol	1	1.35	1.39	0	0.0	1	3.0	3	▇▂▁▁▆
THC	1	0.81	1.27	0	0.0	0	1.5	3	▇▁▁▁▃
Cocaine	1	1.09	1.39	0	0.0	0	3.0	3	▇▁▁▁▅
Stimulants	1	0.12	0.53	0	0.0	0	0.0	3	▇▁▁▁▁
Sedative.hypnotics	1	0.12	0.54	0	0.0	0	0.0	3	▇▁▁▁▁
Opioids	1	0.39	0.99	0	0.0	0	0.0	3	▇▁▁▁▁
Education	1	11.91	2.14	6	11.0	12	12.5	19	▁▅▇▂▁
fNonDx	1	0.39	0.65	0	0.0	0	1.0	2	▇▁▂▁▁
fSubsDx	1	1.35	1.05	0	1.0	1	2.0	3	▆▇▁▅▅
MD.Q1.Count	1	7.35	3.91	0	4.0	8	11.0	13	▅▅▅▇▇

Based on the final imputed data set, a total of 171 instances exist with Suicide attempts accounting for 49 of the total.

# of the 171, Suicide Yes=49, Suicide No=122
table(data$fSuic)

## 
##  No Yes 
## 122  49

The data set is initially split into a training and test with 75% of the data used in the training set which equals 129 of the 171 overall instances. The test set consists of 42 instances.

# Split into training and test data
set.seed(123)
# split the data into training (75%) and testing (25%)
data_split <- initial_split(data, prop = .75)
data_split

## <Analysis/Assess/Total>
## <129/42/171>

data_train <- training(data_split)
data_test <- testing(data_split)

data_train

# Create CV object from training data
data_cv <- vfold_cv(data_train, v=5, repeats=3)

3 Model Definition

Using tidymodels, the recipe function is used to define the independent variables, Sex, Abuse, MD.Q1.Count, MD.Q2, and Md.Q3 in relationship to the dependent variable, Suicide. In order to properly prepare the numeric and factor data, the recipe includes a step to normalize the numeric data and create dummy variables for the factor data. These steps will ensure proper balance in the measurements of the SVM model.

# https://www.tidymodels.org/learn/work/tune-svm/
# 0.8571429 accuracy
data_recipe <- recipe(fSuic ~ fSex + MD.Q1.Count + MD.Q2 + MD.Q3 + fAbuse, data = data) %>%
  step_normalize(all_numeric(), -all_outcomes()) %>%
  step_dummy(all_nominal(), -all_outcomes())

In order to evaluate the different versions of the SVM models defined, the metrics of ROC AUC, accuracy, and kappa are used.

# using roc, accuracy and kappa
roc_vals <- metric_set(roc_auc, accuracy, kap)

# Verbose turned off, save out-of-sample predictions is turned on
ctrl <- control_grid(verbose=F, save_pred=T)

# SVM Model with Polynomial
svm_model <- svm_poly(cost = tune(),
                      degree = tune()) %>% 
  set_engine("kernlab") %>% 
  set_mode("classification") %>% 
  translate()

Given the test data set contains 42 instances, and with the current seed value, 31 of the 42 are No for Suicide. By simply picking No, for every selection, an accuracy of 73.8% would be achieved. This accuracy gives a baseline in which to find a more accurate model.

With tidymodels, the SVM model is available using a polynomial kernel or a radial basis function (RBF) kernel. The RBF kernel approach was selected as the function measures the distance between the feature vectors. The kernel value decreases with distance and measures on a scale from zero to one. With the small size of the data set and limited number of variables, the RBF kernel outperformed the polynomial kernel option.

The tidymodels approach allowed for parameter tuning, which in the case of the RBF kernel, the two parameters to the model are cost and rbf_sigma.

cost: The cost of predicting an instance within or outside of the margin
rbf_sigma: The precision parameter for the radial basis function

With the opportunity to tune the parameters, a cross-fold validations setting of five folds and three repeats, resulting in 15 separate model combinations proved the most effective in terms of the final result while also keeping the model as simple as possible.

# SVM Model with Radial Basis Function
svm_model <- svm_rbf(cost = tune(),
                     rbf_sigma = tune()) %>%
  set_mode("classification") %>%
  set_engine("kernlab") %>% 
  translate()

# Generate formula
formula_res <- svm_model %>%
  tune_grid(data_recipe,
    resamples = data_cv,
    metrics = roc_vals,
    control = ctrl
  )

#formula_res

4 Model Evaluation

The parameter tuning ran a five-fold cross-validation repeated three times. The marginal plot shows each predictor plotted against performance. Both parameters are presented in different log scales to ensure the values along the x-axis are spread evenly for display purposes. As the plot shows, the performance of the two parameters based on the ROC AUC is consistently above 0.650 with several points above 0.700.

The output of the raw metrics shows very poor performance for kappa, with only a couple tuned models resulting in a value above zero. The best results for accuracy show a consistent 0.705, which realistically represents the selection of “No” for every instance.

Finally, the output for the best ROC AUC indicates good performance with the top five results greater than 0.67 and indicate a potential varying selection method as compared to the accuracy metric.

# Execute with a recipe
set.seed(123)
recipe_res <- svm_model %>%
  tune_grid(data_recipe,
    resamples = data_cv,
    metrics = roc_vals,
    control = ctrl
  )
#recipe_res

# https://www.tidyverse.org/blog/2020/07/tune-0-1-1/
autoplot(recipe_res, metric="roc_auc") +
  ggtitle("Parameter Tuning Results")

# Display top combinations
show_best(recipe_res, metric = "kap")

# Display top combinations
show_best(recipe_res, metric = "accuracy")

# Best setting
show_best(recipe_res, metric = "roc_auc")

# Set the workflow
svm_wf <- workflow() %>%
  add_recipe(data_recipe) %>%
  add_model(svm_model)

Based on previous modeling attempts, the kappa metric was used to determine the best tuned model before performing the final predictions. As the kappa results consistently resulted in low values, the decision was made to base the model selection on the ROC AUC metric. In repeated model attempts, the ROC AUC metric proved better at selecting a model with higher accuracy on the test data set. The SVM model based on the RBF kernel with the best ROC AUC contains a cost parameter of 0.501 and an rbf_sigma value of 0.0426.

# Select best model based on roc_auc
best_svm <- recipe_res %>%
  select_best(metric = 'roc_auc')

# view the best svm parameters
best_svm

# finalize workflow
final_svm_wf <- svm_wf %>%
  finalize_workflow(best_svm)

The summary of the final SVM model object reiterates the findings and decisions previously described. An SVM model based on the RBF kernel with two recipe steps: normalization of numeric data and creation of dummy variables for factor data. The SVM model will perform classification with a cost parameter value of 0.5007 and an RBF sigma parameter value of 0.0426. The model contains 83 support vectors.

# fit the model
svm_wf_fit <- final_svm_wf %>%
  fit(data = data_train)

#https://stackoverflow.com/questions/62772397/integration-of-variable-importance-plots-within-the-tidy-modelling-framework
svm_wf_fit

## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: svm_rbf()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 2 Recipe Steps
## 
## ● step_normalize()
## ● step_dummy()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## Support Vector Machine object of class "ksvm" 
## 
## SV type: C-svc  (classification) 
##  parameter : cost C = 0.500710232351587 
## 
## Gaussian Radial Basis kernel function. 
##  Hyperparameter : sigma =  0.0426425309933349 
## 
## Number of Support Vectors : 83 
## 
## Objective Function Value : -34.83 
## Training error : 0.255814 
## Probability model included.

From the final model fit using the tidymodels workflow process, the fit of the selected model on the trained data results in an accuracy of 85.7% and an ROC AUC of 87.5%.

# train and evaluate
svm_last_fit <- final_svm_wf %>%
  last_fit(data_split)

svm_last_fit %>% collect_metrics()

5 Model Predictions

The ROC AUC of the train and test plot indicate the curves for each data split. Based on the plot, the test data appears to outperform the training data. This result shows overfitting did not occur on the training data as often is the case in model creation.

# https://fahadtaimur.wordpress.com/2020/07/19/tuning-svm-in-r-2/
scored_train <- predict(svm_wf_fit, data_train, type="prob") %>%
    bind_cols(predict(svm_wf_fit, data_train, type="class")) %>%
    bind_cols(.,data_train)

scored_test <- predict(svm_wf_fit, data_test, type="prob") %>%
      bind_cols(predict(svm_wf_fit, data_test, type="class")) %>%
      bind_cols(., data_test) 

scored_train %>%
  mutate(model = "train") %>%
  bind_rows(scored_test %>%
              mutate(model="test")) %>%
  group_by(model) %>%
  roc_curve(fSuic, .pred_No) %>%
  autoplot() %>%
    print()

For clarity, the confusion matrix of the test data predictions indicates a total of 36 correct selections of the 42 instances. As noted previously, a prediction of “No” for every instance results in an accuracy of 73.8%. This model clearly outperforms the baseline model with an accuracy greater than 85%. The ability of the model to correctly predict 6 “Yes” instances shows the model truly created a maximum margin between the variables in the vector planes.

svm_predictions <- svm_last_fit %>% collect_predictions()

conf_mat(svm_predictions, truth = fSuic, estimate = .pred_class)

##           Truth
## Prediction No Yes
##        No  30   5
##        Yes  1   6

svm_predictions

mean(svm_predictions$.pred_class == svm_predictions$fSuic)

## [1] 0.8571429

The density plot shows the frequency of the predicted “Yes” values across the test data set. The plot does indicate for those true values of “Yes”, the density was quite consistent across the predicted “Yes” values. Given this realization, the ability to identify 6 of the 11 true “Yes” results indicates the model can be relied upon to some degree in identifying individuals who have attempted suicide. With the large spike in density for true “No” instances, the model does not suffer in predicting “No” values. Given the propensity of the tuned models to simply always predict “No”, the final tuned model achieved success in utilizing the SVM method to properly separate instances based on the classification approach.

svm_predictions %>%
  ggplot() +
  geom_density(aes(x = .pred_Yes, fill = fSuic),
               alpha = 0.5)

The final five plots display the performance of the predicted “Yes” value against the individual independent variables separated by the true values for Suicide Attempt.

The MD.Q1.Count plot shows the values of seven or greater are a reasonable predictor of Suicide Attempt and the SVM model reflects the same.
The MD.Q2 plot indicates that for a true value of Yes, the variable value is more likely to be 1. A variable value of 1 does not directly indicate a suicide attempt, but a value of 0 very likely indicates no suicide attempt.
The MD.Q3 plot shows a similar understanding as MD.Q1.Count, in which, for those with a suicide attempt, the value is more likely to be higher, 2 or 3 for this variable.
The fSex plot does show for suicide attempt a higher volume of female instances. Assessing the true No side of the plot, male instances due result in low predicted Yes values.
The fAbuse plot clearly shows for those with higher levels of abuse, a predicted “Yes” can result.

# https://www.tidymodels.org/learn/work/tune-svm/
augment(recipe_res) %>%
  ggplot(aes(MD.Q1.Count, .pred_Yes, color = fSuic)) +
  geom_point(show.legend = FALSE) +
  facet_wrap(~fSuic)

augment(recipe_res) %>%
  ggplot(aes(MD.Q2, .pred_Yes, color = fSuic)) +
  geom_point(show.legend = FALSE) +
  facet_wrap(~fSuic)

augment(recipe_res) %>%
  ggplot(aes(MD.Q3, .pred_Yes, color = fSuic)) +
  geom_point(show.legend = FALSE) +
  facet_wrap(~fSuic)

augment(recipe_res) %>%
  ggplot(aes(fSex, .pred_Yes, color = fSuic)) +
  geom_point(show.legend = FALSE) +
  facet_wrap(~fSuic)

augment(recipe_res) %>%
  ggplot(aes(fAbuse, .pred_Yes, color = fSuic)) +
  geom_point(show.legend = FALSE) +
  facet_wrap(~fSuic)

http://www.rebeccabarter.com/blog/2020-03-25_machine_learning/ https://stackoverflow.com/questions/62772397/integration-of-variable-importance-plots-within-the-tidy-modelling-framework https://github.com/tidymodels/parsnip/issues/311 https://www.tidymodels.org/learn/work/tune-svm/ https://stackoverflow.com/questions/8287344/cannot-plot-graph-for-an-svm-model-in-r https://stackoverflow.com/questions/23613952/support-vector-machine-visualization-in-r https://stackoverflow.com/questions/1142294/how-do-i-plot-a-classification-graph-of-a-svm-in-r

Data 622 Homework 4: Mental Health Data Modeling

Group 6: Alexander Ng, Scott Reed, Philip Tanofsky, Randall Thompson

Submitted by 05/07/2021

1 Prompt

2 Data Extraction and Variable Selection

3 Model Definition

4 Model Evaluation

5 Model Predictions