Although some models have used more than a single risk factor, most research relies on traditional statistical approaches that restrict the number of variables that can be simultaneously examined, creating overly simplistic models(Franklin et al. 2017)
A shift in research is needed to capture the complexities behind adolescent suicide morbidity
Significance of the study
Theoretically, the processes that facilitate suicide morbidity are complex and entail multiple interactions; therefore, any risk factor considered in isolation will be an inaccurate predictor
Using methods with better predictability performance
A shift in the analysis from single risk factors to risk algorithms instead
Machine learning in Suicidology
35 independent studies used ML to predict suicide-related events
More accurate levels of performance in predictions over traditional statistical methodology
There are few studies using adolescent population
Research aims
Identify the critical risk factors for adolescent suicide morbidity from a set of 99 risk behavior predictors with machine learning classification algorithms
Identify the best machine learning methodology to classify adolescents who attempted and considered suicide according to its classification performance (Receiver Operating Characteristic Curve, overall accuracy, and the Kappa value)
Compare the performance of an a priori-determined model to models informed by feature selection from LASSO (least absolute shrinkage and selection operator method)
Identify if there are differences in the critical risk factors for suicide ideation and suicide attempts
Conceives human development as the constant interaction between the individual and the changing environment in which it lives and grows (Bronfenbrenner 1977).
Ontogenic
Sex, race, age
Microsystem
Family members, friends, school
Exosystem
The media, neighborhood
Macrosystem
Economic, social, educational, legal, and political systems
Allows to study adolescent suicide morbidity as the interaction of multiple risk factors at multiple levels of the adolescent system (Perkins and Hartless 2002)
Moves beyond the tendency to evaluate only individualistic characteristics of adolescents
Will find the function that maps the predictors to the outcomes (Stewart 2020)
The result is an algorithm representing the closest possible match to the behavior of the data, satisfying certain constraints and summarizing what we see on the data
Before the algorithm is tested, model tuning is performed to achieve more accurate predictions
Surveys that monitors health behaviors and experiences among high school students in grades 9–12 attending U.S. public and private schools since 1991 (Underwood et al. 2020)
Combined YRBS High School Dataset (1991-2019)
tidyYRBS
Outcomes:
(Q26) During the past 12 months, did you ever seriously consider attempting suicide?
(Q28) During the past 12 months, how many times did you actually attempt suicide?
The total weighted sample for the Combined YRBS High School Dataset is 14,395,146 cases
From these, 7,159,104 are female, and 7,141,727 are male
The proportion of students who reported attempting suicide in this data is 8%
The proportion of students who considered suicide is 15%
Predictors:
Demographic variables (age, sex, grade, race, sexual identity, site, year)
Questionnaire items (q8-q99)
The main categories included in the survey
Behaviors that contribute to unintentional injury and violence
Tobacco use
Alcohol and other drug use
Sexual behaviors that contribute to unintended pregnancy and STD/HIV infection
Dietary behaviors
Physical inactivity
Logistic Regression, Lasso, K-Nearest Neighbors, Random Forest, Classification and Regression Trees, and Extreme Gradient Boosting will be used to generate the predictive models
The complete dataset will be divided into two datasets: 75% for training 25% for testing (Kuhn and Silge 2022).
The training dataset will be set to make 10-fold cross-validation to tune by the relevant hyperparameters for each technique (Kuhn and Silge 2022)
The best model will be selected according to the highest value of receiver operating characteristic curve, overall accuracy, and Kappa value (Kuhn and Silge 2022)
Accuracy:
Is the fraction of predictions our model got right
Kappa:
How closely the instances classified by the machine learning classifier matched the data labeled as the truth.
It adjusts for the fact that some agreement between the raters may occur by chance
Recommended when there is no equal balance between the classes
A high kappa value indicates that the model is making accurate predictions
ROC: A receiver operating characteristic curve, is a graphical plot that illustrates the diagnostic ability of a binary classifier. It is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings.
The ROC space for a “better” and “worse” classifier. Source: Receiver operating characteristic. https://en.wikipedia.org/wiki/Receiver_operating_characteristic
Realistic Example of Machine Learning Results
Opioid Treatment Trials-CTN0094. Final Project. Data Science and Machine Learning for Health Research
Realistic Example of Machine Learning Results
Opioid Treatment Trials-CTN0094. Final Project. Data Science and Machine Learning for Health Research
To wrap up
Chapter 1: Introduction, literature review and theoretical model
Chapter 2: Methods
Chapter 3:
Results of critical risk factors
Results of best ML model
Results of the comparison of logistic regression and other ML methods
Results for differences of critical risk factors by outcome
Chapter 4: Discussion, limitations, future research
Model the outcome as a linear function of the predictors (Burkov 2019).
The sigmoid function is applied to adjust the predictions to stay between 0 and 1 (Burkov 2019)
The predictors will be selected from past literature modeling YRBSS data (Bae et al. 2005)
Logistic regression gifSource:Laken, Paul van der. 2020. “Animated Machine Learning Classifiers.” Paulvanderlaken.com. https://paulvanderlaken.com/2020/01/20/animated-machine-learning-classifiers/.
Select the subset of variables that minimizes prediction error.
Adds a penalty to the residual sum of squares.
The beta coefficients shrink toward zero
This technique will select only relevant coefficients (James et al. 2013).
Lasso Regression. Source: Ridge and Lasso Regression:Insights into regularization techniques.https://medium.com/geekculture/ridge-and-lasso-regression-51705b608fb9
Tries to predict the correct class for the test data by calculating the distance between the test data and all the training points.
Logistic regression gifSource:Laken, Paul van der. 2020. “Animated Machine Learning Classifiers.” Paulvanderlaken.com. https://paulvanderlaken.com/2020/01/20/animated-machine-learning-classifiers/.
Iterative process that splits the data into partitions or branches, and then continues splitting each partition into smaller groups (Greenwell 2022).
Logistic regression gifSource:Laken, Paul van der. 2020. “Animated Machine Learning Classifiers.” Paulvanderlaken.com. https://paulvanderlaken.com/2020/01/20/animated-machine-learning-classifiers/.
Random forest consists of hundreds or thousands of independently grown decision trees generated from different bootstrap samples from the training data (Greenwell 2022).
Uses hundreds of trees in the back end and thus results in a more flexible boundary
Logistic regression gifSource:Laken, Paul van der. 2020. “Animated Machine Learning Classifiers.” Paulvanderlaken.com. https://paulvanderlaken.com/2020/01/20/animated-machine-learning-classifiers/.
Same concept of Random Forest but..
Each additional tree added to the model partially fixes the errors made by the previous trees until the maximum number of trees are combined (Burkov 2019)