The use of A.I. and Machine Learning in Suicide Surveillance


Literature review

  • Suicide prediction models can distinguish high-risk individuals fairly well (OR = 7.7)
  • Machine learning may perform better than clinical judgement, but the evidence isn’t statistically conclusive
  • Using more risk factors improves machine learning models
  • Results varied widely between studies, so caution is needed in interpreting or applying any single model

results suggest a possible quantitative approach to reduce suicide rates through targeted interventions of social vulnerability.


Literature Review: AI-Based Suicide Surveillance Using Aggregate Demographic and Socioeconomic Data

Artificial Intelligence (AI) and Machine Laerning (ML) are increasingly being employed to enhance suicide surveillance and prevention by analysing large scale, multidimensional data sets.

Recent research emphasizes the value of integrating aggregate level demographic, socioeconomic, and law enforcement data into predictive models. This review synthesizes current approaches, data sources, and methodologies applied in this domain.


Community-Level Risk Modeling

Several studies have applied AI and statistical modeling to explore the relationship between socioeconomic vulnerability and suicide rates at the regional or county level.

  • Cottler-Casanova et al. (2023) used county-level U.S. data to assess the association between suicide mortality and the CDC’s Social Vulnerability Index and Social Vulnerability Metric. Comparing the lowest 10% and highest 10% in the indices showed a 56% higher suicide rate for the Social Vulnerability Index and 82% higher for the Social Vulnerability Metric.

(https://jamanetwork.com/journals/jamanetworkopen/fullarticle/2804213)

  • Edwards et al. (2022) investigated New York State suicide rates using demographic and police-reported variables. The study found that lower income, race/ethnicity, disability prevalence, and household language proficiency were significant correlates of suicide and self-harm hospitalisation rates.

(https://www.sciencedirect.com/science/article/abs/pii/S016517812200021X?via%3Dihub)

  • Sayanti et al employed machine learning (random forest, XGBoost) on U.S. county-level data to model suicide trends across urban–rural gradients. Socioeconomic features such as unemployment and median income showed strong predictive value in non-urban areas.

(https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0258824)


AI with Linked Administrative and Health Data

AI techniques have also been applied to linked health and administrative datasets, incorporating regional socioeconomic indicators to forecast population-level suicide risk.

  • Kharrat et al constructed synthetic suicide risk estimates in Quebec using administrative health data enriched with regional socioeconomic indicators (e.g., education, housing, income). This approach identified underserved areas and gaps in suicide prevention services.

(https://www.suicideinfo.ca/wp-content/uploads/2024/06/Explainable-artificial-intelligence-models-for-predicting-risk-of-suicide-using-health-administrative-data-in-Quebec.pdf)


Public Surveillance Dashboards Using Aggregate Data

Real-time suicide surveillance systems now leverage AI and statistical alerting on aggregated datasets.

  • The New York State Department of Health hosts a public dashboard integrating hospital discharge data, mortality records, and demographics to track self-harm and suicide trends. It enables threshold-based alerts by county and age group.

(https://nyshc.health.ny.gov/web/nyapd/suicides-in-new-york)

(https://tewhatuora.shinyapps.io/suicide-web-tool/)


Limitations

These studies demonstrate that incorporating aggregate-level socioeconomic and demographic data, especially when linked with health, police, and administrative sources, enhances the granularity and predictive accuracy of suicide surveillance models. AI methods such as machine learning, Bayesian modeling, and NLP can identify high-risk subpopulations and geographic regions, support targeted interventions, and improve resource allocation.

However, limitations include:

  • Lack of real-time access to high-quality police or crime data in some jurisdictions

  • Gaps in capturing intersectional variables (e.g. ethnicity, income and rurality)

  • Limited explainability of some machine learning models in public health contexts


Literature review summary

AI-enhanced suicide surveillance is a rapidly evolving field, with increasing integration of diverse data sources including police reports, social determinants, and regional socioeconomic indicators. Continued development of transparent models, responsible data integration, and public-facing dashboards holds promise for improving prevention strategies and health equity.



Synthetic data

  • Synthetic data are artificially-generated data not produced by real-world events.
  • Used validate mathematical models and to train machine learning models
  • Should not be used for analysis and informing operational decisions

Synthetic data generation
  • To explore the potential uses of AI and machine learning for suicide surveillance and prediction I’ve generated an aggregate timeseries data set which shows a range of relevant socioeconomic and demographic indicators for the period 2015-2025. This is segmented by county and sex.

  • The factors were chosen based on previous work by XXX, XXX and XXX.

  • An alternative approach is to look at individual level clinical data as the greatest predictor for future suicide attempts is previous suicide attempts (XXX, XXX)


Synthetic data explorer

To get a sense of the data you can use this data explorer to compare the different indicators by county and sex

Explore the Synthetic Data Explorer Dashboard



Linear regression model

Overview

Explains variation in suicide attempt rates across Irish counties and years using known correlates like:

  • economic hardship

  • healthcare access

  • substance misuse

  • violence/crime exposure

Can be used to identify risk factors and inform policy (e.g., which factors most influence suicide risk)

Serves as a baseline model for forecasting or clustering in more advanced analyses


Model Summary

This analysis of the synthetic data estimates the relationship between suicide attempt rate and various socioeconomic, healthcare, and demographic factors in Ireland between 2015 and 2025 using multiple linear regression.


Model Equation

\[ \text{suicide attempt rate} = \beta_0 + \beta_1 \cdot \text{median income} + \beta_2 \cdot \text{unemployment rate} + \beta_3 \cdot \text{education level} + \ldots + \beta_{n} \cdot \text{year} + \epsilon \]

Where: - ( _0 ) is the intercept, - ( _1 ) to ( _n ) are the coefficients of the explanatory variables, - ( ) is the error term.


Insights from the model

  • Only drug offences is statistically significant at the 5% level (p = 0.0346), suggesting a negative association with suicide attempt rate.
  • Longford shows a marginal effect (p = 0.0726), suggesting a potential positive regional association.
  • All other predictors—including median income, unemployment, education, GP density, alcohol use, and mental health access—were not statistically significant, indicating weak evidence of their individual influence on suicide attempt rate in this model.
  • The model has a low explanatory power, with R² = 0.0615 and Adjusted R² = -0.0073, implying that the model explains very little of the variance in the suicide attempt rate.
  • The overall model is not statistically significant (F(39, 532) = 0.8946, p = 0.655).

Regression Output Summary

Summary of Key Regression Coefficients
Term Estimate Std_Error t_value p_value
(Intercept) 580.600 700.700 0.829 0.408
median_income 0.000 0.000 -0.701 0.484
unemployment_rate 0.055 0.426 0.130 0.897
education_level 2.626 10.890 0.241 0.810
gp_density 5.762 5.576 1.033 0.302
mental_health_access -1.343 4.785 -0.281 0.779
deprivation_index -0.076 0.121 -0.631 0.528
alcohol_consumption -0.187 0.758 -0.247 0.805
drug_admissions 0.105 0.246 0.428 0.669
alcohol_harm_rate 0.037 0.028 1.323 0.187
domestic_violence_reports -0.076 0.096 -0.788 0.431
drug_offences -0.241 0.114 -2.119 0.035
assaults 0.002 0.113 0.017 0.986
sexMale 0.722 2.168 0.333 0.739
year -0.188 0.346 -0.542 0.588

Limitations

This model suggests that, after controlling for a broad range of socioeconomic and regional variables, most individual predictors are not statistically significant. The notable exception is drug-related offences, which exhibit a negative relationship with suicide attempt rates. This could reflect complex dynamics where higher drug offence rates may signal more law enforcement activity or underreporting of mental health crises.

The low R² indicates that the model likely omits important predictors or that the relationship is nonlinear and may benefit from alternative modeling approaches such as machine learning methods (e.g., neural networks, k-means testing).



Neural networks

Overview

Neural networks are capable of learning and identifying patterns directly from data without pre-defined rules.


Model summary: layers in neural network architecture

Input Layer Each input neuron in the layer corresponds to a feature in the input data.

Hidden Layer Performs the computational processing.

Output Layer The final layer produces the output of the model.

Predict suicide rates or suicide attempts using socio-economic, health, and demographic features.


Model Evaluation Metrics

The following metrics were used to evaluate model performance:

Model Evaluation Metrics
Metric Value
Mean Squared Error (MSE) 1184.8000
Root Mean Squared Error (RMSE) 34.4200
R-squared 0.0055

Plot neural network

Code
plot(nn_model)

Predicted vs actual suicide rates

Evaluation of the model

The following metrics were used to evaluate model performance:

Model Evaluation Metrics
Metric Value
Mean Squared Error (MSE) 1184.8000
Root Mean Squared Error (RMSE) 34.4200
R-squared 0.0055

MSE (1184.80) and RMSE (34.42) indicate that prediction errors are relatively large compared to typical suicide attempt rate values (likely ranging in tens).

R-squared (0.0055) is extremely low, suggesting the model explains less than 1% of the variance in the target variable. This implies poor predictive performance and may reflect either noise in the synthetic dataset or that important nonlinear dynamics or features are missing.

Variable Importance (Garson’s Algorithm)

The importance of each predictor was calculated using Garson’s algorithm:

                    Variable Importance
13                  assaults 0.14728132
8        alcohol_consumption 0.14614815
12             drug_offences 0.09630123
9            drug_admissions 0.09285837
10         alcohol_harm_rate 0.08946781
4            education_level 0.07405959
3          unemployment_rate 0.07129090
6       mental_health_access 0.06406725
7          deprivation_index 0.06332271
5                 gp_density 0.06231656
11 domestic_violence_reports 0.04874986
2              median_income 0.03012530
1                        sex 0.01401095

Assaults and alcohol consumption are the top predictors in the model, suggesting a strong association with suicide attempt rates in the synthetic dataset.

Variables like sex and median income show relatively low influence in this model, though this may be due to their distribution or encoding in the synthetic data.

While the neural network identified a few dominant predictors (e.g., assaults, alcohol consumption), the model’s overall performance is weak. The low R-squared suggests the model is not suitable for reliable prediction and should be used only for exploratory purposes. Further refinement, feature engineering, or alternative modeling approaches may be necessary to achieve better predictive power.


Limitations to neural networks

Neural networks are “black boxes” which means that we cannot interpret individual coefficients.

Time series data requires more sophisticated modeling approaches in order to be properly incorporated.

The relative rarity and complexity of suicide attempts could affect generalsability.


K means clustering

Overview

K-Means Clustering is an unsupervised machine learning algorithm that partitions data into k distinct, non-overlapping clusters based on feature similarity. The objective is to minimise the within cluster variance while maximising the between cluster differences.

Each observation (here, an Irish county with values averaged over the 2015-2025 time period) is assigned to one of k clusters based on its proximity to the cluster centroids in a multidimensional space defined by the input variables.

Cluster Summary Table

Cluster Mean Suicide Attempt Rate Median Income Unemployment Rate Education Level Deprivation Index Mental Health Access Alcohol Consumption
1 177.0 €35,221 7.88% 0.769 0.110 0.480 11.5
2 178.0 €35,310 7.57% 0.794 0.659 0.487 11.6
3 184.0 €34,518 7.29% 0.776 -0.988 0.490 11.2

Cluster Composition

  • Cluster 1 (Lower risk, moderate deprivation)
    • Counties: Cork, Galway, Mayo, Waterford, Westmeath
    • Characterised by moderate deprivation, average suicide attempt rates, and typical income/unemployment levels.
    • Slightly lower alcohol harm, lower drug admissions, and balanced education levels.

  • Cluster 2 (Slightly higher education & income, moderate risk)
    • Counties: Carlow, Cavan, Clare, Dublin, Kildare, Laois, Limerick, Louth, Roscommon, Tipperary
    • Higher education levels and mental health service access, but still moderate to high suicide attempt rates.
    • Tends to have higher alcohol harm and drug use indicators, suggesting urban or suburban risk profiles.

  • Cluster 3 (Higher suicide risk, rural deprivation)
    • Counties: Donegal, Kerry, Kilkenny, Leitrim, Longford, Meath, Monaghan, Offaly, Sligo, Wexford, Wicklow
    • These counties show the highest suicide attempt rates, lower incomes, more deprivation, and limited access to services.
    • Although education is slightly above average, rural isolation, drug harm, and lower median income may be driving elevated risks.


Interpretation

The clustering reveals geographically and socioeconomically distinct profiles:

  • Cluster 3 appears to be the most vulnerable, with elevated suicide attempt rates, greater deprivation, and limited access to economic and health supports.

  • Cluster 2, despite its better education and income profiles, may be affected by urban stressors such as substance use and domestic violence.

  • Cluster 1 counties represent a middle ground, with moderate outcomes across most dimensions. These insights can guide targeted policy interventions, especially for:

  • Improving rural mental health access (Cluster 3)

  • Addressing substance use and domestic violence in more urbanised settings (Cluster 2)