Dealing with different data sources in medicine

Challanges, solutions & opportunities

Andrea Francavilla

Unit of Biostatistics, Epidemiology and Public Health, Department of Cardiac, Thoracic, Vascular Sciences and Public Health, University of Padova

Randomized Controlled Trials

Randomised Controlled Trials(RCTs) are considered the gold standard for causal inference

  • Direct comparison of outcomes
  • Cause and effect relationships
  • Low biases

RCTs always?

However, RCTs are not always feasible

  • High study costs
  • High time required
  • Possible ethical issues

Observational studies

Converesly, Observational studies are characterized by the observation and systematic recording of natural occurrences.

  • Observation of phenomena as they naturally exist.

  • Insights without disrupting their natural flow.

Data sources for observational studies

Multiple data sources for observational studies exist.
Among these, 3 important sources can be acknowledged:

  • Electronic Health Records (EHRs)

  • Aggregate Data

  • Individual Data Registries

EHRs

EHRs are digital repositories of patients’ health information and contain a wealth of data

  • Comprehensive view of real-world patient care
  • Aiding in analyzing disease trends, treatment efficacy and healthcare utilization.

Aggregate Data

Aggregate data involves summarized information collected from various sources. 

  • Broader insights into trends, prevalence rates, demographic characteristics, overall health indicators, etc.

Observational studies using aggregate data can focus on epidemiological analyses, public health surveillance, or studying trends across diverse populations or regions.

Individual data registries

Individual data registries contain detailed information about a specific population, condition, or disease.

Disadvantages

Addressing biases in observational studies involves multiple themes:

  • Design
  • Rigorous data collection methods
  • Statistical adjustments for confounders
  • Sensitivity analyses

While it’s challenging to completely eliminate biases in observational studies, researchers aim to minimize their effects to ensure the validity and reliability of their findings.

Disadvantages (2)

  • Individual data registries might have limited generalizability beyond that specific cohort.
  • In EHRs data quality, standardization & privacy concerns can be significant challenges.
  • Aggregate data might lack specific individual-level details.

N/P ratio

A solution for minimizing such biases must necessarily consider multiple aspects.

Within this scenario, the N/P ratio plays a fundamental role.

  • The “N/P ratio” refers to the ratio between the sample size (N) and the number of predictors or features (P) used in a statistical model.

  • The N/P ratio influences the model’s ability to generalize from the sample to the larger population and affects the stability and accuracy of the statistical estimates.

EHRs and N/P ratio

In EHRs, usually the n/p ratio is big

A big N/P ratio refers to a scenario in statistical analysis where the number of samples (N) is significantly larger than the number of predictors or features (P) in a dataset.

  • Good generalization
  • Robust estimations
  • Reduced overfitting
  • Statistical power

EHRs and N/P ratio (2)

However, “high” N/P ratio high N/P ratios might not always be optimal

In particular:

  • High risk of “noise”
  • Lot of irrelevant predictors

Aggregate data and N/P ratio

Conversely, N/P ratio in aggregate data is often small

A small N/P ratio refers to a scenario in statistical analysis where the number of samples (N) is relatively small compared to the number of predictors or features (P) in a dataset. As opposite of high N/P ratio:

  • Limited Generalization
  • Increased overfitting
  • Less reliable estimates
  • Reduced statistical power

Individual data registry and N/P ratio

Finally, it is possible to refer to N/P ratio in individual data registry as a deep N/P ratio.

Objective and agenda

The aim of this thesis is to explore the possibilities that arise from observational studies, using 4 different projects to illustrate the main concept

  • Systematic reviews
  • Machine learning
  • Bibliometric analysis and topic-modeling
  • Propensity Score

WORK N.1: Small N/P Ratio

Addressing a small N/P ratio involves strategic approaches to mitigate limitations and enhance the reliability of analyses.

Systematic reviews with meta-analysis represent a possible statistical answer in case of small N/P ratio.

Systematic Review on ECMO

In collaboration with the surgical team at the Padova hospital, we carried out a systematic review of currently existing predictive models to forecast mortality in critical care unit after ECMO iniziation.

Methods

PubMed, CINAHL, Embase, MEDLINE and Scopus were consulted. 

  • Articles between Jan 2011 and Feb 2022
  • Adults undergoing ECMO reporting a newly developed and validated predictive model for mortality
  • The risk of bias was evaluated through the prediction model risk-of-bias assessment tool (PROBAST).

Results

26 prognostic scores were identified.

  • Sample size ranging from 60 to 4557 patients 
  • Age, lactate and creatinine most represented covariates
  • Validation in 8/26 (30%) scores
  • All scores were considered at moderate to high ROB

Take-home message

Most models have not been validated externally and uncertainty if ECMO should be initiated or not remains.

It has yet to be determined whether and to what extent a new methodological perspective may enhance the performance of predictive models for ECMO.

WORK N.2: High N/P ratio

Having a large N/P ratio can be beneficial for machine learning techniques, especially in enhancing generalization and reducing overfitting.

In 2022, with the help of PEDIANET, we tried to develop a machine learning approach in order to exploit Electronic health records (EHRs) as a source of real-world health data.

This work discusses the development and application of a deep learning model, particularly a recurrent neural network with gated recurrent units (RNN-GRU), for the automatic extraction of information from EHRs to assess VZV infections in children.

Introduction

In presence of high N/P ratio, deep-recurrent neural network architectures offer several advantages:

  • Reduction of input preprocessing and manipulation
  • Possibility to process the text as a data sequence
  • Automatic learning of the correlations between features without superimposed structure

Methods

Gold standard:

  • 3 researchers
  • 2 years
  • Manual assessment

ML approach:

  • NLP with RNN
  • Data augmention (2-yrs latency)
  • Comparison with gold standard

Results

10 models were trained in total: