Predicting Vaccine-elicited Immune Responses Using Pre-vaccination Data
Outline
Introduction Section 1
Preliminary Results Section 2
Methods & Results from a Related Paper Section 3
Framework Review on Prediction for Longitudinal Data Section 4
Next steps Section 5
1 Introduction
1.1 Some biological background
Hemagglutination inhibition (HAI) assay is designed to assess the presence and concentration of antibodies that can neutralize the virus in serum samples.
Essentially, the HAI assay measures the immune system’s ability to block the virus’s HA protein, thus inhibiting hemagglutination and providing an estimate of the level of immunity against the influenza virus.
An HAI titer of 40 or greater is considered protective against influenza infection.
1.2 Cohort at UGA
An ongoing study by the University of Georgia, Athens (UGA), a total of 690 participants had been recruited during five seasons between 2016 and 2020 (UGA1-5).
Participants received the split-inactivated influenza vaccine Fluzone.
Participants provided blood samples on the day of vaccination (day 0) (sample collected before the vaccination event) and 21/ 28 days post-vaccination (day 21/28). Baseline (D0) HAI titer levels or post-vaccination D21/28 titer levels were measured.
Demographic data including age, BMI, sex, race, comorbidities, prior vaccination status, month of vaccination in a flu season, vaccine dose (standard/high).
Repeated measure count summary
1.3 Research question
Using human cohort data with annual influenza vaccination at UGA, we would like to explore predictive models for post-vaccination immune responses using pre-vaccination characteristics.
2 Preliminary Results
2.1 Cohort
Study Cohort UGA Cohort 2018-2019 and UGA Cohort 2019-2020, corresponding to two consecutive flu seasons. (Relatively large sample among 5 seasons)
2.2 Outcome Assessment
Classification (Binary outcome)
Summation of the log2 fold change of each vaccine strain
<8 low-responder
>=8 high-responder
Regression (Continuous outcome)
- Geometric mean of 4 strains of HAI titer values
2.3 Predictor Assessment
Numerical variables:
- HAI titer value at day 0 (Geo-mean & separate 4 strains), Age, BMI
Categorical variables:
- Gender (Male, Female), Race (White, Black, Asian, Others), Dose (Standard, High), Vaccination month (Sep, Oct, Nov, Dec, Jan, Feb), UGA Cohort 19-20 Vaccination history (Yes, No)
2.4 Methods
2.5 Some selected results
3 Methods & Results from a Related Paper
A paper published in Molecular Systems Biology(Wu et al. 2022)
Evaluation of determinants of the serological response to the quadrivalent split-inactivated influenza vaccine
To identify determinants of vaccine efficacy, they used data from > 1,300 vaccination events to predict the response to vaccination measured as seroconversion as well as hemagglutination inhibition (HAI) titer levels one year after.
They evaluated the predictive capabilities of age, BMI, sex, race, comorbidities, vaccination history, and baseline HAI titers, as well as vaccination month and vaccine dose in multiple linear regression models.
3.1 Some potential limitations
Not considering longitudinal assessment
Smoking status variable
Only include homologous antibody responses, but no heterologous antibody responses as predictors
Can also investigate the prediction on vaccine breadth (i.e., outcome is heterologous HAI level)
Possible to conduct dynamic prediction
Definition (Yao has introduced in last meeting)
homologous: against influenza strains in the current vaccine formula
heterologous: against historic influenza strains not included in the vaccine formula
4 Machine Learning Methods Review on Prediction for Longitudinal Biomedical Data(Cascarano et al. 2023)
Longitudinal studies, involve the analysis of repeated measures on the same subject over time.
GLMMs are an early example of widely used approaches for modelling the response of a repeated outcome over time. These methods can achieve satisfactory results when the focus is on the analysis of the statistical associations between a small number of variables.
However, when the goal is to make more advanced and complex clinical predictions, it is advantageous to use many variables to capture the phenomenon in question and to model non-linear relationships. Statistical methods such as GLMMs present some limitations.
4.1 Challenges remaining of ML on longitudinal data
Repeated measures for an individual tend to be correlated with each other (violate iid assumption); not all ML algorithms are suitable for modelling such correlations. Not taking these correlations into account may lead to biased results.
There are often missing measurements or dropouts in longitudinal data cohorts, while the time intervals between one measurement and another are not necessarily evenly distributed.
Longitudinal data trajectories may be highly complex and non-linear (e.g., large variations between individuals)—again breaking the i.i.d. assumption.
The repeated measures can be subject to very different, and sometimes hard to estimate, uncertainties, which may also vary with time.
4.2 Problem formulation for longitudinal data
Data input
There are two types of input variables, namely:
longitudinal features are variables which are sampled many times, i.e., their values are recorded at different time points in a defined time period
static features (such as genetic or socio-demographics variables etc.)
Formally, given a subject \(i \in \{1, ..., N \}\), where \(N\) is the number of subjects under study, and an instant \(t \in \{1, ..., T_i\}\), with \(T_i\) total number of follow-ups, let \(x_{it} = (x_{it1}, ..., x_{itn})\) be the vector of the \(n\) measures recorded, realisation of a random vector \(X_t\). Then,by scrolling through the time index,let\(x_{i} = (x_{i1}, ..., x_{iT_1})\) be the \(T_i\times n\) matrix of all the longitudinal variables recorded for the \(i^{th}\) subject. In addition to this, \(\{z_i, i = 1, ..., N\}\) represents the set of static features.
Data output
There are two scenarios, namely:
static output: the goal is the prediction of a single outcome at a pre-determined time (e.g. diagnosis or risk prediction at one time point);
longitudinal output: the goal is to predict multiple outcomes concerning different time points (e.g. disease progression over time).
Objective
With the set of data input–output introduced, the aim is to build a function \(f\) which given an example \((x_i, z_i, y_i)\) accurately estimates the output \(y_i\) as \(f(x_i, z_i)\).
4.3 Review Outline
Data formulation methodology (Supervised ML)
Summary features (SF) Section 4.3.1.1
Longitudinal features (LF) Non-Sequential; Sequential Section 4.3.1.2
Stacked-vertically Section 4.3.1.3
Multi-task Learning Section 4.3.1.4
Mixed-effect ML Section 4.3.1.5
Multiple instance Learning Section 4.3.1.6
Algorithm for estimating classifier
- Recurrent models Section 4.3.2.1
4.3.1 Data formulation methodology (Supervised ML)
4.3.1.1 Summary features (SF)
The simplest approach to handle longitudinal data is aggregating the repeated measures up to a certain instant into summary statistics and removing the time dimension.
By means of this approach, it is not necessary that each subject has the same numbers of follow-ups, and, thus, it is robust to missing values.
Nevertheless, this approach can result in loss of significant information, especially in the context of clinical data where the variability of some variables may show underlying trends.
4.3.1.2 Longitudinal features (LF) Non-Sequential; Sequential
In an effort to better capture data information, an alternative to summary features is the use of longitudinal features. By means of this approach, the features’ space is expanded, by considering every variable’s observation as a feature and stack them horizontally.
Non-Sequential
An input of fixed length is required. This means that every subject in the sample needs to have the same number of observations per interval time and the same number of follow- ups.
Then, a non-sequential classifier is applied (i.e. the result is invariant by permuting the order of the features), such as SVM or LR
Sequential
The learning technique is aware of the temporal relationship between dynamic features of consecutive time steps. This category includes recurrent models and other adapted classifiers, will introduce later.
4.3.1.3 Stacked-vertically
Building a dataset where every visit of each patient is a separate instance allowing different number of visits over subjects. Features tracking the time component could be added. With this approach, an output needs to be assigned to every instant. It can either be the same value repeated for all instances or different in the case of longitudinal outputs.
This approach is robust to missing encounters, as every subject can have a different number of follow-ups.
Despite this, it does not take into account the correlated structure of the data and it violates the i.i.d. hypothesis. For this reason, it is not common and typically used as a baseline model in comparison to more sophisticated ones.
4.3.1.4 Multi-task Learning
This approach involves the generation of a separate model for each instant recorded, i.e. for each recorded follow-up, resulting in time-specific models as many as the number of follow-ups. Subsequently, the models are trained jointly.
It is flexible, as it can handle a different number of follow-ups for different subjects, it takes into account the sequential nature of the data by means of the joint learning of separate classifiers for each time t. As a limitation, we note that it requires a sufficient number of observations at every instant for training of each model.
4.3.1.5 Mixed-effect ML
An obvious challenge here is the estimation of the subject-specific random-effects coefficients, for unobserved subjects. This limitation results in a limited use of mixed-effects models in prediction scenarios.
A paper(Amiri et al. 2020) worked on Prediction of Serum Creatinine in Hemodialysis Patients Using a Kernel Approach for Longitudinal Data
They applied linear regression (LR), linear mixed-effects model (LMM), least-squares support vector regression (LS-SVR), and mixed-effects least-squares support vector regression (MLS-SVR) methods.The MLS-SVR achieved the best prediction performance.
4.3.1.6 Multiple instance Learning
It takes a set of labelled bags, each containing many instances as input training examples. Note that, in contrast to traditional supervised learning, labels are assigned to a set of inputs (bags) rather than providing input/label pairs.
It is flexible in allowing for different numbers of follow-ups, but typically assumes that data instances inside a bag are i.i.d., so it does not model the sequential nature of the data.
4.3.2 Algorithm for estimating classifier
4.3.2.1 Recurrent models
Recurrent neural networks (RNN) (Rumelhart et al. 1986) are a class of feed-forward neural networks used with the longitudinal features.
They are designed to model sequential data by using a transformation in the hidden-state depending not only on the current input but also on information from the past.
Hidden states are used as the memory of the network such that the current state of the hidden layer depends on the previous time.
A loop allows information to be passed from one step of the network to the next.
An important property of RNNs derived by the recursive modelling is the ability of handling different inputs lengths, which is convenient in a longitudinal framework.
RNNs are also very flexible because there are different types of architectures based on the number of inputs and outputs
one-to-many, where given a single input multiple outputs are provided,
many-to-one, multiple inputs are needed to provide one output,
many-to-many, where both the input and output are sequential.
The majority of works analysed adopted a many-to-one setting, i.e., given historical data, predicting an outcome at a future instant point, or classifying the sequence.
Several works leveraged the many-to-many architecture in order to model a longitudinal outcome and predict several time points simultaneously.
One problem of standard RNNs is the gradient vanishing, due to the fact that the error is back-propagated through every time step, so it is a deep network.
Two prominent variants designed to overcome this issue and to capture the effect of long-term dependencies are widely used:
long short-term memory (LSTM)
gated recurrent unit (GRU)
A paper published in 2019(Tabarestani et al. 2019),
proposes an implementation of Recurrent Neural Networks (RNNs) for predicting future Mini-Mental State Examination (MMSE) scores in a longitudinal study.
This multimodal data is fed into two well-studied variations of the RNNs; Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU).
4.3.3 Summary
Distribution of the presented supervised learning approaches for longitudinal data in the reviewed literature.
The most commonly used approaches are summary features, followed by longitudinal features with a sequential classifier (precisely, recurrent neural networks) and longitudinal features with a non-sequential classifier
5 Next steps
Any Data update?
GLMMs + LS-SVR + MLS-SVR
Try some methods and see where to contribute something new
Multiple-task learning
Recurrent neural network (Transfer learning)