Vaccine Status Prediction Using Linear Mixed Model and Machine Learning Methods

Aim

Use statistical methods and machine learning methods to predict the the post-vaccination status based on the longitudinal vaccine study design.

Visualize the pre- and post- vaccination status varying with seasons

ggplotly(plot)

Limitation of previous literature on the same study:

Did not consider the dependence of data (longitudinal design) Use SLR, with assumptions, not flexible to capture the relationship between predictors and outcomes.

Prediction outcome

Continuous:

1. Composite HAI titer value (homologous and heterologous)

2. Booster score (Future)

3. Waning score (Future)

Categorical:

Whether or not get vaccinated

Predictors

Age + Gender + BMI + (comp_pre_homo) + years_of_part + history(Yes/No)

Methods

Mixed Effect Machine Learning: a framework for predicting longitudinal change in hemoglobin A1c (J Biomed Inform, 2020) (Ngufor et al. 2019)

In this study, they formulate an analytic framework, which integrates the random-effects structure of GLMM into non-linear machine learning models capable of exploiting temporal heterogeneous effects, sparse and varying-length patient characteristics inherent in longitudinal data.

Generalized Linear Mixed Effects Models

In GLMM, random-effects are used to account for the level-wise variabilities in longitudinal data.

Specifically, conditional on a vector \(b_i \in \mathbb{R^q}\) of subject-specific regression coefficients, the model assumes that the responses \(y_{it}\) for a single subject \(i\) are independent and follow a distribution from the exponential family with mean and variance specified as:

\[E[y_{it}|b_i]=\mu_{it}=h(\eta_{it})\] \[Var(y_{it}|b_i)=\phi^2V(\mu_{it})\] ,where \(\eta_{it}=\beta^Tx_{it}+b_i^Tz_{it}\); \(g(\cdot)=h^{-1}(\cdot)\) is a pre-specified link function.

\(\beta \in \mathbb{R}^p\) is the population fixed-effect parameters; \(b_i\) is the random-effects parameters.

Estimation of the parameters \((\beta, b_i)\) can be done through the penalized quasi-likelihood.

\[y_{it}=\mu_{it}+\epsilon_{it}=h(\eta_{it})+\epsilon_{it}\\ \approx h(\hat{\eta_{it}})+h^{'}(\eta_{it}-\hat{\eta}_{it})+\epsilon_{it}\]

Using the relationship \(g'=\frac{1}{h'}\) and rearrange:

\[(y_{it}-\hat{\mu_{it}})g'(\hat{\mu}_{it})+g(\hat{\mu}_{it})=g(\mu_{it})+g'(\hat{\mu}_{it})\epsilon_{it}\]

Letting \(y^*_{it}=(y_{it}-\hat{\mu_{it}})g'(\hat{\mu}_{it})+g(\hat{\mu}_{it})\) and \(\epsilon^*_{it}=g'(\hat{\mu}_{it})\epsilon_{it}\), we obtain the linear mixed effects model

\[y^*_{it}=\beta^Tx_{it}+b_i^Tz_{it}+\epsilon^*_{it}\] with \(Var(\epsilon^*_{it})=\phi^2[g'(\hat{\mu}_{it})]^2V(\mu_{it})\)

Random-Effects for Machine Learning Models

GLMM:

Limitations:

  • Assume a parametric distribution and imposes restrictive linear relationships between the link function \(g(\cdot)\) and the covariates

  • Difficult to verify and often are not applicable to complex clinical settings.

Advanced non-linear machine learning methods can be applied to extract informative patterns from the data without a priori assumptions.

Potential limitations of machine learning algorithms:

  • Most assume that the training data is i.i.d.

    • Random forest (RF), gradient boosted machine (GBM), support vector machine (SVM), neural networks,
  • Several techniques have been proposed to extend tree based algorithms to longitudinal data: classification and regression trees (CART) algorithm.

    • Not allow for modeling time-varying covariates.

RE-EM tree

Took the mixed-effects approach and extended the CART algorithm to incorporate random-effects.

The basic idea of the approach was to disassociate the fixed-effect component of a LMM from the random-effect and iteratively estimate each component in expectation maximization manner.

Limitations:

Prone to overfitting and selective bias towards variables with many possible splits.

Some neural networks to deal with sequential data (RNN):

Limitations:

  • Most RNN models are unable to model sparse and irregularly sampled sequential data.

  • Training these models can also be computationally expensive.

A general framework to incorporate machine learning methods, including RF, SVM, NN, etc.

Formulation of Mixed-Effects Machine Learning

The proposed MEml framework estimates the fixed-effects component \((\beta^Tx_{it})\) using a powerful machine learning algorithm and the random-effects \(b_i\) using GLMM.

Tree-based algorithms (Other supervised learning algorithm can also be used)

  • Random forest (RF)

  • Gradient boosted machine (GBM)

  • Model-based recursive partitioning

  • Conditional inference trees

RF and GBM has many strengths:

(i) the methods can easily handle large and high dimensional longitudinal data,

(ii) all variables, including those with weak effects, highly correlated and interacting have the potential to contribute to the model fit,

(iii) the models easily accommodate complex interactions between variables,

(iv) they can perform both simple and complex classification and regression accurately and are less prone to overfitting.

Despite these appealing properties, RF and GBM are not interpretable

To overcome this limitation, they apply the inTrees (interpretable trees) algorithm to extract insights from the tree ensembles.

Algorithm for Mixed-effect Machine Learning

Algorithm 1
  • If the random-effects \(b_i\) are known, then estimate using a machine learning model based on the modified response \(y^*_{it}-b_i^Tz_{it}\) .

    • Assuming that the means \(\mu_{it}\) are also known, re-weight each observation in the training set by \(w_{it} = \phi^2[g'(\hat{\mu}_{it})]^2V(\mu_{it})\) (help reduce the variability of the repeated measurements in the machine learning model)
  • If the population-level effects \(f(x_i)\) were known, then estimate the random-effects using traditional GLMM with population-level effects corresponding to \(f(x_i)\).

Experiment setup

Training and validation data structure for longitudinal data

To predict the outcome in the future, use information in the current and past visits and in the anticipated future visit to construct the training and evaluation data sets.

For example, if the goal is to predict the status of ith patient at the next visit(\(t_1\)), we combined information available in the current and past visits \(\mathbb{X_{it_0}=(x_{it_0}, z_{it_0}, y_{it_0})}\) and the control status in the next visit \(y_{it_1}\) to create the training and validation data \(\mathbb{D_{it_1}}=\{\mathbb{X_{it_0}}, y_{it_1}\}\).

Example data structure:

Data structure

Table 2 shows the data structure for two visits in advance.

Bootstrap training and validation for longitudinal data

They performed 100 bootstrap resamples from \(\mathbb{D_{it_\lambda}}\) , where on each bootstrap iterate, the models are trained on approximately 63% of the data and the left out samples not selected in the bootstrap are used for testing.

Simple bootstrap resampling with replacement from treats the observations as independent and does not account for the dependence structure in the data, which may lead to invalid inference.

To preserve the hierarchical structure in the bootstrap resamples, use resampling in a nested fashion.

In multi-stage bootstrap:

  • First resample the highest level,

  • Then for each sampled unit, resample the next lower level, and so forth.

  • Each level may be resampled with or without replacement.

For the experiments, they implemented the double bootstrap procedure, where they first resample without replacement the individual patients, and then resample with replacement the visit times for each patient.

Prediction

For binary response, the mean \(\mu_{it}=E[y_{it}|b_i]\) is the conditional probability of success given the random-effects and covariate values:

Two types of ut of sample prediction:

Predictions for new visit times for patients in the training data

Predictions for new visit times for patients not in the training data

In this case, the random-effects for the new patient is not known, and we simply set \(\hat{b_i}=0\) in the equation above.

Next steps:

  • Apply the framework to create the training and test dataset for vaccine cohort.

  • Implement the machine learning methods the paper mentioned.

  • Add SVM and NN to the framework and compare performance.

  • Further develop the R package.

  • Create a R Shiny App for immunologist to use with a user-friendly interface.

Hybrid statistical and machine learning modeling of cognitive neuroscience data

(Cakar and Yavuz 2023)

References

Cakar, Serenay, and Fulya Gokalp Yavuz. 2023. “Hybrid Statistical and Machine Learning Modeling of Cognitive Neuroscience Data.” Journal of Applied Statistics 51 (6): 1076–97. https://doi.org/10.1080/02664763.2023.2176834.
Ngufor, Che, Holly Van Houten, Brian S. Caffo, Nilay D. Shah, and Rozalina G. McCoy. 2019. “Mixed Effect Machine Learning: A Framework for Predicting Longitudinal Change in Hemoglobin A1c.” Journal of Biomedical Informatics 89 (January): 56–67. https://doi.org/10.1016/j.jbi.2018.09.001.