Vaccine Status Prediction Using Linear Mixed Model and Machine Learning Methods

Aim

Use statistical methods and machine learning methods to predict the the post-vaccination status based on the longitudinal vaccine study design.

Methods

Linear mixed model

The formulation of the linear mixed effects model is as follows:

\[ Y_i = X_i\beta + Z_ib_i+\epsilon_i, \]

\[ b_i \sim N(0,\Psi), \epsilon_i \sim N(0, \sigma^2\Lambda_i). \]

The within-groups errors \(\epsilon_i\) and the random effects \(b_i\) are assumed to be independent.

Tree-based Methods

Overview:

  • Includes CART, bagging, random forest, and boosting.

  • All these methods involve recursive splitting of data.

Decision Tree

Process:

  • Starts by splitting the training data at the root node.

  • Splits are made to maximize reduction in mean squared error.

Growing and Pruning:

  • Trees are grown to a large size and then pruned back using weakest link pruning to avoid overfitting.

RE-EM trees with random intercept

  • Regression trees applied to longitudinal data since Segal (1992).

  • Mixed-effects tree model (RE-EM tree) introduced by Sela and Simonoff (2012).

The RE-EM Tree Model

Model Structure

\[ Y_i = f(X_i) + Z_i b_i + \epsilon_i \]

Combines fixed effects \(f(X_i)\) and random effects \(Z_i b_i\).

Estimating Mixed-effects Trees

Fixed and Random Effects

  • If random effects \(b_i\) are known, \(f\) can be estimated directly.

  • If \(b_i\) are unknown, a two-step iterative process is used.

Terative Estimation Process

Initial Steps:

  • Start with \(b_i = 0\).

  • Use regression trees to estimate \(f\) from \(Y_i - Z_i \cdot b_i\).

Further Steps:

  • Adjust \(b_i\) based on tree regression results and continue until convergence.

  • The process does not stop until the estimates of random effects \(b_i\) converge.

Ensemble decision tree methods

Bagging

  • Concept: Bagging, or Bootstrap Aggregating, reduces variance in decision tree predictions by averaging multiple trees.

  • Steps:

    1. Generate random subsamples of the training dataset with replacement.

    2. Train a decision tree on each subsample.

    3. Use averaging of the predictions from all trees for final output on test data.

  • Advantages:

    • Improves prediction accuracy compared to a single decision tree.

    • Reduces overfitting risk by averaging multiple models.

Random forest

  • Overview: Extends Bagging by building a large collection of de-correlated trees.

  • Methodology:

    1. Utilizes a random subset of features for splitting in each tree, differing from bagging.

    2. Averages predictions for regression or uses majority voting for classification.

  • Parameters:

    • Number of trees, observations, predictor variables, and randomly chosen features per split.
  • Benefits:

    • Handles high-dimensional data well.

    • More robust against overfitting than single trees due to feature randomness.

Boosting

  • Principle: Boosting builds trees sequentially, with each tree learning from the errors of its predecessors.

  • Procedure:

    1. Fit a regression tree to the data.

    2. Fit additional trees to the residuals of the previous trees.

    3. Combine predictions of all trees, adjusting with a learning rate (shrinkage parameter).

  • Characteristics:

    • Each tree corrects the previous tree’s errors, making the ensemble progressively better.

    • Often achieves higher prediction accuracy than bagging or random forests.

Support vector regression

The main goal is to find the best separating hyperplane (decision boundary) that maximizes the margin between different classes.

Kernel in SVM

  • Enhancing SVM Capability:

    • The kernel trick allows SVM to fit the maximum-margin hyperplane in a transformed feature space.

    • Enables non-linear classification which broadens the applicability of SVM to more complex datasets.

  • Types of Kernels:

    • Linear, Polynomial, Radial Basis Function (RBF), and Sigmoid are common kernels used depending on the nature of the data.

Neural network

Architecture:

  • Consists of layers of interconnected nodes (neurons), including input layers, hidden layers, and an output layer.

  • Each connection (synapse) can transmit a signal from one neuron to another.

Evaluation Metrics

Three Evaluation Methods by Sela and Simonoff (2012)

  • Future Observation:

    • Predicts the future 30% of observations using the previous 70% for the same subjects.

    • Applicable when historical data of subjects are available for forecasting.

  • New Object Prediction:

    • Involves predicting outcomes for entirely new subjects (K/2 new subjects) based on data from previously known subjects (K subjects).

    • Tests model’s ability to generalize to new, unseen subjects.

  • Future New Observation:

    • Combines elements of the first two methods: predicts future 30% of observations for new K/2 subjects using data from both the known K subjects and 70% of observations of the new subjects.

    • Assesses model’s comprehensive prediction capability across new subjects and future observations.

Real-time Prediction Methods

  • One-step Prediction:

    • Uses data from immediately preceding observations to predict the next one (e.g., first to predict second, first two to predict third, etc.).

    • Useful for sequential or time-series data where each point predicts the next.

  • Two-step Prediction:

    • Utilizes data from every other observation to make predictions (e.g., first to predict the third).

    • Tests the model’s ability to forecast over longer intervals.

Comparison Results

  • New Object Prediction
rmse_data <- data.frame(
  Method = c("LM", "LMM", "GEE",
             "REEM", "Decision Tree", "Bagging", "Random Forest",
             "Boosting", "SVM-Linear", "SVM-Polynomial",
             "Neural Network"),
  RMSE = c(RMSE.lm, RMSE.lme, RMSE.geeglm, RMSE.REEM, RMSE.tree, RMSE.bag,
           RMSE.rf, RMSE.boost, RMSE.svm, RMSE.svmk, RMSE.nn)
)

library(knitr)
kable(rmse_data, caption = "RMSE Values for Various Prediction Methods", format = "html")
RMSE Values for Various Prediction Methods
Method RMSE
LM 3.630545
LMM 3.645843
GEE 3.658180
REEM 3.875498
Decision Tree 3.834614
Bagging 3.643460
Random Forest 3.941796
Boosting 4.210268
SVM-Linear 6.757044
SVM-Polynomial 6.757044
Neural Network 5.625740

Future steps

Prediction design:

  • Future new observation, future observation

  • One step and two step prediction

Methods:

  • Piecewise linear mixed model

  • 5-fold cross validation

Finetune the model:

  • Mixed effects – random slope (Using AIC)

  • Fine-tune the machine learning models using the validation dataset (10%)

Variable importance