The within-groups errors \(\epsilon_i\) and the random effects \(b_i\) are assumed to be independent.
Tree-based Methods
Overview:
Includes CART, bagging, random forest, and boosting.
All these methods involve recursive splitting of data.
Decision Tree
Process:
Starts by splitting the training data at the root node.
Splits are made to maximize reduction in mean squared error.
Growing and Pruning:
Trees are grown to a large size and then pruned back using weakest link pruning to avoid overfitting.
RE-EM trees with random intercept
Regression trees applied to longitudinal data since Segal (1992).
Mixed-effects tree model (RE-EM tree) introduced by Sela and Simonoff (2012).
The RE-EM Tree Model
Model Structure
\[
Y_i = f(X_i) + Z_i b_i + \epsilon_i
\]
Combines fixed effects \(f(X_i)\) and random effects \(Z_i b_i\).
Estimating Mixed-effects Trees
Fixed and Random Effects
If random effects \(b_i\) are known, \(f\) can be estimated directly.
If \(b_i\) are unknown, a two-step iterative process is used.
Terative Estimation Process
Initial Steps:
Start with \(b_i = 0\).
Use regression trees to estimate \(f\) from \(Y_i - Z_i \cdot b_i\).
Further Steps:
Adjust \(b_i\) based on tree regression results and continue until convergence.
The process does not stop until the estimates of random effects \(b_i\) converge.
Ensemble decision tree methods
Bagging
Concept: Bagging, or Bootstrap Aggregating, reduces variance in decision tree predictions by averaging multiple trees.
Steps:
Generate random subsamples of the training dataset with replacement.
Train a decision tree on each subsample.
Use averaging of the predictions from all trees for final output on test data.
Advantages:
Improves prediction accuracy compared to a single decision tree.
Reduces overfitting risk by averaging multiple models.
Random forest
Overview: Extends Bagging by building a large collection of de-correlated trees.
Methodology:
Utilizes a random subset of features for splitting in each tree, differing from bagging.
Averages predictions for regression or uses majority voting for classification.
Parameters:
Number of trees, observations, predictor variables, and randomly chosen features per split.
Benefits:
Handles high-dimensional data well.
More robust against overfitting than single trees due to feature randomness.
Boosting
Principle: Boosting builds trees sequentially, with each tree learning from the errors of its predecessors.
Procedure:
Fit a regression tree to the data.
Fit additional trees to the residuals of the previous trees.
Combine predictions of all trees, adjusting with a learning rate (shrinkage parameter).
Characteristics:
Each tree corrects the previous tree’s errors, making the ensemble progressively better.
Often achieves higher prediction accuracy than bagging or random forests.
Support vector regression
The main goal is to find the best separating hyperplane (decision boundary) that maximizes the margin between different classes.
Kernel in SVM
Enhancing SVM Capability:
The kernel trick allows SVM to fit the maximum-margin hyperplane in a transformed feature space.
Enables non-linear classification which broadens the applicability of SVM to more complex datasets.
Types of Kernels:
Linear, Polynomial, Radial Basis Function (RBF), and Sigmoid are common kernels used depending on the nature of the data.
Neural network
Architecture:
Consists of layers of interconnected nodes (neurons), including input layers, hidden layers, and an output layer.
Each connection (synapse) can transmit a signal from one neuron to another.
Evaluation Metrics
Three Evaluation Methods by Sela and Simonoff (2012)
Future Observation:
Predicts the future 30% of observations using the previous 70% for the same subjects.
Applicable when historical data of subjects are available for forecasting.
New Object Prediction:
Involves predicting outcomes for entirely new subjects (K/2 new subjects) based on data from previously known subjects (K subjects).
Tests model’s ability to generalize to new, unseen subjects.
Future New Observation:
Combines elements of the first two methods: predicts future 30% of observations for new K/2 subjects using data from both the known K subjects and 70% of observations of the new subjects.
Assesses model’s comprehensive prediction capability across new subjects and future observations.
Real-time Prediction Methods
One-step Prediction:
Uses data from immediately preceding observations to predict the next one (e.g., first to predict second, first two to predict third, etc.).
Useful for sequential or time-series data where each point predicts the next.
Two-step Prediction:
Utilizes data from every other observation to make predictions (e.g., first to predict the third).
Tests the model’s ability to forecast over longer intervals.