BART Functions: Comprehensive Technical Reference

This document provides an exhaustive technical reference for each function in the Bayesian Additive Regression Trees (BART) implementation, detailing inputs, outputs, mathematical expressions, and implementation specifics.

Core Data Structures

Node(depth = 0, data_indices = NULL)

Purpose: Creates a node object that represents a single node within a decision tree.

Inputs: - depth (integer): The depth of this node in the tree, where 0 represents the root. - data_indices (vector of integers): Indices of data points that flow to this node.

Outputs: - A node object with the following fields: - depth (integer): The depth level in the tree. - data_indices (vector): Indices of data points at this node. - is_leaf (logical): TRUE if this is a leaf node, FALSE if internal. - mu (numeric): The prediction value (for leaf nodes). - splitvar (integer): The variable index used for splitting (for internal nodes). - splitpoint (numeric): The threshold value for the split (for internal nodes). - left (node): The left child node (for internal nodes). - right (node): The right child node (for internal nodes).

Mathematical Expression: A node \(\eta\) represents a subset of the feature space defined by: - For the root node: the entire feature space \(\mathcal{X}\) - For a left child: \(\mathcal{X}_{\eta_L} = \{x \in \mathcal{X}_{\eta} : x_{v_\eta} \leq c_\eta\}\) - For a right child: \(\mathcal{X}_{\eta_R} = \{x \in \mathcal{X}_{\eta} : x_{v_\eta} > c_\eta\}\)

where \(v_\eta\) is the split variable and \(c_\eta\) is the split point.

Implementation Details: - Creates a list structure with S3 class “bart_node” - All nodes are initialized as leaf nodes (is_leaf = TRUE) - For leaf nodes, mu holds the parameter value (prediction) - For internal nodes, splitvar, splitpoint, left, and right define the decision rule


Tree()

Purpose: Creates an empty tree object that will hold a hierarchical structure of nodes.

Inputs: None

Outputs: - A tree object with the following field: - root (node or NULL): Reference to the root node of the tree.

Mathematical Expression: A tree \(T\) represents a hierarchical partitioning of the feature space, where each leaf node \(\eta \in \text{leaves}(T)\) is associated with a parameter value \(\mu_\eta\).

Implementation Details: - Creates a list structure with S3 class “bart_tree” - The root is initially set to NULL - Serves as a container for the tree structure


initialize_tree(data_indices)

Purpose: Creates a new tree with a single root node containing all data indices.

Inputs: - data_indices (vector of integers): Indices of data points to include in the root node.

Outputs: - A tree object with a single root node containing all specified data indices.

Mathematical Expression: Creates a tree \(T\) with a single node \(\eta_0\) such that: - \(\mathcal{D}_{\eta_0} = \text{data\_indices}\) - \(\mu_{\eta_0} \sim \mathcal{N}(0, 0.1^2)\) (small random initial value)

Implementation Details: - Creates a new tree using Tree() - Creates a root node using Node() with the provided data indices - Initializes the root’s mu value with a small random number to break symmetry - Returns the initialized tree


Tree Operations

find_valid_splits(node, x)

Purpose: Identifies all possible split variables and split points for a given node.

Inputs: - node (node object): The node to find potential splits for. - x (matrix): The feature matrix, where rows are observations and columns are variables.

Outputs: - A list with two components: - variables (vector): Indices of variables that can be used for splitting. - points (list of vectors): For each variable, a vector of possible split points.

Mathematical Expression: For a node \(\eta\) with data indices \(\mathcal{D}_\eta\), and for each variable \(j\): 1. Extract unique values: \(\mathcal{U}_j = \{x_{ij} : i \in \mathcal{D}_\eta\}\) 2. Sort values: \(\mathcal{S}_j = \text{sort}(\mathcal{U}_j) = \{s_1, s_2, \ldots, s_k\}\) where \(s_1 < s_2 < \ldots < s_k\) 3. Compute midpoints: \(\mathcal{M}_j = \{(s_i + s_{i+1})/2 : 1 \leq i < k\}\) 4. Valid split variables: \(\mathcal{V} = \{j : |\mathcal{M}_j| \geq 1\}\)

Implementation Details: 1. Extracts the subset of data corresponding to the node’s data_indices 2. For each variable: a. Extracts unique values for that variable b. Sorts these unique values c. Computes midpoints between adjacent unique values as potential split points 3. Returns only variables with at least 2 unique values (at least 1 potential split point) 4. Returns both the valid variables and their corresponding potential split points


apply_decision_rule(data_indices, x, split_var, split_point)

Purpose: Applies a binary decision rule to partition data into left and right subsets.

Inputs: - data_indices (vector of integers): Indices of data points to partition. - x (matrix): The feature matrix. - split_var (integer): Index of the variable to split on. - split_point (numeric): Threshold value for the split.

Outputs: - A list with two components: - left (vector): Indices of data points satisfying the condition x[, split_var] <= split_point. - right (vector): Indices of data points satisfying the condition x[, split_var] > split_point.

Mathematical Expression: For data indices \(\mathcal{D}\), split variable \(v\), and split point \(c\): - Left partition: \(\mathcal{L} = \{i \in \mathcal{D} : x_{iv} \leq c\}\) - Right partition: \(\mathcal{R} = \{i \in \mathcal{D} : x_{iv} > c\}\)

Implementation Details: 1. Extracts the values of the split variable for the specified data indices 2. Applies the condition <= split_point to identify left indices 3. Applies the condition > split_point to identify right indices 4. Returns the partitioned indices


grow_node(tree, node_indices, x, partial_residuals)

Purpose: Splits a leaf node into two children by selecting a split variable and split point.

Inputs: - tree (tree object): The current tree. - node_indices (vector of integers): Indices of data points in the target node. - x (matrix): The feature matrix. - partial_residuals (vector): Current partial residuals.

Outputs: - Updated tree object with the specified node split into two children.

Mathematical Expression: For a leaf node \(\eta\) with data indices \(\mathcal{D}_\eta\): 1. Select a split variable \(v \sim \text{Uniform}(\mathcal{V})\) from valid variables. 2. Select a split point \(c \sim \text{Uniform}(\mathcal{M}_v)\) from valid points. 3. Create left child \(\eta_L\) with data \(\mathcal{D}_{\eta_L} = \{i \in \mathcal{D}_\eta : x_{iv} \leq c\}\). 4. Create right child \(\eta_R\) with data \(\mathcal{D}_{\eta_R} = \{i \in \mathcal{D}_\eta : x_{iv} > c\}\). 5. Initialize \(\mu_{\eta_L}, \mu_{\eta_R} \sim \mathcal{N}(\mu_\eta, 0.1^2)\).

Implementation Details: 1. Finds the target leaf node in the tree by matching node_indices 2. Calls find_valid_splits() to get potential split variables and points 3. Randomly selects a split variable and split point 4. Applies the decision rule to partition the data indices 5. Verifies that both partitions are non-empty (valid split) 6. Creates two new child nodes with Node() 7. Initializes child nodes with random values near the parent’s value 8. Updates the parent node to be an internal node with the new split rule and children 9. Returns the modified tree


prune_node(tree, node_indices)

Purpose: Removes a split by turning an internal node into a leaf node, effectively pruning its children.

Inputs: - tree (tree object): The current tree. - node_indices (vector of integers): Indices of data points in the target node.

Outputs: - Updated tree object with the specified internal node converted to a leaf node.

Mathematical Expression: For an internal node \(\eta\) with leaf children \(\eta_L\) and \(\eta_R\): 1. Convert \(\eta\) to a leaf node. 2. Set data indices \(\mathcal{D}_\eta = \mathcal{D}_{\eta_L} \cup \mathcal{D}_{\eta_R}\). 3. Set parameter value \(\mu_\eta = (\mu_{\eta_L} + \mu_{\eta_R}) / 2\). 4. Remove children \(\eta_L\) and \(\eta_R\).

Implementation Details: 1. Recursively searches the tree to find the parent of the node with matching node_indices 2. Checks if both children are leaf nodes (prunable) 3. If prunable, combines the data indices from both children 4. Sets the new parameter value as the average of the children’s values 5. Converts the internal node to a leaf node by setting is_leaf = TRUE 6. Removes the split rule and children 7. Returns the modified tree


change_node(tree, node_indices, x)

Purpose: Changes the split rule (variable and point) of an internal node.

Inputs: - tree (tree object): The current tree. - node_indices (vector of integers): Indices of data points in the target node. - x (matrix): The feature matrix.

Outputs: - Updated tree object with the specified internal node using a new split rule.

Mathematical Expression: For an internal node \(\eta\) with split rule \((v_\eta, c_\eta)\): 1. Select a new split variable \(v'_\eta\). 2. Select a new split point \(c'_\eta\). 3. Update the split rule: \((v_\eta, c_\eta) \rightarrow (v'_\eta, c'_\eta)\). 4. Redistribute data to children based on the new rule: - \(\mathcal{D}_{\eta_L} = \{i \in \mathcal{D}_\eta : x_{iv'_\eta} \leq c'_\eta\}\) - \(\mathcal{D}_{\eta_R} = \{i \in \mathcal{D}_\eta : x_{iv'_\eta} > c'_\eta\}\) 5. Preserve the parameter values \(\mu_{\eta_L}\) and \(\mu_{\eta_R}\).

Implementation Details: 1. Recursively searches the tree to find the internal node with matching node_indices 2. Calls find_valid_splits() to get potential new split variables and points 3. Randomly selects a new split variable and split point 4. Verifies the new split is different from the current one 5. Applies the new decision rule to redistribute data indices 6. Verifies both partitions are non-empty (valid split) 7. Preserves the original child nodes’ mu values 8. Updates the split rule and children’s data indices 9. Returns the modified tree


MCMC Sampling

update_leaf_parameters(tree, partial_residuals, sigma, sigma_mu)

Purpose: Updates all leaf node parameters by sampling from their posterior distributions.

Inputs: - tree (tree object): The current tree. - partial_residuals (vector): Residuals for this tree (response minus predictions from all other trees). - sigma (numeric): Current residual standard deviation. - sigma_mu (numeric): Prior standard deviation for leaf parameters.

Outputs: - Updated tree object with new leaf parameter values.

Mathematical Expression: For each leaf node \(\eta\) with data indices \(\mathcal{D}_\eta\): 1. Extract partial residuals: \(\mathbf{r}_\eta = \{r_i : i \in \mathcal{D}_\eta\}\). 2. Compute posterior precision: \(\tau_\eta = \frac{|\mathcal{D}_\eta|}{\sigma^2} + \frac{1}{\sigma_\mu^2}\). 3. Compute posterior variance: \(\sigma^2_\eta = \frac{1}{\tau_\eta}\). 4. Compute posterior mean: \(\mu_\eta = \sigma^2_\eta \cdot \frac{\sum_{i \in \mathcal{D}_\eta} r_i}{\sigma^2}\). 5. Sample new parameter: \(\mu_\eta^{\text{new}} \sim \mathcal{N}(\mu_\eta, \sigma^2_\eta)\).

Implementation Details: 1. Gets all leaf nodes using get_leaf_nodes() 2. For each leaf node: a. Extracts the partial residuals for the data points in that node b. Calculates the posterior precision and variance c. Calculates the posterior mean d. Samples a new mu value from the posterior distribution e. Ensures the new mu is valid (not NA, NULL, or infinite) 3. Updates the leaf nodes in the tree using update_node_mu() 4. Returns the updated tree


update_node_mu(node, target_indices, new_mu)

Purpose: Updates the parameter value of a specific leaf node identified by its data indices.

Inputs: - node (node object): Current node (starting point for search). - target_indices (vector of integers): Indices of data points in the target node. - new_mu (numeric): New parameter value for the node.

Outputs: - Updated node with the new parameter value if it’s the target node, or with updated children if not.

Mathematical Expression: For a tree \(T\) and target leaf node \(\eta\) with \(\mathcal{D}_\eta = \text{target\_indices}\): 1. Update \(\mu_\eta = \text{new\_mu}\).

Implementation Details: 1. Checks if the current node is NULL (base case for recursion) 2. If the current node is a leaf and its data indices match the target, updates its mu value 3. If not the target node and not a leaf, recursively searches both children 4. Ensures the new mu value is numeric 5. Returns the updated node structure


tree_likelihood(node, partial_residuals, sigma)

Purpose: Calculates the log-likelihood of the data given a tree structure and parameters.

Inputs: - node (node object): Current node (starting point for calculation). - partial_residuals (vector): Residuals for this tree. - sigma (numeric): Current residual standard deviation.

Outputs: - Numeric value representing the log-likelihood of the data given the tree.

Mathematical Expression: For a tree \(T\) with leaf nodes \(\text{leaves}(T)\) and partial residuals \(\mathbf{r}\): \[\log p(\mathbf{r} | T, M, \sigma) = \sum_{\eta \in \text{leaves}(T)} \sum_{i \in \mathcal{D}_\eta} \log \mathcal{N}(r_i | \mu_\eta, \sigma^2)\]

where \(\mathcal{N}(r_i | \mu_\eta, \sigma^2)\) is the normal probability density function.

Implementation Details: 1. Handles NULL nodes by returning 0 2. For a leaf node: a. Gets the data indices and corresponding partial residuals b. Checks that the node’s mu is valid c. Calculates the sum of log-likelihoods using dnorm(..., log = TRUE) 3. For an internal node: a. Recursively calculates log-likelihoods for left and right children b. Returns the sum 4. Includes error handling for edge cases (empty indices, non-numeric values)


mcmc_tree_step(tree, x, partial_residuals, sigma, sigma_mu)

Purpose: Performs one MCMC iteration to update a single tree via Metropolis-Hastings.

Inputs: - tree (tree object): The current tree. - x (matrix): The feature matrix. - partial_residuals (vector): Residuals for this tree. - sigma (numeric): Current residual standard deviation. - sigma_mu (numeric): Prior standard deviation for leaf parameters.

Outputs: - Updated tree object (either the original tree or a proposed modification).

Mathematical Expression: 1. Propose a tree modification \(T \rightarrow T'\) via grow, prune, change, or stay. 2. Calculate acceptance ratio: \[\alpha = \min\left(1, \frac{p(\mathbf{r} | T', M', \sigma) \cdot p(T')}{p(\mathbf{r} | T, M, \sigma) \cdot p(T)}\right)\] 3. Accept the proposal with probability \(\alpha\).

The tree prior \(p(T)\) implicitly favors smaller trees via the splitting probability: \[p(\text{split} | \text{depth} = d) = \alpha \cdot (1 + d)^{-\beta}\]

Implementation Details: 1. Stores the current tree as old_tree 2. Updates all leaf parameters using update_leaf_parameters() 3. Randomly selects a move type with the following probabilities: - “grow”: 0.5 (split a leaf into two children) - “prune”: 0.3 (remove a split, combining children) - “change”: 0.1 (change a split rule) - “stay”: 0.1 (do nothing) 4. For “grow”: a. Gets all leaf nodes b. Selects a leaf to split with probability proportional to depth c. Calls grow_node() to perform the split 5. For “prune”: a. Gets all internal nodes with leaf children b. Randomly selects one to prune c. Calls prune_node() to perform the pruning 6. For “change”: a. Gets all internal nodes b. Randomly selects one to modify c. Calls change_node() to change its split rule 7. Calculates the log-likelihood of both trees using tree_likelihood() 8. Computes the log acceptance ratio 9. Accepts the proposal with probability exp(log_acceptance_ratio) 10. Returns either the new tree (if accepted) or the old tree (if rejected) 11. Includes extensive error handling for numerical stability


bart(x, y, num_trees = 50, num_iter = 1000, num_burn = 200, sigma_mu = 0.5, verbose = TRUE)

Purpose: Fits a BART model to data using MCMC sampling.

Inputs: - x (matrix): The feature matrix with rows as observations and columns as variables. - y (vector): The response variable. - num_trees (integer): Number of trees in the ensemble (default: 50). - num_iter (integer): Number of MCMC iterations (default: 1000). - num_burn (integer): Number of burn-in iterations to discard (default: 200). - sigma_mu (numeric): Prior standard deviation for leaf parameters (default: 0.5). - verbose (logical): Whether to print progress information (default: TRUE).

Outputs: - A BART model object with the following components: - posterior_trees (list of lists): For each posterior sample, a list of trees. - posterior_sigma (vector): Posterior samples of the residual standard deviation. - y_mean (numeric): Mean of the response variable (for normalization). - y_sd (numeric): Standard deviation of the response variable (for normalization). - num_trees (integer): Number of trees in the ensemble.

Mathematical Expression: The BART model is: \[y = f(x) + \varepsilon, \quad \varepsilon \sim \mathcal{N}(0, \sigma^2)\]

where \(f(x)\) is approximated by: \[f(x) \approx \sum_{j=1}^{m} g(x; T_j, M_j)\]

The MCMC algorithm produces samples from the posterior: \[p(T_1, \ldots, T_m, M_1, \ldots, M_m, \sigma | y, X)\]

Implementation Details: 1. Centers and scales the response variable for numerical stability 2. Initializes num_trees trees with random values using initialize_tree() 3. Initializes the residual standard deviation sigma 4. For num_iter iterations: a. For each tree: i. Calculates partial residuals (response minus predictions from all other trees) ii. Updates the tree using mcmc_tree_step() iii. Updates the ensemble prediction b. Updates sigma by sampling from its posterior distribution c. Stores the posterior sample if past burn-in 5. Returns a model object with the posterior samples and normalization constants 6. Includes extensive error handling and progress reporting


Prediction Functions

predict_node(node, x, indices)

Purpose: Generates predictions for a subset of data using a single node.

Inputs: - node (node object): The current node. - x (matrix): The feature matrix. - indices (vector of integers): Indices of data points to predict.

Outputs: - Vector of predictions for the specified data points.

Mathematical Expression: For a node \(\eta\) and data points \(\{x_i : i \in \text{indices}\}\): 1. If \(\eta\) is a leaf node: - Return \(\mu_\eta\) for all data points 2. If \(\eta\) is an internal node with split rule \((v_\eta, c_\eta)\): - For each data point \(i \in \text{indices}\): - If \(x_{i, v_\eta} \leq c_\eta\), predict using the left child - If \(x_{i, v_\eta} > c_\eta\), predict using the right child

Implementation Details: 1. Handles NULL nodes by returning zeros 2. For a leaf node, returns the node’s mu value repeated for all indices 3. For an internal node: a. Checks that the split rule is valid b. Partitions the indices based on the split rule c. Recursively calls predict_node() for each partition d. Combines the predictions from both children e. Ensures the result has the same length as the input indices 4. Includes extensive error handling for invalid mu values and NULL nodes


predict_tree(tree, x)

Purpose: Generates predictions for all data points using a single tree.

Inputs: - tree (tree object): The tree to use for prediction. - x (matrix): The feature matrix.

Outputs: - Vector of predictions for all data points.

Mathematical Expression: For a tree \(T\) and data points \(\{x_i\}_{i=1}^n\): \[\hat{y}_i = g(x_i; T, M)\]

where \(g(x_i; T, M)\) is the prediction from tree \(T\) with parameters \(M\).

Implementation Details: 1. Handles NULL trees or empty roots by returning zeros 2. For a stump (tree with only a root node), returns the root’s mu value for all data points 3. For a more complex tree, calls predict_node() with all data indices 4. Ensures all predictions are valid numbers (replaces NAs with zeros) 5. Returns a vector of predictions with one value per data point


predict_bart(trees, x)

Purpose: Generates predictions for all data points using an ensemble of trees.

Inputs: - trees (list of tree objects): The ensemble of trees. - x (matrix): The feature matrix.

Outputs: - Vector of predictions for all data points (sum of all tree predictions).

Mathematical Expression: For an ensemble of trees \(\{T_j\}_{j=1}^m\) and data points \(\{x_i\}_{i=1}^n\): \[\hat{y}_i = \sum_{j=1}^{m} g(x_i; T_j, M_j)\]

Implementation Details: 1. Initializes a vector of zeros for the predictions 2. For each tree in the ensemble: a. Calls predict_tree() to get predictions b. Adds these predictions to the running sum 3. Returns the sum of all tree predictions


predict.bart(model, x_new, return_distribution = FALSE, num_samples = 30)

Purpose: Generates predictions using a trained BART model, with an option to return the full posterior predictive distribution.

Inputs: - model (BART model object): The trained model. - x_new (matrix): New feature matrix for prediction. - return_distribution (logical): Whether to return the full distribution (default: FALSE). - num_samples (integer): Number of posterior samples to use (default: 30).

Outputs: - If return_distribution = FALSE: Vector of mean predictions. - If return_distribution = TRUE: Matrix of predictions from each posterior sample.

Mathematical Expression: For a BART model with posterior samples \(\{(T_1^{(s)}, \ldots, T_m^{(s)}, M_1^{(s)}, \ldots, M_m^{(s)})\}_{s=1}^S\): 1. For each sample \(s\) and data point \(i\): \[\hat{y}_i^{(s)} = \sum_{j=1}^{m} g(x_i; T_j^{(s)}, M_j^{(s)})\] 2. Transform back to original scale: \[\hat{y}_i^{(s)} = \hat{y}_i^{(s)} \cdot \sigma_y + \mu_y\] 3. Mean prediction: \[\hat{y}_i = \frac{1}{S}\sum_{s=1}^{S}\hat{y}_i^{(s)}\]

Implementation Details: 1. Selects a subset of posterior samples for efficiency 2. For each selected posterior sample: a. Extracts the trees for that sample b. Calls predict_bart() to get predictions from those trees c. Transforms predictions back to the original scale d. Stores the predictions in a matrix 3. If return_distribution = TRUE, returns the full matrix 4. If return_distribution = FALSE, returns the column means (average across samples)


predict_sample(model, x_new, sample_idx = 1)

Purpose: Generates predictions using a specific posterior sample from a BART model.

Inputs: - model (BART model object): The trained model. - x_new (matrix): New feature matrix for prediction. - sample_idx (integer): Index of the posterior sample to use (default: 1).

Outputs: - Vector of predictions using the specified posterior sample.

Mathematical Expression: For posterior sample \(s = \text{sample\_idx}\): \[\hat{y}_i = \sum_{j=1}^{m} g(x_i; T_j^{(s)}, M_j^{(s)})\] Transformed to original scale: \[\hat{y}_i = \hat{y}_i \cdot \sigma_y + \mu_y\]

Implementation Details: 1. Validates that the sample index is within range 2. Extracts the trees for the specified posterior sample 3. Calls predict_bart() to get predictions from those trees 4. Transforms predictions back to the original scale 5. Returns a vector of predictions


predict_single_tree(model, x_new, sample_idx = 1, tree_idx = 1)

Purpose: Generates predictions using a single tree from a specific posterior sample.

Inputs: - model (BART model object): The trained model. - x_new (matrix): New feature matrix for prediction. - sample_idx (integer): Index of the posterior sample (default: 1). - tree_idx (integer): Index of the tree within that sample (default: 1).

Outputs: - Vector of predictions from the specified tree.

Mathematical Expression: For posterior sample \(s = \text{sample\_idx}\) and tree \(j = \text{tree\_idx}\): \[\hat{y}_i = g(x_i; T_j^{(s)}, M_j^{(s)})\] Transformed to original scale, accounting for the tree’s contribution: \[\hat{y}_i = \hat{y}_i \cdot \sigma_y + \mu_y / m\] where \(m\) is the number of trees.

Implementation Details: 1. Extracts the specific tree using extract_tree() 2. Calls predict_tree() to get predictions 3. Transforms predictions back to the original scale, adjusting for the tree’s contribution to the ensemble 4. Returns a vector of predictions


predict_with_intervals(model, x_new, alpha = 0.05)

Purpose: Generates predictions with credible intervals quantifying uncertainty.

Inputs: - model (BART model object): The trained model. - x_new (matrix): New feature matrix for prediction. - alpha (numeric): Significance level for intervals (default: 0.05 for 95% intervals).

Outputs: - A list with four components: - mean (vector): Mean predictions. - lower (vector): Lower bounds of the credible intervals. - upper (vector): Upper bounds of the credible intervals. - samples (matrix): Matrix of predictions from all posterior samples.

Mathematical Expression: For each data point \(i\): 1. Collect predictions from all posterior samples: \[\{\hat{y}_i^{(s)}\}_{s=1}^S\] 2. Mean prediction: \[\bar{y}_i = \frac{1}{S}\sum_{s=1}^{S}\hat{y}_i^{(s)}\] 3. Lower bound: \[L_i = \text{quantile}(\{\hat{y}_i^{(s)}\}_{s=1}^S, \alpha/2)\] 4. Upper bound: \[U_i = \text{quantile}(\{\hat{y}_i^{(s)}\}_{s=1}^S, 1-\alpha/2)\]

Implementation Details: 1. Determines the number of posterior samples in the model 2. Initializes a matrix to store predictions from all samples 3. For each posterior sample: a. Calls predict_sample() to get predictions b. Stores these predictions in the matrix 4. Calculates the mean prediction for each data point 5. Calculates the lower and upper quantiles for each data point 6. Returns the means, bounds, and the full matrix of predictions


Model Evaluation Functions

calculate_metrics(actual, predicted)

Purpose: Calculates various performance metrics for regression models.

Inputs: - actual (vector): True/observed values. - predicted (vector): Model predictions.

Outputs: - A list with four metrics: - rmse (numeric): Root Mean Squared Error. - mae (numeric): Mean Absolute Error. - r_squared (numeric): R-squared (coefficient of determination). - mape (numeric): Mean Absolute Percentage Error.

Mathematical Expression: For actual values \(y\) and predictions \(\hat{y}\): 1. RMSE: \[\text{RMSE} = \sqrt{\frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}\] 2. MAE: \[\text{MAE} = \frac{1}{n}\sum_{i=1}^{n}|y_i - \hat{y}_i|\] 3. R-squared: \[R^2 = 1 - \frac{\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}{\sum_{i=1}^{n}(y_i - \bar{y})^2}\] 4. MAPE: \[\text{MAPE} = \frac{100\%}{n'}\sum_{i:y_i \neq 0}\left|\frac{y_i - \hat{y}_i}{y_i}\right|\] where \(n'\) is the number of non-zero actual values.

Implementation Details: 1. Removes any NA values from both vectors 2. Calculates RMSE using the square root of the mean squared error 3. Calculates MAE as the mean of the absolute errors 4. Calculates R-squared using the ratio of the residual sum of squares to the total sum of squares 5. Calculates MAPE only for non-zero actual values to avoid division by zero 6. Returns all metrics in a list 7. Includes handling for edge cases (all zero values, etc.)


calculate_var_importance(model)

Purpose: Determines the relative importance of each variable in the model.

Inputs: - model (BART model object): The trained model.

Outputs: - A data frame with one row per variable and three columns: - variable (character): Variable names. - avg_usage (numeric): Average number of times the variable is used for splitting. - inclusion_prop (numeric): Proportion of iterations where the variable is used.

Mathematical Expression: For each variable \(v\) and posterior sample \(s\): 1. Count occurrences: \[C_v^{(s)} = \sum_{j=1}^{m}\sum_{\eta \in \text{internal}(T_j^{(s)})} \mathbb{1}(v_\eta = v)\] 2. Average usage: \[\bar{C}_v = \frac{1}{S}\sum_{s=1}^{S}C_v^{(s)}\] 3. Inclusion proportion: \[I_v = \frac{1}{S}\sum_{s=1}^{S}\mathbb{1}(C_v^{(s)} > 0)\]

Implementation Details: 1. Initializes a matrix to count variable usage across all posterior samples 2. For each posterior sample: a. For each tree, recursively counts how many times each variable is used for splitting b. Updates the count matrix 3. Calculates the average usage across samples 4. Calculates the inclusion proportion (how often each variable is used in any tree) 5. Creates a data frame with the results 6. Sorts the data frame by average usage (descending) 7. Returns the sorted data frame


Visualization Functions

partial_dependence_plot(model, x, var_idx, num_points = 50, sample_idx = NULL)

Purpose: Calculates data for creating partial dependence plots that show the marginal effect of a variable.

Inputs: - model (BART model object): The trained model. - x (matrix): The feature matrix (usually training data). - var_idx (integer): Index of the variable to analyze. - num_points (integer): Number of grid points to use (default: 50). - sample_idx (vector or NULL): Indices of posterior samples to use (default: all).

Outputs: - A data frame with four columns: - x (numeric): Grid points spanning the variable’s range. - mean (numeric): Mean prediction at each grid point. - lower (numeric): Lower bound of the credible interval. - upper (numeric): Upper bound of the credible interval.

Mathematical Expression: For variable \(v = \text{var\_idx}\) with grid points \(\{z_1, \ldots, z_K\}\): 1. For each grid point \(z_k\): a. Create modified dataset \(\mathcal{X}^{(k)} = \{x_i^{(k)}\}\) where \(x_{iv}^{(k)} = z_k\) and \(x_{ij}^{(k)} = x_{ij}\) for \(j \neq v\) b. For each posterior sample \(s\): \[\hat{y}_i^{(s,k)} = f^{(s)}(x_i^{(k)})\] c. Average across observations: \[\bar{f}_v^{(s)}(z_k) = \frac{1}{n}\sum_{i=1}^{n}\hat{y}_i^{(s,k)}\] 2. Average across samples: \[\bar{f}_v(z_k) = \frac{1}{S}\sum_{s=1}^{S}\bar{f}_v^{(s)}(z_k)\] 3. Compute credible intervals using quantiles of \(\{\bar{f}_v^{(s)}(z_k)\}_{s=1}^{S}\)

Implementation Details: 1. Determines the range of the variable (min to max) 2. Creates a grid of points spanning this range 3. Initializes a matrix to store predictions for each grid point and sample 4. For each grid point: a. Creates a modified copy of the data with that value for the target variable b. For each posterior sample: i. Makes predictions using the modified data ii. Averages these predictions across all observations iii. Stores the average in the matrix 5. Calculates the mean prediction at each grid point 6. Calculates credible intervals at each grid point (2.5% and 97.5% quantiles) 7. Returns a data frame with grid points, means, and interval bounds


plot_tree(tree, max_depth = 3)

Purpose: Creates a graphical visualization of a tree’s structure.

Inputs: - tree (tree object): The tree to visualize. - max_depth (integer): Maximum depth to display (default: 3).

Outputs: - A graphical plot showing the tree structure.

Mathematical Expression: N/A (Visualization function)

Implementation Details: 1. Checks for the igraph package 2. Initializes empty lists for nodes and edges 3. Defines a recursive function to traverse the tree and collect node/edge information 4. For each node: a. For leaf nodes: Stores the mu value and node size b. For internal nodes: Stores the split rule and node size c. Creates edges connecting parent and child nodes 5. Converts the collected information to an igraph object 6. Sets node attributes (labels, types, sizes, colors) 7. Creates a plot using a tree layout 8. Limits traversal to max_depth to prevent overly complex visualizations


Utility Functions

extract_tree(model, sample_idx = 1, tree_idx = 1)

Purpose: Extracts a specific tree from a BART model for inspection.

Inputs: - model (BART model object): The trained model. - sample_idx (integer): Index of the posterior sample (default: 1). - tree_idx (integer): Index of the tree within that sample (default: 1).

Outputs: - A tree object representing the specified tree.

Mathematical Expression: Extracts tree \(T_j^{(s)}\) where \(j = \text{tree\_idx}\) and \(s = \text{sample\_idx}\).

Implementation Details: 1. Validates that the sample index is within range 2. Validates that the tree index is within range 3. Extracts and returns the specified tree from the model’s posterior samples


get_leaf_nodes(node)

Purpose: Collects all leaf nodes in a tree or subtree.

Inputs: - node (node object): Root node of the tree or subtree.

Outputs: - A list of leaf nodes.

Mathematical Expression: \[\text{leaves}(T) = \{\eta \in T : \text{is\_leaf}(\eta) = \text{TRUE}\}\]

Implementation Details: 1. Handles NULL nodes by returning an empty list 2. For a leaf node, returns a list containing that node 3. For an internal node, recursively gets leaf nodes from left and right children 4. Combines and returns all leaf nodes found


get_internal_nodes(node)

Purpose: Collects all internal (decision) nodes in a tree or subtree.

Inputs: - node (node object): Root node of the tree or subtree.

Outputs: - A list of internal nodes.

Mathematical Expression: \[\text{internal}(T) = \{\eta \in T : \text{is\_leaf}(\eta) = \text{FALSE}\}\]

Implementation Details: 1. Handles NULL nodes or leaf nodes by returning an empty list 2. For an internal node, creates a list with that node 3. Recursively gets internal nodes from left and right children 4. Combines and returns all internal nodes found


generate_complex_data(n = 500, p = 10)

Purpose: Creates synthetic data with complex non-linear relationships for testing.

Inputs: - n (integer): Number of observations (default: 500). - p (integer): Number of variables (default: 10).

Outputs: - A list with two components: - x (matrix): Feature matrix with dimensions n × p. - y (vector): Response vector of length n.

Mathematical Expression: \[y_i = 10 \sin(\pi x_{i1} x_{i2}) + 20(x_{i3} - 0.5)^2 + 10x_{i4} + 5|x_{i5} - 0.5| + \varepsilon_i\] \[\varepsilon_i \sim \mathcal{N}(0, 1)\] \[x_{ij} \sim \text{Uniform}(0, 1) \text{ for all } i, j\]

Implementation Details: 1. Generates a random feature matrix with values in [0, 1] 2. Creates a response variable with complex non-linear relationships: a. Interaction effect: \(10 \sin(\pi x_1 x_2)\) b. Quadratic effect: \(20(x_3 - 0.5)^2\) c. Linear effect: \(10x_4\) d. Absolute value effect: \(5|x_5 - 0.5|\) e. Random noise: \(\varepsilon \sim \mathcal{N}(0, 1)\) 3. Only variables 1-5 affect the response; variables 6-10 are noise 4. Returns both the feature matrix and response vector


This comprehensive technical reference provides detailed information about every function in the BART implementation, including inputs, outputs, mathematical expressions, and implementation details.