Bayesian Regression Tree Ensembles That Adapt to Smoothness and Sparsity

Antonio R.Linero and Yun Yang
04/19/2019

Model

Non-parametric regression: \[ Y = f_0(X)+\epsilon, \] where:

\( Y\in R \)
\( X \in [0,1]^p \)
\( \epsilon \sim N(0,\sigma^2) \)

'Hard' decision tree prior

Model \( f_0(x) \) as the realization of a random function

\[ f(x) = \sum_{t=1}^T g(x;\mathcal{T}_t,\mathcal{M}_t), \qquad x \in R^p, \]

where

\( \mathcal{T}_t \): the topology or splitting rules of the tree
\( \mathcal{M}_t = (u_{t1},\cdots,u_{tL_t}) \): a collection of parameters for the leaf nodes
\( g(x;\mathcal{T}_t,\mathcal{M}_t) = \sum_{l=1}^{L_t} u_{tl}\phi(x;\mathcal{T}_t,l) \): where \( \phi \) is the indicator that \( x \) is associated with leaf node \( l \) in \( \mathcal{T}_t \).

Example of one tree

Bayesian Additive Regression Tree (BART)

Following Chipman et al. (2010),

\[ \mathcal{T}_t \sim \text{ a branching process prior}. \]
The branching process begins with a root node of depth \( k=0 \). For \( k=0,1,2,\cdots \),

\[ \Pr(\text{node is non-terminal}) = q(k) = \gamma (1+k)^{-\beta}, \]

where \( \gamma>0 \) and \( \beta>0 \).

BART

Given the tree topology, node \( b \) is given a decision rule \( [x_j\le C_b] \): yes –> left, no –> right, where \( C_b \sim Unif(a,b) \).
\( x_j \) is selected with probability \( s_j \).
The lead parameters \( u_{tl} \sim N(0,\sigma^2_u/T) \).

Soft BART (SBART)

\( x \) goes left at node \( b \) with probability

\[ \psi(x;\mathcal{T},b) = \psi(\frac{x_j-C_b}{\tau_b}), \] where \( \tau_b>0 \) is a bandwidth parameter.

Averaging over all possible paths, the probability of going to leaf \( l \) is

\[ \phi(x;\mathcal{T},l) = \Pi_{b\in A(l)}\, \psi(x;\mathcal{T},b)^{1-R_b}\{1-\psi(x;\mathcal{T},b) \}^{R_b}. \] where \( A(l) \) si the set of ancestor nodes of leaf \( l \) and \( R_b=1 \) if the path goes right at \( b \).

Relation btw BART and SBART

\( \tau^{-1} \in \{10,40,160,2560\} \):

Smoothness adaptation

Smoothness adaptation

SBART model

\[ f(x) = \sum_{t=1}^T g(x;\mathcal{T}_t,\mathcal{M}_t), \qquad x \in R^p, \]

where

\( \mathcal{T}_t \): a branching process \( \Pr(\text{node at depth k is non-terminal}) = q(k) = \gamma (1+k)^{-\beta} \)
\( \mathcal{M}_t = (u_{t1},\cdots,u_{tL_t}) \): \( u_{tl} \sim N(0,\sigma^2_u/T) \)
\( g(x;\mathcal{T}_t,\mathcal{M}_t) = \sum_{l=1}^{L_t} u_{tl}\phi(x;\mathcal{T}_t,l) \): \( \phi(x;\mathcal{T},l) = \Pi_{b\in A(l)}\, \psi(x;\mathcal{T},b)^{1-R_b}\{1-\psi(x;\mathcal{T},b) \}^{R_b}, \) where \( \psi(x;\mathcal{T},b) = \psi(\frac{x_j-C_b}{\tau_b}) \)
\( x_j \) is selected with probability \( s_j \), \( s=(s_1,\cdots,s_p) \).

Prior specification

Algorithm

Results (Friedman's example)

\[ f_0(x) = 10 \sin(\pi x_1 x_2)+20(x_3-0.5)^2+10x_4+5x_5 \]

Results (Variable selection)

\[ f_0(x) = 10 \sin(\pi x_1 x_2)+20(x_3-0.5)^2+\lambda(10x_4+5x_5) \]

Results (Non-smooth function)

Results (Benchmark data sets)

Theorems

Prior concentration for sparse function
Posterior convergence rate for sparse truth
Posterior convergence rate for additive sparse truth

Conclusion

A novel Bayesian sum-of-trees framework: soft decision trees and sparsity inducing priors
Meaningful improvement over existing methods
Software available on github: https://github.com/theodds/SoftBART

Thank you and questions!

References

Chipman, H. A., George, E. I., & McCulloch, R. E. (2010). BART: Bayesian additive regression trees. The Annals of Applied Statistics, 4(1), 266-298.

Linero, A. R., & Yang, Y. (2018). Bayesian regression tree ensembles that adapt to smoothness and sparsity. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 80(5), 1087-1110.