Bayesian Regression Tree Ensembles That Adapt to Smoothness and Sparsity

Antonio R.Linero and Yun Yang
04/19/2019

Model

Non-parametric regression: \[ Y = f_0(X)+\epsilon, \] where:

  • \( Y\in R \)
  • \( X \in [0,1]^p \)
  • \( \epsilon \sim N(0,\sigma^2) \)

'Hard' decision tree prior

Model \( f_0(x) \) as the realization of a random function

\[ f(x) = \sum_{t=1}^T g(x;\mathcal{T}_t,\mathcal{M}_t), \qquad x \in R^p, \]

where

  • \( \mathcal{T}_t \): the topology or splitting rules of the tree
  • \( \mathcal{M}_t = (u_{t1},\cdots,u_{tL_t}) \): a collection of parameters for the leaf nodes
  • \( g(x;\mathcal{T}_t,\mathcal{M}_t) = \sum_{l=1}^{L_t} u_{tl}\phi(x;\mathcal{T}_t,l) \): where \( \phi \) is the indicator that \( x \) is associated with leaf node \( l \) in \( \mathcal{T}_t \).

Example of one tree

Bayesian Additive Regression Tree (BART)



  • Following Chipman et al. (2010),

    \[ \mathcal{T}_t \sim \text{ a branching process prior}. \]

  • The branching process begins with a root node of depth \( k=0 \). For \( k=0,1,2,\cdots \),

    \[ \Pr(\text{node is non-terminal}) = q(k) = \gamma (1+k)^{-\beta}, \]

    where \( \gamma>0 \) and \( \beta>0 \).

BART



  • Given the tree topology, node \( b \) is given a decision rule \( [x_j\le C_b] \): yes –> left, no –> right, where \( C_b \sim Unif(a,b) \).

  • \( x_j \) is selected with probability \( s_j \).

  • The lead parameters \( u_{tl} \sim N(0,\sigma^2_u/T) \).

Soft BART (SBART)



  • \( x \) goes left at node \( b \) with probability

    \[ \psi(x;\mathcal{T},b) = \psi(\frac{x_j-C_b}{\tau_b}), \] where \( \tau_b>0 \) is a bandwidth parameter.


  • Averaging over all possible paths, the probability of going to leaf \( l \) is

\[ \phi(x;\mathcal{T},l) = \Pi_{b\in A(l)}\, \psi(x;\mathcal{T},b)^{1-R_b}\{1-\psi(x;\mathcal{T},b) \}^{R_b}. \] where \( A(l) \) si the set of ancestor nodes of leaf \( l \) and \( R_b=1 \) if the path goes right at \( b \).

Relation btw BART and SBART

\( \tau^{-1} \in \{10,40,160,2560\} \):

Smoothness adaptation

Smoothness adaptation

SBART model

\[ f(x) = \sum_{t=1}^T g(x;\mathcal{T}_t,\mathcal{M}_t), \qquad x \in R^p, \]

where

  • \( \mathcal{T}_t \): a branching process \( \Pr(\text{node at depth k is non-terminal}) = q(k) = \gamma (1+k)^{-\beta} \)
  • \( \mathcal{M}_t = (u_{t1},\cdots,u_{tL_t}) \): \( u_{tl} \sim N(0,\sigma^2_u/T) \)
  • \( g(x;\mathcal{T}_t,\mathcal{M}_t) = \sum_{l=1}^{L_t} u_{tl}\phi(x;\mathcal{T}_t,l) \): \( \phi(x;\mathcal{T},l) = \Pi_{b\in A(l)}\, \psi(x;\mathcal{T},b)^{1-R_b}\{1-\psi(x;\mathcal{T},b) \}^{R_b}, \) where \( \psi(x;\mathcal{T},b) = \psi(\frac{x_j-C_b}{\tau_b}) \)
  • \( x_j \) is selected with probability \( s_j \), \( s=(s_1,\cdots,s_p) \).

Prior specification

Algorithm

Results (Friedman's example)

\[ f_0(x) = 10 \sin(\pi x_1 x_2)+20(x_3-0.5)^2+10x_4+5x_5 \]

Results (Variable selection)

\[ f_0(x) = 10 \sin(\pi x_1 x_2)+20(x_3-0.5)^2+\lambda(10x_4+5x_5) \]

Results (Non-smooth function)

Results (Benchmark data sets)

Theorems



  • Prior concentration for sparse function

  • Posterior convergence rate for sparse truth

  • Posterior convergence rate for additive sparse truth

Conclusion



  • A novel Bayesian sum-of-trees framework: soft decision trees and sparsity inducing priors

  • Meaningful improvement over existing methods

  • Software available on github: https://github.com/theodds/SoftBART





Thank you and questions!

References

Chipman, H. A., George, E. I., & McCulloch, R. E. (2010). BART: Bayesian additive regression trees. The Annals of Applied Statistics, 4(1), 266-298.

Linero, A. R., & Yang, Y. (2018). Bayesian regression tree ensembles that adapt to smoothness and sparsity. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 80(5), 1087-1110.