May 12, 2017

Introduction

  • Goal of Master's project:
    • Regression methods for unbiased or consistent estimators and valid confidence intervals in a high-dimensional setting (\(p \gg N\))
    • There can be confounding
    • Response variable is discrete or right-censored
  • Goal of project with Dr. Hanna:
    • Efficiently analyze large ICES medical data set using ML techniques
    • Predict the impact on the hazard rate or five-year survival rates of patients treated with a specific treatment
    • Time-dependent covariates

Several approaches for high-dimensional inference (HDI)

  1. Double machine learning (DML)
  2. Pooled Logistic Regression (PLR)
  3. Post-selection inference (PSI) with LASSO

Notation (variables)

  • For DML:
    • There are \(N\) indepdentently drawn observations: \(\{(y_i,d_i,{\mathbf{x}_i})\}^N_{i=1}\)
    • \(y_i\) is the response variable (can be continuous, discrete, or right-censored survival time)
    • \({\mathbf{x}_i}\) is a \(p\)-dimensional feature vector (may include an intercept)
    • \(d_i\) is "treatment" variable of interest (may be included in \({\mathbf{x}_i}\))
    • \({\mathbf{X}}\) is the \(N \times p\) design matrix

Notation (model)

  • (Generalized) linear model of the form shown in \(\eqref{eq:dml}\)
  • Treatment effect on (function) of response is \(\alpha_0\)
  • Confounding relationship for continuous or binary treatment \(d_i\) \(\eqref{eq:confound}\)
  • Simulations will use \(N=100\) and \(p=100\).
  • Coefficient impacts must be "sparse" \(\eqref{eq:sparse}\) or "approximately sparse" \(\eqref{eq:asparse}\)

\[ \begin{align} g_1(E[y_i|d_i,{\mathbf{x}_i}] ) &= \alpha_0 d_i + {\boldsymbol\beta}^T {\mathbf{x}_i}+ e_i \label{eq:dml} \\ g_2(E[d_i|{\mathbf{x}_i}]) &= {\boldsymbol\phi}^T {\mathbf{x}_i}+ u_i \label{eq:confound} \\ \| {\boldsymbol\beta}_0 \|_0 &= s < N \label{eq:sparse} \\ | {\boldsymbol\beta}_0 |_{(j)} &\leq A^{-\gamma}, \hspace{2mm} A>0,\hspace{2mm} \gamma>1/2 , \hspace{2mm} j=1,\dots,p \label{eq:asparse} \end{align} \]

Double machine learning

  • Inference is possible with DML on a treatment parameter due to construction of "orthogonal estimating equations" [1]
    • This counteracts bias from both regularization and selection
    • \(w_i\): measurements, \(\eta\): nuissance parameters
  • For classical Gaussian case equations \(\eqref{eq:ee}\) and \(\eqref{eq:oc}\) are the Frish-Waugh-Lovell (FWL) conditions

\[ \begin{align} \mathbb{E}[\psi(w_i,\alpha_0,\eta)] &= 0 &\textbf{Estimating equation} \label{eq:ee} \\ \frac{\mathbb{E}[\psi(w_i,\alpha_0,\eta)]}{\partial \eta} &= 0 &\textbf{Orthogonality condition} \label{eq:oc} \\ \end{align} \]

Double machine learning

  • Relationship of FWL to orthogonal estimating equations
  • Changes in the value of the nuissance parameters (\(\eta\)) evaluated at the true \(\alpha_0\) do not impact equations
  • Hence, equations \(\eqref{eq:fwl1}\) and \(\eqref{eq:fwl2}\) can be estimated with LASSO

\[ \begin{align} d_i &= {\mathbf{x}_i}^T {\boldsymbol\phi}_0^d + u_i^d \hspace{2.2cm} E(u_i^d{\mathbf{x}_i})=\boldsymbol 0_p \label{eq:fwl1} \\ y_i &= {\mathbf{x}_i}^T {\boldsymbol\phi}_0^y + u_i^y \hspace{2.2cm} E(u_i^y{\mathbf{x}_i})=\boldsymbol 0_p \label{eq:fwl2} \\ u_i^y &= \alpha_0 u_i^d + v_i \hspace{2.5cm} E(u_i^d v_i )=0 \label{eq:fwl3} \\ E[v_iu_i^d] &= E([\underbrace{u_i^y-\alpha_0 u_i^d]u_i^d}_{\psi(w_i,\alpha_0,\eta)}) = 0 \nonumber \\ w_i &= [y_i,d_i,x_i], \hspace{3mm} \eta = [{\boldsymbol\phi}^d_0,{\boldsymbol\phi}^y_0] \nonumber \end{align} \]

Double machine learning

  • How does DML select \(\lambda\) for the LASSO estimator \(\eqref{eq:lasso}\) [2] ?
    • "Oracle-rate" convergence of estimator: \(\lambda=\sigma \cdot 2 \sqrt{2\log(pn)/n}\) [3]
      • Hard to aproximate \(\sigma\)
  • Solution: square-root LASSO \(\eqref{eq:srlasso}\) [4] [5]
    • Rate-optimal \(\lambda\) independent of variance: \(\lambda=\sqrt{2\log(pn)/n}\)

\[ \begin{align} \text{LASSO}: \hspace{1cm} &\min_{\mathbf{w}} \hspace{3mm} \frac{1}{n} \sum_{i=1}^n (y_i - \mathbf{w}^T {\mathbf{x}_i})^2 + \lambda \| \mathbf{w} \|_1 \label{eq:lasso} \\ \sqrt{\text{LASSO}}: \hspace{1cm} &\min_{\mathbf{w}} \hspace{3mm} \sqrt{\frac{1}{n} \sum_{i=1}^n (y_i - \mathbf{w}^T {\mathbf{x}_i})^2} + \lambda \| \mathbf{w} \|_1 \label{eq:srlasso} \end{align} \]

Double machine learning

  • If \(y_i\) is Gaussian, my simulations suggest DML works well (or the best)

Double machine learning

  • For the GLM case, there are two approaches [6]
    • An "instrument" \(z_{0i}=z_0(d_i,{\mathbf{x}_i})\) with stimating equations: \(\eqref{eq:glm1}\) \(\eqref{eq:glm2}\) and orthogonality condition: \(\eqref{eq:glm3}\)
    • A "double selection procedure" which uses two "Post-LASSOs" and "honest confidence intervals"

\[ \begin{align} E[\{y_i - g^{-1}(\alpha_0 d_i + {\boldsymbol\beta}^T {\mathbf{x}_i}) z_{0i} \}] &= 0 \label{eq:glm1} \\ \frac{\partial}{\partial \alpha}E[\{y_i - g^{-1}(\alpha d_i + {\boldsymbol\beta}^T {\mathbf{x}_i}) z_{0i} \}] |_{\alpha=\alpha_0} &\neq 0 \label{eq:glm2} \\ \frac{\partial}{\partial {\boldsymbol\beta}}E[\{y_i - g^{-1}(\alpha d_i + {\boldsymbol\beta}^T {\mathbf{x}_i}) z_{0i} \}] |_{{\boldsymbol\beta}={\boldsymbol\beta}_0} &= \boldsymbol 0 \label{eq:glm3} \end{align} \]

Double machine learning

Double machine learning

Double machine learning

  • My simulations suggests it works well too

Double machine learning

  • Strengths of DML:
    • Specifically focuses on a treatment variable
    • More likely to "work"
  • Issues with DML:
    • Estimation procedure for GLMs less theoretically grounded (\(\lambda\))
    • Simulations need to demonstrate how effective it is with PLR for censored and time-dependent survival data

Pooled Logistic Regression (PLR)

  • Logistic regression can be used to time-dependent Cox PH model [7] [8]
  • Interval data is "pooled" into a long format, with each interval given a factor (risk set)
## [1] "Survival format"
##   id start stop event x
## 1  a     0    2     1 2
## 2  b     0    3     1 1
## 3  c     0    4     0 3
## 4  c     4    6     0 4
## 5  c     6   10     1 6
## 6  d     0    5     1 3
## [1] "PLR format"
##    id t0 t1 x event interval
## 1   a  0  2 2     1        1
## 2   b  0  2 1     0        1
## 3   d  0  2 3     0        1
## 4   c  0  2 3     0        1
## 5   b  2  3 1     1        2
## 6   d  2  3 3     0        2
## 7   c  2  3 3     0        2
## 8   d  3  5 3     1        3
## 9   c  3  5 4     0        3
## 10  c  5 10 6     1        4

Pooled Logistic Regression (PLR)

  • For example, using the Stanford heart transplant data: survival::heart, where Transplant is a time-varying covariate
  • Results between two approaches are quite similar
Cox PH PLR-All PLR-Fixed PLR-Log
(1) (2) (3) (4)
Intercept -4.028*** -3.389*** -3.470***
(1.024) (0.314) (0.281)
Age 0.027** 0.028** 0.027** 0.026*
(0.014) (0.014) (0.014) (0.013)
Year -0.146** -0.158** -0.157** -0.174**
(0.070) (0.071) (0.071) (0.070)
Surgery -0.636* -0.659* -0.647* -0.591
(0.367) (0.372) (0.371) (0.369)
Transplant -0.012 -0.017 0.043 0.041
(0.314) (0.318) (0.271) (0.263)
Interval dummies No Yes Yes No
p 4 66 28 6
N 172 3568 3658 3568
Note: p<0.1; p<0.05; p<0.01


Pooled Logistic Regression (PLR)

  • Or the Chronic Granulotomous Disease data dataset: survival::cgd
Cox PH PLR-All PLR-Fixed PLR-Log
(1) (2) (3) (4)
Intercept -4.326*** -4.051*** -4.077***
(1.412) (1.018) (1.003)
Treatment -1.138*** -1.150*** -1.090*** -1.083***
(0.273) (0.276) (0.274) (0.273)
Sex -0.611 -0.647 -0.679* -0.682*
(0.391) (0.396) (0.394) (0.394)
Age -0.097*** -0.100*** -0.099*** -0.099***
(0.036) (0.036) (0.035) (0.035)
Height 0.004 0.005 0.005 0.005
(0.010) (0.011) (0.010) (0.010)
Weight 0.018 0.018 0.019 0.019
(0.016) (0.016) (0.016) (0.016)
Inheritance 0.588** 0.615** 0.617** 0.612**
(0.277) (0.280) (0.277) (0.276)
Antibiotics -0.439 -0.435 -0.432 -0.432
(0.312) (0.316) (0.314) (0.314)
Interval dummies No Yes Yes No
p 8 78 24 10
N 203 7211 7211 7211
Note: p<0.1; p<0.05; p<0.01

Post-selection inference (LASSO)

  • Inference procedures for models that have already been selected:
    • LASSO, forward-stepwise regression, etc
  • Without adjustment, or additional assumptions, inference on selected models has optimistic p-values
  • Furthermore, the "LASSO" returns no p-values, SEs, CIs, etc

Post-selection inference (LASSO)

  • Post-selection inference (PSI) is a rapidly growing and recent field
    • Lockhart, Taylor, Tibs & Tibs. A significance test for the lasso. Annals of Statistics 2014
    • I Lee, Sun, Sun, Taylor (2016) Exact post-selection inference, with the lasso. Annals of Statistics 2016
    • I Fithian, Sun, Taylor (2015) Optimal inference after model selection. arXiv. Submitted
    • I Tibshirani, Ryan, Taylor, Lockhart, Tibs (2016) Exact Post-selection Inference for Sequential Regression Procedures. To appear, JASA
    • I Tian, X. and Taylor, J. (2015) Selective inference with a randomized response. arXiv
    • I Fithian, Taylor, Tibs, Tibs (2015) Selective Sequential Model Selection. arXiv Dec 2015

Post-selection inference (LASSO)

  • Principal idea: "any procedure for which the selection events can be characterized by a set of linear inequalities in \({\mathbf{y}}\): \(A{\mathbf{y}}\leq b\)" [9]
    • Known as "polyhedral constraints on \(y\)"
    • Includes algorithms like LASSO (fixed-\(\lambda\)), LAR, or forward-stepwise (FS)
  • In FS, the variable with the largest (absolute) inner product with \(y\) is selected (compared to the last step)
    • Suppose \(y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2}\), \(x_{i0}=1\) \(\forall i\)
    • If \({\mathbf{x}_2}\) is selected in the first stage, and is positive, then:

\[ \begin{align*} \langle {\mathbf{x}_2}, {\mathbf{y}}\rangle / \langle {\mathbf{x}_2}, {\mathbf{x}_2}\rangle \geq \pm \langle {\mathbf{x}_0}{\mathbf{y}}\rangle / \langle {\mathbf{x}_0}, {\mathbf{x}_0}\rangle \\ \langle {\mathbf{x}_2}, {\mathbf{y}}\rangle / \langle {\mathbf{x}_2}, {\mathbf{x}_2}\rangle \geq \pm \langle {\mathbf{x}_1}{\mathbf{y}}\rangle / \langle {\mathbf{x}_1}, {\mathbf{x}_1}\rangle \end{align*} \]

Post-selection inference (LASSO)

  • Which will give the following set of constraints:

\[ \begin{align*} \Bigg( {\mathbf{x}_2}^T-{\mathbf{x}_0}^T\frac{\langle {\mathbf{x}_2}, {\mathbf{x}_2}\rangle}{\langle {\mathbf{x}_0}, {\mathbf{x}_0}\rangle} \Bigg){\mathbf{y}}\geq 0 \\ \Bigg( {\mathbf{x}_2}^T+{\mathbf{x}_0}^T\frac{\langle {\mathbf{x}_2}, {\mathbf{x}_2}\rangle}{\langle {\mathbf{x}_0}, {\mathbf{x}_0}\rangle} \Bigg){\mathbf{y}}\geq 0 \\ \Bigg( {\mathbf{x}_2}^T-{\mathbf{x}_1}^T\frac{\langle {\mathbf{x}_2}, {\mathbf{x}_2}\rangle}{\langle {\mathbf{x}_1}, {\mathbf{x}_1}\rangle} \Bigg){\mathbf{y}}\geq 0 \\ \Bigg( {\mathbf{x}_2}^T+{\mathbf{x}_1}^T\frac{\langle {\mathbf{x}_2}, {\mathbf{x}_2}\rangle}{\langle {\mathbf{x}_1}, {\mathbf{x}_1}\rangle} \Bigg){\mathbf{y}}\geq 0 \\ \underbrace{A}_{4 \times N} {\mathbf{y}}\geq 0 \end{align*} \]

Post-selection inference (LASSO)

  • For the Gaussian FS-case, the polyhedral lemma would be:
    • Before selection, \(\langle {\mathbf{x}_2}, {\mathbf{y}}\rangle \in \mathbb{R}\)
    • After selection, inner product at least as large as the second-place finisher and \({\mathbf{y}}\)
    • Conditioning on results of competition means inference can be calculated with a truncated Gaussian distribution

Post-selection inference (LASSO)

  • For the LASSO case, we observe \(M\), the index of non-zero coefficients
    • Peform inference on \({\boldsymbol\beta}_M \in \mathbb{R}^M\), assuming \({\boldsymbol\beta}_{M^C} \simeq 0\)
  • The approach has been adapted for logistic and Cox models too [10]
  • For GLM works by using "one-step estimator" \(\eqref{eq:ose}\)
    • \(s_M\): is the sign of the selected coefficients
    • \(I_M\): Fisher Information matrix \(\Big( {\mathbf{X}}^T_M W(\hat{{\boldsymbol\beta}}_M) {\mathbf{X}}_M \Big)\)
    • Polyhedral constraints \(\eqref{eq:glm_poly}\)

\[ \begin{align} \bar{{\boldsymbol\beta}}_M &= \hat{{\boldsymbol\beta}}_M + \lambda \cdot I_M(\bar{{\boldsymbol\beta}}_M)^{-1}s_M \label{eq:ose} \\ \Big\{\text{diag}(s_M)&\Big[ \bar{{\boldsymbol\beta}}_M - I_M(\hat{{\boldsymbol\beta}}_M)^{-1} \lambda s_M \Big] \geq 0 \Big\} \label{eq:glm_poly} \end{align} \]

Post-selection inference (LASSO)

  • My simulation studies for Cox models shows higher variance but similar coverage

Post-selection inference (LASSO)

  • Stengths of PSI:
    • Developed tools and packages (selectiveInference)
    • Top-notch researchers contributing to this field
    • Highly scalable to large data sets (\(N\) or \(p\))
    • (Appears) to work with CV-selected \(\lambda\) too
  • Issues with PSI:
    • Not designed for analysis of treatment variable: what if \(d_i\) isn't selected?
    • Time-varying covariates? (No conceptual problem)

Going forward

  • One week for time-dependent covariates and Cox-LASSO
    • Coordinate descent for Cox IRLS algorithm
  • Two weeks for DML approach
    • One week for theory (\(c(\alpha)\) statistics, score equations, etc)
    • One week for simulating time-dependent survival data and testing DML

References

1. Chernozhukov, V.; Hansen, C.; Spindler, M. Valid post-selection and post-regularization inference: An elementary, general approach. Annual Review of Economics 2015, 7, 649–688.

2. Tibshirani, R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B 1994, 58, 267–288.

3. Bickel, P. J.; Ritov, Y.; Tsybakovg, A. B. Simultaneous analysis of lasso and dantzig selector. Annals of Statistics 2009, 37, 1705–1732.

4. Belloni, A.; Chernozhukov, V.; Wang, L. Square-root lasso: Pivotal recovery of sparse signals via conic programming. Biometrika 2011, 98, 791–806.

5. Belloni, A.; Chernozhukov, V. Least squares after model selection in high-dimensional sparse models. Bernoulli 2013, 19, 521–547.

6. Belloni, A.; Chernozhukov, V.; Wei, Y. Post-selection inference for generalized linear models with many controls. Journal of Business and Economic Statistics 2016, 34, 606–619.

7. Ngwa; Cabral; Cheng A comparison of time dependent cox regression, pooled logistic regression and cross sectional pooling with simulations and an application to the framingham heart study. BMC Medical Research Methodology 2016, 16.

8. D’Agostino; Lee; Belanger; Cupples; Anderson; Kannel Relation of pooled logistic regression to time dependent cox regression analysis: The framingham heart study. Statistics in Medicine 1990.

9. Hastie, T.; Tibshirani, R.; Wainwright, M. Statistical learning with sparsity: The lasso and generalizations; Chapman & Hall/CRC, 2015.

10. Taylor, J.; Tibshirani, R. Post-selection inference for l1-penalized likelihood models. Canadian Journal of Statistics 2017.

11. Lockhart, R.; Taylor, J.; Tibshirani, R. J.; Tibshirani, R. A significance test for the lasso. Annals of Statistics 2014, 42, 413–468.

12. Lee, J. D.; Sun, D. L.; Sun, Y.; Taylor, J. E. Exact post-selection inference, with application to the lasso. Annals of Statistics 2016, 44, 907–927.

13. Chernozhukov, V.; Hansen, C.; Spindler, M. High-Dimensional Metrics in R. arXiv 2016.

14. Dezere, R.; Buhlmann, P.; Nicolai Meinshausen, L. M. adn High-Dimensional Inference: Confidence Intervals, p-values and R-Software hdi. Statistical Science 2015, 30.