But in the above method, we need to tune two hyper parameters to get an appropriate \(m\), although they are usually setted as recommended values. In application, we can also use double bootstrap, and tune the bootstrap sample number \(m\) straightforwardly and automatically.
Here we use a double bootstrap algorithm to choose the tuning parameter \(m\) in a data-driven way. Suppose we are interested in \(c^T \theta\), i.e. the stage 1 variable effect, and its estimate from the original data is \(c^T \hat{\theta}_{n}\). Consider a grid of candidate values for \(m\) (we used {100,200,300,500,1000,2000} in our simulation study, where \(n\) = 500). Then the algorithm is as follows.
Draw B1 n-out-of-n first-stage bootstrap samples from the data and calculate the bootstrap estimate \(c^T \hat{\theta}^{b_1}_{n}\),( \(b_1\) = 1,2,…,\(B_1\)). Fix \(m\) at the smallest value in the grid.
Conditional on each first-stage bootstrap sample, draw \(B_2\) \(m\)-out-of-n second-stage (nested) bootstrap samples and calculate the double bootstrap versions of the estimate \(c^T \hat{\theta}^{b_1b_2}_{n,m}\), \(b_1\) = 1,2,…,\(B_1\), \(b_2\) = 1,2,…,\(B_2\).
For \(b_1\) = 1,2,…,\(B_1\), compute the \((\alpha/2) \times 100\) and \((1-\alpha/2) \times 100\) percentiles of { \(c^T \sqrt{m} (\hat{\theta}^{b_1b_2}_{n,m} - \hat{\theta}^{b_1}_{n})\), \(b_2\) = 1,…,\(B_2\)}; say \(\hat{l}_{DB}^{(b_1)}\) and \(\hat{u}_{DB}^{(b_1)}\) respectively. Construct the double centered percentile bootstrap from the \(b_1\) th first-stage bootstral data as (\(c^T \hat{\theta}^{b_1}_{n} - \hat{u}_{DB}^{(b_1)} / \sqrt{n} , c^T \hat{\theta}^{b_1}_{n} - \hat{l}_{DB}^{(b_1)} / \sqrt{n}\)), \(b_1\) = 1,…,\(B_1\).
Estimate the coverage rate of the double bootstrap CI from all the first-stage bootstrap data sets as \[ \frac{1}{B_1} \sum_{b_1 = 1}^{B_1} I\{ c^T \hat{\theta}^{b_1}_{n} - \hat{u}_{DB}^{(b_1)} / \sqrt{n} \le c^T \hat{\theta}_{n} \le c^T \hat{\theta}^{b_1}_{n} - \hat{l}_{DB}^{(b_1)} / \sqrt{n} \} \]
If the current coverage rate is located in [\((1-\alpha) - 1.96 \times \alpha(1-\alpha) / \sqrt{B_1}, (1-\alpha) + 1.96 \times \alpha(1-\alpha) / \sqrt{B_1}\) ], then it is not significantly biased from \(1-\alpha\), and we will pick the current value of \(m\) as the final value. Otherwise, increase \(m\) to the next value in the grid.
Repeat steps (1) - (5), until (5) is satisfied or the grid is exhausted.
The adaptive method of choosing \(m\) relys on the measure of non-regularity of \(\hat{p}\). But let’s get back to the definition of \(\hat{p}\), we will find that it is only relevant to stage 2 and is in fact independant of stage 1’s data structure \(X_1\). But the estimate of stage 1’s covariate effect \(\hat{\beta}_1\) relies on \(X_1\) and is a weighted sum of pseudo outcome when we use linear model. Thus the estimates of different covariates effect may have different optimal \(m\) due to their their different “non-regularity”.
In the adaptive method, the empirical non-regularity \(\hat{p} = \mathbb{P} I\{ n[ \mathbf{Z}_{21}^T\hat{\psi}_{2,n} ] ^2 \le \tau_n(\mathbf{Z}_{21}) \}\) doesn’t contain enough information to get an apprioriate \(m\) for bootstrap.
But double bootstrap can handle this question because of its data-driven property: for each parameter of interest \(c^T \hat{\theta}_n\), we will calculate the corresponding coverage rate to find the optimal target-specific \(m\), rather than simply relying on the empirical non-regularity. Thus for different variable’s effect (or linear combination of variable effect), we may need to use different \(m\).
\[Y_{2,pseudo} = \hat{\beta}_2Z_{20} + |\hat{\psi}_2^TZ_{11}|\]
\[\widetilde{PO} = Y_1 + Y_{2,pseudo}\]
\[\hat{\beta}_1 = (X_1^TX_1)^{-1}X_1^T \widetilde{PO}\]
Since we need nested bootstrap, the time cost will be much longer than the adaptive one. And according to the analysis above, we need different \(m\) as bootstrap size to estimate different covariates, thus fail to get all the estimates with one shoot.
When stage 2’s model is non-consistent, the \(\hat{Y}_{2,pseudo}\) will also be non-consistent, leading to biased \(\hat{Y}_{2,pseudo}\), especially when the non-regularity is serious. Thus the stage 1 estimate \(c^T \hat{\theta}_{n}\) in double bootstrap, as a function of \(|\hat{Y}_{2,pseudo}|\), will be biased from the ture value. Then the double bootstrap won’t be able to choose accurate \(m\) since our criteria is wrong.
For pure two-stage bootstrap, we may choose some extremly small \(m\) as bootstrap sample size (e.g. \(m = 30\), for \(n = 500\)) and it still works. But when we combine it with cencered data and inverse probability weighted censoring (IPCW) methods, the \(m\) shall have certain lower bound (e.g. \(m > 100\) for \(n = 500\)), otherwise we may fail to fit stage 2’s model with bootstrap samples sometimes, leading to biased and less accurate estimate. The reason is: some bootstrap samples are more likely to cause error in real applications, it is hard to integrate these error with other results, and excluding them will result into selection bias for final estimates.