library(dplyr)library(ggplot2)library(robustbase)library(knitr)#' @param B number of resamples. #' @param n number of observation in generated data. #' @param k number of regressors in generated data.#' @param sigma standard deviation of the errors in generated data.#' @param level probability for the confidence intervals, lower with at confidence `conf = (1 - level/2)` and upper at `1 - conf`. #' @param sample_perc percentage of resampled observations with respect to total sample. #' The number of observations in subsample are computed as `n_bar = trunc(n*sample_perc)`. #' @param ci type of confidence interval. Can be `empiric` in which bootstrapped quantiles are used#' or `theoric` in which normal quantiles are used. #' @param d threshold parameter for Tukey function.#' @param eps threshold for outliers. We generate a random sequence in `tau in [0,1]`. #' When `tau < eps` then the observation become an outlier. #' @param z value of the outlier. #' @param type type of contamination when `type = "R"` the outlier observation become equal to `z`. #' When `type = "M"`, the outlier observation become equal to `z*Y[tau < eps]`.# Data generation parametersn =1000k =8sigma =4seed =0# Bootstrap parameters B =500level =0.95sample_perc =0.5# Outliers parameters d =4.56eps =0.05z =1.9type ="M"
We review a development on a bootstrap method presented by Salibián-Barrera, Aelst, and Willems (2008) for robust estimators which is computationally faster and more resistant to outliers than the classical bootstrap. This fast and robust bootstrap method is, under reasonable regularity conditions, asymptotically consistent.
Robustness to outliers
In most practical cases the bootstrap is based on simulating the distribution of the estimator of interest by generating a large number of new samples randomly drawn from the original data. The estimator is recalculated in each of these bootstrap samples, and the empirical distribution of these bootstrapped estimates yields an approximation to the estimator’s true sampling distribution.
However, two main problems arise when we want to use the bootstrap to estimate the sampling distribution of robust estimators: numerical instability and computational cost.
Intuitively, the reason for the numerical instability mentioned above is as follows. Since outlying and non-outlying observations have the same chance of belonging to any bootstrap sample, with a certain positive probability, the proportion of outliers in a bootstrap sample may be larger than the fraction of contamination that the robust estimator can tolerate. In other words, a certain proportion of the re-calculated values of the robust estimate may be heavily influenced by the outliers in the data.
Recently, Salibian-Barrera and Zamar (2002) introduced a new bootstrap method for robust estimators which has the advantage of being at the same time easy to compute and resistant to outliers in the sample (in other words, it can overcome the two problems mentioned above). Examples of well-known robust estimators that fulfill this condition are S-estimators, MM-estimators and \tau-estimators.
Notations
Let’s denote with n the number of observation and with k the number of regressors. For simplify the computations, we will use a matrix notation. All the bold symbols are interpreted as vector or matrices.
The functions \rho_0 and \rho_1 are the Tukey’s functions used respectively in S-estimate and MM-estimate. For simplicity we will fix a threshold d and we consider the same Tukey function for both S and MM-estimates. We consider the following function: \rho_d(x) = \begin{cases} \begin{align*} {} & \frac{3x^2}{d^2} - \frac{3x^4}{d^4} + \frac{x^6}{d^6} \quad {} & |x| \le d \\ & 1 & |x| > d \end{align*} \end{cases} \tag{1}
For Tukey’s derivatives functions we refer to this page.
The parameter \hat{\boldsymbol{\beta}} stands for the parameter obtained by an MM-regression, while \tilde{\boldsymbol{\beta}} and \hat{\sigma} are obtained from an S-regression.
Robust Regression
Consider a regression model in matrix form
\underset{\small n \times 1}{\boldsymbol{Y}} = \underset{\small n \times k}{\boldsymbol{X}} \; \underset{\small k \times 1}{\boldsymbol{\beta}} + \sigma \underset{\small n \times 1}{\boldsymbol{\epsilon}} where \boldsymbol{\epsilon} \sim \mathbb{F}^{c}, \boldsymbol{\beta} and \sigma are the true parameters in population. For simplicity we consider a normal distribution of the errors, namely \boldsymbol{\epsilon} \sim \mathbb{F}^{c} = N(0,1)
Initialization
Given an initial estimate of \hat{\boldsymbol{\beta}}, \tilde{\boldsymbol{\beta}} and \hat{\sigma}, compute the \textbf{residuals} for the MM and S regression respectively: \underset{\small n \times 1}{\boldsymbol{r}} = \underset{\small n \times 1}{\boldsymbol{Y}} - \underset{\small n \times k}{\boldsymbol{X}} \; \underset{\small k \times 1}{\hat{\boldsymbol{\beta}}} \quad \quad \underset{\small n \times 1}{\boldsymbol{\tilde{r}}} = \underset{\small n \times 1}{\boldsymbol{Y}} - \underset{\small n \times k}{\boldsymbol{X}} \; \underset{\small k \times 1}{\tilde{\boldsymbol{\beta}}} \tag{2} Compute the \textbf{weights} for i = 1,2, \dots, n: w_i = \frac{1}{r_i}\rho_1^{\prime} \biggl(\frac{r_i}{\hat{\sigma}}\biggl) \tag{3}\tilde{w}_i = \frac{1}{\tilde{r}_i}\rho_0^{\prime} \biggl(\frac{\tilde{r}_i}{\hat{\sigma}}\biggl) \tag{4}v_i = \frac{\hat{\sigma}_s}{n \ b \ \tilde{r}_i}\rho_0 \biggl(\frac{\tilde{r}_i}{\hat{\sigma}}\biggl) \tag{5} where b = \mathbb{E}^{\mathbb{F}_c}\{\rho_0\}. Then, MM-parameters are computed as: \underset{k \times 1}{\hat{\boldsymbol{\beta}}} = (\underset{k \times n}{\boldsymbol{X}^{T}} \underset{n \times n}{\text{diag}(w)} \underset{n \times k}{\boldsymbol{X}})^{-1} \underset{k \times n}{\boldsymbol{X}}^T \underset{n \times n}{\text{diag}(w)} \underset{n \times 1}{\boldsymbol{Y}} \tag{6} while for the S-parameters, \underset{k \times 1}{\boldsymbol{\tilde{\beta}}} = (\underset{k \times n}{\boldsymbol{X}}^{T} \underset{n \times n}{\text{diag}(\tilde{w})} \underset{n \times k}{\boldsymbol{X}})^{-1} \underset{k \times n}{\boldsymbol{X}}^T \underset{n \times n}{\text{diag}(\tilde{w})} \underset{n \times 1}{\boldsymbol{Y}} \tag{7} and the S-scale is obtained as \underset{1 \times 1}{\hat{\sigma}} = \underset{1 \times n}{\tilde{\boldsymbol{r}}^T} \ \underset{n \times 1}{\boldsymbol{v}} \iff \underset{1 \times 1}{\hat{\sigma}} = \sum_{i = 1}^{n} v_i \tilde{r}_i \tag{8} Finally, the correction factors are defined as: \underset{k \times k}{\boldsymbol{M}} = \hat{\sigma} \biggl(\boldsymbol{X}^T \text{diag}[\rho_1^{\prime\prime}\bigl(\frac{\boldsymbol{r}}{\hat{\sigma}}\bigl)] \boldsymbol{X}\biggl)^{-1} \boldsymbol{X}^T \text{diag}(\boldsymbol{w}) \quad\quad \underset{k \times k}{\tilde{\boldsymbol{M}}} = 0 \tag{9}\underset{k \times 1}{\boldsymbol{d}} = \frac{1}{a_n} \biggl(\boldsymbol{X}^T \text{diag}[\rho_1^{\prime\prime}\bigl(\frac{\boldsymbol{r}}{\hat{\sigma}}\bigl)] \boldsymbol{X}\biggl)^{-1} \boldsymbol{X}^T (\text{diag}[\rho_1^{\prime\prime}\bigl(\frac{\boldsymbol{r}}{\hat{\sigma}}\bigl) \ \odot \ \boldsymbol{r}]) \tag{10}\underset{1 \times 1}{a_n} = \frac{1}{n \ b} \sum_{i=1}^{n} \biggl[\rho_0^{\prime} \biggl(\frac{r_i}{\hat{\sigma}}\biggl) \ \frac{\tilde{r}_i}{\hat{\sigma}} \biggl] \tag{11} where \odot denotes the Hadamard product and all are computed once on the complete sample.
Bootstrapping
Once the initial parameters are estimated, consider a sample of n^{\ast} of observations from which we extract X^{\ast}, Y^{\ast}. The residuals are then updated on the sub-sample as: \underset{\small n^{\ast} \times 1}{\boldsymbol{r}^{\ast}} = \underset{\small n^{\ast} \times 1}{\boldsymbol{Y}^{\ast}} - \underset{\small n^{\ast} \times k}{\boldsymbol{X}^{\ast}} \; \underset{\small k \times 1}{\hat{\boldsymbol{\beta}}} \quad \quad \underset{\small n^{\ast} \times 1}{\boldsymbol{\tilde{r}}^{\ast}} = \underset{\small n^{\ast} \times 1}{\boldsymbol{Y}^{\ast}} - \underset{\small n^{\ast} \times k}{\boldsymbol{X}^{\ast}} \; \underset{\small k \times 1}{\tilde{\boldsymbol{\beta}}} \tag{12} Then, for i = 1, 2, \dots, n^{\ast} the \textbf{updated weights} becomes: w_i^{\ast} = \frac{1}{r_i^{\ast}}\rho_1^{\prime} \biggl(\frac{r_i^{\ast}}{\hat{\sigma}_s}\biggl) \tag{13}\tilde{w}_i^{\ast} = \frac{1}{\tilde{r}_i^{\ast}}\rho_0^{\prime} \biggl(\frac{\tilde{r}_i^{\ast}}{\hat{\sigma}_s}\biggl) \tag{14}v_i^{\ast} = \frac{\hat{\sigma}_s}{n^{\ast} \ b \ \tilde{r}_i^{\ast}}\rho_0 \biggl(\frac{\tilde{r}_i^{\ast}}{\hat{\sigma}}\biggl) \tag{15} We recalculate the one-step estimate \boldsymbol{\hat{\beta}}^{1\ast}, \boldsymbol{\tilde{\beta}}^{1\ast} and \hat{\sigma}^{1\ast} respectively as in Equation 6, Equation 7 and Equation 8, but with the sub-sample elements. Then, the linearly corrected estimator for building the confidence intervals are obtained as: \underset{k \times 1}{\hat{\boldsymbol{\beta}}^{R\ast}} = \hat{\boldsymbol{\beta}} + \boldsymbol{M} (\boldsymbol{\hat{\beta}}^{1\ast} - \boldsymbol{\hat{\beta}}) + \boldsymbol{d} (\boldsymbol{\hat{\sigma}}^{1\ast} - \boldsymbol{\hat{\sigma}}) + \boldsymbol{\tilde{M}} (\boldsymbol{\tilde{\beta}}^{1\ast} - \boldsymbol{\tilde{\beta}}) \tag{16}
Since \tilde{\boldsymbol{M}} is assumed to be zero, it is possible to avoid the calculation of \boldsymbol{\tilde{\beta}}^{1\ast} and \tilde{w}_i^{\ast} on the sub-sample and consider a simplified version: \underset{k \times 1}{\hat{\boldsymbol{\beta}}^{R\ast}} = \hat{\boldsymbol{\beta}} + \boldsymbol{M} (\boldsymbol{\hat{\beta}}^{1\ast} - \boldsymbol{\hat{\beta}}) + \boldsymbol{d} (\boldsymbol{\hat{\sigma}}^{1\ast} - \boldsymbol{\hat{\sigma}}) \tag{17}
Routine
Routine
Step
Description
Initialization
Compute \hat{\boldsymbol{\beta}} (MM-estimate) and \tilde{\boldsymbol{\beta}} and \hat{\sigma} (S-estimate).
Compute the correction factors \boldsymbol{M}, a_n and \boldsymbol{d} on the complete sample.
Bootstrapping
Forb = 1, 2, \dots, B:
Extract a sample \boldsymbol{X}^{\ast}, \boldsymbol{Y}^{\ast} with n^{\ast} observations.
Update the weights \boldsymbol{w}^{\ast}, \boldsymbol{v}^{\ast}.
Compute one-step estimates \hat{\boldsymbol{\beta}}^{1\ast} and $ ^{1}$.
Confidence intervals for \hat{\boldsymbol{\beta}}_j \quad j = 1,\dots,k. Two methods:
Theoric: with normal quantiles as \hat{\boldsymbol{\beta}}_{j} \pm z_{\alpha/2} \hat{\sigma}_j where z_{\alpha/2} is the normal quantile with confidence \alpha and \hat{\sigma}_j is the estimated asymptotic standard deviation of \hat{\boldsymbol{\beta}}_{j}.
Estimated: with the empirical quantiles of the bootstrapped distribution of \hat{\boldsymbol{\beta}}_{j}.
Implementation Example
Data Generator Function
Data Generator Function
#' Generate regression data #' #' @param n number of observation in generated data. #' @param k number of regressors in generated data.#' @param sigma standard deviation of the errors in generated data.#' @param intercept Set to `TRUE` to include the intercept parameter. #' @param seed random seed.#' @param eps threshold for outliers. We generate a random sequence in `tau in [0,1]`. #' When `tau < eps` then the observation become an outlier. #' @param z value of the outlier. #' @param type type of contamination when `type = "R"` the outlier observation become equal to `z`. #' When `type = "M"`, the outlier observation become equal to `z*Y[tau < eps]`.#' @rdname generate_regression_data#' @name generate_regression_datagenerate_regression_data <-function(n =5000, k =8, sigma =2, intercept =TRUE, eps =0, z =100, type ="R", seed =1){set.seed(seed)# Simulate value for true beta true_beta <-rnorm(k, mean =0, sd =1)names(true_beta) <-paste0("beta_", 1:k) # names # Vector of parameters theta_true =c(true_beta, sigma = sigma)# Simulate the regressors X <-matrix(rnorm(n*k), nrow = n, ncol = k, byrow =TRUE)colnames(X) <-c(paste0("X", 1:k)) # names # Include an intercept if (intercept) { X[,1] <-rep(1, n) } # Regression error eps_sim <-rnorm(n, mean =0, sd = sigma)# Response variable Y <- X %*% true_betaif (!eps !=0) { tau <-runif(n) type <-match.arg(type, choices =c("R", "M"))if (type =="R") { Y[tau < eps,] <- z } elseif (type =="R") { Y[tau < eps,] <- z*Y[tau < eps,] } }colnames(Y) <-"Y"structure(list(X = X, Y = Y,e = eps,z = z,type = type,eps = eps_sim,k = k,n = n,theta = theta_true ) )}
#' @param object object coming from fastRobBoot function#' @param k index of the parameter to plot, if `NULL` all the parameter will be plotted. #' @param legend position of the legend, can be `none`, `top`, `left`, `right`, `bottom`. #' In case of multiple parameter plotted, it will be included only on first plot.#' @title position of the legend, when `TRUE` it will be added only on first plot.plot.fastRobBoot <-function(object, k =NULL, legend ="none", title =TRUE){ plot_fastRobBoot <-function(object, k =1, legend ="top", title =TRUE){# Bootstrapped series theta_boot <-c(object$bootstrap[,k])[[1]]# Parameter name theta_name <-names(object$bootstrap[,k]) theta_min <-min(theta_boot) # minimum value theta_max <-max(theta_boot) # maximum value theta_lo <- object$results$lower[k] # lower quantile theta_up <- object$results$upper[k] # upper quantile theta_true <- object$results$true[k] # true parameter theta_init <- object$results$init[k] # true parameter e_theta <- object$results$mean[k] # expected value sd_theta <- object$results$sigma[k] # standard deviation# Theoric Distribution xy_e_theta <-list(x = e_theta, y =dnorm(e_theta, mean = e_theta, sd = sd_theta)) xy_theta_lo <-list(x = theta_lo, y =dnorm(theta_lo, mean = e_theta, sd = sd_theta)) xy_theta_up <-list(x = theta_up, y =dnorm(theta_up, mean = e_theta, sd = sd_theta))# Grid for theoric distribution grid <-list(x =0, y =0) grid$x <-seq(theta_min, theta_max, length.out =100) grid$y <-dnorm(grid$x, mean = e_theta, sd = sd_theta) plt <-ggplot()+geom_histogram(aes(theta_boot, ..density..),color ="black", fill ="lightgray", alpha =0.5, bins =30)+geom_segment(aes(x = xy_e_theta$x, xend = xy_e_theta$x, y =0, yend = xy_e_theta$y, color ="mean"), linewidth =1.1)+geom_segment(aes(x = xy_theta_lo$x, xend = xy_theta_lo$x, y =0, yend = xy_theta_lo$y, color ="lower"), linewidth =1.1)+geom_segment(aes(x = xy_theta_up$x, xend = xy_theta_up$x, y =0, yend = xy_theta_up$y, color ="upper"), linewidth =1.1)+geom_line(aes(grid$x, grid$y, color ="theoric"), linewidth =1.3)+geom_point(aes(x = theta_init, y =0, color ="init"), size =3)if (!is.na(theta_true[1])){ plt <- plt +geom_point(aes(x = theta_true, y =0, color ="true"), size =2, fill ="purple", shape =17) } plt <- plt +scale_color_manual(values =c(mean ="green", lower ="red", upper ="blue", theoric ="darkorange", true ="purple", init ="magenta"),labels =c(mean ="Mean", lower ="Lower", upper ="Upper", theoric ="Theoric", true ="True", init ="Initial"))+theme_bw()+theme(legend.position = legend)+labs(x = theta_name, y =NULL, color =NULL, subtitle =ifelse(title, paste0(nrow(object$bootstrap), " bootstrap"), ""))return(plt) } max_k <-nrow(object$results)if (is.null(k)){ k <-1:max_k } plt <-list() plt[[1]] <-plot_fastRobBoot(object, k = k[1], legend = legend, title = title)if (length(k) >1){ k[k > max_k | k == k[1]] <-NA k <-na.omit(k)for(i in k){ plt[[i]] <-plot_fastRobBoot(object, k = i, legend ="none", title =FALSE) } } gridExtra::grid.arrange(grobs = plt)}
Test: is it fast?
We compare the fast robust bootstrap procedure proposed with another approach that is still robust but slower. In the slow-robust approach we simply re-estimate the parameters on a sub-sample with an MM-estimator. We can see that the fast robust bootstrap is almost 7x faster.
init_time <-Sys.time()object <-fastRobBoot(B = B, n = n, k = k, sigma = sigma,level = level, sample_perc = sample_perc, d = d, ci ="empiric", seed = seed, quiet =TRUE)end_time <-Sys.time()init_time2 <-Sys.time()object2 <-slowRobBoot(B = B, n = n, k = k, sigma = sigma,level = level, sample_perc = sample_perc,ci ="empiric", seed = seed, quiet =TRUE, robust =TRUE)end_time2 <-Sys.time()
[1] "Execution time (Fast Robust): 3.54428815841675 seconds!"
[1] "Execution time (Slow Robust): 26.7345499992371 seconds!"
Fast robust bootstrap on subsamples without outliers. In green the bootstrapped mean parameter, in magenta the initial estimate parameter, in purple the true parameter. The histogram is obtained from the bootstrapped distribution while the theoric normal distribution is in orange. The lower and upper confidence levels are respectively in red and blue.
Figure (Slow Robust Bootstrap)
plot(object2)
Slow robust bootstrap on subsamples without outliers. In green the bootstrapped mean parameter, in magenta the initial estimate parameter, in purple the true parameter. The histogram is obtained from the bootstrapped distribution while the theoric normal distribution is in orange. The lower and upper confidence levels are respectively in red and blue.
(Fast Robust) Confidence intervals without outliers.
theta
init
lower
mean
upper
true
bias
sigma
\hat{\beta_{1}}
1.0589404
0.7889691
1.0548896
1.3078614
1.2629543
-0.2080647
0.1294458
\hat{\beta_{2}}
-0.3988854
-0.6793477
-0.4046007
-0.1390852
-0.3262334
-0.0783674
0.1334516
\hat{\beta_{3}}
1.3919871
1.1320694
1.3882910
1.6424921
1.3297993
0.0584917
0.1284604
\hat{\beta_{4}}
1.3962083
1.1343451
1.3945891
1.6418834
1.2724293
0.1221598
0.1260848
\hat{\beta_{5}}
0.5257455
0.2716822
0.5210130
0.7807042
0.4146414
0.1063715
0.1301632
\hat{\beta_{6}}
-1.4569341
-1.6987144
-1.4554621
-1.2101879
-1.5399500
0.0844879
0.1295230
\hat{\beta_{7}}
-0.8917018
-1.1550028
-0.8972858
-0.6389953
-0.9285670
0.0312813
0.1332243
\hat{\beta_{8}}
-0.1095115
-0.3542053
-0.1062331
0.1613692
-0.2947204
0.1884873
0.1348299
\hat{\sigma}
4.0257902
3.6790257
3.9813193
4.2993828
4.0000000
-0.0186807
0.1525036
(Slow Robust) Confidence intervals without outliers.
theta
init
lower
mean
upper
true
bias
sigma
\hat{\beta_{1}}
1.0511706
0.7878794
1.0674608
1.3628851
1.2629543
-0.1954935
0.1480856
\hat{\beta_{2}}
-0.3981599
-0.6498027
-0.3865137
-0.1022582
-0.3262334
-0.0602804
0.1517938
\hat{\beta_{3}}
1.3961376
1.1330371
1.3866320
1.6821311
1.3297993
0.0568327
0.1409718
\hat{\beta_{4}}
1.3954471
1.1021259
1.3930762
1.6847461
1.2724293
0.1206469
0.1418879
\hat{\beta_{5}}
0.5261062
0.2400689
0.5275744
0.7928579
0.4146414
0.1129330
0.1577865
\hat{\beta_{6}}
-1.4521350
-1.7256519
-1.4659156
-1.2128659
-1.5399500
0.0740344
0.1444875
\hat{\beta_{7}}
-0.8949850
-1.1179134
-0.8891667
-0.6380188
-0.9285670
0.0394003
0.1317179
\hat{\beta_{8}}
-0.1095828
-0.3904141
-0.0922017
0.1573025
-0.2947204
0.2025187
0.1453933
\hat{\sigma}
4.0112498
3.8397435
4.0192902
4.2228865
4.0000000
0.0192902
0.1002278
Test: is it robust?
We compare the fast robust bootstrap procedure proposed with another approach that is not robust but faster. In the non-robust approach we simply re-estimate the parameters on a sub-sample with an OLS-estimator.
init_time <-Sys.time()object_pert <-fastRobBoot(B = B, n = n, k = k, sigma = sigma,level = level, sample_perc = sample_perc, d = d, eps = eps, z = z, type = type, ci ="empiric", seed = seed, quiet =TRUE, consistency =FALSE)end_time <-Sys.time()init_time2 <-Sys.time()object2_pert <-slowRobBoot(B = B, n = n, k = k, sigma = sigma,level = level, sample_perc = sample_perc,eps = eps, z = z, type = type,ci ="empiric", seed = seed, quiet =TRUE, robust =FALSE)end_time2 <-Sys.time()
[1] "Execution time (Fast Robust): 3.40817880630493 seconds!"
[1] "Execution time (non-Robust): 0.718003988265991 seconds!"
Fast robust bootstrap on subsamples with outliers. In green the bootstrapped mean parameter, in magenta the initial estimate parameter, in purple the true parameter. The histogram is obtained from the bootstrapped distribution while the theoric normal distribution is in orange. The lower and upper confidence levels are respectively in red and blue.
Figure (non-Robust Bootstrap)
plot(object2_pert)
Slow non-robust bootstrap on subsamples with outliers. In green the bootstrapped mean parameter, in magenta the initial estimate parameter, in purple the true parameter. The histogram is obtained from the bootstrapped distribution while the theoric normal distribution is in orange. The lower and upper confidence levels are respectively in red and blue.
(Fast Robust) Confidence intervals with outliers.
theta
init
lower
mean
upper
true
bias
sigma
\hat{\beta_{1}}
1.0589404
0.7889691
1.0553627
1.3078614
1.2629543
-0.2075915
0.1292296
\hat{\beta_{2}}
-0.3988854
-0.6793477
-0.4042600
-0.1390852
-0.3262334
-0.0780266
0.1336887
\hat{\beta_{3}}
1.3919871
1.1235711
1.3875019
1.6424921
1.3297993
0.0577026
0.1290206
\hat{\beta_{4}}
1.3962083
1.1343451
1.3948776
1.6418834
1.2724293
0.1224483
0.1263326
\hat{\beta_{5}}
0.5257455
0.2716822
0.5219242
0.7844976
0.4146414
0.1072828
0.1311146
\hat{\beta_{6}}
-1.4569341
-1.6987144
-1.4550449
-1.2044396
-1.5399500
0.0849052
0.1300482
\hat{\beta_{7}}
-0.8917018
-1.1550028
-0.8973723
-0.6389953
-0.9285670
0.0311947
0.1332723
\hat{\beta_{8}}
-0.1095115
-0.3542053
-0.1051146
0.1637056
-0.2947204
0.1896058
0.1358755
\hat{\sigma}
4.0257902
3.6790257
3.9815491
4.2993828
4.0000000
-0.0184509
0.1530552
(non-Robust) Confidence intervals with outliers.
theta
init
lower
mean
upper
true
bias
sigma
\hat{\beta_{1}}
1.0511706
0.8005770
1.0449652
1.2820813
1.2629543
-0.2179891
0.1206155
\hat{\beta_{2}}
-0.3981599
-0.6669506
-0.4086902
-0.1466543
-0.3262334
-0.0824569
0.1302746
\hat{\beta_{3}}
1.3961376
1.1036999
1.3592628
1.6133904
1.3297993
0.0294636
0.1312717
\hat{\beta_{4}}
1.3954471
1.2360310
1.4701240
1.7206092
1.2724293
0.1976947
0.1168997
\hat{\beta_{5}}
0.5261062
0.2960187
0.5253167
0.7519219
0.4146414
0.1106753
0.1168807
\hat{\beta_{6}}
-1.4521350
-1.6617490
-1.4503151
-1.2281579
-1.5399500
0.0896350
0.1153854
\hat{\beta_{7}}
-0.8949850
-1.1344081
-0.8735736
-0.5855254
-0.9285670
0.0549934
0.1359038
\hat{\beta_{8}}
-0.1095828
-0.4125427
-0.1510493
0.1253507
-0.2947204
0.1436711
0.1358338
\hat{\sigma}
4.0048564
3.8003654
3.9915528
4.1610923
4.0000000
-0.0084472
0.0916854
Test: is it consistent?
We generate a new random error at each step and we re-estimate the linearly corrected beta parameter to show that it’s expected value converge to the true population value.
object_consistency <-fastRobBoot(B = B, n = n, k = k, sigma = sigma,level = level, sample_perc = sample_perc, d = d,ci ="empiric", seed = seed, quiet =TRUE, consistency =TRUE)
Figure (Consistency Fast Robust Bootstrap)
plot(object_consistency)
Bootstrapped confidence intervals on all the sample for consistency. In green the bootstrapped mean parameter, in magenta the initial estimate parameter, in purple the true parameter. The histogram is obtained from the bootstrapped distribution while the theoric normal distribution is in orange. The lower and upper confidence levels are respectively in red and blue.
(Fast Robust) Consistency of bootstrapped expectation.
For example purposes we use iris dataset that contains only 150 observations.
object_emp <-lm(Sepal.Width ~., data = iris)boot <-fastRobBoot.lm(object_emp, B =20000, level =0.95, sample_perc =0.5, d =4.56, quiet =TRUE)
Fast robust bootstrap on subsamples of real data. In green the bootstrapped mean parameter, in magenta the initial estimate parameter. The histogram is obtained from the bootstrapped distribution while the theoric normal distribution is in orange. The lower and upper confidence levels are respectively in red and blue.
(Fast Robust) Confidence intervals on real data.
theta
init
lower
mean
upper
sigma
\hat{\beta_{1}}
1.8149570
1.1593555
1.7685507
2.3911049
0.3163409
\hat{\beta_{2}}
0.3289448
0.1920690
0.3442840
0.4940748
0.0777053
\hat{\beta_{3}}
-0.1336001
-0.3570081
-0.1519162
0.0536134
0.1037093
\hat{\beta_{4}}
0.6508073
0.3749381
0.6488262
0.9445551
0.1442441
\hat{\beta_{5}}
-1.2747748
-1.6833223
-1.2440808
-0.7838698
0.2282095
\hat{\beta_{6}}
-1.5776975
-2.1674423
-1.5282813
-0.8621432
0.3308906
\hat{\sigma}
0.2822656
0.1829479
0.2398150
0.2965793
0.0292481
References
Salibian-Barrera, M, and RH Zamar. 2002. “Bootstrapping Robust Estimates of Regression.” Article. Annals of Statistics 30 (2): 556–82. https://doi.org/10.1214/aos/1021379865.
Salibián-Barrera, Matías, Stefan Aelst, and Gert Willems. 2008. “Fast and robust bootstrap.”Statistical Methods & Applications 17 (1): 41–71. https://doi.org/10.1007/s10260-007-0048-6.