R 語言與 Bootstrap

Pair, Residual, Wild and Bolck Bootstrap

Sun-Ting Lee

Introduction

於此節，我們將介紹如何使用R內建語法語言執行各類 Bootstrap。此外，於 R 中亦有套件 boot 可執行 Bootstrap。大綱如下：

Basic Bootstrap
Pair Bootstrap
Residual Bootstrap
Wild Bootstrap
Bolck Bootstrap

Basic Bootstrap - Introduction

n <- 100
b <- 1000
rep <- 5000
...
x <- rnorm(n,0,2)
sample_mean <- mean(x)

STEP1: 假設母體分配為常態(0,2)，並於母體分配中抽取 \(n=100\) 個樣本。我們想要以樣本平均 \(\hat{\mu}_{n}\) 推估母體平均 \(\mu_{0}\) 的區間與檢定。

STEP2: 使用R內建函數 sample() ，執行 \(b = 1000\) 次重複抽樣。計算 \(\sqrt{n}(\hat{\mu}_{n,b}^{*}-\hat{\mu}_{n})\) ，並使用 sort() 將其排序:

\[ \sqrt{n}(\hat{\mu}_{n,(1)}^{*} - \hat{\mu}_{n}) < \sqrt{n}(\hat{\mu}_{n,(2)}^{*} - \hat{\mu}_{n}) < ... < \sqrt{n}(\hat{\mu}_{n,(B)}^{*} - \hat{\mu}_{n}) \]

Basic Bootstrap - Introduction

STEP3: 給定 \(\alpha = 0.05\) 下，使用函數 quantile() 尋找重抽後統計量分配中第 \(2.5\) 與 \(97.5\) 百分位數，即 \(q^{*}_{0.025}\) 與 \(q^{*}_{0.975}\) ，同時估計信賴區間。

STEP4: 計算 test statistic \(= \sqrt{n}(\hat{\mu}_{n} - \mu_{H0})\) ，拒絕規則為： \[ \sqrt{n}(\hat{\mu}_{n} - \mu_{H0}) \not\in [q^{*}_{0.025}, q^{*}_{0.975} ] \]

Basic Bootstrap - Introduction

n <- 100
b <- 1000
rep <- 5000

於此模擬中，須使用兩組迴圈 for() ，第一組迴圈( k )控制模擬次數 rep 、第二組一組( l )則用以執行重複抽樣。同時於此範例中，我們要模擬test size與power。

Basic Bootstrap - Introduction

reject1_num <- 0
reject2_num <- 0
## 控制rep (控制模擬次數)
  for(k in 1:rep){
    x <- rnorm(n,0,3)
    sample_mean <- mean(x)
    resample_distribution <- rep(0,n)
## 執行重複抽樣
      for(l in 1:b){
        ...

for 迴圈部分可以參考r講義「6.流程控制與自定函數」。

Basic Bootstrap - Construct Data

## 設定size與power之拒絕次數初始值

    reject1_num <- 0
    reject2_num <- 0
## 控制rep (控制模擬次數)

    for(k in 1:rep){

      x <- rnorm(n,0,3) # 建造樣本分配 
      sample_mean <- mean(x) # 計算樣本統計量 
      re_dis <- rep(0,b) # 製造空向量，以放入重抽之統計量分配

Basic Bootstrap - Resampled distribution

於第二個迴圈 l 中，我們使用函數 sample 執行重複抽樣的動作，並將重複抽樣之統計量分配填入 re_dis 中。

## Construct resample statistic distribution
      for(l in 1:b){
        resample_mean <- mean(sample(x, n, replace = T))
        re_dis[l] <- sqrt(n)*(resample_mean - sample_mean)  
      }

   re_dis <- sort(re_dis) # 將向量由小到大重新排序

sample(x, i, replace = T) 輸出結果為一向量。其中 x 為被抽樣的資料、 i 為抽樣次數、 replace = T 代表抽後放回。

Basic Bootstrap - Resampled distribution

    sampleMatrix <- replicate(b, sample(x, i, replace = T))
    resample_statisitc <- colMeans(sampleMatrix)
    re_dis <- sqrt(i)*(resample_statisitc - sample_mean)

其中，指令 replicate 代表重複 sample(x, i, replace = T) b次，並將由行填入 sampleMatrix 中。
colMeans 用以計算每一行的平均數，並輸出一向量。

Basic Bootstrap - Resample Statistic Distribution

resample_distribution 向量於重排後，依序對應講義中 \[ \begin{align*} & \sqrt{n}(\hat{\mu}_{n,(1)}^{*} - \hat{\mu}_{n})\\ & \sqrt{n}(\hat{\mu}_{n,(2)}^{*} - \hat{\mu}_{n})\\ &\sqrt{n}(\hat{\mu}_{n,(3)}^{*} - \hat{\mu}_{n})\\ & \;\;\;\;\;\;\;\;\;\;\;\vdots\\ & \sqrt{n}(\hat{\mu}_{n,(B-1)}^{*} - \hat{\mu}_{n})\\ & \sqrt{n}(\hat{\mu}_{n,(B)}^{*} - \hat{\mu}_{n}) \end{align*} \]

Basic Bootstrap - 計算 Test Size 與 Power

接著，我們使用 quantile(re_dis,0.025) 尋找檢定區間上下屆。並執行兩種檢定，分別計算 Test Size 與 Power:

## Hypothesis Testing H_{0} : mean = 0 (Caculate Test Size)
t_s1 <- sqrt(i)*(sample_mean - 0)

reject_rule <- test1_statistic < quantile(re_dis,0.025) | t_s1 > quantile(re_dis,0.975)

reject1_num <- reject1_num + reject_rule

## Hypothesis Testing H_{0} : mean = 1 (Caculate Power)
t_s2 <- sqrt(i)*(sample_mean - 1)

reject_rule <- t_s2 < quantile(re_dis,0.025) | t_2 > quantile(re_dis,0.975)

reject2_num <- reject2_num + reject_rule

邏輯運算部分指令可參考r講義「2.資料屬性與常見運算」。

Basic Bootstrap - 輸入模擬結果

最後，我們將結果存於資料表 result 中相對位置：資料選取指令可參考R講義「5.資料輸入整理與繪圖」。

 ...
 }
    result1 <- reject1_num / rep
    result2 <- reject2_num / rep

Basic Bootstrap - Result

我們使用 ggplot2 繪圖，並列出此模擬結果如下：

Pair bootstrap - Introduction

\[ \begin{align*} &\mbox{Sample}\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\mbox{Resample}\\ & (y_{1}, x_{1})\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;(y^{*}_{1}, x^{*}_{1}) = (y_{2},x_{2})\\ & (y_{2}, x_{2})\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;(y^{*}_{2}, x^{*}_{2}) = (y_{4},x_{4})\\ & (y_{3}, x_{3})\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;(y^{*}_{3}, x^{*}_{3}) = (y_{7},x_{7})\\ & (y_{4}, x_{4})\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;(y^{*}_{4}, x^{*}_{4}) = (y_{1},x_{1})\\ & \;\;\vdots\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\Rightarrow\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\vdots\\ & (y_{n}, x_{n})\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;(y^{*}_{n}, x^{*}_{n}) = (y_{7}, x_{7}) \end{align*} \]

於 Pair bootstrap中，其模擬流程大致同於前敘。其關鍵差別如下：

於重抽過程中，我們使用函數 sample() 在 c(1:n) 重複抽取位置變數
並將位置變數帶入 x 與 y 矩陣中。抓取相對應位置的資料。矩陣位置提取可參考r講義「3.矩陣與因子」。

Pair bootstrap - Introduction

於此模擬中，我們僅關心當樣本數變大將如何影響檢定效果。故我們可以給定重抽數量b不變。詳細過程如下：

n <- seq(25,500,by=25)
b <- 1000
rep <- 5000
test_size1 <- rep(0,length(n))

result <-data.frame(sample.size = n,
                    Methods = rep(c("Pair_Bootstrap"),each=length(n)),
                    test.size = rep(0,length(n)),
                    power = rep(0,length(n)))

Pair bootstrap - Data Construction

我們假設母體迴歸如下: \[ y_{i} = -2 + 0.1x_{1,i} + 2x_{2,i} + \varepsilon_{i} \]

其中 \(\{ x_{1},x_{2},\varepsilon \}_{i=1}^{n}\) 為\(i.i.d\) 且服從常態分配。

for(i in n){
  reject_num1 <- 0 #錯誤拒絕次數
  reject_num2 <- 0 #正確拒絕次數
  for(j in 1:rep){
##### Data Description:
    x <- cbind(1,matrix(rnorm(2*i,0,5), i, 2))
    y <- x %*% c(-2,0.1,2) + rnorm(i,0,3)
    b_hat <- solve(t(x) %*% x) %*% t(x) %*% y
    residuals <- y - x %*% b_hat
    re_beta <- matrix(0,b,3) # 製造空矩陣，以便填入重抽之beta_hat
    re_dis <- matrix(0,b,3) # 製造空矩陣，以便填入重抽之檢定統計量

Pair bootstrap - Construct Resample Statistic Distribution

for(k in 1:b){
  index <- sample(1:i, i, replace = T) # 重抽位置變數

### Pair Bootstrap 
  resample_y <- y[index] #選取相對位置之y資料
  resample_x <- x[index,] #選取相對位置之x資料

### 計算重抽的beta_hat    
  re_beta[k,] <- solve(t(resample_x) %*% resample_x) %*% t(resample_x) %*% resample_y

### 計算重抽之檢定統計量  
  re_dis[k,] <- sqrt(i)*(re_beta[k,] - b_hat)

提取相應位置之資料相關指令可參考r講義「2.資料屬性與常見運算」與「3.矩陣與因子」。

Pair bootstrap - Simulate Test Size

於此我們模擬Test size，設定檢定規則如下( \(\alpha = 0.05\) )： \[ H_{0}: \beta_{2} = 2,\;\;\;H_{1}: \beta_{2} \neq 2 \]

### Hypothesis Testing H_{0} : beta_2 = 2
t_s <- sqrt(i)*(b_hat[3] - 2)
reject_rule1 <- (t_s < quantile(re_dis[,3],0.025) | t_s > quantile(re_dis[,3],0.975))   
reject_num1 <- reject_num1 + reject_rule1

Pair bootstrap - Simulate Power

於此我們模擬Power，設定檢定規則如下( \(\alpha = 0.05\) )： \[ H_{0}: \beta_{1} = 0,\;\;\;H_{1}: \beta_{1} \neq 0 \]

### Hypothesis Testing H_{0} : beta_2 = 2
t_s <- sqrt(i)*(b_hat[2] - 0)
reject_rule2 <- (t_s < quantile(re_dis[,2],0.025) | t_s > quantile(re_dis[,3],0.975))   
reject_num2 <- reject_num2 + reject_rule2

Residual Bootstrap - Introduction

1.Sample \[ \begin{align*} & (y_{1}, x_{1})\\ & (y_{2}, x_{2})\\ & \;\;\;\;\vdots\;\;\;\;\;\;\;\;\;\;\;\;\; \Rightarrow \;\; y_{i} = x_{i}\hat{\beta}_{n} + \hat{e}_{i}\\ & (y_{n}, x_{n}) \end{align*} \] 2.Resample \[ \begin{align*} & \hat{e}_{1}^{*} = \hat{e}_{3}\;\;\;\;\;\Rightarrow\;\;\;\;\;\;\; y^{*}_{1} = x_{1}\hat{\beta}_{n} + \hat{e}_{1}^{*} \;\;\;\;\;\Rightarrow (y_{1}^{*}, x_{1})\\ & \hat{e}_{2}^{*} = \hat{e}_{3}\;\;\;\;\;\Rightarrow\;\;\;\;\;\;\; y^{*}_{2} = x_{2}\hat{\beta}_{n} + \hat{e}_{2}^{*} \;\;\;\;\;\Rightarrow (y_{2}^{*}, x_{2})\\ & \;\;\;\;\vdots \\ & \hat{e}_{n}^{*} = \hat{e}_{3}\;\;\;\;\;\Rightarrow\;\;\;\;\;\;\; y^{*}_{n} = x_{n}\hat{\beta}_{n} + \hat{e}_{n}^{*} \;\;\;\;\;\Rightarrow (y_{n}^{*}, x_{n}) \end{align*} \]

Residual Bootstrap Introduction

將重抽的位置變數代入sample-residuals中。
給定樣本資料 \(x\) 與樣本統計量 \(\hat{\beta}_{n}\) ，計算 \(y_{i}^{*}\) 。
由 \(x\) 與 \(y_{i}^{*}\) 計算重抽樣本統計量 \(\hat{\beta}_{n}^{*}\) 。

Residual Bootstrap - Resample

  for(j in 1:rep){
##### Data Description:
    x <- cbind(1,matrix(rnorm(2*i,0,5), i, 2))
    y <- x %*% c(-2,0.1,2) + rnorm(i,0,3)
    b_hat <- solve(t(x) %*% x) %*% t(x) %*% y
    residuals <- y - x %*% b_hat
##### Resample
    res_beta <- matrix(0,b,3)
    re_distribution <- matrix(0,b,3)
    for(k in 1:b){
      index <- sample(1:i, i, replace = T)
      resample_y <- x %*%  b_hat + residuals[index]
      re_beta[k,] <-  solve(t(x) %*% x) %*% t(x) %*% resample_y
      re_dis[k,] <- sqrt(i)*(re_beta[k,] - b_hat)

其餘部分皆同於 Pair bootstrap。

Residual Bootstrap - Result

由課堂上可得，Pair Bootstrap 所得之 \(\hat{\beta}_{n,b}^{*}\) 非為 \(\hat{\beta}_{n}\) 一致估計量。而給定 \(\varepsilon_{i} |x\) 為 \(i.i.d\) 下， Residual Bootstrap 可得 \(\hat{\beta}_{n}\) 之一致估計量。故在此我們比較兩種方法之 Test Size 與 Power 。

Wild bootstrap - Introduction

let \(\{u_{i}\}_{i=1}^{n} \in \{ \frac{1-\sqrt{5}}{2}, \frac{1 + \sqrt{5}}{2} \}\) with probability \(\{ \frac{1+\sqrt{5}}{2\sqrt{5}}, \frac{1 + \sqrt{5}}{2\sqrt{5}} \}\), which is \(i.i.d\).
\(\hat{e}_{i,b}^{*} = \hat{e}_{i} u_{i,b}\)
With \(y^{*}_{i,b} = x_{i}\hat{\beta}_{n} + \hat{e}_{i,b}^{*} = x_{i}\hat{\beta}_{n} + \hat{e}_{i} u_{i,b}\), the bootstrap sample are \(\{ (y_{i,b}^{*}, x_{i}) \}_{i=1}^{n}\)

由上可知，執行Wild boostrap時，我們僅需重抽 \(u_{i}\) ，並將其依序乘入 \(\hat{e}\) 即可得 \(\{ (y_{i,b}^{*}, x_{i}) \}_{i=1}^{n}\) 。

Wild bootstrap - Construct heteroskedasticity Data

n <- 100
b <- 1000
rep <- 1000
reject_num <- 0

for(i in 1:rep){
## Construct heteroskedasticity Data:
    x <- cbind(1,matrix(rnorm(2*n,0,5), n, 2))
    eps <- rep(0,n)
    for(k in 1:n){
    eps[k] <- rnorm(1,0,sqrt(x[3,i]))
    }
    y <- x %*% c(-2,2,10) + eps
    b_hat <- solve(t(x) %*% x) %*% t(x) %*% y
    residuals <- y - x %*% b_hat
...

Wild bootstrap - Resample

re_beta <- rep(0,b)
for(l in 1:b){
### Wild Bootstrap      
  u <- c( (1-sqrt(5))/2, (1+sqrt(5))/2)
  resample_u <- sample(u,n,replace = T)
  resample_eps <- residuals * resample_u
  resample_y <- x %*% b_hat + resample_eps
  re_beta[l] <- solve(t(x) %*% x) %*% t(x) %*% resample_y     
}

re_beta <- sort(re_beta)

由此，我們可建構 \(\{ \hat{\beta}_{n,b} \}_{b=1}^{B}\) 之 empritical distribution 。其檢定與估計同上。

Some idea about Bolck Bootstrap - Data Costruction

於此節，我們將簡述如何在一組樣本中，一次抽取一整個 bolck 。

## Data Costruction

n <- 100
b <- 1000
rho <- 0.8
y <- rep(0,n+1)

for(i in 1:n){
  y[i+1] <- rho*y[i] + rnorm(1,0,1)
}

y <- y[-1]

Some idea about Bolck Bootstrap - Resample

若我們假設每一個 bolck 固定為 \(10\) 個元素，則我們每一次能重抽的次數為 \(10\) 次，資料位置的範圍為 c(1:91) 。

index <- sample(c(1:91),10,replace = T)
resample_y <- rep(0,n)
for(j in 0:9){
  resample_y[(10*j + 1) : (10 * (j + 1))] <- y[index[j+1] : (index[j+1] + 9)]
}