於此節,我們將介紹如何使用R內建語法語言執行各類 Bootstrap。此外,於 R 中亦有套件 boot 可執行 Bootstrap。大綱如下:
- Basic Bootstrap
- Pair Bootstrap
- Residual Bootstrap
- Wild Bootstrap
- Bolck Bootstrap
Sun-Ting Lee
於此節,我們將介紹如何使用R內建語法語言執行各類 Bootstrap。此外,於 R 中亦有套件 boot 可執行 Bootstrap。大綱如下:
n <- 100
b <- 1000
rep <- 5000
...
x <- rnorm(n,0,2)
sample_mean <- mean(x)
STEP1: 假設母體分配為常態(0,2),並於母體分配中抽取 \(n=100\) 個樣本。我們想要以樣本平均 \(\hat{\mu}_{n}\) 推估母體平均 \(\mu_{0}\) 的區間與檢定。
STEP2: 使用R內建函數 sample() ,執行 \(b = 1000\) 次重複抽樣。計算 \(\sqrt{n}(\hat{\mu}_{n,b}^{*}-\hat{\mu}_{n})\) ,並使用 sort() 將其排序:
\[ \sqrt{n}(\hat{\mu}_{n,(1)}^{*} - \hat{\mu}_{n}) < \sqrt{n}(\hat{\mu}_{n,(2)}^{*} - \hat{\mu}_{n}) < ... < \sqrt{n}(\hat{\mu}_{n,(B)}^{*} - \hat{\mu}_{n}) \]
STEP3: 給定 \(\alpha = 0.05\) 下,使用函數 quantile() 尋找重抽後統計量分配中第 \(2.5\) 與 \(97.5\) 百分位數,即 \(q^{*}_{0.025}\) 與 \(q^{*}_{0.975}\) ,同時估計信賴區間。
STEP4: 計算 test statistic \(= \sqrt{n}(\hat{\mu}_{n} - \mu_{H0})\) ,拒絕規則為: \[ \sqrt{n}(\hat{\mu}_{n} - \mu_{H0}) \not\in [q^{*}_{0.025}, q^{*}_{0.975} ] \]
n <- 100
b <- 1000
rep <- 5000
於此模擬中,須使用兩組迴圈 for() ,第一組迴圈( k )控制模擬次數 rep 、第二組一組( l )則用以執行重複抽樣。同時於此範例中,我們要模擬test size與power。
reject1_num <- 0
reject2_num <- 0
## 控制rep (控制模擬次數)
for(k in 1:rep){
x <- rnorm(n,0,3)
sample_mean <- mean(x)
resample_distribution <- rep(0,n)
## 執行重複抽樣
for(l in 1:b){
...
for 迴圈部分可以參考r講義「6.流程控制與自定函數」。## 設定size與power之拒絕次數初始值
reject1_num <- 0
reject2_num <- 0
## 控制rep (控制模擬次數)
for(k in 1:rep){
x <- rnorm(n,0,3) # 建造樣本分配
sample_mean <- mean(x) # 計算樣本統計量
re_dis <- rep(0,b) # 製造空向量,以放入重抽之統計量分配
於第二個迴圈 l 中,我們使用函數 sample 執行重複抽樣的動作,並將重複抽樣之統計量分配填入 re_dis 中。
## Construct resample statistic distribution
for(l in 1:b){
resample_mean <- mean(sample(x, n, replace = T))
re_dis[l] <- sqrt(n)*(resample_mean - sample_mean)
}
re_dis <- sort(re_dis) # 將向量由小到大重新排序
sample(x, i, replace = T) 輸出結果為一向量。其中 x 為被抽樣的資料、 i 為抽樣次數、 replace = T 代表抽後放回。 sampleMatrix <- replicate(b, sample(x, i, replace = T))
resample_statisitc <- colMeans(sampleMatrix)
re_dis <- sqrt(i)*(resample_statisitc - sample_mean)
replicate 代表重複 sample(x, i, replace = T) b次,並將由行填入 sampleMatrix 中。colMeans 用以計算每一行的平均數,並輸出一向量。resample_distribution 向量於重排後,依序對應講義中
\[
\begin{align*}
& \sqrt{n}(\hat{\mu}_{n,(1)}^{*} - \hat{\mu}_{n})\\
& \sqrt{n}(\hat{\mu}_{n,(2)}^{*} - \hat{\mu}_{n})\\
&\sqrt{n}(\hat{\mu}_{n,(3)}^{*} - \hat{\mu}_{n})\\
& \;\;\;\;\;\;\;\;\;\;\;\vdots\\
& \sqrt{n}(\hat{\mu}_{n,(B-1)}^{*} - \hat{\mu}_{n})\\
& \sqrt{n}(\hat{\mu}_{n,(B)}^{*} - \hat{\mu}_{n})
\end{align*}
\]
接著,我們使用 quantile(re_dis,0.025) 尋找檢定區間上下屆。並執行兩種檢定,分別計算 Test Size 與 Power:
## Hypothesis Testing H_{0} : mean = 0 (Caculate Test Size)
t_s1 <- sqrt(i)*(sample_mean - 0)
reject_rule <- test1_statistic < quantile(re_dis,0.025) | t_s1 > quantile(re_dis,0.975)
reject1_num <- reject1_num + reject_rule
## Hypothesis Testing H_{0} : mean = 1 (Caculate Power)
t_s2 <- sqrt(i)*(sample_mean - 1)
reject_rule <- t_s2 < quantile(re_dis,0.025) | t_2 > quantile(re_dis,0.975)
reject2_num <- reject2_num + reject_rule
最後,我們將結果存於資料表 result 中相對位置:資料選取指令可參考R講義「5.資料輸入整理與繪圖」。
...
}
result1 <- reject1_num / rep
result2 <- reject2_num / rep
我們使用 ggplot2 繪圖,並列出此模擬結果如下:

\[ \begin{align*} &\mbox{Sample}\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\mbox{Resample}\\ & (y_{1}, x_{1})\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;(y^{*}_{1}, x^{*}_{1}) = (y_{2},x_{2})\\ & (y_{2}, x_{2})\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;(y^{*}_{2}, x^{*}_{2}) = (y_{4},x_{4})\\ & (y_{3}, x_{3})\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;(y^{*}_{3}, x^{*}_{3}) = (y_{7},x_{7})\\ & (y_{4}, x_{4})\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;(y^{*}_{4}, x^{*}_{4}) = (y_{1},x_{1})\\ & \;\;\vdots\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\Rightarrow\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\vdots\\ & (y_{n}, x_{n})\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;(y^{*}_{n}, x^{*}_{n}) = (y_{7}, x_{7}) \end{align*} \]
於 Pair bootstrap中,其模擬流程大致同於前敘。其關鍵差別如下:
於重抽過程中,我們使用函數 sample() 在 c(1:n) 重複抽取位置變數
並將位置變數帶入 x 與 y 矩陣中。抓取相對應位置的資料。矩陣位置提取可參考r講義「3.矩陣與因子」。
於此模擬中,我們僅關心當樣本數變大將如何影響檢定效果。故我們可以給定重抽數量b不變。詳細過程如下:
n <- seq(25,500,by=25)
b <- 1000
rep <- 5000
test_size1 <- rep(0,length(n))
result <-data.frame(sample.size = n,
Methods = rep(c("Pair_Bootstrap"),each=length(n)),
test.size = rep(0,length(n)),
power = rep(0,length(n)))
我們假設母體迴歸如下: \[ y_{i} = -2 + 0.1x_{1,i} + 2x_{2,i} + \varepsilon_{i} \]
其中 \(\{ x_{1},x_{2},\varepsilon \}_{i=1}^{n}\) 為\(i.i.d\) 且服從常態分配。
for(i in n){
reject_num1 <- 0 #錯誤拒絕次數
reject_num2 <- 0 #正確拒絕次數
for(j in 1:rep){
##### Data Description:
x <- cbind(1,matrix(rnorm(2*i,0,5), i, 2))
y <- x %*% c(-2,0.1,2) + rnorm(i,0,3)
b_hat <- solve(t(x) %*% x) %*% t(x) %*% y
residuals <- y - x %*% b_hat
re_beta <- matrix(0,b,3) # 製造空矩陣,以便填入重抽之beta_hat
re_dis <- matrix(0,b,3) # 製造空矩陣,以便填入重抽之檢定統計量
for(k in 1:b){
index <- sample(1:i, i, replace = T) # 重抽位置變數
### Pair Bootstrap
resample_y <- y[index] #選取相對位置之y資料
resample_x <- x[index,] #選取相對位置之x資料
### 計算重抽的beta_hat
re_beta[k,] <- solve(t(resample_x) %*% resample_x) %*% t(resample_x) %*% resample_y
### 計算重抽之檢定統計量
re_dis[k,] <- sqrt(i)*(re_beta[k,] - b_hat)
於此我們模擬Test size,設定檢定規則如下( \(\alpha = 0.05\) ): \[ H_{0}: \beta_{2} = 2,\;\;\;H_{1}: \beta_{2} \neq 2 \]
### Hypothesis Testing H_{0} : beta_2 = 2
t_s <- sqrt(i)*(b_hat[3] - 2)
reject_rule1 <- (t_s < quantile(re_dis[,3],0.025) | t_s > quantile(re_dis[,3],0.975))
reject_num1 <- reject_num1 + reject_rule1
於此我們模擬Power,設定檢定規則如下( \(\alpha = 0.05\) ): \[ H_{0}: \beta_{1} = 0,\;\;\;H_{1}: \beta_{1} \neq 0 \]
### Hypothesis Testing H_{0} : beta_2 = 2
t_s <- sqrt(i)*(b_hat[2] - 0)
reject_rule2 <- (t_s < quantile(re_dis[,2],0.025) | t_s > quantile(re_dis[,3],0.975))
reject_num2 <- reject_num2 + reject_rule2
1.Sample \[ \begin{align*} & (y_{1}, x_{1})\\ & (y_{2}, x_{2})\\ & \;\;\;\;\vdots\;\;\;\;\;\;\;\;\;\;\;\;\; \Rightarrow \;\; y_{i} = x_{i}\hat{\beta}_{n} + \hat{e}_{i}\\ & (y_{n}, x_{n}) \end{align*} \] 2.Resample \[ \begin{align*} & \hat{e}_{1}^{*} = \hat{e}_{3}\;\;\;\;\;\Rightarrow\;\;\;\;\;\;\; y^{*}_{1} = x_{1}\hat{\beta}_{n} + \hat{e}_{1}^{*} \;\;\;\;\;\Rightarrow (y_{1}^{*}, x_{1})\\ & \hat{e}_{2}^{*} = \hat{e}_{3}\;\;\;\;\;\Rightarrow\;\;\;\;\;\;\; y^{*}_{2} = x_{2}\hat{\beta}_{n} + \hat{e}_{2}^{*} \;\;\;\;\;\Rightarrow (y_{2}^{*}, x_{2})\\ & \;\;\;\;\vdots \\ & \hat{e}_{n}^{*} = \hat{e}_{3}\;\;\;\;\;\Rightarrow\;\;\;\;\;\;\; y^{*}_{n} = x_{n}\hat{\beta}_{n} + \hat{e}_{n}^{*} \;\;\;\;\;\Rightarrow (y_{n}^{*}, x_{n}) \end{align*} \]
將重抽的位置變數代入sample-residuals中。
給定樣本資料 \(x\) 與樣本統計量 \(\hat{\beta}_{n}\) ,計算 \(y_{i}^{*}\) 。
由 \(x\) 與 \(y_{i}^{*}\) 計算 重抽樣本統計量 \(\hat{\beta}_{n}^{*}\) 。
for(j in 1:rep){
##### Data Description:
x <- cbind(1,matrix(rnorm(2*i,0,5), i, 2))
y <- x %*% c(-2,0.1,2) + rnorm(i,0,3)
b_hat <- solve(t(x) %*% x) %*% t(x) %*% y
residuals <- y - x %*% b_hat
##### Resample
res_beta <- matrix(0,b,3)
re_distribution <- matrix(0,b,3)
for(k in 1:b){
index <- sample(1:i, i, replace = T)
resample_y <- x %*% b_hat + residuals[index]
re_beta[k,] <- solve(t(x) %*% x) %*% t(x) %*% resample_y
re_dis[k,] <- sqrt(i)*(re_beta[k,] - b_hat)
其餘部分皆同於 Pair bootstrap。
由課堂上可得,Pair Bootstrap 所得之 \(\hat{\beta}_{n,b}^{*}\) 非為 \(\hat{\beta}_{n}\) 一致估計量。而給定 \(\varepsilon_{i} |x\) 為 \(i.i.d\) 下, Residual Bootstrap 可得 \(\hat{\beta}_{n}\) 之一致估計量。故在此我們比較兩種方法之 Test Size 與 Power 。

let \(\{u_{i}\}_{i=1}^{n} \in \{ \frac{1-\sqrt{5}}{2}, \frac{1 + \sqrt{5}}{2} \}\) with probability \(\{ \frac{1+\sqrt{5}}{2\sqrt{5}}, \frac{1 + \sqrt{5}}{2\sqrt{5}} \}\), which is \(i.i.d\).
\(\hat{e}_{i,b}^{*} = \hat{e}_{i} u_{i,b}\)
With \(y^{*}_{i,b} = x_{i}\hat{\beta}_{n} + \hat{e}_{i,b}^{*} = x_{i}\hat{\beta}_{n} + \hat{e}_{i} u_{i,b}\), the bootstrap sample are \(\{ (y_{i,b}^{*}, x_{i}) \}_{i=1}^{n}\)
由上可知,執行Wild boostrap時,我們僅需重抽 \(u_{i}\) ,並將其依序乘入 \(\hat{e}\) 即可得 \(\{ (y_{i,b}^{*}, x_{i}) \}_{i=1}^{n}\) 。
n <- 100
b <- 1000
rep <- 1000
reject_num <- 0
for(i in 1:rep){
## Construct heteroskedasticity Data:
x <- cbind(1,matrix(rnorm(2*n,0,5), n, 2))
eps <- rep(0,n)
for(k in 1:n){
eps[k] <- rnorm(1,0,sqrt(x[3,i]))
}
y <- x %*% c(-2,2,10) + eps
b_hat <- solve(t(x) %*% x) %*% t(x) %*% y
residuals <- y - x %*% b_hat
...
re_beta <- rep(0,b)
for(l in 1:b){
### Wild Bootstrap
u <- c( (1-sqrt(5))/2, (1+sqrt(5))/2)
resample_u <- sample(u,n,replace = T)
resample_eps <- residuals * resample_u
resample_y <- x %*% b_hat + resample_eps
re_beta[l] <- solve(t(x) %*% x) %*% t(x) %*% resample_y
}
re_beta <- sort(re_beta)
由此,我們可建構 \(\{ \hat{\beta}_{n,b} \}_{b=1}^{B}\) 之 empritical distribution 。其檢定與估計同上。
於此節,我們將簡述如何在一組樣本中,一次抽取一整個 bolck 。
## Data Costruction
n <- 100
b <- 1000
rho <- 0.8
y <- rep(0,n+1)
for(i in 1:n){
y[i+1] <- rho*y[i] + rnorm(1,0,1)
}
y <- y[-1]
若我們假設每一個 bolck 固定為 \(10\) 個元素,則我們每一次能重抽的次數為 \(10\) 次,資料位置的範圍為 c(1:91) 。
index <- sample(c(1:91),10,replace = T)
resample_y <- rep(0,n)
for(j in 0:9){
resample_y[(10*j + 1) : (10 * (j + 1))] <- y[index[j+1] : (index[j+1] + 9)]
}