1 Question (a): Bootstrap Sampling Distribution of \(\widehat{E[X]}\)

To estimate the sampling distribution of \(\widehat{E[X]}\), we apply the bootstrap procedure. The key idea is to treat the observed sample as an approximation of the population and repeatedly resample from it with replacement.

# Input data (Pb concentrations)
data <- c(
  0.85, 1.23, 0.92, 3.45, 2.11, 1.56, 4.89, 2.34, 1.78, 6.72, 0.95, 1.34, 8.91,
  2.67, 1.89, 5.43, 1.12, 3.78, 2.45, 7.65, 1.05, 1.45, 12.34, 2.89, 2.01, 4.56,
  1.23, 4.32, 2.67, 9.87, 0.99, 1.56, 15.23, 3.12, 2.34, 3.89, 1.34, 5.67, 2.89,
  11.45, 1.12, 1.67, 18.90, 3.45, 2.56, 3.45, 1.45, 6.78, 3.12, 14.56, 1.23, 1.78,
  22.34, 3.78, 2.78
)

# Number of bootstrap samples
B <- 5000
n <- length(data)

# Function to compute E[X] using log-normal MLE
estimate_mean_lognormal <- function(x) {
  log_x <- log(x)
  mu_hat <- mean(log_x)
  sigma2_hat <- mean((log_x - mu_hat)^2)
  
  # Plug-in estimator for E[X]
  exp(mu_hat + sigma2_hat / 2)
}

# Store bootstrap estimates
boot_estimates <- numeric(B)

set.seed(123)  # for reproducibility

# Bootstrap loop
for (b in 1:B) {
  sample_b <- sample(data, size = n, replace = TRUE)  # resample with replacement
  boot_estimates[b] <- estimate_mean_lognormal(sample_b)
}

# Plot histogram with density scaling
hist(boot_estimates,
     breaks = 30,
     probability = TRUE,   # ensures histogram is on density scale
     main = "Bootstrap Distribution of E[X]",
     xlab = "Bootstrap Estimates",
     border = "black")

# Add smooth density curve on top
lines(density(boot_estimates),
      lwd = 2)  # thicker line for visibility

The resulting bootstrap distribution appears approximately unimodal and roughly symmetric, with a slight right skew. The peak of the distribution is centered around values between 4 and 5, indicating that most bootstrap estimates of \(\widehat{E[X]}\) fall within this range.

In terms of variability, the distribution shows moderate spread, suggesting a reasonable level of uncertainty in estimating \(E[X]\). While there is a mild right tail extending toward larger values, extreme values are relatively infrequent, indicating that the estimator is fairly stable across resamples.

Overall, the bootstrap sampling distribution is well-behaved, with no severe skewness or irregular patterns, supporting the reliability of the bootstrap procedure in this setting.

2 Question (b) 95% Percentile Bootstrap Confidence Interval for \(\mu_{LN} = E[X]\)

The percentile bootstrap confidence interval is constructed directly from the empirical distribution of the bootstrap estimates. This method does not impose any assumptions about symmetry or normality and instead uses the observed distribution of \(\widehat{E[X]}\).

After generating \(B = 5000\) bootstrap estimates, we sort these values in ascending order. The 95% confidence interval is then obtained by selecting the empirical quantiles corresponding to the lower and upper tails of the distribution.

Specifically, for a \((1 - \alpha) = 0.95\) confidence level, we take:

  • the \(2.5%\) percentile as the lower bound
  • the \(97.5%\) percentile as the upper bound

This yields the interval: [ ]

Conceptually, this interval represents the central 95% of the bootstrap distribution, meaning that 95% of the resampled estimates fall within this range. Since this method relies entirely on the empirical distribution, it naturally adapts to any skewness or asymmetry present in the data.

Thus, the percentile bootstrap confidence interval provides a simple and intuitive estimate of uncertainty for \(\mu_{LN} = E[X]\) based on the observed data.

# --- Manual Percentile CI ---
alpha <- 0.05

lower_perc <- quantile(boot_estimates, probs = alpha / 2)
upper_perc <- quantile(boot_estimates, probs = 1 - alpha / 2)

ci_percentile <- c(lower = lower_perc, upper = upper_perc)
ci_percentile
 lower.2.5% upper.97.5% 
   3.093880    5.643249 
# --- Using boot package for verification ---
library(boot)

# Define statistic function for boot()
boot_stat <- function(data, indices) {
  sample_data <- data[indices]
  
  log_x <- log(sample_data)
  mu_hat <- mean(log_x)
  sigma2_hat <- mean((log_x - mu_hat)^2)
  
  # Return E[X] estimator
  exp(mu_hat + sigma2_hat / 2)
}

set.seed(123)

boot_result <- boot(data, statistic = boot_stat, R = 5000)

# Percentile CI using boot package
boot_ci <- boot.ci(boot_result, type = "perc")

boot_ci
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 5000 bootstrap replicates

CALL : 
boot.ci(boot.out = boot_result, type = "perc")

Intervals : 
Level     Percentile     
95%   ( 3.097,  5.598 )  
Calculations and Intervals on Original Scale

3 Question (c) 95% BCa Bootstrap Confidence Interval for \(\mu_{LN} = E[X]\)

The bias-corrected and accelerated (BCa) bootstrap confidence interval improves upon the percentile method by adjusting for both bias and skewness in the bootstrap distribution.

Conceptually, the BCa method modifies the percentile cutoffs based on two factors:

  • A bias correction term, which accounts for whether the bootstrap distribution is centered around the original estimate
  • An acceleration term, which adjusts for skewness by measuring how sensitive the estimator is to individual observations

These adjustments lead to modified percentile levels, which are then used to determine the confidence interval from the bootstrap distribution.

Compared to the percentile interval, the BCa interval may exhibit slight asymmetry, reflecting adjustments for bias and skewness in the bootstrap distribution. This makes the BCa method more reliable, particularly when the distribution is not perfectly symmetric.

# --- Manual BCa CI ---

alpha <- 0.05
B <- length(boot_estimates)

# Original estimate
theta_hat <- estimate_mean_lognormal(data)

# ---- Bias correction (z0) ----
prop_less <- mean(boot_estimates < theta_hat)
z0 <- qnorm(prop_less)

# ---- Jackknife for acceleration ----
n <- length(data)
jack_vals <- numeric(n)

for (i in 1:n) {
  jack_sample <- data[-i]  # leave-one-out
  jack_vals[i] <- estimate_mean_lognormal(jack_sample)
}

theta_jack_mean <- mean(jack_vals)

numerator <- sum((theta_jack_mean - jack_vals)^3)
denominator <- sum((theta_jack_mean - jack_vals)^2)

a_hat <- numerator / (6 * denominator^(3/2))

# ---- Adjusted percentiles ----
z_alpha <- qnorm(alpha / 2)
z_1_alpha <- qnorm(1 - alpha / 2)

alpha1 <- pnorm(z0 + (z0 + z_alpha) / (1 - a_hat * (z0 + z_alpha)))
alpha2 <- pnorm(z0 + (z0 + z_1_alpha) / (1 - a_hat * (z0 + z_1_alpha)))

# ---- BCa interval ----
lower_bca <- quantile(boot_estimates, probs = alpha1)
upper_bca <- quantile(boot_estimates, probs = alpha2)

ci_bca <- c(lower = lower_bca, upper = upper_bca)
ci_bca
lower.4.130304% upper.98.67843% 
       3.222668        5.850080 
# BCa CI using boot package
boot_ci_bca <- boot.ci(boot_result, type = "bca")

boot_ci_bca
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 5000 bootstrap replicates

CALL : 
boot.ci(boot.out = boot_result, type = "bca")

Intervals : 
Level       BCa          
95%   ( 3.225,  5.865 )  
Calculations and Intervals on Original Scale

4 Question (d) Asymptotic (CLT-Based) Confidence Interval for \(\mu_{LN} = E[X]\)

An alternative to bootstrap-based confidence intervals is the asymptotic confidence interval based on the Central Limit Theorem (CLT). This approach assumes that the estimator \(\widehat{E[X]}\) is approximately normally distributed for sufficiently large sample sizes.

Under this assumption, a 95% confidence interval is constructed as: [ \(\widehat{E[X]} ;\pm\); z_{0.975} \(\cdot \widehat{\mathrm{SE}}(\widehat{E[X]})\),] where \(z_{0.975}\) is the standard normal critical value and \(\widehat{\mathrm{SE}}(\widehat{E[X]})\) is the estimated standard error of the estimator.

In this problem, the standard error is approximated using the bootstrap distribution. Specifically, the standard deviation of the bootstrap estimates is used as an estimate of \(\widehat{\mathrm{SE}}(\widehat{E[X]})\).

Conceptually, this method constructs a symmetric interval around the estimate, unlike the percentile and BCa methods which adapt to the shape of the bootstrap distribution. As a result, if the bootstrap distribution exhibits slight skewness, the CLT-based interval may not fully capture that asymmetry.

Unlike parts (b) and (c), there is no direct implementation of the CLT-based interval in the \(\texttt{boot}\) package. Instead, the bootstrap is used here only to estimate the standard error, and the interval is constructed using the normal approximation.

# --- CLT / Normal-based CI ---

# Original estimate
theta_hat <- estimate_mean_lognormal(data)

# Bootstrap standard error (std dev of bootstrap estimates)
se_boot <- sd(boot_estimates)

# Critical value for 95% CI
z <- qnorm(0.975)

# Construct CI
lower_clt <- theta_hat - z * se_boot
upper_clt <- theta_hat + z * se_boot

ci_clt <- c(lower = lower_clt, upper = upper_clt)
ci_clt
   lower    upper 
2.959686 5.476512 

5 Question (e) Comparison of Confidence Intervals

We compare the three confidence intervals constructed for \(\mu_{LN} = E[X]\): the percentile bootstrap interval, the BCa bootstrap interval, and the asymptotic (CLT-based) interval.

The percentile bootstrap interval is: [3.094, 5.643], while the BCa interval is: [3.223, 5.850]. These two intervals are similar in width, but the BCa interval is slightly shifted upward and extends further on the upper end. This reflects the bias and skewness corrections applied by the BCa method.

The CLT-based interval is: [2.960, 5.477], which is symmetric around \(\widehat{E[X]}\). Compared to the bootstrap intervals, it has a lower lower-bound and a slightly smaller upper-bound, indicating that it does not fully capture the asymmetry present in the bootstrap distribution.

In terms of performance:

  • The percentile and BCa intervals are data-driven and reflect the empirical distribution.
  • The BCa interval provides additional correction for bias and skewness, making it more robust.
  • The CLT interval relies on a normality assumption and may be less accurate when the distribution is not perfectly symmetric.

Overall, the BCa interval is the most appropriate choice in this setting due to its ability to account for both bias and skewness. The percentile interval provides a reasonable approximation, while the CLT interval serves as a useful but more assumption-dependent benchmark.

---
title: 'STA 506 HOMEWORK 7'
author: 'Gerard Ike'
date: "2026-03-24"
output:
  html_document:              # output document format
    toc: yes                  # add table contents
    toc_float: yes            # toc_property: floating
    toc_depth: 4              # depth of TOC headings
    fig_width: 6              # global figure width
    fig_height: 4             # global figure height
    fig_caption: yes          # add figure caption
    number_sections: yes      # numbering section headings
    toc_collapsed: yes        # TOC subheading clapsing
    code_folding: hide        # folding/showing code
    code_download: yes        # allow to download complete RMarkdown source code
    smooth_scroll: yes        # scrolling text of the document
    theme: lumen              # visual theme for HTML document only
    highlight: tango          # code syntax highlighting styles
  pdf_document:
    toc: yes
    toc_depth: 4
    fig_caption: yes
    number_sections: yes
  word_document:
    toc: yes
    toc_depth: '4'
---

```{css, echo = FALSE}
div#TOC li {     /* table of content  */
    list-style:upper-roman;
    background-image:none;
    background-repeat:none;
    background-position:0;
}

h1.title {    /* level 1 header of title  */
  font-size: 24px;
  font-weight: bold;
  color: DarkRed;
  text-align: center;
}

h4.author { /* Header 4 - and the author and data headers use this too  */
  font-size: 18px;
  font-weight: bold;
  font-family: "Times New Roman", Times, serif;
  color: DarkRed;
  text-align: center;
}

h4.date { /* Header 4 - and the author and data headers use this too  */
  font-size: 18px;
  font-weight: bold;
  font-family: "Times New Roman", Times, serif;
  color: DarkBlue;
  text-align: center;
}

h1 { /* Header 1 - and the author and data headers use this too  */
    font-size: 20px;
    font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: darkred;
    text-align: center;
}

h2 { /* Header 2 - and the author and data headers use this too  */
    font-size: 18px;
    font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}

h3 { /* Header 3 - and the author and data headers use this too  */
    font-size: 16px;
    font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}

h4 { /* Header 4 - and the author and data headers use this too  */
    font-size: 14px;
  font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: darkred;
    text-align: left;
}

/* Add dots after numbered headers */
.header-section-number::after {
  content: ".";
}
```

```{r setup, include=FALSE}
# code chunk specifies whether the R code, warnings, and output 
# will be included in the output files.

if (!require("knitr")) {                      # use conditional statement to detect
   install.packages("knitr")                  # whether a package was installed in
   library(knitr)                             # your machine. If not, install it and
}                                             # load it to the working directory.
#
knitr::opts_chunk$set(echo = TRUE,            # include code chunk in the output file
                      warning = FALSE,        # sometimes, you code may produce warning messages,
                                              # you can choose to include the warning messages in
                                              # the output file. 
                      results = TRUE,         # you can also decide whether to include the output
                                              # in the output file.
                      message = FALSE,        # suppress messages 
                      comment = NA            # remove the default leading hash tags in the output
                      )   
```

# Question (a): Bootstrap Sampling Distribution of $\widehat{E[X]}$

To estimate the sampling distribution of $\widehat{E[X]}$, we apply the bootstrap procedure. The key idea is to treat the observed sample as an approximation of the population and repeatedly resample from it with replacement.


```{r}
# Input data (Pb concentrations)
data <- c(
  0.85, 1.23, 0.92, 3.45, 2.11, 1.56, 4.89, 2.34, 1.78, 6.72, 0.95, 1.34, 8.91,
  2.67, 1.89, 5.43, 1.12, 3.78, 2.45, 7.65, 1.05, 1.45, 12.34, 2.89, 2.01, 4.56,
  1.23, 4.32, 2.67, 9.87, 0.99, 1.56, 15.23, 3.12, 2.34, 3.89, 1.34, 5.67, 2.89,
  11.45, 1.12, 1.67, 18.90, 3.45, 2.56, 3.45, 1.45, 6.78, 3.12, 14.56, 1.23, 1.78,
  22.34, 3.78, 2.78
)

# Number of bootstrap samples
B <- 5000
n <- length(data)

# Function to compute E[X] using log-normal MLE
estimate_mean_lognormal <- function(x) {
  log_x <- log(x)
  mu_hat <- mean(log_x)
  sigma2_hat <- mean((log_x - mu_hat)^2)
  
  # Plug-in estimator for E[X]
  exp(mu_hat + sigma2_hat / 2)
}

# Store bootstrap estimates
boot_estimates <- numeric(B)

set.seed(123)  # for reproducibility

# Bootstrap loop
for (b in 1:B) {
  sample_b <- sample(data, size = n, replace = TRUE)  # resample with replacement
  boot_estimates[b] <- estimate_mean_lognormal(sample_b)
}

# Plot histogram with density scaling
hist(boot_estimates,
     breaks = 30,
     probability = TRUE,   # ensures histogram is on density scale
     main = "Bootstrap Distribution of E[X]",
     xlab = "Bootstrap Estimates",
     border = "black")

# Add smooth density curve on top
lines(density(boot_estimates),
      lwd = 2)  # thicker line for visibility
```


The resulting bootstrap distribution appears approximately unimodal and roughly symmetric, with a slight right skew. The peak of the distribution is centered around values between 4 and 5, indicating that most bootstrap estimates of $\widehat{E[X]}$ fall within this range.

In terms of variability, the distribution shows moderate spread, suggesting a reasonable level of uncertainty in estimating $E[X]$. While there is a mild right tail extending toward larger values, extreme values are relatively infrequent, indicating that the estimator is fairly stable across resamples.

Overall, the bootstrap sampling distribution is well-behaved, with no severe skewness or irregular patterns, supporting the reliability of the bootstrap procedure in this setting.


# Question (b) 95% Percentile Bootstrap Confidence Interval for $\mu_{LN} = E[X]$

The percentile bootstrap confidence interval is constructed directly from the empirical distribution of the bootstrap estimates. This method does not impose any assumptions about symmetry or normality and instead uses the observed distribution of $\widehat{E[X]}$.

After generating $B = 5000$ bootstrap estimates, we sort these values in ascending order. The 95% confidence interval is then obtained by selecting the empirical quantiles corresponding to the lower and upper tails of the distribution.

Specifically, for a $(1 - \alpha) = 0.95$ confidence level, we take:

* the $2.5%$ percentile as the lower bound
* the $97.5%$ percentile as the upper bound

This yields the interval:
[
\left[ \widehat{E[X]}*{(0.025)}, ; \widehat{E[X]}*{(0.975)} \right]
]

Conceptually, this interval represents the central 95% of the bootstrap distribution, meaning that 95% of the resampled estimates fall within this range. Since this method relies entirely on the empirical distribution, it naturally adapts to any skewness or asymmetry present in the data.

Thus, the percentile bootstrap confidence interval provides a simple and intuitive estimate of uncertainty for $\mu_{LN} = E[X]$ based on the observed data.

```{r}
# --- Manual Percentile CI ---
alpha <- 0.05

lower_perc <- quantile(boot_estimates, probs = alpha / 2)
upper_perc <- quantile(boot_estimates, probs = 1 - alpha / 2)

ci_percentile <- c(lower = lower_perc, upper = upper_perc)
ci_percentile


# --- Using boot package for verification ---
library(boot)

# Define statistic function for boot()
boot_stat <- function(data, indices) {
  sample_data <- data[indices]
  
  log_x <- log(sample_data)
  mu_hat <- mean(log_x)
  sigma2_hat <- mean((log_x - mu_hat)^2)
  
  # Return E[X] estimator
  exp(mu_hat + sigma2_hat / 2)
}

set.seed(123)

boot_result <- boot(data, statistic = boot_stat, R = 5000)

# Percentile CI using boot package
boot_ci <- boot.ci(boot_result, type = "perc")

boot_ci
```

# Question (c) 95% BCa Bootstrap Confidence Interval for $\mu_{LN} = E[X]$

The bias-corrected and accelerated (BCa) bootstrap confidence interval improves upon the percentile method by adjusting for both bias and skewness in the bootstrap distribution.

Conceptually, the BCa method modifies the percentile cutoffs based on two factors:

* A bias correction term, which accounts for whether the bootstrap distribution is centered around the original estimate
* An acceleration term, which adjusts for skewness by measuring how sensitive the estimator is to individual observations

These adjustments lead to modified percentile levels, which are then used to determine the confidence interval from the bootstrap distribution.

Compared to the percentile interval, the BCa interval may exhibit slight asymmetry, reflecting adjustments for bias and skewness in the bootstrap distribution. This makes the BCa method more reliable, particularly when the distribution is not perfectly symmetric.

```{r}
# --- Manual BCa CI ---

alpha <- 0.05
B <- length(boot_estimates)

# Original estimate
theta_hat <- estimate_mean_lognormal(data)

# ---- Bias correction (z0) ----
prop_less <- mean(boot_estimates < theta_hat)
z0 <- qnorm(prop_less)

# ---- Jackknife for acceleration ----
n <- length(data)
jack_vals <- numeric(n)

for (i in 1:n) {
  jack_sample <- data[-i]  # leave-one-out
  jack_vals[i] <- estimate_mean_lognormal(jack_sample)
}

theta_jack_mean <- mean(jack_vals)

numerator <- sum((theta_jack_mean - jack_vals)^3)
denominator <- sum((theta_jack_mean - jack_vals)^2)

a_hat <- numerator / (6 * denominator^(3/2))

# ---- Adjusted percentiles ----
z_alpha <- qnorm(alpha / 2)
z_1_alpha <- qnorm(1 - alpha / 2)

alpha1 <- pnorm(z0 + (z0 + z_alpha) / (1 - a_hat * (z0 + z_alpha)))
alpha2 <- pnorm(z0 + (z0 + z_1_alpha) / (1 - a_hat * (z0 + z_1_alpha)))

# ---- BCa interval ----
lower_bca <- quantile(boot_estimates, probs = alpha1)
upper_bca <- quantile(boot_estimates, probs = alpha2)

ci_bca <- c(lower = lower_bca, upper = upper_bca)
ci_bca
```

```{r}
# BCa CI using boot package
boot_ci_bca <- boot.ci(boot_result, type = "bca")

boot_ci_bca
```


# Question (d) Asymptotic (CLT-Based) Confidence Interval for $\mu_{LN} = E[X]$

An alternative to bootstrap-based confidence intervals is the asymptotic confidence interval based on the Central Limit Theorem (CLT). This approach assumes that the estimator $\widehat{E[X]}$ is approximately normally distributed for sufficiently large sample sizes.

Under this assumption, a 95% confidence interval is constructed as:
[
$\widehat{E[X]} ;\pm$; z_{0.975} $\cdot \widehat{\mathrm{SE}}(\widehat{E[X]})$,
]
where $z_{0.975}$ is the standard normal critical value and $\widehat{\mathrm{SE}}(\widehat{E[X]})$ is the estimated standard error of the estimator.

In this problem, the standard error is approximated using the bootstrap distribution. Specifically, the standard deviation of the bootstrap estimates is used as an estimate of $\widehat{\mathrm{SE}}(\widehat{E[X]})$.

Conceptually, this method constructs a symmetric interval around the estimate, unlike the percentile and BCa methods which adapt to the shape of the bootstrap distribution. As a result, if the bootstrap distribution exhibits slight skewness, the CLT-based interval may not fully capture that asymmetry.

Unlike parts (b) and (c), there is no direct implementation of the CLT-based interval in the $\texttt{boot}$ package. Instead, the bootstrap is used here only to estimate the standard error, and the interval is constructed using the normal approximation.

```{r}
# --- CLT / Normal-based CI ---

# Original estimate
theta_hat <- estimate_mean_lognormal(data)

# Bootstrap standard error (std dev of bootstrap estimates)
se_boot <- sd(boot_estimates)

# Critical value for 95% CI
z <- qnorm(0.975)

# Construct CI
lower_clt <- theta_hat - z * se_boot
upper_clt <- theta_hat + z * se_boot

ci_clt <- c(lower = lower_clt, upper = upper_clt)
ci_clt
```


# Question (e) Comparison of Confidence Intervals

We compare the three confidence intervals constructed for $\mu_{LN} = E[X]$: the percentile bootstrap interval, the BCa bootstrap interval, and the asymptotic (CLT-based) interval.

The percentile bootstrap interval is:
[3.094, 5.643],
while the BCa interval is:
[3.223, 5.850].
These two intervals are similar in width, but the BCa interval is slightly shifted upward and extends further on the upper end. This reflects the bias and skewness corrections applied by the BCa method.

The CLT-based interval is:
[2.960, 5.477],
which is symmetric around $\widehat{E[X]}$. Compared to the bootstrap intervals, it has a lower lower-bound and a slightly smaller upper-bound, indicating that it does not fully capture the asymmetry present in the bootstrap distribution.

In terms of performance:

* The percentile and BCa intervals are data-driven and reflect the empirical distribution.
* The BCa interval provides additional correction for bias and skewness, making it more robust.
* The CLT interval relies on a normality assumption and may be less accurate when the distribution is not perfectly symmetric.

Overall, the BCa interval is the most appropriate choice in this setting due to its ability to account for both bias and skewness. The percentile interval provides a reasonable approximation, while the CLT interval serves as a useful but more assumption-dependent benchmark.

