Data and Setup

data = rbind.data.frame(0.85, 1.23, 0.92, 3.45, 2.11, 1.56, 4.89, 2.34, 1.78, 6.72, 0.95, 1.34, 8.91, 
2.67, 1.89, 5.43, 1.12, 3.78, 2.45, 7.65, 1.05, 1.45, 12.34, 2.89, 2.01, 4.56, 
1.23, 4.32, 2.67, 9.87, 0.99, 1.56, 15.23, 3.12, 2.34, 3.89, 1.34, 5.67, 2.89, 
11.45, 1.12, 1.67, 18.90, 3.45, 2.56, 3.45, 1.45, 6.78, 3.12, 14.56, 1.23, 1.78, 
22.34, 3.78, 2.78)
colnames(data) = "soil"

For this analysis, we will be examining a dataset of 55 urban garden sites’ soil lead concentrations. These concentrations of soil lead (chemical prefix “Pb”) are measured in mg/kg. Moving forward in this analysis, we will assume the data, and broader population of lead concentrations in urban sites follow a log-normal distribution. This assumption is predicated on the fact that samples such as what we are working with are typically notably right skewed and by nature must be non-negative.

Summary statistics and a histogram of the dataset at hand are below.

sum_stats_table = function(variable){
  table = cbind(
    round(min(variable),2),
    round(quantile(variable, 0.25),2),
    round(median(variable),2),
    round(mean(variable),2),
    round(quantile(variable, 0.75),2),
    round(max(variable),2)
  )
  colnames(table) = c("Min", "Q1", "Med", "Mean", "Q3", "Max")
  rownames(table) = NULL
  return(table)
}

Summary_Table = sum_stats_table(variable = data$soil)
kable(Summary_Table,align = "c",
      caption = "<span style='color:##000000;'>
      Summary Stats of Soil Times </span>") %>%
  kable_styling(
    bootstrap_options = c("striped", "bordered"),
    full_width = FALSE,
    position = "center")
Summary Stats of Soil Times
Min Q1 Med Mean Q3 Max
0.85 1.5 2.67 4.4 4.72 22.34
hist(data$soil, 
     main = "Distribution of Concentrations",
     xlab = "Concentrations (mg/kg)",
     prob = TRUE,
     col = "lightblue"
)
box(lwd=3)

Using our available sample data, and our presumption of log-normality, the following sections will consist of attempts to estimate the average lead concentrations (mg/kg) across the broad population of urban sites similar to those that were observed for our sample. For mathematical notation purposes, this mean lead concentration will be denoted as E[X]. I will perform both point estimation and interval estimation.

To find an initial point estimate for E[X], I utilized the MLE estimation formulas for the log normal distribution as well as standard algebraic substitution. The relevant formulas are visible below.

knitr::include_graphics("/Users/chrisbahm/Library/Mobile Documents/com~apple~CloudDocs/West Chester/Spring 2026/STA 506 - Grad Math Stat 2 /STA 506 R Folder/Week 9 - Bootstrap Confidence Intervals/MLE.png")

knitr::include_graphics("/Users/chrisbahm/Library/Mobile Documents/com~apple~CloudDocs/West Chester/Spring 2026/STA 506 - Grad Math Stat 2 /STA 506 R Folder/Week 9 - Bootstrap Confidence Intervals/EX.png")

sample_mean = mean(data$soil)

  # based on GIVEN formula in assignment instructions for MLE estimator of mu
mu_hat = sum(log(data$soil))/nrow(data)

 # based on GIVEN formula in assignment instructions for MLE estimator of theta squared
theta_squared_hat = sum((log(data$soil) - mu_hat)^2)/nrow(data)

Ex_hat = exp(mu_hat + (theta_squared_hat/2))

# E[x] = lognormal population mean

Per this mathematical setup, our starting estimate of population average lead concentration is approximately 4.218.

a.) Bootstrap Sampling Distribution

Next, I used bootstrapping to acquire a broader sampling distribution of the average lead concentration point estimate. After performing 5,000 iterations of bootstrap sampling, and calculating E[X] for each sample, the sampling distribution of population mean estimates is visible below.

library(boot)
set.seed(100) # setting seed now for entire program

bootstrap_Ex = function(data, b) {
  n = length(data)
  bootstrap_ex_estimates = numeric(b)
  
    # START FOR LOOP
   for (i in 1:b){
     boot_sample = sample(data, size = n, replace= TRUE)
     mu_hat[i] = sum(log(boot_sample))/n
        # make sure to use boot_sample in this line, previously I was using data inside the log function and my output was the same value for every element of the vector
     
     theta_squared_hat[i] = sum((log(boot_sample) - mu_hat[i])^2)/n
     
     bootstrap_ex_estimates[i] = exp(mu_hat[i] +
                                       (theta_squared_hat[i]/2))
   }
    # END FOR LOOP
  
  return(bootstrap_ex_estimates) # final output
}

  # always remember to set the seed for consistent repetition
Ex_sampling_distribution = bootstrap_Ex(data = data$soil, b = 5000)

hist(Ex_sampling_distribution, 
     main = "Sampling Distribution of Population Mean Estimates",
     xlab = "Concentrations (mg/kg)",
     prob = TRUE,
     col = "lightblue"
)

box(lwd=3)

As we can see from our histogram, the sampling distribution of E[X] estimates is roughly log-normal distributed, with a rightward skew. The mean of our sampling distribution is roughly 4.248, which is extraordinarily close to the population mean estimate acquired strictly from our available sample data of 4.218.

b.) Bootstrap Percentile CI

After finding a sample distribution, I once again employ bootstrapping, this time to construct a percentile confidence interval for E[X]. Percentile intervals are a solid foundation for parameter estimation. They are the least mathematically complex of all confidence intervals, relying on the left and right quantile cutoffs of our distribution, as per the chosen level of \(\alpha\). Since the confidence level for this interval will be 95%, \(\alpha\) = 0.05 in this instance.

custom_ci_percentile = function(data, b, alpha){
  n = length(data)
  bootstrap_ex_estimates = numeric(b)
  mu_hat = numeric(b)
  theta_squared_hat = numeric(b)
  
  # for this function can use the same bootstrapping procedure from above to obtain the bootstrap_ex estimates vector
    # START FOR LOOP
    for(i in 1:b){
     boot_sample = sample(data, size = n, replace= TRUE)
     mu_hat[i] = sum(log(boot_sample))/n
     theta_squared_hat[i] = sum((log(boot_sample) - mu_hat[i])^2)/n
     bootstrap_ex_estimates[i] = exp(mu_hat[i] +
                                       (theta_squared_hat[i]/2))
    } # END FOR LOOP
  
  
  sort_bootstrap_ex_estimates = sort(bootstrap_ex_estimates)
  lower_quant = (alpha/2)
  higher_quant = (1-(alpha/2))
  lower_estimate = quantile(sort_bootstrap_ex_estimates, lower_quant)
  higher_estimate = quantile(sort_bootstrap_ex_estimates, higher_quant)
  results = rbind(lower_estimate,
                  higher_estimate)
  colnames(results) = "Custom Percent"
  rownames(results) = c("Lower Estimate", "Higher Estimate")
 
  return (results)
}

ci_percent_custom = custom_ci_percentile(data$soil, b = 5000, alpha = 0.05)

#ci_percent_custom
### built-in R function

######
# reference section 3.2.3 of notes

## create stat function that finds E[X] from a singular bootstrap sample of our data

find_ex = function(data, indices){
  n = length(data)
  sample_data = data[indices]
  
  mu_hat = sum(log(sample_data))/n
  theta_squared_hat = sum((log(data) - mu_hat)^2)/n
  
  ex_estimate = exp(mu_hat + 
                      (theta_squared_hat/2))
}
  # now that this function is made, plug it into the boot function

boot_result = boot(data$soil, find_ex, R = 5000)

ci_percent_r_stats = boot.ci(boot_result, type = "perc")
  # Note: boot.ci function defaults to 95% confidence level, can change this using the conf = argument

#ci_percent_r_stats

#ci_percent_r_stats$percent

#ci_percent_r_stats$percent[4] # = lower interval end
#ci_percent_r_stats$percent[5] # = higher interval end

ci_percent_r = rbind(ci_percent_r_stats$percent[4],
                     ci_percent_r_stats$percent[5])

rownames(ci_percent_r) = c("Lower Estimate", "Higher Estimate")
colnames(ci_percent_r) = "Percentile"

#ci_percent_r

To construct the percentile confidence interval for the mean soil concentration amongst the broader population of urban sites, I resorted to both custom function creation as well as R’s boot() function. Per these results, we can say with approximately 95% confidence that the population mean is either between 3.154 and 5.565 mg/kg or 3.456 and 5.413 mg/kg.

c.) Bootstrap BCa CI

After creating the percentile intervals, I then moved forward with bootstrap BCa (bias-corrected and accelerated) intervals. BCa intervals are a more sophisticated option to their standard percentile counterparts. They are especially valued when dealing with skewed distributions/data or biased estimators, something that the percentile option does not account for nearly as much.

### custom function
# reference section 3.3.2

# START FUNCTION
custom_ci_bca = function(data, b, alpha, original_estimate){ 
  n = length(data)
  bootstrap_ex_estimates = numeric(b)
  
# bias correction "measures median bias by comparing bootstrap estimates to the original estimate."
  
  prop_less = 0 # initialize counting variable at = 0
  
  for (i in 1:b){ # START BOOTSTRAP LOOP
     boot_sample = sample(data, size = n, replace= TRUE)
     mu_hat[i] = sum(log(boot_sample))/n
     theta_squared_hat[i] = sum((log(boot_sample) - mu_hat[i])^2)/n
     bootstrap_ex_estimates[i] = exp(mu_hat[i] +
                                       (theta_squared_hat[i]/2))
     
     # bias correction factor below
     ## can use normal if statement syntax, just incorporate the [i] to indicate that the statement is evaluating element by element
         if (bootstrap_ex_estimates[i] < original_estimate){
          prop_less = prop_less + 1
         }
     } # END BOOTSTRAP LOOP
     
      #### START CORRECTION CALCULATION  
        prop_less = prop_less/b
        z0 = qnorm(prop_less)
        # now have bias correction
      #### END BIAS CORRECTION CALCULATION
  
        
      #### START ACCELERATION CALCULATION
          # good practice to use different "counting variable than was earlier used in function, I'll use j instead of i
     jack_vals = numeric(n)
     
     for (j in 1:n){ # START JACK LOOP
       jack_sample = data[-j]
       
       mu_jack = sum(log(jack_sample)) / (n-1)
        # the jackknife estimate is computed by leaving OUT the ith observation
       theta_squared_jack = sum((log(jack_sample)-mu_jack)^2) / (n-1)
       jack_vals[j] = exp(mu_jack + (theta_squared_jack/2))
     } # END JACK LOOP
     
       theta_jack_mean = mean(jack_vals)
             # we want the difference between the total average and each jackknife estimate that left one out
       numerator = sum((theta_jack_mean - jack_vals)^3)
       denominator = sum((theta_jack_mean - jack_vals)^2)
       
       a_hat = numerator/ (6 * denominator^(3/2))
     #### END ACCELERATION CALCULATION
       
       
    ### FINAL OUTPUT/INTERVAL
       z_score1 = qnorm(alpha/2)
       z_score2 = qnorm(1-alpha/2)
       alpha1 = pnorm(z0 + (z0 + z_score1)/(1-a_hat*(z0 + z_score1)))
       alpha2 = pnorm(z0 + (z0 + z_score2)/(1-a_hat*(z0 + z_score2)))
       
       lower = quantile(bootstrap_ex_estimates, alpha1)
       upper = quantile(bootstrap_ex_estimates, alpha2)
       results = rbind(lower,upper)
       rownames(results) = c("Lower Estimate", "Higher Estimate")
       colnames(results) = "BCA"
       
       return(results)
            
} # END FUNCTION


#test
ci_bca_custom = custom_ci_bca(data = data$soil, b= 5000, alpha = 0.05, original_estimate = Ex_hat)

#ci_bca_custom
### built-in R function

# can use boot_result object from section b
# once again using boot.ci function, however argument changes to "bca"
ci_bca_r_stats = boot.ci(boot_result, type = "bca")

#ci_bca_r_stats
  # interval between 3.515 and 5.515

#ci_bca_r_stats$bca

#ci_bca_r_stats$bca[4] # = lower interval end
#ci_bca_r_stats$bca[5] # = higher interval end

ci_bca_r = rbind(ci_bca_r_stats$bca[4],
                     ci_bca_r_stats$bca[5])

rownames(ci_bca_r) = c("Lower Estimate", "Higher Estimate")
colnames(ci_bca_r) = "BCA"

#ci_bca_r

Once again utilizing two different coding methods, my BCa results were the following. With approximately 95% confidence, we can claim that the population mean of lead concentration in urban sites is somewhere between 3.244 and 5.85 mg/kg or 3.473 and 5.443 mg/kg.

d.) Asymptotic CI

Lastly, I attempted to estimate the population mean lead concentration, E[X], via an asymptotic or “normal” approach. Asymptotic sampling is of course predicated on the idea of the central limit theorem, meaning that an increase in sample size is correlated with an increase in normalization.

A fundamental difference in asymptotic and percentile interval estimation is that the critical value term for asymptotic intervals, which the standard error is multiplied by, is dependent on the qualities of the normal distribution. On the other hand, the critical value term for percentile intervals is dependent on the qualities of the bootstrap samples’ distribution. A downside of normal bootstrap intervals is the assumption of normally distributed data, which is not quite met to a large degree in this instance.

### custom function
  # reference section 3.1 from notes
  # can use bootstrap_Ex function from part a

custom_ci_asym = function(data, b, alpha) {
  n = length(data)
  bootstrap_ex_estimates = numeric(b)
  
  # call to previous function
  bootstrap_ex_estimates = bootstrap_Ex(data = data, b = b)
  
  # now estimate SE
  sum_part = sum((bootstrap_ex_estimates - 
                    mean(bootstrap_ex_estimates))^2)
  se = sqrt((1/(b-1))*sum_part)
  
  # now find and output interval
  crit_value = abs(qnorm(alpha/2))
  lower = mean(bootstrap_ex_estimates) - (crit_value*se)
  upper = mean(bootstrap_ex_estimates) + (crit_value*se)
  results = rbind(lower,upper)
  colnames(results) = "Custom Asym"
  rownames(results) = c("Lower Estimate", "Higher Estimate")
  
  return(results)
}

ci_asym_custom = custom_ci_asym(data = data$soil, b = 5000, alpha = 0.05)
### built-in R function

# once again use boot_result object and boot.ci function

ci_asym_r_stats = boot.ci(boot_result, type = "norm")
  # using argument "norm"; as sample size increases so does resemblance to normality as per CLT and asymptotic theory

#ci_asym_r_stats
  # about 3.162 to 5.148

#ci_asym_r_stats$normal[2] = lower interval
#ci_asym_r_stats$normal[3] = higher interval

ci_asym_r = rbind(ci_asym_r_stats$normal[2],
                  ci_asym_r_stats$normal[3])

rownames(ci_asym_r) = c("Lower Estimate", "Higher Estimate")
colnames(ci_asym_r) = "Asymptotic"

#ci_asym_r

Per the asymptotic method, we can say with roughly 95% certainty that the mean lead concentration amongst urban sites is either between 2.974 and 5.493 mg/kg or 3.184 and 5.135 mg/kg

e.) Interval Comparison

Below are two comparison tables displaying the results of each of the three interval methods that have been discussed. The top table compares results that were computed via R’s boot() function, while the latter table shows results calculated via my custom-built functions. The code for each custom function can be found underneath the respective sections.

## comparing the intervals created via R's boot function
r_intervals = cbind(round(ci_percent_r,3),
                    round(ci_bca_r,3),
                    round(ci_asym_r,3))
rownames(r_intervals) = c("Lower Estimate", "Higher Estimate")

kable(r_intervals,align = "c",
      caption = "<span style='color:##000000;'>
      Intervals via R's Boot Function </span>") %>%
  kable_styling(
    bootstrap_options = c("striped", "bordered"),
    full_width = FALSE,
    position = "center")
Intervals via R’s Boot Function
Percentile BCA Asymptotic
Lower Estimate 3.456 3.473 3.184
Higher Estimate 5.413 5.443 5.135
## comparing the intervals created via my custom functions
custom_intervals = cbind(round(ci_percent_custom,3),
                    round(ci_bca_custom,3),
                    round(ci_asym_custom,3))
rownames(custom_intervals) = c("Lower Estimate", "Higher Estimate")
colnames(custom_intervals) = c("Percentile", "BCA", "Asymptotic")

kable(custom_intervals,align = "c",
      caption = "<span style='color:##000000;'>
      Intervals via Custom-Built Function </span>") %>%
  kable_styling(
    bootstrap_options = c("striped", "bordered"),
    full_width = FALSE,
    position = "center")
Intervals via Custom-Built Function
Percentile BCA Asymptotic
Lower Estimate 3.154 3.244 2.974
Higher Estimate 5.565 5.850 5.493

Final Recommendation

Taking everything into consideration, I would suggest moving forward under the assumption that the BCa interval is the most accurate.

As I alluded to in a previous section, use of the the asymptotic interval generally assumes a relatively normal distribution of the population of interest. In this scenario, we have not assumed normality for the population of urban sites’ lead concentrations, but rather log-normality. Beyond that, the BCa interval method is the most robust in the face of skewness and bias. Its bias correction component utilizes the median statistic to counterbalance outliers, something that is certainly relevant to both our sample data and bootstrap distribution of E[X] estimates.

Below are the histograms for both the sample data, and population mean bootstrap estimates, this time with the 90th percentile of the dataset marked. This allows us to see how great of a rightward skew is applicable to this particular analysis.

hist(data$soil, 
     main = "Distribution of Concentrations",
     xlab = "Concentrations (mg/kg)",
     prob = TRUE,
     col = "lightblue"
)

 # Data point 90th percentile
    segments(
      x0 = quantile(data$soil,0.9),
      y0 = 0,
      x1 = quantile(data$soil,0.9),
      y1 = 1,
      col = "black",
      lwd = 4,
      lty = 3
    )

legend("topright",
       legend = "90th Percentile",
       col = "black",
       lwd = 4,
       lty = 3,
       box.lwd=3)

box(lwd=3)

hist(Ex_sampling_distribution, 
     main = "Sampling Distribution of Population Mean Estimates",
     xlab = "Concentrations (mg/kg)",
     prob = TRUE,
     col = "lightblue"
)

 # Data point 90th percentile
    segments(
      x0 = quantile(Ex_sampling_distribution,0.9),
      y0 = 0,
      x1 = quantile(Ex_sampling_distribution,0.9),
      y1 = 1,
      col = "black",
      lwd = 4,
      lty = 3
    )

legend("topright",
       legend = "90th Percentile",
       col = "black",
       lwd = 4,
       lty = 3,
       box.lwd=3)    
  
box(lwd=3)

---
title: "Bootstrap Confidence Intervals"
author: "Chris Bahm"
date: "March 29, 2026"
output:
  html_document:
    toc: true
    toc_float:
      collapsed: true
      smooth_scroll: true
    toc_depth: 4
    fig_width: 6
    fig_height: 4
    fig_caption: true
    number_sections: false
    code_folding: hide
    code_download: true
    theme: lumen
    highlight: tango
  pdf_document:
    toc: true
    toc_depth: 4
    fig_caption: true
    number_sections: true
  word_document:
    toc: true
    toc_depth: 4
---

```{css, echo = FALSE}
div#TOC li {     /* table of content  */
    list-style:upper-roman;
    background-image:none;
    background-repeat:none;
    background-position:0;
}

h1.title {    /* level 1 header of title  */
  font-size: 24px;
  font-weight: bold;
  color: DarkRed;
  text-align: center;
}

h4.author { /* Header 4 - and the author and data headers use this too  */
  font-size: 18px;
  font-weight: bold;
  font-family: "Times New Roman", Times, serif;
  color: DarkRed;
  text-align: center;
}

h4.date { /* Header 4 - and the author and data headers use this too  */
  font-size: 18px;
  font-weight: bold;
  font-family: "Times New Roman", Times, serif;
  color: DarkBlue;
  text-align: center;
}

h1 { /* Header 1 - and the author and data headers use this too  */
    font-size: 20px;
    font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: darkred;
    text-align: center;
}

h2 { /* Header 2 - and the author and data headers use this too  */
    font-size: 18px;
    font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}

h3 { /* Header 3 - and the author and data headers use this too  */
    font-size: 16px;
    font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}

h4 { /* Header 4 - and the author and data headers use this too  */
    font-size: 14px;
  font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: darkred;
    text-align: left;
}

/* Add dots after numbered headers */
.header-section-number::after {
  content: ".";
}
```

```{r setup, include=FALSE}
# code chunk specifies whether the R code, warnings, and output 
# will be included in the output files.

if (!require("knitr")) {                      # use conditional statement to detect
   install.packages("knitr")                  # whether a package was installed in
   library(knitr)                             # your machine. If not, install it and
}                                             # load it to the working directory.

if (!require(tidyverse)) {library(tidyvserse)} 

if (!require(GGally)) {library(GGally)} 

if (!require(kableExtra)) {library(kableExtra)} 

if (!require(ggplot2)) {library(ggplot2)} 

if (!require(car)) {library(car)} 

if (!require(dplyr)) {library(dplyr)} 

if (!require(pander)) {library(pander)} 

if (!require(forecast)) {library(forecast)} 

if (!require(lubridate)) {library(lubridate)} 

if (!require("scales")) {
install.packages("scales")                                        
library("scales") 
}

knitr::opts_chunk$set(
	echo = TRUE,
	message = FALSE,
	warning = FALSE,
	comment = NA,
	results = TRUE,
	fig.align = "center"
)
```


# Data and Setup

```{r}
data = rbind.data.frame(0.85, 1.23, 0.92, 3.45, 2.11, 1.56, 4.89, 2.34, 1.78, 6.72, 0.95, 1.34, 8.91, 
2.67, 1.89, 5.43, 1.12, 3.78, 2.45, 7.65, 1.05, 1.45, 12.34, 2.89, 2.01, 4.56, 
1.23, 4.32, 2.67, 9.87, 0.99, 1.56, 15.23, 3.12, 2.34, 3.89, 1.34, 5.67, 2.89, 
11.45, 1.12, 1.67, 18.90, 3.45, 2.56, 3.45, 1.45, 6.78, 3.12, 14.56, 1.23, 1.78, 
22.34, 3.78, 2.78)
colnames(data) = "soil"
```
For this analysis, we will be examining a dataset of `r nrow(data)` urban garden sites' soil lead concentrations. These concentrations of soil lead (chemical prefix "Pb") are measured in mg/kg. Moving forward in this analysis, we will assume the data, and broader population of lead concentrations in urban sites follow a **log-normal distribution**. This assumption is predicated on the fact that samples such as what we are working with are typically notably right skewed and by nature must be non-negative.

Summary statistics and a histogram of the dataset at hand are below.

```{r, fig.height=5.5, fig.width=6.5}
sum_stats_table = function(variable){
  table = cbind(
    round(min(variable),2),
    round(quantile(variable, 0.25),2),
    round(median(variable),2),
    round(mean(variable),2),
    round(quantile(variable, 0.75),2),
    round(max(variable),2)
  )
  colnames(table) = c("Min", "Q1", "Med", "Mean", "Q3", "Max")
  rownames(table) = NULL
  return(table)
}

Summary_Table = sum_stats_table(variable = data$soil)
kable(Summary_Table,align = "c",
      caption = "<span style='color:##000000;'>
      Summary Stats of Soil Times </span>") %>%
  kable_styling(
    bootstrap_options = c("striped", "bordered"),
    full_width = FALSE,
    position = "center")


hist(data$soil, 
     main = "Distribution of Concentrations",
     xlab = "Concentrations (mg/kg)",
     prob = TRUE,
     col = "lightblue"
)
box(lwd=3)

```

Using our available sample data, and our presumption of log-normality, the following sections will consist of attempts to estimate the **average lead concentrations** (mg/kg) across the **broad population** of urban sites similar to those that were observed for our sample. For mathematical notation purposes, this mean lead concentration will be denoted as **E[X]**. I will perform both point estimation and interval estimation.

To find an initial point estimate for E[X], I utilized the MLE estimation formulas for the log normal distribution as well as standard algebraic substitution. The relevant formulas are visible below.

```{r}
knitr::include_graphics("/Users/chrisbahm/Library/Mobile Documents/com~apple~CloudDocs/West Chester/Spring 2026/STA 506 - Grad Math Stat 2 /STA 506 R Folder/Week 9 - Bootstrap Confidence Intervals/MLE.png")

knitr::include_graphics("/Users/chrisbahm/Library/Mobile Documents/com~apple~CloudDocs/West Chester/Spring 2026/STA 506 - Grad Math Stat 2 /STA 506 R Folder/Week 9 - Bootstrap Confidence Intervals/EX.png")
```


```{r}
sample_mean = mean(data$soil)

  # based on GIVEN formula in assignment instructions for MLE estimator of mu
mu_hat = sum(log(data$soil))/nrow(data)

 # based on GIVEN formula in assignment instructions for MLE estimator of theta squared
theta_squared_hat = sum((log(data$soil) - mu_hat)^2)/nrow(data)

Ex_hat = exp(mu_hat + (theta_squared_hat/2))

# E[x] = lognormal population mean
```

Per this mathematical setup, our starting estimate of population average lead concentration is approximately `r round(Ex_hat,3)`. 


# a.) Bootstrap Sampling Distribution
Next, I used bootstrapping to acquire a broader sampling distribution of the average lead concentration point estimate. After performing 5,000 iterations of bootstrap sampling, and calculating E[X] for each sample, the sampling distribution of population mean estimates is visible below.

```{r, fig.height=5.5, fig.width=6.5}
library(boot)
set.seed(100) # setting seed now for entire program

bootstrap_Ex = function(data, b) {
  n = length(data)
  bootstrap_ex_estimates = numeric(b)
  
    # START FOR LOOP
   for (i in 1:b){
     boot_sample = sample(data, size = n, replace= TRUE)
     mu_hat[i] = sum(log(boot_sample))/n
        # make sure to use boot_sample in this line, previously I was using data inside the log function and my output was the same value for every element of the vector
     
     theta_squared_hat[i] = sum((log(boot_sample) - mu_hat[i])^2)/n
     
     bootstrap_ex_estimates[i] = exp(mu_hat[i] +
                                       (theta_squared_hat[i]/2))
   }
    # END FOR LOOP
  
  return(bootstrap_ex_estimates) # final output
}

  # always remember to set the seed for consistent repetition
Ex_sampling_distribution = bootstrap_Ex(data = data$soil, b = 5000)

hist(Ex_sampling_distribution, 
     main = "Sampling Distribution of Population Mean Estimates",
     xlab = "Concentrations (mg/kg)",
     prob = TRUE,
     col = "lightblue"
)

box(lwd=3)
```

As we can see from our histogram, the sampling distribution of E[X] estimates is roughly log-normal distributed, with a rightward skew. The mean of our sampling distribution is roughly `r round(mean(Ex_sampling_distribution),3)`, which is extraordinarily close to the population mean estimate acquired strictly from our available sample data of `r round(Ex_hat,3)`.


# b.) Bootstrap Percentile CI
After finding a sample distribution, I once again employ bootstrapping, this time to construct a percentile confidence interval for E[X]. Percentile intervals are a solid foundation for parameter estimation. They are the least mathematically complex of all confidence intervals, relying on the left and right quantile cutoffs of our distribution, as per the chosen level of $\alpha$. Since the confidence level for this interval will be 95%, $\alpha$ = 0.05 in this instance.

```{r}
custom_ci_percentile = function(data, b, alpha){
  n = length(data)
  bootstrap_ex_estimates = numeric(b)
  mu_hat = numeric(b)
  theta_squared_hat = numeric(b)
  
  # for this function can use the same bootstrapping procedure from above to obtain the bootstrap_ex estimates vector
    # START FOR LOOP
    for(i in 1:b){
     boot_sample = sample(data, size = n, replace= TRUE)
     mu_hat[i] = sum(log(boot_sample))/n
     theta_squared_hat[i] = sum((log(boot_sample) - mu_hat[i])^2)/n
     bootstrap_ex_estimates[i] = exp(mu_hat[i] +
                                       (theta_squared_hat[i]/2))
    } # END FOR LOOP
  
  
  sort_bootstrap_ex_estimates = sort(bootstrap_ex_estimates)
  lower_quant = (alpha/2)
  higher_quant = (1-(alpha/2))
  lower_estimate = quantile(sort_bootstrap_ex_estimates, lower_quant)
  higher_estimate = quantile(sort_bootstrap_ex_estimates, higher_quant)
  results = rbind(lower_estimate,
                  higher_estimate)
  colnames(results) = "Custom Percent"
  rownames(results) = c("Lower Estimate", "Higher Estimate")
 
  return (results)
}

ci_percent_custom = custom_ci_percentile(data$soil, b = 5000, alpha = 0.05)

#ci_percent_custom
```

```{r}
### built-in R function

######
# reference section 3.2.3 of notes

## create stat function that finds E[X] from a singular bootstrap sample of our data

find_ex = function(data, indices){
  n = length(data)
  sample_data = data[indices]
  
  mu_hat = sum(log(sample_data))/n
  theta_squared_hat = sum((log(data) - mu_hat)^2)/n
  
  ex_estimate = exp(mu_hat + 
                      (theta_squared_hat/2))
}
  # now that this function is made, plug it into the boot function

boot_result = boot(data$soil, find_ex, R = 5000)

ci_percent_r_stats = boot.ci(boot_result, type = "perc")
  # Note: boot.ci function defaults to 95% confidence level, can change this using the conf = argument

#ci_percent_r_stats

#ci_percent_r_stats$percent

#ci_percent_r_stats$percent[4] # = lower interval end
#ci_percent_r_stats$percent[5] # = higher interval end

ci_percent_r = rbind(ci_percent_r_stats$percent[4],
                     ci_percent_r_stats$percent[5])

rownames(ci_percent_r) = c("Lower Estimate", "Higher Estimate")
colnames(ci_percent_r) = "Percentile"

#ci_percent_r
```

To construct the percentile confidence interval for the mean soil concentration amongst the broader population of urban sites, I resorted to both custom function creation as well as R's <span style="color:blue;">boot()</span> function. Per these results, we can say with approximately 95% confidence that the population mean is either between `r round(ci_percent_custom[1],3)` and `r round(ci_percent_custom[2],3)` mg/kg *or* `r round(ci_percent_r[1],3)` and `r round(ci_percent_r[2],3)` mg/kg.

# c.) Bootstrap BCa CI 
After creating the percentile intervals, I then moved forward with bootstrap BCa (bias-corrected and accelerated) intervals. BCa intervals are a more sophisticated option to their standard percentile counterparts. They are especially valued when dealing with skewed distributions/data or biased estimators, something that the percentile option does *not* account for nearly as much.

```{r}
### custom function
# reference section 3.3.2

# START FUNCTION
custom_ci_bca = function(data, b, alpha, original_estimate){ 
  n = length(data)
  bootstrap_ex_estimates = numeric(b)
  
# bias correction "measures median bias by comparing bootstrap estimates to the original estimate."
  
  prop_less = 0 # initialize counting variable at = 0
  
  for (i in 1:b){ # START BOOTSTRAP LOOP
     boot_sample = sample(data, size = n, replace= TRUE)
     mu_hat[i] = sum(log(boot_sample))/n
     theta_squared_hat[i] = sum((log(boot_sample) - mu_hat[i])^2)/n
     bootstrap_ex_estimates[i] = exp(mu_hat[i] +
                                       (theta_squared_hat[i]/2))
     
     # bias correction factor below
     ## can use normal if statement syntax, just incorporate the [i] to indicate that the statement is evaluating element by element
         if (bootstrap_ex_estimates[i] < original_estimate){
          prop_less = prop_less + 1
         }
     } # END BOOTSTRAP LOOP
     
      #### START CORRECTION CALCULATION  
        prop_less = prop_less/b
        z0 = qnorm(prop_less)
        # now have bias correction
      #### END BIAS CORRECTION CALCULATION
  
        
      #### START ACCELERATION CALCULATION
          # good practice to use different "counting variable than was earlier used in function, I'll use j instead of i
     jack_vals = numeric(n)
     
     for (j in 1:n){ # START JACK LOOP
       jack_sample = data[-j]
       
       mu_jack = sum(log(jack_sample)) / (n-1)
        # the jackknife estimate is computed by leaving OUT the ith observation
       theta_squared_jack = sum((log(jack_sample)-mu_jack)^2) / (n-1)
       jack_vals[j] = exp(mu_jack + (theta_squared_jack/2))
     } # END JACK LOOP
     
       theta_jack_mean = mean(jack_vals)
             # we want the difference between the total average and each jackknife estimate that left one out
       numerator = sum((theta_jack_mean - jack_vals)^3)
       denominator = sum((theta_jack_mean - jack_vals)^2)
       
       a_hat = numerator/ (6 * denominator^(3/2))
     #### END ACCELERATION CALCULATION
       
       
    ### FINAL OUTPUT/INTERVAL
       z_score1 = qnorm(alpha/2)
       z_score2 = qnorm(1-alpha/2)
       alpha1 = pnorm(z0 + (z0 + z_score1)/(1-a_hat*(z0 + z_score1)))
       alpha2 = pnorm(z0 + (z0 + z_score2)/(1-a_hat*(z0 + z_score2)))
       
       lower = quantile(bootstrap_ex_estimates, alpha1)
       upper = quantile(bootstrap_ex_estimates, alpha2)
       results = rbind(lower,upper)
       rownames(results) = c("Lower Estimate", "Higher Estimate")
       colnames(results) = "BCA"
       
       return(results)
            
} # END FUNCTION


#test
ci_bca_custom = custom_ci_bca(data = data$soil, b= 5000, alpha = 0.05, original_estimate = Ex_hat)

#ci_bca_custom
```

```{r}
### built-in R function

# can use boot_result object from section b
# once again using boot.ci function, however argument changes to "bca"
ci_bca_r_stats = boot.ci(boot_result, type = "bca")

#ci_bca_r_stats
  # interval between 3.515 and 5.515

#ci_bca_r_stats$bca

#ci_bca_r_stats$bca[4] # = lower interval end
#ci_bca_r_stats$bca[5] # = higher interval end

ci_bca_r = rbind(ci_bca_r_stats$bca[4],
                     ci_bca_r_stats$bca[5])

rownames(ci_bca_r) = c("Lower Estimate", "Higher Estimate")
colnames(ci_bca_r) = "BCA"

#ci_bca_r
```

Once again utilizing two different coding methods, my BCa results were the following. With approximately 95% confidence, we can claim that the population mean of lead concentration in urban sites is somewhere between `r round(ci_bca_custom[1],3)` and `r round(ci_bca_custom[2],3)` mg/kg *or* `r round(ci_bca_r[1],3)` and `r round(ci_bca_r[2],3)` mg/kg.


# d.) Asymptotic CI 
Lastly, I attempted to estimate the population mean lead concentration, E[X], via an asymptotic or "normal" approach. Asymptotic sampling is of course predicated on the idea of the central limit theorem, meaning that an increase in sample size is correlated with an increase in normalization. 

A fundamental difference in asymptotic and percentile interval estimation is that the critical value term for asymptotic intervals, which the standard error is multiplied by, is dependent on the qualities of the normal distribution. On the other hand, the critical value term for percentile intervals is dependent on the qualities of the bootstrap samples' distribution. A downside of normal bootstrap intervals is the assumption of normally distributed data, which is not quite met to a large degree in this instance.


```{r, include=FALSE}
### testing chunk
  # ending up not going this route

## below is "formula" for creating larger population of log norm distributed data similar to our sample data
large_pop = rlnorm(50, mean = log(Ex_hat), sd = theta_squared_hat)
large_pop = sort(large_pop)
sort = sort(data$soil)

large_pop
sort
mean(large_pop)
mean(data$soil)

## make large population with characteristics of our sample / population distribution
large_pop = rlnorm(10000, mean = log(Ex_hat), sd = theta_squared_hat)

#mean(large_pop)
#mean(data$soil)
  # means are of similar value

## now take bootstrap samples of that large population

#ci_asym_custom = custom_ci_asym(data = large_pop, b = 5000, alpha = 0.05)


```

```{r}
### custom function
  # reference section 3.1 from notes
  # can use bootstrap_Ex function from part a

custom_ci_asym = function(data, b, alpha) {
  n = length(data)
  bootstrap_ex_estimates = numeric(b)
  
  # call to previous function
  bootstrap_ex_estimates = bootstrap_Ex(data = data, b = b)
  
  # now estimate SE
  sum_part = sum((bootstrap_ex_estimates - 
                    mean(bootstrap_ex_estimates))^2)
  se = sqrt((1/(b-1))*sum_part)
  
  # now find and output interval
  crit_value = abs(qnorm(alpha/2))
  lower = mean(bootstrap_ex_estimates) - (crit_value*se)
  upper = mean(bootstrap_ex_estimates) + (crit_value*se)
  results = rbind(lower,upper)
  colnames(results) = "Custom Asym"
  rownames(results) = c("Lower Estimate", "Higher Estimate")
  
  return(results)
}

ci_asym_custom = custom_ci_asym(data = data$soil, b = 5000, alpha = 0.05)

```

```{r}
### built-in R function

# once again use boot_result object and boot.ci function

ci_asym_r_stats = boot.ci(boot_result, type = "norm")
  # using argument "norm"; as sample size increases so does resemblance to normality as per CLT and asymptotic theory

#ci_asym_r_stats
  # about 3.162 to 5.148

#ci_asym_r_stats$normal[2] = lower interval
#ci_asym_r_stats$normal[3] = higher interval

ci_asym_r = rbind(ci_asym_r_stats$normal[2],
                  ci_asym_r_stats$normal[3])

rownames(ci_asym_r) = c("Lower Estimate", "Higher Estimate")
colnames(ci_asym_r) = "Asymptotic"

#ci_asym_r
```

Per the asymptotic method, we can say with roughly 95% certainty that the mean lead concentration amongst urban sites is either between `r round(ci_asym_custom[1], 3)` and `r round(ci_asym_custom[2], 3)` mg/kg *or* `r round(ci_asym_r[1], 3)` and `r round(ci_asym_r[2], 3)` mg/kg

# e.) Interval Comparison
Below are two comparison tables displaying the results of each of the three interval methods that have been discussed. The top table compares results that were computed via R's <span style="color:blue;">boot()</span> function, while the latter table shows results calculated via my custom-built functions. The code for each custom function can be found underneath the respective sections.

```{r}
## comparing the intervals created via R's boot function
r_intervals = cbind(round(ci_percent_r,3),
                    round(ci_bca_r,3),
                    round(ci_asym_r,3))
rownames(r_intervals) = c("Lower Estimate", "Higher Estimate")

kable(r_intervals,align = "c",
      caption = "<span style='color:##000000;'>
      Intervals via R's Boot Function </span>") %>%
  kable_styling(
    bootstrap_options = c("striped", "bordered"),
    full_width = FALSE,
    position = "center")

## comparing the intervals created via my custom functions
custom_intervals = cbind(round(ci_percent_custom,3),
                    round(ci_bca_custom,3),
                    round(ci_asym_custom,3))
rownames(custom_intervals) = c("Lower Estimate", "Higher Estimate")
colnames(custom_intervals) = c("Percentile", "BCA", "Asymptotic")

kable(custom_intervals,align = "c",
      caption = "<span style='color:##000000;'>
      Intervals via Custom-Built Function </span>") %>%
  kable_styling(
    bootstrap_options = c("striped", "bordered"),
    full_width = FALSE,
    position = "center")

```

## Final Recommendation
Taking everything into consideration, I would suggest moving forward under the assumption that the **BCa interval** is the most accurate. 

As I alluded to in a previous section, use of the the asymptotic interval generally assumes a relatively normal distribution of the population of interest. In this scenario, we have not assumed normality for the population of urban sites' lead concentrations, but rather log-normality. Beyond that, the BCa interval method is the most robust in the face of skewness and bias. Its bias correction component utilizes the median statistic to counterbalance outliers, something that is certainly relevant to both our sample data and bootstrap distribution of E[X] estimates. 

Below are the histograms for both the sample data, and population mean bootstrap estimates, this time with the 90th percentile of the dataset marked. This allows us to see how great of a rightward skew is applicable to this particular analysis.

```{r, fig.height=5.5, fig.width=6.5}
hist(data$soil, 
     main = "Distribution of Concentrations",
     xlab = "Concentrations (mg/kg)",
     prob = TRUE,
     col = "lightblue"
)

 # Data point 90th percentile
    segments(
      x0 = quantile(data$soil,0.9),
      y0 = 0,
      x1 = quantile(data$soil,0.9),
      y1 = 1,
      col = "black",
      lwd = 4,
      lty = 3
    )

legend("topright",
       legend = "90th Percentile",
       col = "black",
       lwd = 4,
       lty = 3,
       box.lwd=3)

box(lwd=3)
```


```{r, fig.height=5.5, fig.width=6.5}
hist(Ex_sampling_distribution, 
     main = "Sampling Distribution of Population Mean Estimates",
     xlab = "Concentrations (mg/kg)",
     prob = TRUE,
     col = "lightblue"
)

 # Data point 90th percentile
    segments(
      x0 = quantile(Ex_sampling_distribution,0.9),
      y0 = 0,
      x1 = quantile(Ex_sampling_distribution,0.9),
      y1 = 1,
      col = "black",
      lwd = 4,
      lty = 3
    )

legend("topright",
       legend = "90th Percentile",
       col = "black",
       lwd = 4,
       lty = 3,
       box.lwd=3)    
  
box(lwd=3)
```
