Assignment Objectives

  • Understand the theoretical basis of Bootstrap sampling methods for approximating sampling distributions.

  • Assess the performance of Bootstrap sampling distributions against exact and asymptotic sampling distributions.

  • Implement Bootstrap sampling algorithm and construct sampling distributions using R.


Use of AI Tools

Policy on AI Tool Use: Students must adhere to the AI tool policy specified in the course syllabus. The direct copying of AI-generated content is strictly prohibited. All submitted work must reflect your own understanding; where external tools are consulted, content must be thoroughly rephrased and synthesized in your own words.

Code Inclusion Requirement: Any code included in your essay must be properly commented to explain the purpose and/or expected output of key code lines. Submitting AI-generated code without meaningful, student-added comments will not be accepted.


Asymptotic Distribution of Sample Variance

Assume that \(\{ x_1, x_2, \cdots, x_n \} \to F(x)\) with \(\mu = E[X]\) and \(\sigma^2 = \text{var}(X)\). Denote

\[ s^2 = \frac{1}{n-1}\sum_{i=1}^n (x_i - \mu)^2 \]

If \(n\) is large,

\[ s^2 \to N\left(\sigma^2, \frac{\mu_4-\sigma^4}{n} \right) \]

where \(\mu_4 = E[(X_i - \mu)^4]\) is tje 4th central moment which can be estimated by

\[ \hat{\mu}_4 = \frac{1}{n}\sum_{i=1}^n(x_i-\bar{x})^4. \]

Note: This describes the asymptotic convergence of the sample variance, following from the central limit theorem (CLT). The sample size required for this approximation to hold is situation-dependent.


Question 1: Asymptotic vs Bootstrap Sampling Distributions

Write an essay summarizing the concepts of Asymptotic and Bootstrap Sampling Distributions, along with their key applications. Your discussion should be grounded in your personal understanding of the material. Any external sources including AI tools consulted must be clearly cited.

Essay Prompt: Discuss the concepts of the bootstrap sampling plan, the bootstrap sampling distribution, and the asymptotic sampling distribution in the context of statistics (e.g., sample mean and variance) computed from an independent and identically distributed (i.i.d.) sample. Your discussion should:

  • Clearly outline the key assumptions required for each method.

  • Explain the practical application of each distribution.

  • Provide guidance on when and why one should be preferred over the other in statistical inference.

In statistical inference having the understanding of the sampling distribution of an estimator such as the sample mean or sample variance is very fundamental in the world of statistics. Two approaches to approximating sampling distributions are the asymptotic approach and the bootstrap approach. Although both approaches describe how an estimator varies across repeated samples, they rely differently with it’s principles and assumptions.

The asymptotic sampling distribution is established from the probability theory, with most notably the Central Limit Theorem coming into mind. Suppose we observe an independent and identically distributed sample (X_1, X_2, , X_n) from a population with a limited mean and variance. The CLT highlights that as the sample size (n) becomes large, the sampling distribution of the sample mean becomes normal, regardless of the shape of the underlying population distribution. Specifically, the sample mean is approximately normally distributed with mean equal to the population mean and variance equal to the population variance divided by (n). The key assumptions for asymptotic results are independence, identical distribution, and finite varaince. The practical advantage of the asymptotic approach is that is simple because once the standard error is estimated, inference can proceed using normal approximations. These approximations rely on large sample sizes and would perform poorly when the sample is relatively small or the data is highly skewed.

In contrast, the bootstrap sampling distribution is a computational method that approximates the true sampling distribution directly from the observed data and instead of relying on theoretical limit results, the bootstrap treats the empirical distribution of the sample as an estimate of the whole population distribution. From the original i.i.d. sample, many new samples of size are drawn with replacement and the statistic of interest (sample mean or variance) is computed for each resample. The distribution of these bootstrap replicates forms what is known as the bootstrap sampling distribution. The primary assumption underlying the bootstrap is that the original sample is representative of the population and that observations are independent and identically distributed. No assumption of normality is required and the bootstrap is particularly useful when the sampling distribution of a statistic is complicated, unknown, or difficult to derive. It is perferably advantageous when using small to moderate samples or when the population distribution is skewed.

The choice between asymptotic and bootstrap methods depends on the context and practicality of the situation. When the sample size is large and theoretical conditions are clear and satisfied, asymptotic approximations are more efficent, interpretable, and simple to compute. They are often preferred for standard statistics like the sample mean when data behaves well. However, when sample sizes are limited, distributions are non-normal, or the statistic is complex, the bootstrap often provides a more accurate approximation of the sampling distribution.

In summary, asymptotic sampling distributions rely on theoretical large-sample results and provide formula-based inferencing under clear assumptions. Bootstrap sampling distributions rely on resampling from the observed data to approximate the estimator’s variability with fewer distributional assumptions. Both approaches are grounded in the idea of repeated sampling, but they differ on whether such repetition is supported by analysis or by computational simulation.

Reference List: Efron, B. “Bootstrap Methods: Another Look at the Jackknife .” Https://Www.Jstor.Org/, Institute of Mathematical Statistics, 5 Apr. 2007, sites.stat.washington.edu/courses/stat527/s14/readings/ann_stat1979.pdf.


Question 2: Daily Coffee Sales (in mL) at Two Different Cafe Locations

This data set represents the volume of regular brewed coffee sold per day (in milliliters) at two different cafe locations over a period of 50 days.

2850, 3200, 2900, 3100, 2950, 7800, 8100, 7900, 3300, 3050, 4000, 4200, 3150, 3400, 7700, 8200, 
3250, 4400, 3100, 4200, 4500, 4800, 4300, 8500, 8200, 8900, 8700, 3250, 3000, 4600, 4100, 8400, 
8800, 3350, 4700, 3100, 8100, 3050, 8300, 4100, 3100, 8300, 8900, 8200, 4400, 4500, 3250, 4600, 
8400, 3300, 4200, 4500, 4800, 4300, 8500

We are interested in finding the sampling distribution of sample means that will be used for various inferences about the underlying population mean.

  1. Based on the given data, can the Central Limit Theorem be used to derive the asymptotic sampling distribution of the sample mean? Justify your answer.
# Data
coffee <- c(
  2850, 3200, 2900, 3100, 2950, 7800, 8100, 7900, 3300, 3050,
  4000, 4200, 3150, 3400, 7700, 8200, 3250, 4400, 3100, 4200,
  4500, 4800, 4300, 8500, 8200, 8900, 8700, 3250, 3000, 4600,
  4100, 8400, 8800, 3350, 4700, 3100, 8100, 3050, 8300, 4100,
  3100, 8300, 8900, 8200, 4400, 4500, 3250, 4600, 8400, 3300,
  4200, 4500, 4800, 4300, 8500
)

n <- length(coffee)

# Sample Mean and Standard Error
sample_mean <- mean(coffee)
sample_sd   <- sd(coffee)
se_mean     <- sample_sd / sqrt(n)
se_asymp <- sample_sd / sqrt(n)

sample_mean
[1] 5250
se_mean
[1] 302.4396

Yes, since there are 55 samples, n ≈ 50 the Central Limit Theorem applies and the sampling distribution of the mean ≈ Normal.

  1. Apply the bootstrap method to estimate the sampling distribution (often called the bootstrap sampling distribution) of the sample mean. Generate a kernel density estimate from the bootstrap sample means and plot it. Then, use this bootstrap distribution to validate your conclusion from part (a). Make sure your visuals are effective in enhancing the presentation of these results.
# Bootstrap Procedure


set.seed(123)
B <- 10000   # number of bootstrap samples

boot_means <- replicate(B, mean(sample(coffee, n, replace = TRUE)))

boot_se_mean <- sd(boot_means)
boot_se_mean
[1] 297.3615
# Plot Bootstrap Sampling Distribution


plot(density(boot_means),
     main = "Bootstrap Sampling Distribution of the Sample Mean",
     xlab = "Bootstrap Sample Means",
     lwd = 2)

# Add vertical line for observed sample mean
abline(v = sample_mean, col = "red", lwd = 2, lty = 2)

# Overlay asymptotic normal approximation
curve(dnorm(x, mean = sample_mean, sd = se_asymp),
      lwd = 2,
      lty = 3,
      add = TRUE)

legend("topright",
       legend = c("Bootstrap KDE",
                  "Observed Sample Mean",
                  "Asymptotic Normal Approximation"),
       lwd = c(2,2,2),
       lty = c(1,2,3))

  1. Repeat the analysis in parts (a) and (b) for the sample variance.
# Sample Variance


sample_var <- var(coffee)
sample_var
[1] 5030833
# Bootstrap for Variance


boot_vars <- replicate(B, var(sample(coffee, n, replace = TRUE)))

boot_se_var <- sd(boot_vars)
boot_se_var
[1] 529733.8
plot(density(boot_vars),
     main = "Bootstrap Sampling Distribution of the Variance",
     xlab = "Bootstrap Sample Variances",
     lwd = 2)

abline(v = sample_var, col = "red", lwd = 2)

# Asymptotic normal approximation (normal case approximation)
asymp_sd_var <- sqrt(2 * sample_var^2 / (n - 1))

curve(dnorm(x, mean = sample_var, sd = asymp_sd_var),
      lwd = 2,
      lty = 3,
      add = TRUE)

legend("topright",
       legend = c("Bootstrap KDE",
                  "Observed Sample Variance",
                  "Asymptotic Normal"),
       lwd = c(2,2,2),
       lty = c(1,2,3))

---
title: "Assignment 3: ECDF and Bootstrap Sampling and Applications"
author: "Kieran Hefferan "
date: " Due: 2/17/26 "
output:
  html_document: 
    toc: yes
    toc_depth: 4
    toc_float: yes
    number_sections: no
    toc_collapsed: yes
    code_folding: hide
    code_download: yes
    smooth_scroll: yes
    theme: lumen
  pdf_document: 
    toc: yes
    toc_depth: 4
    fig_caption: yes
    number_sections: yes
    fig_width: 3
    fig_height: 3
  word_document: 
    toc: yes
    toc_depth: 4
    fig_caption: yes
    keep_md: yes
editor_options: 
  chunk_output_type: inline
---

```{css, echo = FALSE}
#TOC::before {
  content: "Table of Contents";
  font-weight: bold;
  font-size: 1.2em;
  display: block;
  color: navy;
  margin-bottom: 10px;
}


div#TOC li {     /* table of content  */
    list-style:upper-roman;
    background-image:none;
    background-repeat:none;
    background-position:0;
}

h1.title {    /* level 1 header of title  */
  font-size: 22px;
  font-weight: bold;
  color: DarkRed;
  text-align: center;
  font-family: "Gill Sans", sans-serif;
}

h4.author { /* Header 4 - and the author and data headers use this too  */
  font-size: 15px;
  font-weight: bold;
  font-family: system-ui;
  color: navy;
  text-align: center;
}

h4.date { /* Header 4 - and the author and data headers use this too  */
  font-size: 18px;
  font-weight: bold;
  font-family: "Gill Sans", sans-serif;
  color: DarkBlue;
  text-align: center;
}

h1 { /* Header 1 - and the author and data headers use this too  */
    font-size: 20px;
    font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: darkred;
    text-align: center;
}

h2 { /* Header 2 - and the author and data headers use this too  */
    font-size: 18px;
    font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}

h3 { /* Header 3 - and the author and data headers use this too  */
    font-size: 16px;
    font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}

h4 { /* Header 4 - and the author and data headers use this too  */
    font-size: 14px;
  font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: darkred;
    text-align: left;
}

/* Add dots after numbered headers */
.header-section-number::after {
  content: ".";

body { background-color:white; }

.highlightme { background-color:yellow; }

p { background-color:white; }

}
```

```{r setup, include=FALSE}
# code chunk specifies whether the R code, warnings, and output 
# will be included in the output files.
if (!require("knitr")) {
   install.packages("knitr")
   library(knitr)
}
if (!require("pander")) {
   install.packages("pander")
   library(pander)
}
if (!require("ggplot2")) {
  install.packages("ggplot2")
  library(ggplot2)
}
if (!require("tidyverse")) {
  install.packages("tidyverse")
  library(tidyverse)
}

if (!require("plotly")) {
  install.packages("plotly")
  library(plotly)
}
####
knitr::opts_chunk$set(echo = TRUE,       # include code chunk in the output file
                      warning = FALSE,   # sometimes, you code may produce warning messages,
                                         # you can choose to include the warning messages in
                                         # the output file. 
                      results = TRUE,    # you can also decide whether to include the output
                                         # in the output file.
                      message = FALSE,
                      comment = NA
                      )  
```
 
 \
 
## **Assignment Objectives** 

* Understand the theoretical basis of Bootstrap sampling methods for approximating sampling distributions.

* Assess the performance of Bootstrap sampling distributions against exact and asymptotic sampling distributions.

* Implement Bootstrap sampling algorithm and construct sampling distributions using R.

\

**Use of AI Tools**

**Policy on AI Tool Use**: Students must adhere to the AI tool policy specified in the course syllabus. The direct copying of AI-generated content is strictly prohibited. All submitted work must reflect your own understanding; where external tools are consulted, content must be thoroughly rephrased and synthesized in your own words.

**Code Inclusion Requirement**: Any code included in your essay must be properly commented to explain the purpose and/or expected output of key code lines. Submitting AI-generated code without meaningful, student-added comments will not be accepted.

\

**Asymptotic Distribution of Sample Variance**

Assume that $\{ x_1, x_2, \cdots, x_n \} \to F(x)$ with $\mu = E[X]$ and $\sigma^2 = \text{var}(X)$. Denote 

$$
s^2 = \frac{1}{n-1}\sum_{i=1}^n (x_i - \mu)^2
$$

If $n$ is large, 

$$
s^2 \to N\left(\sigma^2,  \frac{\mu_4-\sigma^4}{n} \right)
$$

where $\mu_4 = E[(X_i - \mu)^4]$ is tje 4th central moment which can be estimated by

$$
\hat{\mu}_4 = \frac{1}{n}\sum_{i=1}^n(x_i-\bar{x})^4.
$$

**Note**: This describes the asymptotic convergence of the sample variance, following from the central limit theorem (CLT). The sample size required for this approximation to hold is situation-dependent.


\

## **Question 1: Asymptotic vs Bootstrap Sampling Distributions**

Write an essay summarizing the concepts of Asymptotic and Bootstrap Sampling Distributions, along with their key applications. Your discussion should be grounded in your personal understanding of the material. Any external sources including AI tools consulted must be clearly cited. 


**Essay Prompt**: Discuss the concepts of the bootstrap sampling plan, the bootstrap sampling distribution, and the asymptotic sampling distribution in the context of statistics (e.g., sample mean and variance) computed from an independent and identically distributed (i.i.d.) sample. Your discussion should:

* Clearly outline the key assumptions required for each method.

* Explain the practical application of each distribution.

* Provide guidance on when and why one should be preferred over the other in statistical inference.


**In statistical inference having the understanding of the sampling distribution of an estimator such as the sample mean or sample variance is very fundamental in the world of statistics. Two approaches to approximating sampling distributions are the asymptotic approach and the bootstrap approach. Although both approaches describe how an estimator varies across repeated samples, they rely differently with it's principles and assumptions.**

**The asymptotic sampling distribution is established from the probability theory, with most notably the Central Limit Theorem coming into mind. Suppose we observe an independent and identically distributed sample (X_1, X_2, \dots, X_n) from a population with a limited mean and variance. The CLT highlights that as the sample size (n) becomes large, the sampling distribution of the sample mean becomes normal, regardless of the shape of the underlying population distribution. Specifically, the sample mean is approximately normally distributed with mean equal to the population mean and variance equal to the population variance divided by (n). The key assumptions for asymptotic results are independence, identical distribution, and finite varaince. The practical advantage of the asymptotic approach is that is simple because once the standard error is estimated, inference can proceed using normal approximations. These approximations rely on large sample sizes and would perform poorly when the sample is relatively small or the data is highly skewed.**

**In contrast, the bootstrap sampling distribution is a computational method that approximates the true sampling distribution directly from the observed data and instead of relying on theoretical limit results, the bootstrap treats the empirical distribution of the sample as an estimate of the whole population distribution. From the original i.i.d. sample, many new samples of size are drawn with replacement and the statistic of interest (sample mean or variance) is computed for each resample. The distribution of these bootstrap replicates forms what is known as the bootstrap sampling distribution. The primary assumption underlying the bootstrap is that the original sample is representative of the population and that observations are independent and identically distributed. No assumption of normality is required and the bootstrap is particularly useful when the sampling distribution of a statistic is complicated, unknown, or difficult to derive. It is perferably advantageous when using small to moderate samples or when the population distribution is skewed.**

**The choice between asymptotic and bootstrap methods depends on the context and practicality of the situation. When the sample size is large and theoretical conditions are clear and satisfied, asymptotic approximations are more efficent, interpretable, and simple to compute. They are often preferred for standard statistics like the sample mean when data behaves well. However, when sample sizes are limited, distributions are non-normal, or the statistic is complex, the bootstrap often provides a more accurate approximation of the sampling distribution.**

**In summary, asymptotic sampling distributions rely on theoretical large-sample results and provide formula-based inferencing under clear assumptions. Bootstrap sampling distributions rely on resampling from the observed data to approximate the estimator’s variability with fewer distributional assumptions. Both approaches are grounded in the idea of repeated sampling, but they differ on whether such repetition is supported by analysis or by computational simulation. **

**Reference List:**
Efron, B. “Bootstrap Methods: Another Look at the Jackknife .” Https://Www.Jstor.Org/, Institute of Mathematical Statistics, 5 Apr. 2007, sites.stat.washington.edu/courses/stat527/s14/readings/ann_stat1979.pdf. 

\

## **Question 2: Daily Coffee Sales (in mL) at Two Different Cafe Locations**

This data set represents the volume of regular brewed coffee sold per day (in milliliters) at two different cafe locations over a period of 50 days.

```
2850, 3200, 2900, 3100, 2950, 7800, 8100, 7900, 3300, 3050, 4000, 4200, 3150, 3400, 7700, 8200, 
3250, 4400, 3100, 4200, 4500, 4800, 4300, 8500, 8200, 8900, 8700, 3250, 3000, 4600, 4100, 8400, 
8800, 3350, 4700, 3100, 8100, 3050, 8300, 4100, 3100, 8300, 8900, 8200, 4400, 4500, 3250, 4600, 
8400, 3300, 4200, 4500, 4800, 4300, 8500
```
We are interested in finding the sampling distribution of sample means that will be used for various inferences about the underlying population mean.

a) Based on the given data, can the Central Limit Theorem be used to derive the asymptotic sampling distribution of the sample mean? Justify your answer.

```{r}
# Data
coffee <- c(
  2850, 3200, 2900, 3100, 2950, 7800, 8100, 7900, 3300, 3050,
  4000, 4200, 3150, 3400, 7700, 8200, 3250, 4400, 3100, 4200,
  4500, 4800, 4300, 8500, 8200, 8900, 8700, 3250, 3000, 4600,
  4100, 8400, 8800, 3350, 4700, 3100, 8100, 3050, 8300, 4100,
  3100, 8300, 8900, 8200, 4400, 4500, 3250, 4600, 8400, 3300,
  4200, 4500, 4800, 4300, 8500
)

n <- length(coffee)

# Sample Mean and Standard Error
sample_mean <- mean(coffee)
sample_sd   <- sd(coffee)
se_mean     <- sample_sd / sqrt(n)
se_asymp <- sample_sd / sqrt(n)

sample_mean
se_mean
```
**Yes, since there are 55 samples, n ≈ 50 the Central Limit Theorem applies and the sampling distribution of the mean ≈ Normal.**

b) Apply the bootstrap method to estimate the sampling distribution (often called the bootstrap sampling distribution) of the sample mean. Generate a kernel density estimate from the bootstrap sample means and plot it. Then, use this bootstrap distribution to validate your conclusion from part (a). Make sure your visuals are effective in enhancing the presentation of these results.

```{r}

# Bootstrap Procedure


set.seed(123)
B <- 10000   # number of bootstrap samples

boot_means <- replicate(B, mean(sample(coffee, n, replace = TRUE)))

boot_se_mean <- sd(boot_means)
boot_se_mean


# Plot Bootstrap Sampling Distribution


plot(density(boot_means),
     main = "Bootstrap Sampling Distribution of the Sample Mean",
     xlab = "Bootstrap Sample Means",
     lwd = 2)

# Add vertical line for observed sample mean
abline(v = sample_mean, col = "red", lwd = 2, lty = 2)

# Overlay asymptotic normal approximation
curve(dnorm(x, mean = sample_mean, sd = se_asymp),
      lwd = 2,
      lty = 3,
      add = TRUE)

legend("topright",
       legend = c("Bootstrap KDE",
                  "Observed Sample Mean",
                  "Asymptotic Normal Approximation"),
       lwd = c(2,2,2),
       lty = c(1,2,3))

```

c) Repeat the analysis in parts (a) and (b) for the sample variance.

```{r}

# Sample Variance


sample_var <- var(coffee)
sample_var


# Bootstrap for Variance


boot_vars <- replicate(B, var(sample(coffee, n, replace = TRUE)))

boot_se_var <- sd(boot_vars)
boot_se_var


plot(density(boot_vars),
     main = "Bootstrap Sampling Distribution of the Variance",
     xlab = "Bootstrap Sample Variances",
     lwd = 2)

abline(v = sample_var, col = "red", lwd = 2)

# Asymptotic normal approximation (normal case approximation)
asymp_sd_var <- sqrt(2 * sample_var^2 / (n - 1))

curve(dnorm(x, mean = sample_var, sd = asymp_sd_var),
      lwd = 2,
      lty = 3,
      add = TRUE)

legend("topright",
       legend = c("Bootstrap KDE",
                  "Observed Sample Variance",
                  "Asymptotic Normal"),
       lwd = c(2,2,2),
       lty = c(1,2,3))

```
