Assignment Objectives

  • Understand the theoretical basis of Bootstrap sampling methods for approximating sampling distributions.

  • Assess the performance of Bootstrap sampling distributions against exact and asymptotic sampling distributions.

  • Implement Bootstrap sampling algorithm and construct sampling distributions using R.


Use of AI Tools

Policy on AI Tool Use: Students must adhere to the AI tool policy specified in the course syllabus. The direct copying of AI-generated content is strictly prohibited. All submitted work must reflect your own understanding; where external tools are consulted, content must be thoroughly rephrased and synthesized in your own words.

Code Inclusion Requirement: Any code included in your essay must be properly commented to explain the purpose and/or expected output of key code lines. Submitting AI-generated code without meaningful, student-added comments will not be accepted.


Asymptotic Distribution of Sample Variance

Assume that \(\{ x_1, x_2, \cdots, x_n \} \to F(x)\) with \(\mu = E[X]\) and \(\sigma^2 = \text{var}(X)\). Denote

\[ s^2 = \frac{1}{n-1}\sum_{i=1}^n (x_i - \mu)^2 \]

If \(n\) is large,

\[ s^2 \to N\left(\sigma^2, \frac{\mu_4-\sigma^4}{n} \right) \]

where \(\mu_4 = E[(X_i - \mu)^4]\) is tje 4th central moment which can be estimated by

\[ \hat{\mu}_4 = \frac{1}{n}\sum_{i=1}^n(x_i-\bar{x})^4. \]

Note: This describes the asymptotic convergence of the sample variance, following from the central limit theorem (CLT). The sample size required for this approximation to hold is situation-dependent.


Question 1: Asymptotic vs Bootstrap Sampling Distributions

Write an essay summarizing the concepts of Asymptotic and Bootstrap Sampling Distributions, along with their key applications. Your discussion should be grounded in your personal understanding of the material. Any external sources including AI tools consulted must be clearly cited.

Essay Prompt: Discuss the concepts of the bootstrap sampling plan, the bootstrap sampling distribution, and the asymptotic sampling distribution in the context of statistics (e.g., sample mean and variance) computed from an independent and identically distributed (i.i.d.) sample. Your discussion should:

  • Clearly outline the key assumptions required for each method.

  • Explain the practical application of each distribution.

  • Provide guidance on when and why one should be preferred over the other in statistical inference.

Essay on Boostrap vs. Asymptotic Sampling Distribution

The bootstrap sampling distribution and the asymptotic sampling distribution are involved in two contrasting methods of approximating the sampling distribution for a specific statistic (mean, variance, etc.). For ease of understanding, let’s consider an example to walk through the assumptions underpinning each method, the practical application of these distributions, and finally conclude with a discussion of when these distributions should be used.

Suppose we have a six-sided die, and we are hoping to approximate the sampling distribution of the sample means of this six-sided die. Now, we may know intuitively that the distribution of a die would follow a Discrete Uniform Distribution between the values of 1 and 6. That is to say, if we were to roll a six-sided die we would expect to see some value between 1 and 6 with an equal probability of the die landing on each of these values. However, suppose we wanted to take a sample of these values and use that sample to find a sample mean. And then suppose that we took this sample again and again, finding the sample means for multiple different samples. Would we expect this distribution of sample means to follow a uniform distribution? Below is a histogram of 1000 sample means, each taken from a simulated sample of size 30:

set.seed(123)  #Used to set seed so that result can be replicated
n <- 30
samples <- 1000 

die <- replicate(samples, mean(runif(n, min=1, max=6))) #Samples from a discrete uniform distribution the amount of times indicated by 'samples' and at sample size 'n' to find mean values for each sample


hist(die, breaks=20, prob=TRUE, main="Histogram of Sample Means of 6-sided Dice", xlab="Sample Means", xlim=c(2,5)) #Creates histogram of sample means

This result doesn’t appear uniform at all! Which leads us to the question of how we should we go about approximating the distribution of sample means (or any statistic for that matter). To answer that question we will consider two methods, one involving the asymptotic sampling distribution and the other involving the bootstrap sampling distribution.

The asymptotic sampling distribution relies on the concept that as the sample size (\(n\)) approaches infinity, the distribution of a specific statistic can be approximated in a specific way. Most commonly, the asymptotic sampling distribution is discussed in relation to the Central Limit Theorem (CLT). The Central Limit Theorem holds that, for many statistics, the asymptotic sampling distribution turns out to be a normal distribution. However, in order to assume that it is appropriate to use an asymptotic sampling distribution we must first ensure that our observations are independent and identically distributed and that our \(n\) is sufficiently large. In order to use the CLT in particular, we must also know that our samples come a population with a finite mean \(\mu\) and variance \(\sigma^2\). In our example, we know that our observations are independent (one roll of the die does not affect the other) and identically distributed (our results are coming from the same population). Furthermore, we know that there is some finite \(\mu\) and \(\sigma^2\) since our observations can only be between 1 and 6. The only question to whether or not we can use the Central Limit Theorem and the associated normal asymptotic sampling distribution is the sample size.

In our previous example we used a sample size of 30. In most cases this is considered an appropriate sample size for using the CLT. However, for the sake of understanding the importance of this large sample size, lets instead imagine that we only used a sample size of 2:

set.seed(2) #Used to set seed so that result can be replicated
n <- 2
mu <- 3.5
sigma <- 1.7078
samples <- 100

die <- replicate(samples, mean(runif(n, min=1, max=6))) #Samples from a discrete uniform distribution the amount of times indicated by 'samples' and at sample size 'n' to find mean values for each sample

x <- seq(mu - 3*sigma/sqrt(n), mu + 3*sigma/sqrt(n), length.out = 100) #Used to later create normal curve based on 'mu' and 'sigma'
nc <- dnorm(x, mu, sd=sigma/sqrt(n))

hist(die, breaks=seq(min(die), max(die), length.out=20), prob=TRUE, main="Histogram of Sample Means of 6-sided Dice (n = 2)", xlab="Sample Means", xlim=c(1,6))
lines(x, nc, col="red", lwd=2) #Creates histogram of sample means

An appropriate normal curve is overlayed over our histogram. However, we can see that our resulting sample means don’t appear to closely follow the the normal curve. On the other hand, what if we were to choose a significantly higher \(n\) of 100:

set.seed(123) #Used to set seed so that result can be replicated
n <- 100
mu <- 3.5
sigma <- 1.7078
samples <- 100

die <- replicate(samples, mean(runif(n, min=1, max=6))) #Samples from a discrete uniform distribution the amount of times indicated by 'samples' and at sample size 'n' to find mean values for each sample

x <- seq(mu - 3*sigma/sqrt(n), mu + 3*sigma/sqrt(n), length.out = 100) #Used to later create normal curve based on 'mu' and 'sigma'
nc <- dnorm(x, mu, sd=sigma/sqrt(n))

hist(die, breaks=seq(min(die), max(die), length.out=7), prob=TRUE, main="Histogram of Sample Means of 6-sided Dice (n = 100)", xlab="Sample Means", xlim=c(2,5))
lines(x, nc, col="red", lwd=2) #Creates histogram of sample means

Here we can see our histogram much more closely follows the appropriate normal curve. This is a visual example of why our assumption of a large \(n\) is fundamental to using an asymptotic sampling distribution. This also serves to demonstrate how an asymptotic sampling distribution could be used, specifically in regards to the CLT. In practice, we wouldn’t necessarily have the ability to repeatedly sample our population. However, by using the CLT (as long as the associated assumptions are met) we are able to safely assume that our sample means will approximately follow the following distribution:

\[ \bar{X} \rightarrow N(\mu, \frac{\sigma}{\sqrt{n}}) \] Where \(\bar{X}\) is the sample mean. A similar assumption can be made in regards to the distribution of the sample variance using the CLT:

\[ s^2 \to N\left(\sigma^2, \frac{\mu_4-\sigma^4}{n} \right) \]

Where \(s^2\) is the sample variance and \(\mu_4\) is the 4th central moment.

In contrast to the asymptotic sampling distribution, the bootstrap sampling distribution relies a fundamentally different method of approximating the sample distribution for a statistic. Instead of relying on the idea that as sample size gets larger it may become possible to approximate the true distribution of a sample statistic, the bootstrap sampling distribution is found by resampling the existing sample in order to estimate the true distribution of a sample statistic.

Much like the asymptotic sampling distribution, there are several assumptions that must be met in order for our bootstrap sampling distribution to accurately approximate the true sample statistic distribution. It assumes that our observations are independent and identically distributed. Furthermore, it also has some requirements regarding the size of the sample, given that it is harder to treat the sample as an accurate substitute for the population with a smaller sample size. However, these requirements are less stringent (especially when the population is heavily skewed) than with the Central Limit Theorem. Once again going back to our six-sided die example, we already know our observations are independent and identically distributed. In the following example we will also use a sample size of 50 in order to ensure our bootstrap sample distribution is an accurate approximation.

In order to perform bootstrap sampling and obtain our bootstrap distribution we will repeatedly randomly sample, with replacement, from an already existing sample. Below we will simulate a random sample of 50 rolls of a six sided and then randomly sample from this sample 1000 times. We will then use the resulting sample means to form the bootstrap distribution. The results of our analysis can be seen below:

set.seed(123) #Used to set seed so that result can be replicated
n <- 50
die2 <- runif(n, min=1, max=6) #Samples from a discrete uniform distribution a sample of size 'n'
B <- 1000

bootstrap.means <- numeric(B) #Resets 'bootstrap.means' 

for (i in 1:B){
  boot.sample <- sample(die2, size=n, replace=TRUE)
  bootstrap.means[i] <- mean(boot.sample)
} #Used to randomly sample (with replacement) from 'die2' 'B' times to get sample means

kde.die <- density(bootstrap.means)

hist(bootstrap.means, prob=TRUE, main="Bootstrap Sampling Distribution of 6-sided Die", xlab="Sample Means", xlim=c(2,5))
lines(kde.die, col="red", lwd=2) #Used to create histogram of bootstrap sampling distribution

As stated, we randomly sampled our initial simulated sample of 50 observations, 1000 times. This allows us to form the Bootstrap Sampling Distribution of our 6-sided die based on our one sample.

Finally, its worth considering, given the similarities between the assumptions of these two distributions, why we would choose to use one over the other. The asymptotic sampling distribution most accurately models symmetric, non skewed data. While it can be used to model skewed data, it often requires a larger sample size to achieve an accurate result. However, it is not as computationally intensive to use as the bootstrap sampling distribution. Conversely, the bootstrap sampling distribution can be more accurate than the asymptotic sampling distribution when applied to heavily skewed data and complex or unknown distributions. However, it takes a lot of computational power to accurately model, since it requires repeated resampling to obtain. Ultimately, it is important to know both the limitations and applications of both in order most accurately model the sampling distributions of sampling statistics.

Sources:

https://pengdsci.github.io/STA506/w03/03-SamplingDistributions.html

https://pengdsci.github.io/STA506/w04/04-ECDandBootstrapSampling.html

Question 2: Daily Coffee Sales (in mL) at Two Different Cafe Locations

This data set represents the volume of regular brewed coffee sold per day (in milliliters) at two different cafe locations over a period of 50 days.

2850, 3200, 2900, 3100, 2950, 7800, 8100, 7900, 3300, 3050, 4000, 4200, 3150, 3400, 7700, 8200, 
3250, 4400, 3100, 4200, 4500, 4800, 4300, 8500, 8200, 8900, 8700, 3250, 3000, 4600, 4100, 8400, 
8800, 3350, 4700, 3100, 8100, 3050, 8300, 4100, 3100, 8300, 8900, 8200, 4400, 4500, 3250, 4600, 
8400, 3300, 4200, 4500, 4800, 4300, 8500

We are interested in finding the sampling distribution of sample means that will be used for various inferences about the underlying population mean.

  1. Based on the given data, can the Central Limit Theorem be used to derive the asymptotic sampling distribution of the sample mean? Justify your answer.

Answer to Part A

In order to use the Central Limit Theorem to derive the asymptotic sampling distribution of the sample mean we need independent, identically distributed observations from a population with a finite \(\mu\) and \(\sigma^2\) and enough observations that we can justify using the Central Limit Theorem. The observations are from a single population of cafes so we can assume the observations are identically distributed. There is a finite amount of coffee that can be brewed per day so we can assume that there is a finite \(\mu\) and \(\sigma^2\). Finally, there appear to be enough observations to use the Central Limit Theorem, given that the sample size of 54 is greater than the typical CLT cutoff of 30. However, the question does not provide information on whether or not the observations were part of a random sample so we do not have enough evidence to state that the observations are independent of one another. That being said, if the observations are from a random sample then we should meet the conditions to use the CLT.

  1. Apply the bootstrap method to estimate the sampling distribution (often called the bootstrap sampling distribution) of the sample mean. Generate a kernel density estimate from the bootstrap sample means and plot it. Then, use this bootstrap distribution to validate your conclusion from part (a). Make sure your visuals are effective in enhancing the presentation of these results.

Answer to Part B

set.seed(123) #Used to set seed so that result can be replicated
coffee <-  c(2850, 3200, 2900, 3100, 2950, 7800, 8100, 7900, 3300, 3050, 4000, 4200, 3150, 3400, 7700, 8200, 
3250, 4400, 3100, 4200, 4500, 4800, 4300, 8500, 8200, 8900, 8700, 3250, 3000, 4600, 4100, 8400, 
8800, 3350, 4700, 3100, 8100, 3050, 8300, 4100, 3100, 8300, 8900, 8200, 4400, 4500, 3250, 4600, 
8400, 3300, 4200, 4500, 4800, 4300, 8500)

n <- length(coffee)
B <- 1000

bootstrap.means <- numeric(B) #Resets 'bootstrap.means' 

for (i in 1:B){
  boot.sample <- sample(coffee, size=n, replace=TRUE)
  bootstrap.means[i] <- mean(boot.sample) #Used to randomly sample (with replacement) from 'coffee' 'B' times to get sample means
}

kde.coffee <- density(bootstrap.means) #Makes KDE of bootstrap sample means

x <- seq(mean(bootstrap.means) - 3*sd(bootstrap.means), mean(bootstrap.means) + 3*sd(bootstrap.means), length.out = 100)
nc <- dnorm(x, mean(bootstrap.means), sd=sd(bootstrap.means)) #Used to later create normal curve

plot(kde.coffee, main = "KDE from Bootstrap Sample Means", xlab = "Sample Means", col="blue")
lines(x, nc, col="red") #Used to create histogram of bootstrap sampling distribution

Above, in blue, is the KDE from the bootstrap sampling distribution of sample means obtained by using the bootstrap method. As we can see, it fairly closely follows the normal curve (in red), indicating that there is evidence that it would be appropriate to use the Central Limit Theorem.

  1. Repeat the analysis in parts (a) and (b) for the sample variance.

Answer to Part C

As stated previously, in order to use the Central Limit Theorem to derive the asymptotic sampling distribution of the sample mean we need independent, identically distributed observations from a population with a finite \(\mu\) and \(\sigma^2\) and enough observations that we can justify using the Central Limit Theorem. Since this is the same random sample as used previously, we can assume the observations are identically distributed with a finite \(\mu\) and \(\sigma^2\) and there are enough observations to use the CLT. However, we are still unsure of independence.

set.seed(123) #Used to set seed so that result can be replicated
coffee <-  c(2850, 3200, 2900, 3100, 2950, 7800, 8100, 7900, 3300, 3050, 4000, 4200, 3150, 3400, 7700, 8200, 
3250, 4400, 3100, 4200, 4500, 4800, 4300, 8500, 8200, 8900, 8700, 3250, 3000, 4600, 4100, 8400, 
8800, 3350, 4700, 3100, 8100, 3050, 8300, 4100, 3100, 8300, 8900, 8200, 4400, 4500, 3250, 4600, 
8400, 3300, 4200, 4500, 4800, 4300, 8500)

n <- length(coffee)
B <- 1000

bootstrap.variances <- numeric(B) #Resets 'bootstrap.variances' 

for (i in 1:B){
  boot.sample <- sample(coffee, size=n, replace=TRUE)
  bootstrap.variances[i] <- var(boot.sample)
} #Used to randomly sample (with replacement) from 'coffee' 'B' times to get sample variances

x <- seq(mean(bootstrap.variances) - 3*sd(bootstrap.variances), mean(bootstrap.variances) + 3*sd(bootstrap.variances), length.out = 100)
nc <- dnorm(x, mean(bootstrap.variances), sd=sd(bootstrap.variances)) #Used to later create normal curve

kde.coffee <- density(bootstrap.variances) #Makes KDE of bootstrap sample variances

plot(kde.coffee, main = "KDE from Bootstrap Sample Variances", xlab="Sample Variances", col="blue")
lines(x, nc, col="red") #Used to create histogram of bootstrap sampling distribution

Above, in blue, is the KDE from the bootstrap sampling distribution of sample variances obtained by using the bootstrap method. Similar to the KDE from the bootstrap sampling distribution of sample means, it fairly closely follows the normal curve (in red), indicating that there is evidence that it would be appropriate to use the Central Limit Theorem.

