Assignment Objectives

  • Understand the theoretical basis of Bootstrap sampling methods for approximating sampling distributions.

  • Assess the performance of Bootstrap sampling distributions against exact and asymptotic sampling distributions.

  • Implement Bootstrap sampling algorithm and construct sampling distributions using R.


Use of AI Tools

Policy on AI Tool Use: Students must adhere to the AI tool policy specified in the course syllabus. The direct copying of AI-generated content is strictly prohibited. All submitted work must reflect your own understanding; where external tools are consulted, content must be thoroughly rephrased and synthesized in your own words.

Code Inclusion Requirement: Any code included in your essay must be properly commented to explain the purpose and/or expected output of key code lines. Submitting AI-generated code without meaningful, student-added comments will not be accepted.


Asymptotic Distribution of Sample Variance

Assume that \(\{ x_1, x_2, \cdots, x_n \} \to F(x)\) with \(\mu = E[X]\) and \(\sigma^2 = \text{var}(X)\). Denote

\[ s^2 = \frac{1}{n-1}\sum_{i=1}^n (x_i - \mu)^2 \]

If \(n\) is large,

\[ s^2 \to N\left(\sigma^2, \frac{\mu_4-\sigma^4}{n} \right) \]

where \(\mu_4 = E[(X_i - \mu)^4]\) is tje 4th central moment which can be estimated by

\[ \hat{\mu}_4 = \frac{1}{n}\sum_{i=1}^n(x_i-\bar{x})^4. \]

Note: This describes the asymptotic convergence of the sample variance, following from the central limit theorem (CLT). The sample size required for this approximation to hold is situation-dependent.


Question 1: Asymptotic vs Bootstrap Sampling Distributions

Write an essay summarizing the concepts of Asymptotic and Bootstrap Sampling Distributions, along with their key applications. Your discussion should be grounded in your personal understanding of the material. Any external sources including AI tools consulted must be clearly cited.

Essay Prompt: Discuss the concepts of the bootstrap sampling plan, the bootstrap sampling distribution, and the asymptotic sampling distribution in the context of statistics (e.g., sample mean and variance) computed from an independent and identically distributed (i.i.d.) sample. Your discussion should:

  • Clearly outline the key assumptions required for each method.

  • Explain the practical application of each distribution.

  • Provide guidance on when and why one should be preferred over the other in statistical inference.

The statistical sampling process is done to understand the population of interest in question. Due to the total population distribution often being unknown, a sampling distribution can be used to make assumptions and gain a better understanding of how the data is distributed. This can be used to make predictions and estimations, based upon sample statistics and the sampling distribution. There are a few approaches to estimating the sampling distribution and two of these include the asymptotic sampling distribution and the bootstrap sampling distribution. Both of these methods are useful in approximating the sampling distribution and gaining a better understanding of the data in question.

The asymptotic sampling distribution is one way to create a sampling distribution that can be used to make assumptions and inferences about the overall population distribution of a particular data set of interest. The asymptotic sampling distribution describes the distribution as the sample size n grows infinitely large. As n grows large \(s^2 \to N\left(\sigma^2, \frac{\mu_4-\sigma^4}{n} \right)\). This serves as the sample variance for that of an asymptotic sampling distribution. This means that the asymptotic sampling distribution is useful for calculating sample statistics that can be used to make inferences about the population, as the population parameters often are unknown in the random sampling process.

The key assumptions of the asymptotic sampling distribution are that the observations must be independent and identically distributed, the sample size must be sufficiently large, and the population must have finite mean and variance. An important outcome of the asymptotic distribution is that it will result in a normal distribution as long as the assumptions are followed. This is due in part to the Central Limit Theorem which states that as the sample size grows sufficiently large, the sampling distribution will reach an approximately normal distribution, regardless of the population distribution.

The bootstrap sampling distribution method is another valuable choice in approximating a sampling distribution. This bootstrap method involves repeatedly drawing samples from the observed data’s empirical distribution, with replacement. This sampling process serves as a stand-in for the unknown population distribution, and allows for conclusions to be made regarding the distribution of the data. As the sample size grows larger and larger, the sampling distribution will get closer to that of the population distribution. The bootstrap method uses the empirical distribution, \(\hat{F}_n\) as an approximation to the true \({F}_n\) of the population. Then the bootstrap sampling process draws many samples of size n with replacement, to simulate what repeated data collection as an experiment would do.

The key assumption of the bootstrap sampling method are that the data are independently and identically distributed, and that the original sample is representative of the overall population. The bootstrap method is a nonparametric sampling method, so no assumption of a normal distribution is required.

A bootstrap method can be used the created bootstrap confidence intervals which can be interpreted as standard statistical confidence intervals. For example, a bootstrap confidence interval would tell how confident one can be that the true value is between the lower and upper bound of the confidence interval. Additionally, the bootstrap method could be used to estimate the standard error and create approximate sampling distributions which can be used to make assumptions about the overall population distribution.

While both the asymptotic and bootstrap sampling distributions are great ways to visualize sampling distributions in order to see how they are distributed and calculate useful sample statistics, there are cases where one would be ideal to use over the other. The asymptotic sampling distribution is ideal for large samples, however, if a sample size is not sufficiently large, the assumptions for the asymptotic sampling distribution would not be met. So, in the case of a small sample size, the bootstrap method would be the better choice. However, if the sample size is sufficiently large, then the asymptotic sampling distribution would be the ideal choice, as long as all other assumptions of the asymptotic distribution are indeed met. Overall, both sampling methods provide for useful statistical analysis and learning more about how a particular sample of data is distributed, which can be used to further understand the likely trends of the overall population.


Question 2: Daily Coffee Sales (in mL) at Two Different Cafe Locations

This data set represents the volume of regular brewed coffee sold per day (in milliliters) at two different cafe locations over a period of 50 days.

2850, 3200, 2900, 3100, 2950, 7800, 8100, 7900, 3300, 3050, 4000, 4200, 3150, 3400, 7700, 8200, 
3250, 4400, 3100, 4200, 4500, 4800, 4300, 8500, 8200, 8900, 8700, 3250, 3000, 4600, 4100, 8400, 
8800, 3350, 4700, 3100, 8100, 3050, 8300, 4100, 3100, 8300, 8900, 8200, 4400, 4500, 3250, 4600, 
8400, 3300, 4200, 4500, 4800, 4300, 8500

We are interested in finding the sampling distribution of sample means that will be used for various inferences about the underlying population mean.

  1. Based on the given data, can the Central Limit Theorem be used to derive the asymptotic sampling distribution of the sample mean? Justify your answer.

Before doing anything else, I will create a data set called ‘coffee’ with the given values above. An create a histogram of these values.

coffee <- c(2850, 3200, 2900, 3100, 2950, 7800, 8100, 7900, 3300, 3050, 4000, 4200, 3150, 3400, 7700, 8200, 3250, 4400, 3100, 4200, 4500, 4800, 4300, 8500, 8200, 8900, 8700, 3250, 3000, 4600, 4100, 8400, 8800, 3350, 4700, 3100, 8100, 3050, 8300, 4100, 3100, 8300, 8900, 8200, 4400, 4500, 3250, 4600, 8400, 3300, 4200, 4500, 4800, 4300, 8500)
hist(coffee)

The histogram shows that the distribution of the coffee observations appear to be bimodal. The asymptotic sampling distribution does not require normality, only finite mean and variance and that the sample size is sufficiently large. The sample size is sufficiently large, so the Central Limit Theorem be used to derive the asymptotic sampling distribution in this case. However, it is still worth noting that the data does not follow a normal distribution, but the Central Limit Theorem can still be used.

  1. Apply the bootstrap method to estimate the sampling distribution (often called the bootstrap sampling distribution) of the sample mean. Generate a kernel density estimate from the bootstrap sample means and plot it. Then, use this bootstrap distribution to validate your conclusion from part (a). Make sure your visuals are effective in enhancing the presentation of these results.
B <- 10000
boot_means <- replicate(B, mean(sample(coffee, replace = TRUE)))

# Kernel Density Plot
plot(density(boot_means),
     main="Bootstrap Sampling Distribution of Sample Mean",
     xlab="Sample Mean")

abline(v=mean(coffee), col="red", lwd=2)

The bootstrap approach shows that the distribution of the sample follows an approximately normal distribution. The above graph shows the kernel density estimate. This validates the conclusion made in part a that it is alright to apply the Central Limit Theorem, and that because the sample size is sufficiently large following the random sampling approach will result in a distribution that is approximately normal even if the original distribution was not. The bootstrap method works well in this case for providing a distribution of the sampling mean.

  1. Repeat the analysis in parts (a) and (b) for the sample variance.
boot_vars <- replicate(B, var(sample(coffee, replace = TRUE)))

plot(density(boot_vars),
     main="Bootstrap Sampling Distribution of Sample Variance",
     xlab="Sample Variance")

abline(v=var(coffee), col="red", lwd=2)

The distribution of the sample variance shows some noticeable skew to the left, however, this skew is not too incredibly severe. It is worth noting that variance measurements tend to be more sensitive to skewness than that of the mean. So, this is not surprising that the sample variance is not as close to that of a true normal distribution compared to the sample mean. In this case, the bootstrap method would be a good choice for approximating the sample variance distribution due to it being a nonparametric approach so that the assumption of normality is not required. This means that despite the sample variance distribution showing some slight skewness to the left, the bootstrap method is still a good choice for a sampling distribution.

