We hardly ever have access to the complete population while doing statistical inference. Rather, we use a sample to draw conclusions about population parameters. The concept of a sampling distribution is central to this process, as it describes the distribution of a statistic (such as the sample mean or sample variance) across all possible samples of a given size. The asymptotic and bootstrap approaches are two important frameworks for estimating or creating sample distributions. They all rely on distinct presumptions and are appropriate for various real-world situations.
Probability theory serves as the foundation for the asymptotic approach. It investigates the behavior of a statistic as the sample size grows significantly. The main premise is that, in some cases, when the sample size is large enough, many statistics exhibit predictable patterns.
For instance, the Central Limit Theorem asserts that, under some circumstances, a high sample size tends to yield a sample mean that roughly follows a normal distribution, irrespective of the initial population’s form. There are comparable large-sample outcomes for the sample variance and other statistics.
Independence: The observations must not influence one another. If data points are dependent, the theoretical results may no longer hold.
Identical Distribution: All observations should come from the same underlying population. If the data come from different populations, the large-sample approximation may not reflect the true variability.
Finite Moments: The population must have a well-defined mean and variance. In practical terms, this means the data should not come from an extremely heavy-tailed distribution where variability is effectively infinite. Normal approximations may not work if variability is poorly specified, and the statistic may not “settle down” as the sample size increases.
Large Sample Size: Asymptotic results are justified when the sample is large. With small samples, the approximation may be inaccurate.
The asymptotic distribution is useful because it provides an explicit mathematical expression that can be worked with directly, making statistical inference straightforward and efficient.
For example, if we compute a sample mean from a large dataset, asymptotic theory tells us how to estimate its standard error and construct a normal-based confidence interval. The same principle applies to sample variance and a variety of other statistics.
The advantage of this method is efficiency. Once the theoretical solution is determined, the computation is simple and quick. However, its reliability is dependent on the sample size and appropriate assumptions.
The bootstrap takes a fundamentally different data-driven approach. Rather than relying on analytical approximations that require large samples and known distributional forms, the bootstrap treats the observed sample as a proxy for the population and resamples from it repeatedly to empirically construct a sampling distribution.
The idea is simple:
Start with the original sample.
Treat that sample as if it represents the population.
Draw many new samples from it, with replacement.
Recalculate the statistic for each resampled dataset.
Examine how those recalculated values vary.
The distribution of these recalculated statistics is called the bootstrap sampling distribution. It serves as an approximation of the true sampling distribution.
The Sample Is Representative: The bootstrap assumes the observed sample reflects the structure of the population. If the sample is biased or unrepresentative, the bootstrap will reproduce that bias.
Independence: Standard bootstrap methods assume observations are independent. If the data are dependent, specialized bootstrap techniques are required.
Sufficient Sample Size: While bootstrap can work with moderate samples, it still needs enough data to capture meaningful variability. Very small samples may not provide enough information for reliable resampling.
The bootstrap is especially valuable in situations where:
In practice, the bootstrap is often used to estimate standard errors and build confidence intervals without relying heavily on theoretical formulas. Instead of assuming a normal shape, we let the data reveal the shape of the distribution through repeated resampling.
Both methods aim to approximate the same underlying object: the sampling distribution of a statistic.
The asymptotic method is based on theory. It provides a solution to the question, “What would probability theory predict if we repeatedly sampled from the population and the sample size were large?”
The bootstrap methodology is based on data. It provides a solution to the question, “What would happen if I repeatedly resampled this dataset given the data I have?”
The asymptotic and bootstrap methods offer two distinct ways to approximate a statistic’s sample distribution. When its presumptions are met, the asymptotic method, which is based on large-sample probability theory, is effective. When theoretical formulations are challenging or unreliable, the bootstrap provides flexibility through computation and resampling.
Understanding both strategies, identifying their underlying presumptions, and choosing the one that offers the best reliable approximation for the available data are all necessary for a successful statistical analysis.
The dataset consists of 55 daily coffee sales observations collected from two different cafe locations. To justify the use of the Central Limit Theorem (CLT), we must consider whether the required conditions are reasonably satisfied.
The CLT states that the sampling distribution of the sample mean is approximately normal when the observations are independent, come from the same distribution, and have finite variance, particularly when the sample size is sufficiently large.
The sample size here is n = 55, which is generally large enough for the CLT to provide a reliable approximation. Assuming that daily sales are independent across days and that the variability in sales is finite, the normal approximation for the sample mean is reasonable.
However, since the data come from two different cafe locations, the identical distribution assumption requires careful interpretation. If the two cafes have similar sales behavior, treating the data as one combined sample is appropriate. If the locations differ systematically, then the sample mean represents the average of a mixture of two distributions rather than a single homogeneous population.
Despite this consideration, with a sample size of 55 and no indication of extreme skewness or heavy tails, the CLT provides a reasonable justification for approximating the sampling distribution of the sample mean by a normal distribution.
To complement the asymptotic justification in part (a), we now construct a bootstrap sampling distribution for the sample mean. The goal is to approximate how the sample mean would vary under repeated sampling, using only the observed data.
The bootstrap procedure proceeds as follows:
Treat the observed 55 values as an estimate of the population distribution.
Draw a new sample of size 55 with replacement from the observed data.
Compute the sample mean of this resampled dataset.
Repeat steps 2–3 a large number of times.
Collect all resampled means to form the bootstrap sampling distribution.
# Store the 55 observed daily sales values
sales <- c(2850, 3200, 2900, 3100, 2950, 7800, 8100, 7900, 3300, 3050,
4000, 4200, 3150, 3400, 7700, 8200, 3250, 4400, 3100, 4200,
4500, 4800, 4300, 8500, 8200, 8900, 8700, 3250, 3000, 4600,
4100, 8400, 8800, 3350, 4700, 3100, 8100, 3050, 8300, 4100,
3100, 8300, 8900, 8200, 4400, 4500, 3250, 4600, 8400, 3300,
4200, 4500, 4800, 4300, 8500)
# Set number of bootstrap replications
B <- 5000
# Store the bootstrap sample means
boot_means <- numeric(B)
# Perform bootstrap resampling
for (b in 1:B) {
# Draw a resample of size 55 with replacement
resample <- sample(sales, size = length(sales), replace = TRUE)
# Compute the sample mean of the resample
boot_means[b] <- mean(resample)
}
# Plot histogram of bootstrap means
hist(boot_means,
probability = TRUE,
main = "Bootstrap Sampling Distribution of the Sample Mean",
xlab = "Bootstrap Sample Means")
# Add kernel density estimate for smooth visualization
lines(density(boot_means), lwd = 2)The resulting histogram and corresponding kernel density estimate provide a smooth visualization of the bootstrap distribution. The distribution appears approximately symmetric and bell-shaped, with no evidence of substantial skewness or irregularities. The bootstrap means are centered near the original sample mean, and the spread appears stable and well-behaved.
This empirical result supports the conclusion drawn in part (a). Specifically, the shape of the bootstrap distribution closely resembles a normal distribution, which is consistent with the prediction of the Central Limit Theorem for a sample size of 55. The absence of pronounced skewness or heavy tails suggests that the large-sample normal approximation is appropriate for inference about the population mean.
Overall, the bootstrap analysis validates the use of asymptotic methods for this dataset. Both approaches indicate that the sampling distribution of the sample mean can be reasonably approximated by a normal distribution.
In this dataset, we are working with 55 observations. Assuming independence and finite variability in daily sales, the conditions required for the CLT appear reasonably satisfied. However, because the variance depends on squared values, it is more sensitive to extreme observations than the mean. As a result, the normal approximation for the sampling distribution of the variance may be less accurate than it was for the mean.
To assess the validity of the asymptotic approximation, we again apply the bootstrap method.
Bootstrap Procedure:
Resample 55 observations with replacement from the original dataset.
Compute the sample variance for each resample.
Repeat this process many times.
Examine the distribution of the bootstrap sample variances.
# Store the 55 observed daily sales values
sales <- c(2850, 3200, 2900, 3100, 2950, 7800, 8100, 7900, 3300, 3050,
4000, 4200, 3150, 3400, 7700, 8200, 3250, 4400, 3100, 4200,
4500, 4800, 4300, 8500, 8200, 8900, 8700, 3250, 3000, 4600,
4100, 8400, 8800, 3350, 4700, 3100, 8100, 3050, 8300, 4100,
3100, 8300, 8900, 8200, 4400, 4500, 3250, 4600, 8400, 3300,
4200, 4500, 4800, 4300, 8500)
# Set number of bootstrap replications
B <- 5000
# Store bootstrap sample variances
boot_vars <- numeric(B)
# Perform bootstrap resampling
for (b in 1:B) {
# Draw a resample of size 55 with replacement
resample <- sample(sales, size = length(sales), replace = TRUE)
# Compute and store the sample variance of the resample
boot_vars[b] <- var(resample)
}
# Compute kernel density estimate of bootstrap variances
dens <- density(boot_vars)
# Plot histogram with of bootstrap variances
hist(boot_vars,
probability = TRUE,
ylim = c(0, max(dens$y)), # expand y-axis to fit full KDE
main = "Bootstrap Sampling Distribution of the Sample Variance",
xlab = "Bootstrap Sample Variances")
# Add kernel density curve
lines(dens, lwd = 2)The bootstrap sampling distribution of the sample variance appears smooth and unimodal, indicating consistent behavior across repeated resamples. The histogram and kernel density estimate show a generally bell-shaped form, though the distribution exhibits mild right-skewness. This slight asymmetry is expected because the variance depends on squared deviations, making it more sensitive to larger observations. When higher sales values are sampled more frequently in a bootstrap resample, the resulting variance increases, which contributes to a longer right tail.
Despite this mild skewness, the overall shape of the distribution is well-behaved and does not display irregular patterns or multiple peaks. The distribution remains fairly concentrated and structured, suggesting that the sampling variability of the sample variance is stable.
Overall, the bootstrap results indicate that a normal approximation for the sampling distribution of the variance is reasonable in this setting, although the bootstrap distribution provides a more direct, data-driven representation of its shape.