Asymptotic and bootstrap sampling concepts serve as a solid foundation for when we wish to perform inferential statistics on sample data which came from a population that might not resemble an ideal and well defined probability distribution (like the normal distribution), or the costs (in terms of money, time and resources) of getting a large and desired sample size are too high. The ideas behind asymptotic and bootstrap sampling are non-parametric, meaning we do not need to make assumptions about the broader population from which a sample came from in order to deploy these methods.
Asymptotic sampling relies on the claim of the central limit theorem (CLT). Said theorem states that, regardless of a population’s true distribution, a continuous increase in our sample size (n) will lead to the sample data’s distribution reaching closer and closer to a resemblance of normality, as long as the principles of random sampling are adhered to and the subjects within the sample are independent of one another. A simulation to demonstrate is below.
In this theoretical example, say we wish to sample a population which is gamma distributed, with a shape parameter of 3 (alpha) and a rate parameter of 2 (beta). For reference, the distribution curve for such a population would look like the following.
curve(dgamma(x, shape = 3, rate = 2),
from = 0,
to = 4,
lwd = 3,
col = "blue",
main = "Gamma Distribution (3, 2)",
xlab = "Critical Value",
ylab = "Probability Density")
box(lwd = 3)Due to financial or time constraints, we may not be able to garner a particularly large sample size for our survey or experiment. If we were to take a random sample of size 100, the distribution of our sample data may look very much like what is seen below; a rightly skewed and abnormal distribution that largely resembles that of its population distribution. In this particular simulation, the mean of our sample data was roughly 1.46 while our standard deviation was about 0.71.
Layered over the histogram of this size-100 sample is both the normal distribution curve representing what normally distributed data with the same mean and standard deviation would look like, as well as the gamma distribution curve depicted previously.
##### Single sample
set.seed(1) # Set the seed for consistency when re-running program
SampleG = rgamma(n = 100, shape = 3, rate = 2)
# Will use Gamma distribution for example
# alpha = shape parameter = 3
# beta = scale parameter = 2
SampleG_Data = cbind(SampleG)
SampleG_Data = sort_by(SampleG_Data, SampleG)
# Put numerical data in data frame and sorted it
invisible(
mean(SampleG) # = 1.457996
)
invisible(
sd(SampleG) # = 0.7102148
)
## Single Sample Histogram
hist(SampleG,
main = "Single Sample Distribution",
xlab = "Individual Values",
ylim = c(0,0.6),
prob = TRUE)
box(lwd = 3)
curve(dnorm(x, mean = 1.457996, sd = 0.7102148),
lwd = 3,
col = "red",
add = TRUE)
curve(dgamma(x, shape = 3, rate = 2),
from = 0,
to = 4,
lwd = 3,
col = "blue",
add = TRUE)
legend("topright",
legend = c("N (1.46, 0.71)", "Gamma"),
col = c("red","blue"),
lwd = 3,
lty = 1,
box.lwd = 3)As seen above, without a particularly grand sample size, a sample’s distribution will likely resemble the distribution of which the population it came from. However, as sample size increases so does normalization. Through the use of R’s replicate function, below I simulated 10,000 repeats of the same exact type of sample performed previously (n = 100, alpha = 3, beta = 2). The distribution breakdown for these sample means is below.
Layered over the histogram depiction of the sample means is once again the normal distribution curve, this time for the given parameters provided by our 10,000 replicate sample distribution, those being a mean of about 1.5 and a standard deviation of about 0.08. Additionally, the gamma curve is depicted for a gamma distribution with alpha and beta parameters appropriately multiplied due to the number of repeated samples (alpha = 30,000 and beta = 20,000).
##### Sample Means
## Sample Means Histogram
set.seed(1) # Set the seed for consistency when re-running program
SamplesG = replicate(10000, mean(rgamma(n = 100, shape = 3, rate = 2)))
SamplesG_Data = cbind(SamplesG)
SamplesG_Data = sort_by(SamplesG_Data, SamplesG)
SamplesG_Data = data.frame(SamplesG_Data)
SamplesG_Data_Conf = SamplesG_Data
##### Was originally going to drop outlier means but decided to just leave them all in
#[51:9950,]
# 10000*.01 = 100, will drop the 50 highest and lowest means
SamplesG_Data_Conf = data.frame(SamplesG_Data_Conf)
invisible(
mean(SamplesG_Data_Conf[,]) # = 1.499979
)
invisible(
sd(SamplesG_Data_Conf[,]) # = 0.08320512
)
hist(SamplesG_Data_Conf[,],
main = "Sample Means Distribution",
xlab = "Sample Mean",
prob = TRUE,
ylim = c(0,5))
box(lwd = 3)
curve(dnorm(x, mean = 1.499979, sd = 0.08320512),
lwd = 3,
col = "red",
add = TRUE)
curve(dgamma(x, shape = 10000*3, rate = 10000*2),
from = 0,
to = 3,
lwd = 3,
col = "blue",
add = TRUE)
legend("topright",
legend = c("N (1.5, 0.08)", "Gamma"),
col = c("red","blue"),
lwd = 3,
lty = 1,
box.lwd = 3)What is overwhelmingly visible in this second plot is the striking resemblance between the normal distribution curve and the distribution of our sample means, along with the simultaneously striking disparity between our sample means’ distribution and that of the gamma curve. Despite nothing changing about the inherent population from which we sampled from, the increased sample size caused a reduction in standard deviation (by about a factor of 10) and a substantial increase in normalization.
To summarize, the theory of asymptotic sampling is non-parametric (works regardless of population’s distribution as long as there is finite mean and variance) and largely predicated on the central limit theorem. As sample size increases, so does normalization. Knowing this, we can use the framework of a normal distribution for estimating population parameters such as means and proportions.
An alternative to asymptotic sampling analysis is bootstrapping, which approaches the potential issue of limited sample size from a different angle. Rather than simulating an extremely large sample size from the population, we continuously take sub samples from our original sample.
With bootstrap sampling, we take repeated sub samples of the original sample (that came from our population of interest) with replacement. In other words, our initial sample serves as a “proxy” for the population.
Like with asymptotic application, we likely would resort to a method such as bootstrapping if a significantly large sample size was unable to be obtained for our survey or study. Also in alignment with the central limit theorem that backs asymptotic theory, a crucial element of bootstrapping is that after a large number of repeated resamples, our test statistic of interest’s distribution will resemble that of a normal distribution.
For demonstration purposes, say we want to sample a population which is exponentially distributed, with a rate parameter of 5. For reference, the distribution curve for such a population would look like this.
curve(dexp(x, rate = 5),
lwd = 3,
col = "blue",
main = "Exponential Distribution (5)",
xlab = "Critical Value",
ylab = "Probability Density")
box(lwd=3)In this scenario, we are only able to achieve a sample size of 60 without sacrificing principles of random sampling. Once again using R’s randomized functionality - this time rexp for the exponential distribution - a simulation of this scenario was performed. The resulting sample data yielded a mean of about 0.20 and a standard deviation of about 0.18.
Below, both a normal distribution curve with the same mean and standard deviation that our 60-subject sample has, and the exponential distribution curve depicted earlier are layered over a histogram of the sample data.
set.seed(1)
Exp_Sample = rexp(n = 60, rate = 5)
# This time will use exp distribution for example, with parameter rate (lamba) = 5
invisible(
mean(Exp_Sample) # = 0.1985006
)
invisible(
sd(Exp_Sample) # = 0.1787757
)
hist(Exp_Sample,
main = "Single Sample Distribution",
xlab = "Individual Values",
prob = TRUE)
box(lwd=3)
curve(dnorm(x, mean = 0.1985006, sd = 0.1787757),
lwd = 3,
col = "red",
add = TRUE)
curve(dexp(x, rate = 5),
lwd = 3,
col = "blue",
add = TRUE)
legend("topright",
legend = c("N (0.20, 0.18)", "Exp (5)"),
col = c("red","blue"),
lwd = 3,
lty = 1,
box.lwd = 3)Our sample data’s distribution unsurprisingly has a great deal of resemblance with the population from which it came from. However, if we were to continuously resample the subjects within this data, the distribution of the resulting sample means would much more closely resemble a normal distribution.
Below I performed bootstrap sampling on our sample data. Meaning that, for our sample of 60 subjects from the exponentially distributed population, I performed 1,000 resamples with replacement on the initial sample. Each resample was also of size 60, which is why the replacement element of the sampling is so crucial.
The distribution of these bootstrap sample means can be seen below, once again in comparison to a normal distribution with the same mean (~ 0.20) and standard deviation (~ 0.02) of the 1,000 bootstrap samples. As we can clearly see, there is significant overlap between the distribution of the bootstrap sample means and the normal distribution curve.
# For bootstrap sampling, referenced code from Section 4 of Week 4 notes as well as additional explanation through ChatGPT
# START BOOTSTRAP FUNCTION
# Let B denote number of sub samples we want to take
Bootstrap_Mean = function(Data, B) {
n = length(Data)
Bootstrap_Means = numeric(B)
# This function numeric() is creating a vector with B number of elements, the elements will then be updated with each bootstrap mean iteration as the for loop below operates.
# START FOR LOOP
for (i in 1:B) {
Boot_Sample = sample(Data, size = n, replace = TRUE)
# The argument "size = n" means that each bootstrap sample will be the same size as our original sample data that we are inputting (n = 60 in this case)
# This is why argument "replace = TRUE" is so important. Without it, we'd just be repeating the same sample endlessly
Bootstrap_Means[i] = mean(Boot_Sample)
# The [i] aspect of this code indicates to R that Bootstrap_Means will be a vector rather than simply one single object. The bracketed language tells R that there will be multiple values placed in each i-th spot in the Bootstrap_Means vector.
} # END FOR LOOP
return(Bootstrap_Means)
} # END BOOTSTRAP FUNCTION
set.seed(1)
Boot_Sample_Data = Bootstrap_Mean(Data = Exp_Sample, B = 1000)
invisible(
mean(Boot_Sample_Data) # = 0.1976613
)
invisible(
sd(Boot_Sample_Data) # = 0.0228317
)
###### Histogram Plots below - PICK UP HERE
hist(Boot_Sample_Data,
main = "Bootstrap Means Distribution",
xlab = "Sample Mean",
prob = TRUE)
box(lwd=3)
curve(dnorm(x, mean = 0.1976613, sd = 0.0228317),
lwd = 3,
col = "red",
add = TRUE)
legend("topright",
legend = "N (0.20, 0.02)",
col = "red",
lwd = 3,
lty = 1,
box.lwd = 3)To conclude, the principles of asymptotic sampling as well as the bootstrap sampling methods are useful tools when we as statisticians or researchers wish to utilize qualities of the normal distribution to make inferential conclusions from our sample data, despite the population from which we are sampling from not resembling normality.
Both asymptotic and bootstrap ideas are applicable regardless of the underlying population distribution of a sample, so long as the mean and variance of said population is finite.
Bootstrapping is particularly useful when we have small to moderate sample sizes and wish to assess a population parameter (whether that be mean, variance, proportion, etc…) despite not having the resources to obtain a larger sample. A key component of a bootstrapping model’s validity though is the quality of the initial sample. If the initial sample was tainted in anyway and does not serve as a truly accurate representation of the population, that bias will be seen in our bootstrap results.
For this analysis we are examining the volume of coffee sold per day (mL) at two different cafes over a period of 50 days. Our sample size is 55, and a summary of basic descriptive statistics as well as a histogram to visualize the data is below.
Sales = c(2850, 3200, 2900, 3100, 2950, 7800, 8100, 7900, 3300, 3050, 4000, 4200, 3150, 3400, 7700, 8200,3250, 4400, 3100, 4200, 4500, 4800, 4300, 8500, 8200, 8900, 8700, 3250, 3000, 4600, 4100, 8400, 8800, 3350, 4700, 3100, 8100, 3050, 8300, 4100, 3100, 8300, 8900, 8200, 4400, 4500, 3250, 4600, 8400, 3300, 4200, 4500, 4800, 4300, 8500)
# n = 55
invisible(
summary(Sales)
)
# Min = 2850
# Q1 = 3250
# Median = 4400
# Mean = 5250
# Q3 = 8100
# Max = 8900
invisible(
sd(Sales) # = 2242.952
)
Sum_Stats = cbind(2850, 3250, 4400, 5250, 8100, 8900)
colnames(Sum_Stats) = c("Min", "Q1", "Median", "Mean", "Q3", "Max")
kable(Sum_Stats,align = "c",
caption = "<span style='color:##000000;'>Coffee Sales Summary Stats (mL)</span>") %>%
kable_styling(
bootstrap_options = c("striped", "bordered"),
full_width = FALSE,
position = "center")| Min | Q1 | Median | Mean | Q3 | Max |
|---|---|---|---|---|---|
| 2850 | 3250 | 4400 | 5250 | 8100 | 8900 |
Looking at the plot above, we can say the data is roughly rightly skewed. About a third of recorded daily sales (18 of the 55) were greater than 5,000 mL. From initial appearances, this samples does not resemble any well-known distribution.
Sales_Data = data.frame(Sales) # putting data in a data frame to explore easier
invisible(
nrow(filter(Sales_Data, Sales>5000))
)
# 18 observations above 5,000. Which is about 1/3 of the data.In this instance, the central limit theorem can be applied to estimate the sampling distribution of the sample mean. An increasing sample size not only normalizes the distribution of our sample statistic of interest, but also improves the accuracy of our estimation. Generally speaking, if a sample size is greater than 30 and the survey was conducted with adherence to random sampling principles, then we can assume that the mean of the population, and mean of the sampling distribution, is roughly equal to the mean of our initial sample data. Meaning in this instance that on average, the amount of coffee sold per day between the two cafes is roughly 5,250 mL.
Next I will use bootstrapping to validate this approximation.
Using the custom-built bootstrapping function from before, I performed 1,000 resample iterations (with replacement) of our coffee sales sample. A histogram display of the sampling distribution is below. Along with the bootstrap sampling data, I overlaid both a normal distribution curve and a Gaussian kernel density estimator with a bandwidth of 68.
# Will use Bootstrap_Mean function from earlier
set.seed(1)
Sales_Boot_Means = Bootstrap_Mean(Data = Sales, B = 1000)
invisible(
mean(Sales_Boot_Means) # = 5266.007
)
invisible(
sd(Sales_Boot_Means) # = 299.0682
)# Manual kernel density estimate function; I created to serve as a complement to the graph made with R's built in density function
# going to create generic parametric KDE function
# going to use Gaussian kernel
## Guassian kernel function from HW 1
G_KU = function(u) {
Exp = exp(1)^(-(u^2)/2)
Answer = Exp / sqrt(2*pi)
return(Answer)
}
KDE = function(Data, h, x){
n = length(Data)
total = 0 # initialize outside loop
# begin for loop
for (i in Data){
u = (x - i)/h
k_i = G_KU(u)
total = total + k_i
} # end for loop
Answer = (1/(n*h)) * total
return(Answer)
}hist(Sales_Boot_Means,
main = "Coffee Sales Bootstrap Sample Means",
xlab = "Sample Means (mL)",
xaxt = "n",
prob = TRUE)
axis(side = 1, at = seq(from = 4500, to = 6000, by = 250))
box(lwd = 3)
Gaussian_KDE_Mean = density(Sales_Boot_Means, bw = 68, kernel = "gaussian")
lines(Gaussian_KDE_Mean,
lwd = 3,
col = "blue"
)
curve(dnorm(x, mean = 5266.007, sd = 299.0682),
lwd = 3,
col = "red",
add = TRUE)
legend("topright",
legend = c("N (5266, 299)", "Gaussian KDE(68)", "Initial Sample Mean"),
col = c("red","blue", "black"),
lwd = 3,
lty = c(1,1,3),
box.lwd = 3)
segments(
x0 = 5250,
y0 = 0,
x1 = 5250,
y1 = 0.0014,
col = "black",
lwd = 4,
lty = 3
)The results of our bootstrapping simulation strongly support the initial claim of the central limit theorem’s validity regarding this dataset and the estimation of population mean. The mean of our sample data is 5,250 while the mean of our bootstrap sampling distribution is approximately 5,266.
Not only do the principles of asymptotic and bootstrapping apply to the mean, but the variance as well. In other words, as the sample size increases, the sampling distribution of sample variances will reach closer to a normal distribution. Additionally, the average of the sampling distribution of sample variances should closely resemble the initial sample variance.
Below is a bootstrap simulation demonstrating such.
## Below is bootstrap function for sample variance (based off bootstrap mean sample from before)
# START BOOTSTRAP FUNCTION
# Let B denote number of sub samples we want to take
Bootstrap_Var = function(Data, B) {
n = length(Data)
Bootstrap_Vars = numeric(B)
# START FOR LOOP
for (i in 1:B) {
Boot_Sample = sample(Data, size = n, replace = TRUE)
Bootstrap_Vars[i] = var(Boot_Sample)
} # END FOR LOOP
return(Bootstrap_Vars)
} # END BOOTSTRAP FUNCTION
set.seed(1)
Sales_Boot_Vars = Bootstrap_Var(Data = Sales, B = 1000)
invisible(
mean(Sales_Boot_Vars) # = 4953990
)
invisible(
sd(Sales_Boot_Vars) # = 511182.4
)hist(Sales_Boot_Vars,
main = "Coffee Sales Bootstrap Sample Variances",
xlab = "Sample Variances",
xaxt = "n",
prob = TRUE,
breaks = 12)
box(lwd=3)
# Doing some additional formatting for ease of visual understanding
ticks = axTicks(1)
# axTicks function tells R which position we want the axis to be, argument 1 is the bottom.
axis(1,
at = ticks,
labels = format(ticks, big.mark = ","))
# the format function by default assumes the 'big mark' will occur every three digits
Gaussian_KDE_Vars = density(Sales_Boot_Vars, bw = 110938, kernel = "gaussian")
# IQR/1.34 = 490723.1
# using formula, h value is 110937.7
lines(Gaussian_KDE_Vars,
lwd = 3,
col = "blue"
)
curve(dnorm(x, mean = 4953990, sd = 511182.4),
lwd = 3,
col = "red",
add = TRUE)
segments(
x0 = 5030833,
y0 = 0,
x1 = 5030833,
y1 = 0.0014,
col = "black",
lwd = 4,
lty = 3
)
legend("topleft",
legend = c("N (4,953,990, 511,182)", "Gaussian KDE (110,938)", "Initial Sample Variance"),
col = c("red","blue", "black"),
lwd = 3,
lty = c(1,1,3),
box.lwd = 3)The variance of our initial sample is 5,030,833, which is visibly within the modal bin of the sampling distribution of sample variances. The mean variance of our sampling distribution is about 4,953,990, which is very close to our initial sample’s variance given the scale of this data.