Problem 1
For this problem, we are asked to summarize and compare Asymptotic
and Bootstrap Sampling distributions. We are asked to outline the key
assumptions, practical applications, and discuss when and why to use one
over the other.
Asymptotic and Bootstrap Sampling Distributions are both probability
distributions that are used to estimate sample statistics, like means or
standard deviations. They are both used when the distribution of any
given population is unknown, or if you don’t have access to any more
ways to gather data for whatever reason.
Asympotitic
Distributions
Asymptotic Distributions rely on assumptions from the central limit
theorem. The central limit theorem says that if we take a random sample
from any given population, then we can say that as we continually
increase the sample size (gather more data points), any distribution of
a function of said sample should eventually be normally distributed.
Asymptotic Distributions are generally used if a sample size is
“large” or when the sample resembles a distribution we are familiar
with, like a normal, chi-square, etc.
Bootstrapped
Distributions
Bootstrapping is a nonparametric resampling method in which samples
are taken from a sample. In other words, if you have a sample, we treat
that sample as our “population” and we would then randomly sample with
replacement from that “population”. From there, we can take functions of
theses new samples and use those as estimates.
Bootstrapping is generally used when the original sample is small or
when data is more complex. There is one similarity with asymptotic
distributions: as you take more and more samples, the sampling
distribution of a given statistic will eventuall converge to a normal
distribution.
Problem 2
For this problem, we are interested in analyzing sample means and
variances via bootstrapping. To do this, we will use a data set that
consists of the volume of coffee sold daily at two different coffee
shops:
coffee <- c(2850, 3200, 2900, 3100, 2950, 7800, 8100, 7900, 3300, 3050, 4000, 4200, 3150, 3400, 7700, 8200, 3250, 4400, 3100, 4200, 4500, 4800, 4300, 8500, 8200, 8900, 8700, 3250, 3000, 4600, 4100, 8400, 8800, 3350, 4700, 3100, 8100, 3050, 8300, 4100, 3100, 8300, 8900, 8200, 4400, 4500, 3250, 4600, 8400, 3300, 4200, 4500, 4800, 4300, 8500)
Part A
I believe that we can use the central limit theorem to derive the
asymptotic distribution of the sample mean because our sample size is
greater than 30 (n = 55 in our case).
length(coffee)
[1] 55
Part B
For this part, we will do the bootstrap samples and calculate the
sample means for each. They will then be plotted via a histogram.
Afterwards we will generate and plot a kernel density estimate for the
sample means.
Let’s start by generating and plotting the bootstrapped sampling
distribution for the mean. For the sampling distribution, 1000 samples
will be used. We will also plot a kernel density curve on the plot:
original.sample = sample(coffee, # population of all coffee volume sales
55, # sample size = 55
replace = FALSE # sample without replacement
)
### Bootstrap sampling begins
bt.sample.mean.vec = NULL # define an empty vector to hold sample means of repeated samples.
for(i in 1:1000){ # starting for-loop to take bootstrap samples with n = 55
ith.bt.sample = sample(x = original.sample, # Original sample with 55 Daily
size = 55, # sample size = 55 MUST be equal to the sample size!!
replace = TRUE # MUST use WITH REPLACEMENT!!
) # this is the i-th Bootstrap sample
bt.sample.mean.vec[i] = mean(ith.bt.sample) # calculate the mean of i-th bootstrap sample and
# save it in the empty vector: sample.bt.mean.vec
}
###
hist(bt.sample.mean.vec, # data from sample
breaks = 14, # specify number of vertical bars
probability = TRUE,
xlab = "Bootstrapped Sample Means", # change the label of x-axis
# add a title to the histogram
main = "Bootstrap Sampling Distribution of Sample Means",
cex.main = 0.9, col = "lightblue"
)
lines(density(bt.sample.mean.vec, n = 1000), col = "red", lwd = 2)

# overlays the kernel density estimate for the sample mean
The histogram shape looks pretty normal, although slightly skewed.
The kernel density (in red) looks rough, but does resemble a normal
curve. Let’s call the kernel seperately and compare:
density(bt.sample.mean.vec, n = 1000)
Call:
density.default(x = bt.sample.mean.vec, n = 1000)
Data: bt.sample.mean.vec (1000 obs.); Bandwidth 'bw' = 69.4
x y
Min. :4103 Min. :6.470e-08
1st Qu.:4667 1st Qu.:2.124e-05
Median :5231 Median :2.768e-04
Mean :5231 Mean :4.426e-04
3rd Qu.:5796 3rd Qu.:8.520e-04
Max. :6360 Max. :1.268e-03
mean(coffee)
[1] 5250
The actual sample mean is 5250. Even though the estimate for the mean
will change every time you run a sample, it does stay relatively close
to the mean (usually between 5220 and 5280). This seems to do a good job
of estimating the sample mean.
Part C
For this part, we will repeat what we did in parts A and B to
estimate the sample variance.
I still think that the central limit theorem can apply to estimate
the sample variance. Our sample size isn’t changing (n = 55) and the
amount of samples we will take won’t change either (samples = 1000).
Now for the bootstrapped samples and the kernel density estimate:
original.sample = sample(coffee, # population of all daily coffee sales
55, # sample size = 55 values in the sample
replace = FALSE # sample without replacement
)
### Bootstrap sampling begins
bt.sample.var.vec = NULL # define an empty vector to hold sample means of repeated samples.
for(i in 1:1000){ # starting for-loop to take bootstrap samples with n = 55
ith.bt.sample = sample(x = original.sample, # Original sample with 55 WCU students' heights
size = 55, # sample size = 55 MUST be equal to the sample size!!
replace = TRUE # MUST use WITH REPLACEMENT!!
) # this is the i-th Bootstrap sample
bt.sample.var.vec[i] = var(ith.bt.sample) # calculate the variance of i-th bootstrap sample and
# save it in the empty vector: sample.bt.var.vec
}
###
hist(bt.sample.var.vec, # data used for histogram
breaks = 14, # specify number of vertical bars
probability = TRUE,
xlab = "Bootstrap Sample Variance", # change the label of x-axis
# add a title to the histogram
main="Bootstrap Sampling Distribution of Sample Variances",
cex.main = 0.9,
col.main = "navy", col = "lightblue")
lines(density(bt.sample.var.vec), col = "red", lwd = 2)
Although it resembles a normal distribution, there seems to be a lot
more skew compared to the sampling distribution of the mean. I believe
some of the potential skew may be coming from just how the graph looks.
I don’t think the skew is as bad as it looks.
density(bt.sample.var.vec, n = 1000)
Call:
density.default(x = bt.sample.var.vec, n = 1000)
Data: bt.sample.var.vec (1000 obs.); Bandwidth 'bw' = 1.245e+05
x y
Min. :1909410 Min. :3.570e-11
1st Qu.:3075557 1st Qu.:5.538e-09
Median :4241705 Median :7.790e-08
Mean :4241705 Mean :2.142e-07
3rd Qu.:5407852 3rd Qu.:3.824e-07
Max. :6573999 Max. :7.415e-07
var(coffee)
[1] 5030833
The actual sample variance is 5,030,833. The Kernel Density Estimate
usually falls between 4,200,000 and 4,400,000. In rare cases, it may
exceed 4,400,000. However, if we keep running the code, the peak of the
histograms generated usually float around the 5,000,000 mark. So I think
we did a good enough job estimating the sample variance as well.
