1 Problem 1

For this problem, we are asked to summarize and compare Asymptotic and Bootstrap Sampling distributions. We are asked to outline the key assumptions, practical applications, and discuss when and why to use one over the other.

Asymptotic and Bootstrap Sampling Distributions are both probability distributions that are used to estimate sample statistics, like means or standard deviations. They are both used when the distribution of any given population is unknown, or if you don’t have access to any more ways to gather data for whatever reason.

1.1 Asympotitic Distributions

Asymptotic Distributions rely on assumptions from the central limit theorem. The central limit theorem says that if we take a random sample from any given population, then we can say that as we continually increase the sample size (gather more data points), any distribution of a function of said sample should eventually be normally distributed.

Asymptotic Distributions are generally used if a sample size is “large” or when the sample resembles a distribution we are familiar with, like a normal, chi-square, etc.

1.2 Bootstrapped Distributions

Bootstrapping is a nonparametric resampling method in which samples are taken from a sample. In other words, if you have a sample, we treat that sample as our “population” and we would then randomly sample with replacement from that “population”. From there, we can take functions of theses new samples and use those as estimates.

Bootstrapping is generally used when the original sample is small or when data is more complex. There is one similarity with asymptotic distributions: as you take more and more samples, the sampling distribution of a given statistic will eventuall converge to a normal distribution.

2 Problem 2

For this problem, we are interested in analyzing sample means and variances via bootstrapping. To do this, we will use a data set that consists of the volume of coffee sold daily at two different coffee shops:

coffee <- c(2850, 3200, 2900, 3100, 2950, 7800, 8100, 7900, 3300, 3050, 4000, 4200, 3150, 3400, 7700, 8200, 3250, 4400, 3100, 4200, 4500, 4800, 4300, 8500, 8200, 8900, 8700, 3250, 3000, 4600, 4100, 8400, 8800, 3350, 4700, 3100, 8100, 3050, 8300, 4100, 3100, 8300, 8900, 8200, 4400, 4500, 3250, 4600, 8400, 3300, 4200, 4500, 4800, 4300, 8500)

2.1 Part A

I believe that we can use the central limit theorem to derive the asymptotic distribution of the sample mean because our sample size is greater than 30 (n = 55 in our case).

length(coffee)
[1] 55

2.2 Part B

For this part, we will do the bootstrap samples and calculate the sample means for each. They will then be plotted via a histogram. Afterwards we will generate and plot a kernel density estimate for the sample means.

Let’s start by generating and plotting the bootstrapped sampling distribution for the mean. For the sampling distribution, 1000 samples will be used. We will also plot a kernel density curve on the plot:

original.sample = sample(coffee,    # population of all coffee volume sales
55,                      # sample size = 55 
replace = FALSE          # sample without replacement
                 )                              
### Bootstrap sampling begins 
bt.sample.mean.vec = NULL      # define an empty vector to hold sample means of repeated samples.
for(i in 1:1000){              # starting for-loop to take bootstrap samples with n = 55
  ith.bt.sample = sample(x = original.sample, # Original sample with 55 Daily
                       size = 55,             # sample size = 55 MUST be equal to the sample size!!
                    replace = TRUE            # MUST use WITH REPLACEMENT!!
                       )                      # this is the i-th Bootstrap sample
  bt.sample.mean.vec[i] = mean(ith.bt.sample) # calculate the mean of i-th bootstrap sample and 
                                              # save it in the empty vector: sample.bt.mean.vec 
}
###

hist(bt.sample.mean.vec,                     # data from sample
     breaks = 14,                            # specify number of vertical bars
     probability = TRUE,
       xlab = "Bootstrapped Sample Means",      # change the label of x-axis
      # add a title to the histogram
        main = "Bootstrap Sampling Distribution of Sample Means",
          cex.main = 0.9, col = "lightblue"
       )   

lines(density(bt.sample.mean.vec, n = 1000), col = "red", lwd = 2)

# overlays the kernel density estimate for the sample mean

The histogram shape looks pretty normal, although slightly skewed. The kernel density (in red) looks rough, but does resemble a normal curve. Let’s call the kernel seperately and compare:

density(bt.sample.mean.vec, n = 1000)

Call:
    density.default(x = bt.sample.mean.vec, n = 1000)

Data: bt.sample.mean.vec (1000 obs.);   Bandwidth 'bw' = 69.4

       x              y            
 Min.   :4103   Min.   :6.470e-08  
 1st Qu.:4667   1st Qu.:2.124e-05  
 Median :5231   Median :2.768e-04  
 Mean   :5231   Mean   :4.426e-04  
 3rd Qu.:5796   3rd Qu.:8.520e-04  
 Max.   :6360   Max.   :1.268e-03  
mean(coffee)
[1] 5250

The actual sample mean is 5250. Even though the estimate for the mean will change every time you run a sample, it does stay relatively close to the mean (usually between 5220 and 5280). This seems to do a good job of estimating the sample mean.

2.3 Part C

For this part, we will repeat what we did in parts A and B to estimate the sample variance.

I still think that the central limit theorem can apply to estimate the sample variance. Our sample size isn’t changing (n = 55) and the amount of samples we will take won’t change either (samples = 1000).

Now for the bootstrapped samples and the kernel density estimate:

original.sample = sample(coffee,    # population of all daily coffee sales
                       55,                      # sample size = 55 values in the sample
                       replace = FALSE          # sample without replacement
                 )                              
### Bootstrap sampling begins 
bt.sample.var.vec = NULL      # define an empty vector to hold sample means of repeated samples.
for(i in 1:1000){              # starting for-loop to take bootstrap samples with n = 55
  ith.bt.sample = sample(x = original.sample, # Original sample with 55 WCU students' heights
                       size = 55,             # sample size = 55 MUST be equal to the sample size!!
                    replace = TRUE            # MUST use WITH REPLACEMENT!!
                       )                      # this is the i-th Bootstrap sample
  bt.sample.var.vec[i] = var(ith.bt.sample) # calculate the variance of i-th bootstrap sample and 
                                              # save it in the empty vector: sample.bt.var.vec 
}
###
hist(bt.sample.var.vec,                     # data used for histogram
     breaks = 14,                            # specify number of vertical bars
     probability = TRUE,
       xlab = "Bootstrap Sample Variance",      # change the label of x-axis
      # add a title to the histogram
        main="Bootstrap Sampling Distribution of Sample Variances",
          cex.main = 0.9,
       col.main = "navy", col = "lightblue")   
lines(density(bt.sample.var.vec), col = "red", lwd = 2)

Although it resembles a normal distribution, there seems to be a lot more skew compared to the sampling distribution of the mean. I believe some of the potential skew may be coming from just how the graph looks. I don’t think the skew is as bad as it looks.

density(bt.sample.var.vec, n = 1000)

Call:
    density.default(x = bt.sample.var.vec, n = 1000)

Data: bt.sample.var.vec (1000 obs.);    Bandwidth 'bw' = 1.245e+05

       x                 y            
 Min.   :1909410   Min.   :3.570e-11  
 1st Qu.:3075557   1st Qu.:5.538e-09  
 Median :4241705   Median :7.790e-08  
 Mean   :4241705   Mean   :2.142e-07  
 3rd Qu.:5407852   3rd Qu.:3.824e-07  
 Max.   :6573999   Max.   :7.415e-07  
var(coffee)
[1] 5030833

The actual sample variance is 5,030,833. The Kernel Density Estimate usually falls between 4,200,000 and 4,400,000. In rare cases, it may exceed 4,400,000. However, if we keep running the code, the peak of the histograms generated usually float around the 5,000,000 mark. So I think we did a good enough job estimating the sample variance as well.

