Introduction

The data set used in the following analysis of confidence intervals using random sampling without replacement and with replacement - otherwise known as bootstrap sampling - contains the percentage of protein intake from different types of food in 170 countries around the world. The particular variable used in this analysis is the population count of each of the 170 countries.

Analysis

The following analysis will develop a random and bootstrapped sample mean of the population variable and construct a 95% confidence interval around said sample means. Finally, two histograms will provide additional understanding of the calculated sample means using random and bootstrap methods.

Confidence Interval for Original Sample

sample.mean.vec = NULL      # define an empty vector to hold sample means of repeated samples.
for(i in 1:1000){           # starting for-loop to take repeated random samples with n = 85
  ith.sample = sample(protein$Population,       # population of all country's populations
                       85,                     # sample size = 85 values in the sample
                       replace = FALSE          # sample without replacement
                 )                              # this is the i-th random sample
   sample.mean.vec[i] = mean(ith.sample)        # calculate the mean of i-th sample and save it in
                                                # the empty vector: sample.mean.vec 
  }

CI_sample <- quantile(sample.mean.vec,c(.025,.975)) #95% confidence interval for original sample data
CI_sample

##     2.5%    97.5% 
## 23261108 65920964

Using random sampling without replacement, a 95% confidence interval was constructed around the sample mean returning a lower bound of 23,261,108 and an upper bound of 65,920,964.

Confidence Interval for Bootstrapped Sample

original.sample = sample(protein$Population,    # population of all country's populations
                       85,                      # sample size = 85 values in the sample
                       replace = FALSE          # sample without replacement
                 )   

bt.sample.mean.vec = NULL      # define an empty vector to hold sample means of repeated samples.
for(i in 1:1000){              # starting for-loop to take bootstrap samples with n = 85
  ith.bt.sample = sample(original.sample,    # Original sample with 85 country's populations
                       85,                    # sample size = 85 MUST be equal to the sample size!!
                       replace = TRUE         # MUST use WITH REPLACEMENT!!
                 )                            # this is the i-th Bootstrap sample
  bt.sample.mean.vec[i] = mean(ith.bt.sample) # calculate the mean of i-th bootstrap sample and 
                                              # save it in the empty vector: sample.bt.mean.vec 
}

CI_bt = quantile(bt.sample.mean.vec, c(0.025, 0.975)) #95% confidence interval for bootstrapped sample means
CI_bt

##      2.5%     97.5% 
##  18268782 106659037

Using a bootstrap sample with replacement, a 95% confidence interval was constructed around the sample mean with a lower bound of 18,268,782 and an upper bound of 106,659,037.

Histogram of Sample Means from the Sampling Distribution

sample.mean.vec <- as.data.frame(sample.mean.vec) #sample means for original and bootstrapped data must be coerced into data frame for ggplot2
bt.sample.mean.vec <- as.data.frame(bt.sample.mean.vec)

x1 <- ggplot(sample.mean.vec, aes(x=sample.mean.vec)) + 
                geom_histogram(fill = "lightblue", 
                                color = "darkblue") + 
                theme_minimal() + 
                labs(title = "Approximated Sampling Distribution \n of Sample Means", 
                      x = "Sample Means of Repeated Samples \n (Original)", 
                      y = "Frequency")

x2 <- ggplot(bt.sample.mean.vec, aes(x=bt.sample.mean.vec)) + 
                geom_histogram(fill = "gold2", 
                               color = "goldenrod4") + 
                theme_minimal() + 
                labs(title = "Approximated Sampling Distribution \n of Bootstrapped Sample Means", 
                     x = "Sample Means of Repeated Samples \n (Bootstrapped)", 
                     y = "")

figure <- ggarrange(x1, x2, nrow = 1, ncol = 2) #plot both histograms in the same graphic

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

figure

Conclusion

In this analysis two samples were taken, a random sampling without replacement and a bootstrap sample with replacement. The means of these samples were calculated to create a sampling distributions of the sample means. After the sampling distributions of the sample means were calculated, a 95% confidence interval was constructed around the sample means. The random sample yielded a confidence interval with the following lower and upper bounds: [23,261,108, 65,920,964]. The bootstrap sample yielded the following confidence interval: [18,268,782, 106,659,037].

A comparison of the two confidence intervals is as follows: the confidence interval for the random sample (without replacement) is contained within the bootstrapped confidence interval or, rather, the random sample confidence interval is a subset of the bootstrap confidence interval. The bootstrap sample has a larger interval between the lower and upper bounds meaning that the range of values for an expected value to fall into 95% of the time is between 18,268,782 and 106,659,037. The random sample has a much smaller, or tighter, interval between its lower and upper bound meaning that the bootstrapped sample has more variance contained within its samples compared with sampling without replacement.

Assignment 1

Angelo Saporito

2023-02-03