I will be using confidence intervals of the mean with and without bootstrapping to decide which is a better method for inferences.
protein = read.csv("https://raw.githubusercontent.com/pengdsci/sta321/main/ww02/w02-Protein_Supply_Quantity_Data.csv", header = TRUE)
##
#head(protein)
#dim(protein)
The data set shows the percentage of protein intake from different foods around the world. I will be analyzing the Treenuts variable, which will show the percentage of protein intake from tree nuts.
sample = sample(protein$Treenuts, #finding mean of the original sample
170, #sample size
replace = FALSE) #no replacement
CI = quantile(sample, c(0.025, 0.975)) #confidence interval of the mean
CI #print CI
## 2.5% 97.5%
## 0.0000000 0.9948625
sample.mean.vec = NULL # empty vector for storing b BT means
for(i in 1:1000){ #for loop for each of the 170 samples taking 1000 bootstrap samples
ith.sample = sample(protein$Treenuts, #finding bootstrap mean
170, #sample size
replace = TRUE #WITH replacement because of bootstrap and big sample size
)
sample.mean.vec[i] = mean(ith.sample) #mean of ith sample saved in the empty vector
}
b.CI = quantile(sample.mean.vec, c(0.025, 0.975)) #confidence interval of the mean from bootstrapping
b.CI #printing bootstrap CI
## 2.5% 97.5%
## 0.2038416 0.2905259
hist(sample.mean.vec, #histogram of bootstrap data
breaks = 20, #amount of breaks
xlab = "Bootstrap sample means", #x axis label
main="Bootstrap Sampling Distribution \n of Sample Means") #title
The confidence interval of the mean is (0.0000000, 0.9948625). After using the bootstrap method, we find the confidence interval to be (0.2027401, 0.2902978). Because the bootstrap sample mean confidence interval is much smaller and therefore better than the actual mean confidence interval, it is a better for predictions and estimation.