Week 2

A) Data set Description

The data set includes the percentage of protein intake from different types of food in countries around the world. The last couple of columns also includes counts of obesity and COVID-19 cases as percentages of the total population for comparison purposes. For my analysis I chose to focus on the variable AnimalProducts. We see that the mean protein intake via Animal Products accounts for 21.23% of their total protein intake. The minimum value is 4.46%, and the maximum 35.79%.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.456  14.461  21.853  21.232  28.299  35.786

2 Confidence Interval of the Mean of AnimalProducts

## Here we store the mean value of the variable AnimalProducts
apMean <- mean(protein$AnimalProducts)        
## Store sample size of variable
apN <- length(protein$AnimalProducts)         
## Store standard deviation of variable
apSD <- sd(protein$AnimalProducts)            
## compute standard error
apStandardError <- apSD / sqrt(apN)           
alpha = 0.05
degrees_of_freedom = apN - 1
t_score = qt(p=alpha/2, df=degrees_of_freedom,lower.tail=F)
margin_error <- t_score * apStandardError
##Calculating lower bound and upper bound
lower_bound <- apMean - margin_error
upper_bound <- apMean + margin_error
##Print the confidence interval
print(c(lower_bound,upper_bound))

## [1] 20.03275 22.43156

We see that the approximate 95% confidence interval for the mean value of the variable AnimalProducts is (20.03, 22.43).

3) Boostrap Confidence of the Mean of AnimalProducts

# we begin the bootstrap process with our dataset "protein", and 

sample.mean.vec = NULL      # define an empty vector to hold sample means of repeated samples.
for(i in 1:1000){           # starting for-loop to take repeated random samples with n = 81
  ith.sample = sample( protein$AnimalProducts,       # population of all WCU students heights
                       81,                      # sample size = 81 values in the sample
                       replace = FALSE          # sample without replacement
                 )                              # this is the i-th random sample
   sample.mean.vec[i] = mean(ith.sample)        # calculate the mean of i-th sample and save it in
                                                # the empty vector: sample.mean.vec 
}

original.sample = sample(protein$AnimalProducts,
                         81,
                         replace = FALSE
)

### Bootstrap sampling begins 
bt.sample.mean.vec = NULL      # define an empty vector to hold sample means of repeated samples.
for(i in 1:1000){              # starting for-loop to take bootstrap samples with n = 81
  ith.bt.sample = sample( original.sample,    # Original sample with 81 WCU students' heights
                       81,                    # sample size = 81 MUST be equal to the sample size!!
                       replace = TRUE         # MUST use WITH REPLACEMENT!!
                 )                            # this is the i-th Bootstrap sample
  bt.sample.mean.vec[i] = mean(ith.bt.sample) # calculate the mean of i-th bootstrap sample and 
                                              # save it in the empty vector: sample.bt.mean.vec 
}

## We construct a 95% two-sided bootstrap percentile confidence interval of the mean for the proportion of dietary protein that comes from AnimalProducts

CI = quantile(bt.sample.mean.vec, c(0.025, 0.975))
CI

##     2.5%    97.5% 
## 19.68046 22.88922

The bootstrap confidence interval method returns the confidence interval for mean of AnimalProducts (20.072, 23.447).

20.07198 23.44660

4) Histogram of the Bootstrap Sampling Distribution of the Sample Mean

hist(bt.sample.mean.vec,                                         # data used for histogram
     breaks = 14,                                                # specify number of vertical bars
     xlab = "Bootstrap sample means",                            # change the label of x-axis
     main="Bootstrap Sampling Distribution \n of Sample Means")   # add a title to the histogram

5) Compare and Contrast

The normal method for constructing confidence intervals gave us the interval (20.03, 22.43), while the bootstrap method gave us (20.072, 23.447). The two intervals have a similar lower boundary, while the bootstrap method has a larger upper bound. The bootstrap confidence interval practically represents the confidence interval of the mean of the variable AnimalProducts from a bootstrap sampling distribution. The bootstrap method is only useful insofar as the original sample has enough information to estimate the true population distribution. Using a repeated sample method when possible is always preferable to the bootstrap method, as the latter only provides an estimate of the true parameter. Based on the repeated sample approach, we can be 95% confident that the true mean of the sample lies somewhere between 20.03 and 22.43. The same can be said for the bootstrap confidence interval with the bootstrap values.