Introduction

Overview: In this lab exercise, you will learn how to create confidence intervals and explore the properties of these intervals.

Objectives: At the end of this lab you will be able to:

Part 0: Download and organize files

Many tasks and commands that were explained in the first lab will be used here with less direction. Refer to the first lab if more direction is needed.

Choosing a different “seed” number than someone else makes the computer generate a different sequence of random numbers.

Choosing the same seed number makes the computer generate the same sequence of random numbers.

Choose a seed and keep it the same throughout this assignment so that your answers don’t change every time you knit the document, but do not intentionally choose the same number as a classmate.

# Here, ensure to change the number "0" to a number of  
# your choosing before you proceed with this Lab.
set.seed(101)

Part 1: Creating confidence intervals for a proportion

In the code chunk below, load the nhanes.RData file and print it. Recall labs 1 and 2 on how to load a .RData file and print.

# Enter code here
load("nhanes.RData")
print(nhanes)
## # A tibble: 100 × 33
##       id race  ethnicity sex     age familySize urban region
##    <int> <fct> <fct>     <fct> <int>      <int> <fct> <fct> 
##  1     1 black not hisp… fema…    56          1 metr… midwe…
##  2     2 white not hisp… fema…    73          1 other west  
##  3     3 white not hisp… fema…    25          2 metr… south 
##  4     4 white mexican-… fema…    53          2 other south 
##  5     5 white mexican-… fema…    68          2 other south 
##  6     6 white not hisp… fema…    44          3 other west  
##  7     7 black not hisp… fema…    28          2 metr… south 
##  8     8 white not hisp… male     74          2 other midwe…
##  9     9 white not hisp… fema…    65          1 other north…
## 10    10 white other hi… fema…    61          3 metr… west  
## # ℹ 90 more rows
## # ℹ 25 more variables: pir <dbl>, yrsEducation <int>,
## #   maritalStatus <fct>, healthStatus <ord>,
## #   heightInSelf <int>, weightLbSelf <int>, beer <int>,
## #   wine <int>, liquor <int>, everSmoke <fct>,
## #   smokeNow <fct>, active <ord>, SBP <int>, DBP <int>,
## #   weightKg <dbl>, heightCm <dbl>, waist <dbl>, …

First, we will make a frequency table using the tally() function. In the code chunk below, make a frequency table for the variable smokeNow, which tells us if a subject is a smoker or not. See lab 2 for help.

# Enter code here
tally(~ smokeNow, data = nhanes)
## smokeNow
##   no  yes <NA> 
##   78   21    1

Note here that the proportions are computed based on n=99 (not 100) due to 1 missing value.

STOP! Answer Question 1 now.

We now have an estimate of the true proportion of smokers in the population, but would like to create a 95% confidence interval for this proportion. This can be achieved using the binom.test() funtion and reading the confidence interval output.

For example:

# Not evaluated
binom.test( ~ sex, data = mydata, conf.level = 0.80, ci.method = "Wilson", success="male")

creates a two-sided 80% confidence interval (using Wilson’s method) for the true population proportion of males from the variable sex (i.e., success = “male”).

In the code chunk below, create a two-sided 95% confidence interval (using Wilson’s method) for the population proportion of smokers (i.e., those who answered “yes”) from the variable smokeNow . nh

# Enter code here
binom.test( ~ smokeNow, data = nhanes, conf.level = 0.95, ci.method = "Wilson", success="yes")
## 
##  Exact binomial test (Score CI without continuity
##  correction)
## 
## data:  nhanes$smokeNow  [with success = yes]
## number of successes = 21, number of trials = 99,
## p-value = 6.882e-09
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
##  0.1431354 0.3026135
## sample estimates:
## probability of success 
##              0.2121212

STOP! Answer Questions 2–4 now.

Now let us compute a two-sided 99% confidence interval for the estimate of the true proportion of smokers in the population.

In the code chunk below, create a two-sided 99% confidence interval (using Wilson’s method) for the population proportion of smokers (i.e., those who answered “yes”) from the variable smokeNow.

# Enter code here
binom.test( ~ smokeNow, data = nhanes, conf.level = 0.99, ci.method = "Wilson", success="yes")
## 
##  Exact binomial test (Score CI without continuity
##  correction)
## 
## data:  nhanes$smokeNow  [with success = yes]
## number of successes = 21, number of trials = 99,
## p-value = 6.882e-09
## alternative hypothesis: true probability of success is not equal to 0.5
## 99 percent confidence interval:
##  0.1261640 0.3342416
## sample estimates:
## probability of success 
##              0.2121212

STOP! Answer Question 5 now.

Part 2: Creating confidence intervals for a mean.

For continuous/numeric variables, we know that many different summary measures can be calculated to summarize a “typical” value (a measure of the center). Two continuous/numeric variables in the NHANES dataset that were measured in subjects’ blood samples were their red blood cell count (RBC) and their blood lead concentration (lead).

The favstats() command provides some useful summary statistics. In the code chunk below, use favstats() with the formula ~ RBC and ~ lead, and data nhanes to compute summary statistics. See lab 2 for help.

# Enter code here
favstats(~ RBC, data = nhanes)
##   min    Q1 median    Q3  max     mean        sd  n missing
##  3.43 4.285   4.59 4.985 6.15 4.618478 0.5013256 92       8
favstats(~ lead, data = nhanes)
##  min    Q1 median    Q3  max     mean       sd  n missing
##  0.7 2.075   3.05 4.725 13.3 3.693478 2.542855 92       8

In the code chunk below, generate histograms for RBC and lead. See labs 2 and 3 for help.

NOTE: For all histograms created in this lab, if necessary, adjust the width of “bins” using the binwidth argument to make the histograms look less noisy.

# Enter code here
histogram(~ RBC, data = nhanes, binwidth = 0.1)

histogram(~ lead, data = nhanes, binwidth = 0.5)

STOP! Answer Question 6 now.

With the obtained estimates of the true means of RBC and lead in the population, we can create a (1 - alpha)% confidence interval for true means in the population. The t.test() function enables us to obtain these confidence intervals.

For example:

# Not evaluated
t.test( ~ age, data = nhanes, alternative = "two.sided", conf.level = 0.80)

creates a two-sided 80% confidence interval for the variable age in the nhanes dataset.

In the code chunk below, create a two-sided 95% confidence interval using the t.test() function for the variables RBC and lead in the nhanes dataset.

# Enter code here
t.test(~ RBC, data = nhanes, alternative = "two.sided", conf.level = 0.95)
## 
##  One Sample t-test
## 
## data:  RBC
## t = 88.364, df = 91, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  4.514657 4.722300
## sample estimates:
## mean of x 
##  4.618478
t.test(~ lead, data = nhanes, alternative = "two.sided", conf.level = 0.95)
## 
##  One Sample t-test
## 
## data:  lead
## t = 13.932, df = 91, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  3.166868 4.220089
## sample estimates:
## mean of x 
##  3.693478

STOP! Answer Questions 7–8 now.

In the code chunk below, use favstats() with the formula ~ RBC and data nhanes to compute summary statistics. See lab 2 for help.

# Enter code here
favstats(~ RBC, data = nhanes)
##   min    Q1 median    Q3  max     mean        sd  n missing
##  3.43 4.285   4.59 4.985 6.15 4.618478 0.5013256 92       8

Recall that stardard error \(= \frac{SD}{\sqrt{n}}\), where SD is standard deviation and n is the sample size.

STOP! Answer Question 9 now.

Now create a two-sided 90% confidence interval for the true means of RBC and lead in the population.

In the code chunk below, create a two-sided 90% confidence interval using the t.test()function for the variables RBC and lead in the nhanes dataset.

# Enter code here
t.test(~ RBC, data = nhanes, alternative = "two.sided", conf.level = 0.90)
## 
##  One Sample t-test
## 
## data:  RBC
## t = 88.364, df = 91, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 90 percent confidence interval:
##  4.531623 4.705334
## sample estimates:
## mean of x 
##  4.618478
t.test(~ lead, data = nhanes, alternative = "two.sided", conf.level = 0.90)
## 
##  One Sample t-test
## 
## data:  lead
## t = 13.932, df = 91, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 90 percent confidence interval:
##  3.252925 4.134032
## sample estimates:
## mean of x 
##  3.693478

Notice that these 90% confidence intervals (CIs) are narrower than the 95% CIs you obtained earlier.

Part 3: Creating confidence intervals for a mean using resampling methods.

The confidence interval you created for the mean RBC level was made using the assumption that RBC values have a normal distribution in the population. An alternative method for creating confidence intervals is to use bootstrapping, also called the resampling method.

You already know from Question 8 that not all 100 people in the data set have RBC values – and they don’t all have lead values either. So before using R to create a bootstrap confidence interval, you will need to tell R to only use the subjects with values of RBC and lead.

There are 8 subjects who are missing RBC and lead values and need to be excluded before proceeding. The following R lines of code would enable us to do these exclusions in the dataset.

For example:

# Not evaluated
bootstrap.data <- mydata %>% select(id, age) %>% na.omit()

creates a dataset (bootstrap.data) without any subjects with missing values for the selected variables, id and age.

In the code chunk below, create a new dataset bootstrap.data without the subjects with missing RBC and lead.

# Enter code here
bootstrap.data <- nhanes %>% select(RBC, lead) %>% na.omit()

In the code chunk below, use favstats() with the formula ~ RBC and ~lead, and data bootstrap.data to compute summary statistics. See lab 2 for help.

# Enter code here
favstats(~ RBC, data = bootstrap.data)
##   min    Q1 median    Q3  max     mean        sd  n missing
##  3.43 4.285   4.59 4.985 6.15 4.618478 0.5013256 92       0
favstats(~ lead, data = bootstrap.data)
##  min    Q1 median    Q3  max     mean       sd  n missing
##  0.7 2.075   3.05 4.725 13.3 3.693478 2.542855 92       0

Recall from Lab 3 that since samples are drawn randomly from the population, the sample means are random variables as well.

In Lab 3, we used the sample() function to draw a sample (without replacement) from a population.

Now we will use the resample() function to draw a sample (with replacement) from the sample.

For example,

# Not evaluated
bootstrap.data %>% resample()

draws a sample with replacement from bootstrap.data of the same size as bootstrap.data.

Now we will draw many samples (with replacement) of each size so we can examine the distributions of these sample means.

For example

# Not evaluated
many.samples.height <- do(50) * mean( ~ height, data = bootstrap.data %>% resample())
head(many.samples.height)

computes the means for 50 samples drawn (with replacement) for height (variable) from bootstrap.data, each of the same size as bootstrap.data, stores the results as many.samples.height, and prints the first few (6) rows of the results using the head() function.

Similarly to what you did in lab 3, use the do() function to compute the sample means of the variable RBC from 1000 resampled versions of bootstrap.data, and store the results in an object called many.samples.RBC. Then, use head() to print the first few (6) lines of the result.

# Enter code here
many.samples.RBC <- do(1000) * mean(~ RBC, data = bootstrap.data %>% resample())
head(many.samples.RBC)
##       mean
## 1 4.587283
## 2 4.610761
## 3 4.734783
## 4 4.508696
## 5 4.587717
## 6 4.597826

Repeat the above with the variable lead, storing the results in many.samples.lead.

# Enter code here
many.samples.lead <- do(1000) * mean(~ lead, data = bootstrap.data %>% resample())
head(many.samples.lead)
##       mean
## 1 3.418478
## 2 3.604348
## 3 3.390217
## 4 3.516304
## 5 3.715217
## 6 4.007609

We are interested in the resampled (bootstrapped) values of the mean (i.e., the sample means). In the code chunk below, use favstats() with the formula ~ mean and data many.samples.RBC to compute summary statistics.

# Enter code here
favstats(~ mean, data = many.samples.RBC)
##       min       Q1  median       Q3      max     mean
##  4.441957 4.581712 4.61375 4.652772 4.782609 4.616899
##          sd    n missing
##  0.05198643 1000       0

Similarly, in the code chunk below, compute summary statistics for the sample means in the data many.samples.lead.

# Enter code here
favstats(~ mean, data = many.samples.lead)
##       min       Q1   median       Q3      max     mean
##  2.909783 3.501902 3.698913 3.871739 4.505435 3.693878
##        sd    n missing
##  0.265326 1000       0

STOP! Answer Question 10 now.

Also, in the code chunk below, generate histograms for the sample means of RBC and lead (separately). See labs 2 and 3 for help.

NOTE: For all histograms created in this lab, if necessary, adjust the width of “bins” using the binwidth argument to make the histograms look less noisy.

# Enter code here
histogram(~ mean, data = many.samples.RBC, binwidth = 0.01)

histogram(~ mean, data = many.samples.lead, binwidth = 0.05)

STOP! Answer Question 11 now.

The bootstrap sample means can be used to construct a bootstrap confidence interval. To compute a 80% (bootstrap) confidence interval for the mean of RBC, for example, we would want to find the middle 80% of bootstrap sample means, i.e., MUST exclude the bottom 10% (0.10) and top 10% (0.10).

We can do this by using the quantiles() function to find the 0.10 and 0.90 quantiles (the 10th and 90th percentiles), for example,

# Not evaluated
quantile( ~ mean, data = many.samples.RBC, probs = c(0.10, 0.90))

Here, note that 0.90 minus 0.10 equals 0.80 (i.e., the desired 80% bootstrap confidence interval).

In the code chunk below, determine the 95% bootstrap confidence interval for the means of RBC and of lead. Here, note that you MUST exclude the same percentage at the bottom and at the top, such that the difference in percentiles or quantiles will equal 0.95 = 95%.

# Enter code here
quantile(~ mean, data = many.samples.RBC, probs = c(0.025, 0.975))
##     2.5%    97.5% 
## 4.512804 4.716359
quantile(~ mean, data = many.samples.lead, probs = c(0.025, 0.975))
##     2.5%    97.5% 
## 3.209701 4.242418

STOP! Answer Questions 12–14 now.

Please turn in your completed worksheet (DOCX, i.e., word document), and your RMD file and updated HTML file to Carmen by the due date. Here, ensure to upload all the three (3) files before you click on the “Submit Assignment” tab to complete your submission.