Overview: In this lab exercise, you will learn how to create confidence intervals and explore the properties of these intervals.
Objectives: At the end of this lab you will be able to:
Many tasks and commands that were explained in the first lab will be used here with less direction. Refer to the first lab if more direction is needed.
Create a subdirectory named Lab 4
in the
PUBHBIO 2210 Labs
directory you created in your OneDrive
folder in Lab 1.
Download the four lab files from Carmen while in the RStudio server:
lab-04-confints-blank.html
lab-04-confints-blank.Rmd
lab-04-confints-worksheet-blank.docx
nhanes.RData
If you have not downloaded all of these files, do so now.
Save the four downloaded files in the
PUBHBIO 2210 Labs/Lab 4
directory (i.e., save the
downloaded files in the Lab 4
directory or folder created).
When working on labs, it is important to keep all related files in the
same directory.
Change the author and date information in the lab header.
Replace 0
in the code chunk below with a number of
your choosing.
Choosing a different “seed” number than someone else makes the computer generate a different sequence of random numbers.
Choosing the same seed number makes the computer generate the same sequence of random numbers.
Choose a seed and keep it the same throughout this assignment so that your answers don’t change every time you knit the document, but do not intentionally choose the same number as a classmate.
# Here, ensure to change the number "0" to a number of
# your choosing before you proceed with this Lab.
set.seed(101)
In the code chunk below, load the nhanes.RData
file and
print it. Recall labs 1 and 2 on how to load a .RData
file
and print.
# Enter code here
load("nhanes.RData")
print(nhanes)
## # A tibble: 100 × 33
## id race ethnicity sex age familySize urban region
## <int> <fct> <fct> <fct> <int> <int> <fct> <fct>
## 1 1 black not hisp… fema… 56 1 metr… midwe…
## 2 2 white not hisp… fema… 73 1 other west
## 3 3 white not hisp… fema… 25 2 metr… south
## 4 4 white mexican-… fema… 53 2 other south
## 5 5 white mexican-… fema… 68 2 other south
## 6 6 white not hisp… fema… 44 3 other west
## 7 7 black not hisp… fema… 28 2 metr… south
## 8 8 white not hisp… male 74 2 other midwe…
## 9 9 white not hisp… fema… 65 1 other north…
## 10 10 white other hi… fema… 61 3 metr… west
## # ℹ 90 more rows
## # ℹ 25 more variables: pir <dbl>, yrsEducation <int>,
## # maritalStatus <fct>, healthStatus <ord>,
## # heightInSelf <int>, weightLbSelf <int>, beer <int>,
## # wine <int>, liquor <int>, everSmoke <fct>,
## # smokeNow <fct>, active <ord>, SBP <int>, DBP <int>,
## # weightKg <dbl>, heightCm <dbl>, waist <dbl>, …
First, we will make a frequency table using the tally()
function. In the code chunk below, make a frequency table for the
variable smokeNow
, which tells us if a subject is a smoker
or not. See lab 2 for help.
# Enter code here
tally(~ smokeNow, data = nhanes)
## smokeNow
## no yes <NA>
## 78 21 1
Note here that the proportions are computed based on n=99 (not 100) due to 1 missing value.
We now have an estimate of the true proportion of smokers in the
population, but would like to create a 95% confidence interval for this
proportion. This can be achieved using the binom.test()
funtion and reading the confidence interval output.
For example:
# Not evaluated
binom.test( ~ sex, data = mydata, conf.level = 0.80, ci.method = "Wilson", success="male")
creates a two-sided 80% confidence interval (using Wilson’s method)
for the true population proportion of males from the variable
sex
(i.e., success = “male”).
In the code chunk below, create a two-sided 95% confidence interval
(using Wilson’s method) for the population proportion of smokers (i.e.,
those who answered “yes”) from the variable smokeNow
.
nh
# Enter code here
binom.test( ~ smokeNow, data = nhanes, conf.level = 0.95, ci.method = "Wilson", success="yes")
##
## Exact binomial test (Score CI without continuity
## correction)
##
## data: nhanes$smokeNow [with success = yes]
## number of successes = 21, number of trials = 99,
## p-value = 6.882e-09
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
## 0.1431354 0.3026135
## sample estimates:
## probability of success
## 0.2121212
Now let us compute a two-sided 99% confidence interval for the estimate of the true proportion of smokers in the population.
In the code chunk below, create a two-sided 99% confidence interval
(using Wilson’s method) for the population proportion of smokers (i.e.,
those who answered “yes”) from the variable smokeNow
.
# Enter code here
binom.test( ~ smokeNow, data = nhanes, conf.level = 0.99, ci.method = "Wilson", success="yes")
##
## Exact binomial test (Score CI without continuity
## correction)
##
## data: nhanes$smokeNow [with success = yes]
## number of successes = 21, number of trials = 99,
## p-value = 6.882e-09
## alternative hypothesis: true probability of success is not equal to 0.5
## 99 percent confidence interval:
## 0.1261640 0.3342416
## sample estimates:
## probability of success
## 0.2121212
For continuous/numeric variables, we know that many different summary
measures can be calculated to summarize a “typical” value (a measure of
the center). Two continuous/numeric variables in the NHANES dataset that
were measured in subjects’ blood samples were their red blood cell count
(RBC
) and their blood lead concentration
(lead
).
The favstats()
command provides some useful summary
statistics. In the code chunk below, use favstats()
with
the formula ~ RBC
and ~ lead
, and data
nhanes
to compute summary statistics. See lab 2 for
help.
# Enter code here
favstats(~ RBC, data = nhanes)
## min Q1 median Q3 max mean sd n missing
## 3.43 4.285 4.59 4.985 6.15 4.618478 0.5013256 92 8
favstats(~ lead, data = nhanes)
## min Q1 median Q3 max mean sd n missing
## 0.7 2.075 3.05 4.725 13.3 3.693478 2.542855 92 8
In the code chunk below, generate histograms for RBC
and
lead
. See labs 2 and 3 for help.
NOTE: For all histograms created in this lab,
if necessary, adjust the width of “bins” using the
binwidth
argument to make the histograms look less
noisy.
# Enter code here
histogram(~ RBC, data = nhanes, binwidth = 0.1)
histogram(~ lead, data = nhanes, binwidth = 0.5)
With the obtained estimates of the true means of RBC
and
lead
in the population, we can create a (1 - alpha)%
confidence interval for true means in the population. The
t.test()
function enables us to obtain these confidence
intervals.
For example:
# Not evaluated
t.test( ~ age, data = nhanes, alternative = "two.sided", conf.level = 0.80)
creates a two-sided 80% confidence interval for the variable
age
in the nhanes
dataset.
In the code chunk below, create a two-sided 95% confidence interval
using the t.test()
function for the variables
RBC
and lead
in the nhanes
dataset.
# Enter code here
t.test(~ RBC, data = nhanes, alternative = "two.sided", conf.level = 0.95)
##
## One Sample t-test
##
## data: RBC
## t = 88.364, df = 91, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 4.514657 4.722300
## sample estimates:
## mean of x
## 4.618478
t.test(~ lead, data = nhanes, alternative = "two.sided", conf.level = 0.95)
##
## One Sample t-test
##
## data: lead
## t = 13.932, df = 91, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 3.166868 4.220089
## sample estimates:
## mean of x
## 3.693478
In the code chunk below, use favstats()
with the formula
~ RBC
and data nhanes
to compute summary
statistics. See lab 2 for help.
# Enter code here
favstats(~ RBC, data = nhanes)
## min Q1 median Q3 max mean sd n missing
## 3.43 4.285 4.59 4.985 6.15 4.618478 0.5013256 92 8
Recall that stardard error \(=
\frac{SD}{\sqrt{n}}\), where SD
is standard
deviation and n
is the sample size.
Now create a two-sided 90% confidence interval for the true means of
RBC
and lead
in the population.
In the code chunk below, create a two-sided 90% confidence interval
using the t.test()
function for the variables
RBC
and lead
in the nhanes
dataset.
# Enter code here
t.test(~ RBC, data = nhanes, alternative = "two.sided", conf.level = 0.90)
##
## One Sample t-test
##
## data: RBC
## t = 88.364, df = 91, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 90 percent confidence interval:
## 4.531623 4.705334
## sample estimates:
## mean of x
## 4.618478
t.test(~ lead, data = nhanes, alternative = "two.sided", conf.level = 0.90)
##
## One Sample t-test
##
## data: lead
## t = 13.932, df = 91, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 90 percent confidence interval:
## 3.252925 4.134032
## sample estimates:
## mean of x
## 3.693478
Notice that these 90% confidence intervals (CIs) are narrower than the 95% CIs you obtained earlier.
The confidence interval you created for the mean RBC level was made using the assumption that RBC values have a normal distribution in the population. An alternative method for creating confidence intervals is to use bootstrapping, also called the resampling method.
You already know from Question 8 that not all 100 people in the data
set have RBC
values – and they don’t all have
lead
values either. So before using R to
create a bootstrap confidence interval, you will need to tell
R to only use the subjects with values of
RBC
and lead
.
There are 8 subjects who are missing RBC and lead values and need to be excluded before proceeding. The following R lines of code would enable us to do these exclusions in the dataset.
For example:
# Not evaluated
bootstrap.data <- mydata %>% select(id, age) %>% na.omit()
creates a dataset (bootstrap.data
) without any subjects
with missing values for the selected variables, id
and
age
.
In the code chunk below, create a new dataset
bootstrap.data
without the subjects with missing
RBC
and lead
.
# Enter code here
bootstrap.data <- nhanes %>% select(RBC, lead) %>% na.omit()
In the code chunk below, use favstats()
with the formula
~ RBC
and ~lead
, and data
bootstrap.data
to compute summary statistics. See lab 2 for
help.
# Enter code here
favstats(~ RBC, data = bootstrap.data)
## min Q1 median Q3 max mean sd n missing
## 3.43 4.285 4.59 4.985 6.15 4.618478 0.5013256 92 0
favstats(~ lead, data = bootstrap.data)
## min Q1 median Q3 max mean sd n missing
## 0.7 2.075 3.05 4.725 13.3 3.693478 2.542855 92 0
Recall from Lab 3 that since samples are drawn randomly from the population, the sample means are random variables as well.
In Lab 3, we used the sample()
function to draw a sample
(without replacement) from a population.
Now we will use the resample()
function to draw a sample
(with replacement) from the sample.
For example,
# Not evaluated
bootstrap.data %>% resample()
draws a sample with replacement from bootstrap.data
of
the same size as bootstrap.data
.
Now we will draw many samples (with replacement) of each size so we can examine the distributions of these sample means.
For example
# Not evaluated
many.samples.height <- do(50) * mean( ~ height, data = bootstrap.data %>% resample())
head(many.samples.height)
computes the means for 50 samples drawn (with replacement) for height
(variable) from bootstrap.data
, each of the same size as
bootstrap.data
, stores the results as
many.samples.height
, and prints the first few (6) rows of
the results using the head()
function.
Similarly to what you did in lab 3, use the do()
function to compute the sample means of the variable RBC
from 1000 resampled versions of bootstrap.data
, and store
the results in an object called many.samples.RBC
. Then, use
head()
to print the first few (6) lines of the result.
# Enter code here
many.samples.RBC <- do(1000) * mean(~ RBC, data = bootstrap.data %>% resample())
head(many.samples.RBC)
## mean
## 1 4.587283
## 2 4.610761
## 3 4.734783
## 4 4.508696
## 5 4.587717
## 6 4.597826
Repeat the above with the variable lead
, storing the
results in many.samples.lead
.
# Enter code here
many.samples.lead <- do(1000) * mean(~ lead, data = bootstrap.data %>% resample())
head(many.samples.lead)
## mean
## 1 3.418478
## 2 3.604348
## 3 3.390217
## 4 3.516304
## 5 3.715217
## 6 4.007609
We are interested in the resampled (bootstrapped) values of the mean
(i.e., the sample means). In the code chunk below, use
favstats()
with the formula ~ mean
and data
many.samples.RBC
to compute summary statistics.
# Enter code here
favstats(~ mean, data = many.samples.RBC)
## min Q1 median Q3 max mean
## 4.441957 4.581712 4.61375 4.652772 4.782609 4.616899
## sd n missing
## 0.05198643 1000 0
Similarly, in the code chunk below, compute summary statistics for
the sample means in the data many.samples.lead
.
# Enter code here
favstats(~ mean, data = many.samples.lead)
## min Q1 median Q3 max mean
## 2.909783 3.501902 3.698913 3.871739 4.505435 3.693878
## sd n missing
## 0.265326 1000 0
Also, in the code chunk below, generate histograms for the sample
means of RBC
and lead
(separately). See labs 2
and 3 for help.
NOTE: For all histograms created in this lab,
if necessary, adjust the width of “bins” using the
binwidth
argument to make the histograms look less
noisy.
# Enter code here
histogram(~ mean, data = many.samples.RBC, binwidth = 0.01)
histogram(~ mean, data = many.samples.lead, binwidth = 0.05)
The bootstrap sample means can be used to construct a bootstrap
confidence interval. To compute a 80% (bootstrap) confidence interval
for the mean of RBC
, for example, we would want to find the
middle 80% of bootstrap sample means, i.e., MUST exclude the bottom 10%
(0.10) and top 10% (0.10).
We can do this by using the quantiles()
function to find
the 0.10 and 0.90 quantiles (the 10th and 90th percentiles), for
example,
# Not evaluated
quantile( ~ mean, data = many.samples.RBC, probs = c(0.10, 0.90))
Here, note that 0.90 minus 0.10 equals 0.80 (i.e., the desired 80% bootstrap confidence interval).
In the code chunk below, determine the 95% bootstrap confidence
interval for the means of RBC
and of lead
.
Here, note that you MUST exclude the same percentage at the bottom and
at the top, such that the difference in percentiles or quantiles will
equal 0.95 = 95%.
# Enter code here
quantile(~ mean, data = many.samples.RBC, probs = c(0.025, 0.975))
## 2.5% 97.5%
## 4.512804 4.716359
quantile(~ mean, data = many.samples.lead, probs = c(0.025, 0.975))
## 2.5% 97.5%
## 3.209701 4.242418
Please turn in your completed worksheet (DOCX, i.e., word document), and your RMD file and updated HTML file to Carmen by the due date. Here, ensure to upload all the three (3) files before you click on the “Submit Assignment” tab to complete your submission.