Overview: In this lab exercise, you will experiment with drawing samples from a population.
Objectives: At the end of this lab you will be able to:
Create a subdirectory named Lab 3
in the
PUBHBIO 2210 Labs
directory you created in your OneDrive
folder in Lab 1.
Download the four lab files from Carmen while in the RStudio server:
lab-03-sampling-blank.html
lab-03-sampling-blank.Rmd
lab-03-sampling-worksheet-blank.docx
nhanesPopulation.csv
If you have not downloaded all of these files, do so now.
Save the four downloaded files in the
PUBHBIO 2210 Labs/Lab 3
directory (i.e., save the
downloaded files in the Lab 3
directory or folder created).
When working on labs, it is important to keep all related files in the
same directory.
Change the author and date information in the lab header.
Replace 0
in the code chunk below with a number of
your choosing. Choosing a different “seed” number than someone else
makes the computer generate a different sequence of random
numbers.
Choosing the same seed number makes the computer generate the same sequence of random numbers.
Choose a seed and keep it the same throughout this assignment so that your answers don’t change every time you knit the document, but do not intentionally choose the same number as a classmate.
# Here, ensure to change the number "0" to a number of
# your choosing before you proceed with this Lab.
set.seed(101)
To help understand random sampling from an infinite population, we
will first randomly sample cases from a large dataset. (A “case” is one
row in a dataset and refers to all the measurements for a given
individual or experimental unit.) The large dataset,
nhanesPopulation.csv
is a small set of variables from the
larger NHANES dataset, from which our nhanesSubset.csv
dataset was randomly sampled. The large dataset consists of 17,314
cases; remember that the subset we have been working with consists of
only 100 cases.
In the code chunk below, load the large dataset
(nhanesPopulation.csv
) and store it in an object called
nhanes.pop
, convert it to a tibble
and then,
print the dataset (i.e., the object). Refer to your Lab 1 (especially)
and Lab 2 for the code chunks on how to do these.
# Enter code here
nhanes.pop <- read.csv("nhanesPopulation.csv")
nhanes.pop <- as_tibble(nhanes.pop)
print(nhanes.pop)
## # A tibble: 17,341 × 4
## id heightCm lead everSmoke
## <int> <dbl> <dbl> <int>
## 1 1 179. 5 1
## 2 2 162. 0.7 1
## 3 3 157. 2.7 0
## 4 4 177. 2 1
## 5 5 168. 2.3 0
## 6 6 178. 5.2 0
## 7 7 159 4.9 0
## 8 8 156. 1 1
## 9 9 164. 2.1 0
## 10 10 172. 1.7 0
## # ℹ 17,331 more rows
Some rows have missing data that would cause problems for us. To
remove all rows with any missing data, we can use the
na.omit()
function, for example:
# Not evaluated
mydata <- mydata %>% na.omit()
In the code chunk below, remove rows with any
missing data from the NHANES population dataset
(nhanes.pop
). Then, print the dataset.
# Enter code here
nhanes.pop <- nhanes.pop %>% na.omit()
There are only four variables in this dataset: id
,
heightCm
, lead
, and everSmoke
.
The everSmoke
variable is categorical (binary) and takes
the values 1 = "yes"
and 0 = "no"
. However,
today we will leave everSmoke
as a numeric variable so that
R will compute the mean for us.
The sample()
function randomly selects rows from a
dataset and the head()
function prints the first 6 rows of
the dataset.
For example,
# Not evaluated
mydata.sample.30 <- mydata.pop %>% sample(size = 30)
head(mydata.sample.30)
draws a sample of size 30 from the dataset mydata.pop
and stores in an object called mydata.sample.30
, and prints
the first 6 rows of mydata.sample.30
.
In the code chunk below, draw a random sample of size 10 from the
NHANES population dataset (nhanes.pop
), and print the first
6 rows of the (random sample) dataset.
# Enter code here
nhanes.pop.sample.10 <- nhanes.pop %>% sample(size=10)
head(nhanes.pop.sample.10)
## # A tibble: 6 × 5
## id heightCm lead everSmoke orig.id
## <int> <dbl> <dbl> <int> <chr>
## 1 2881 176. 2.3 0 2873
## 2 13768 164 2.6 1 13742
## 3 10360 157. 4.9 1 10335
## 4 16404 189 2.5 0 16375
## 5 5576 178. 1.7 0 5562
## 6 8021 158. 0.7 1 7997
Note that there are now only 6 rows, and a new column,
orig.id
, has been added. The new column is the row number
in the nhanes.pop
dataset whence came the row in the
nhanes.sample.10
dataset. Here, orig.id
is
equal to or slightly lower than the id
variable, but this
is not always the case. It is the case here because in
nhanesPopulation.csv
the id
variable matches
the row number, and then we deleted a few rows to make
nhanes.pop
because there were missing data.
Rather than using favstats()
, we can use
mean()
to get just the mean of a variable
For example,
# Not evaluated
mean( ~ age, data = mydata.sample.30)
would produce the mean for the variable age
in
mydata.sample.30
(sample of size 30).
Using the code chunk below, find the mean of lead
in
your sample of size 10 generated from nhanes.pop
.
# Enter code here
mean( ~ lead, data = nhanes.pop.sample.10)
## [1] 3.13
Rather than creating a new dataset for a sample and then finding means, we can create a sample and find the mean in a single step.
For example,
# Not evaluated
mean( ~ age, data = mydata.pop %>% sample(size = 100))
finds the mean of the variable age
in a sample of size
100 from mydata.pop
.
Now, in one step, create a sample of size 50 from
nhanes.pop
and find the mean of lead
in that
sample.
# Enter code here
mean(~ lead, data=nhanes.pop %>% sample(size=50))
## [1] 4.148
We can use the do()
command to repeat something a given
number of times.
For example,
# Not evaluated
many.samples.30 <- do(5) * sd( ~ age, data = mydata.pop %>% sample(size = 30))
computes the standard deviations (sd) of 5 random samples of size 30
from mydata.pop
and stores the results in
many.samples.30
.
Previously, you calculated two sample means: one from a sample of size 10, and one from a sample of size 50.
Since the sample is drawn randomly from the population, the sample means are random variables as well.
Now, use the do()
command to draw 100 samples of size 10
from nhanes.pop
and store the results as
many.samples.10
so we can examine the distributions of
these sample means. Then, print the first few rows of this dataset using
the head()
function to see what it looks like.
# Enter code here
many.samples.10 <- do(100) * mean( ~ lead, data = nhanes.pop %>% sample(size=10))
head(many.samples.10)
## mean
## 1 3.58
## 2 3.53
## 3 4.79
## 4 4.79
## 5 4.17
## 6 4.98
Now, do the same for 100 random samples of size 50.
# Enter code here
many.samples.50 <- do(100) * mean( ~ lead, data = nhanes.pop %>% sample(size=50))
head(many.samples.50)
## mean
## 1 3.288
## 2 4.130
## 3 3.746
## 4 3.232
## 5 3.512
## 6 4.594
Create two histograms of sample means: one from samples of size 10
and one from samples of size 50. Here note that the two samples
generated contain ONLY one variable called mean
. Also,
recall how to create histograms from Lab 2.
NOTE: For all histograms created in this lab,
if necessary, adjust the width of “bins” using the
binwidth
argument to make the histograms look less
noisy.
# Enter code here
gf_histogram(~ mean, data = many.samples.10, binwidth = 1, title = "Distribution of Sample Means)")
gf_histogram(~ mean, data = many.samples.50, binwidth = 0.5, title = "Distribution of Sample Means")
Create a histogram and find the mean of lead
in the
population, not your sample. Recall how to create
histograms from Lab 2.
# Enter code here
gf_histogram(~ lead, data = nhanes.pop, binwidth = 2, title = "Distribution of Lead in the Population")
mean(~ lead, data = nhanes.pop)
## [1] 3.926821
Draw 100 random samples of size 50 from the population and compute
the sample means of everSmoke
. Produce a histogram of the
sample means, and compute the mean of those sample means. Also produce a
histogram of everSmoke
in the population, and compute the
population mean.
# Enter code here
many.samples.smoke <- do(100) * mean(~ everSmoke, data = nhanes.pop %>% sample(size = 50))
gf_histogram(~ mean, data = many.samples.smoke, binwidth = 0.05, title = "Distribution of Sample Means for everSmoke")
mean(~ mean, data = many.samples.smoke)
## [1] 0.5188
gf_histogram(~ everSmoke, data = nhanes.pop, title = "Distribution of everSmoke in the Population")
mean(~ everSmoke, data = nhanes.pop)
## [1] 0.5122177
Draw 100 random samples of size 50 from the population and compute
the sample means of heightCm
. Produce a histogram of the
sample means, and compute the mean of those sample means. Also produce a
histogram of heightCm
in the population.
# Enter code here
many.samples.height <- do(100) * mean(~ heightCm, data = nhanes.pop %>% sample(size = 50))
gf_histogram(~ mean, data = many.samples.height, binwidth = 1, title = "Distribution of Sample Means for heightCm ")
mean(~ mean, data = many.samples.height)
## [1] 166.2264
gf_histogram(~ heightCm, data = nhanes.pop, binwidth = 5, title = "Distribution of heightCm in the Population")
mean(~ heightCm, data = nhanes.pop)
## [1] 166.3403
Please turn in your completed worksheet (DOCX, i.e., word document), and your RMD file and updated HTML file to Carmen by the due date. Here, ensure to upload all the three (3) files before you click on the “Submit Assignment” tab to complete your submission.