Introduction

Overview: In this lab exercise, you will experiment with drawing samples from a population.

Objectives: At the end of this lab you will be able to:

Part 0: Download and organize files

Choosing the same seed number makes the computer generate the same sequence of random numbers.

Choose a seed and keep it the same throughout this assignment so that your answers don’t change every time you knit the document, but do not intentionally choose the same number as a classmate.

# Here, ensure to change the number "0" to a number of  
# your choosing before you proceed with this Lab.
set.seed(101)

Part 1: Drawing a sample from an existing dataset (a “population”)

To help understand random sampling from an infinite population, we will first randomly sample cases from a large dataset. (A “case” is one row in a dataset and refers to all the measurements for a given individual or experimental unit.) The large dataset, nhanesPopulation.csv is a small set of variables from the larger NHANES dataset, from which our nhanesSubset.csv dataset was randomly sampled. The large dataset consists of 17,314 cases; remember that the subset we have been working with consists of only 100 cases.

In the code chunk below, load the large dataset (nhanesPopulation.csv) and store it in an object called nhanes.pop, convert it to a tibble and then, print the dataset (i.e., the object). Refer to your Lab 1 (especially) and Lab 2 for the code chunks on how to do these.

# Enter code here
nhanes.pop <- read.csv("nhanesPopulation.csv")
nhanes.pop <- as_tibble(nhanes.pop)
print(nhanes.pop)
## # A tibble: 17,341 × 4
##       id heightCm  lead everSmoke
##    <int>    <dbl> <dbl>     <int>
##  1     1     179.   5           1
##  2     2     162.   0.7         1
##  3     3     157.   2.7         0
##  4     4     177.   2           1
##  5     5     168.   2.3         0
##  6     6     178.   5.2         0
##  7     7     159    4.9         0
##  8     8     156.   1           1
##  9     9     164.   2.1         0
## 10    10     172.   1.7         0
## # ℹ 17,331 more rows

Some rows have missing data that would cause problems for us. To remove all rows with any missing data, we can use the na.omit() function, for example:

# Not evaluated
mydata <- mydata %>% na.omit()

In the code chunk below, remove rows with any missing data from the NHANES population dataset (nhanes.pop). Then, print the dataset.

# Enter code here
nhanes.pop <- nhanes.pop %>% na.omit()

There are only four variables in this dataset: id, heightCm, lead, and everSmoke. The everSmoke variable is categorical (binary) and takes the values 1 = "yes" and 0 = "no". However, today we will leave everSmoke as a numeric variable so that R will compute the mean for us.

The sample() function randomly selects rows from a dataset and the head() function prints the first 6 rows of the dataset.

For example,

# Not evaluated
mydata.sample.30 <- mydata.pop %>% sample(size = 30)
head(mydata.sample.30)

draws a sample of size 30 from the dataset mydata.pop and stores in an object called mydata.sample.30, and prints the first 6 rows of mydata.sample.30.

In the code chunk below, draw a random sample of size 10 from the NHANES population dataset (nhanes.pop), and print the first 6 rows of the (random sample) dataset.

# Enter code here
nhanes.pop.sample.10 <- nhanes.pop %>% sample(size=10)
head(nhanes.pop.sample.10)
## # A tibble: 6 × 5
##      id heightCm  lead everSmoke orig.id
##   <int>    <dbl> <dbl>     <int> <chr>  
## 1  2881     176.   2.3         0 2873   
## 2 13768     164    2.6         1 13742  
## 3 10360     157.   4.9         1 10335  
## 4 16404     189    2.5         0 16375  
## 5  5576     178.   1.7         0 5562   
## 6  8021     158.   0.7         1 7997

Note that there are now only 6 rows, and a new column, orig.id, has been added. The new column is the row number in the nhanes.pop dataset whence came the row in the nhanes.sample.10 dataset. Here, orig.id is equal to or slightly lower than the id variable, but this is not always the case. It is the case here because in nhanesPopulation.csv the id variable matches the row number, and then we deleted a few rows to make nhanes.pop because there were missing data.

Rather than using favstats(), we can use mean() to get just the mean of a variable

For example,

# Not evaluated
mean( ~ age, data = mydata.sample.30)

would produce the mean for the variable age in mydata.sample.30 (sample of size 30).

Using the code chunk below, find the mean of lead in your sample of size 10 generated from nhanes.pop.

# Enter code here
mean( ~ lead, data = nhanes.pop.sample.10)
## [1] 3.13

STOP! Answer Question 1 now.

Rather than creating a new dataset for a sample and then finding means, we can create a sample and find the mean in a single step.

For example,

# Not evaluated
mean( ~ age, data = mydata.pop %>% sample(size = 100))

finds the mean of the variable age in a sample of size 100 from mydata.pop.

Now, in one step, create a sample of size 50 from nhanes.pop and find the mean of lead in that sample.

# Enter code here
mean(~ lead, data=nhanes.pop %>% sample(size=50))
## [1] 4.148

STOP! Answer Question 2 now.

Part 2: Describing the distribution of the sample mean.

We can use the do() command to repeat something a given number of times.

For example,

# Not evaluated
many.samples.30 <- do(5) * sd( ~ age, data = mydata.pop %>% sample(size = 30))

computes the standard deviations (sd) of 5 random samples of size 30 from mydata.pop and stores the results in many.samples.30.

Previously, you calculated two sample means: one from a sample of size 10, and one from a sample of size 50.

Since the sample is drawn randomly from the population, the sample means are random variables as well.

Now, use the do() command to draw 100 samples of size 10 from nhanes.popand store the results as many.samples.10 so we can examine the distributions of these sample means. Then, print the first few rows of this dataset using the head() function to see what it looks like.

# Enter code here
many.samples.10 <- do(100) * mean( ~ lead, data = nhanes.pop %>% sample(size=10))
head(many.samples.10)
##   mean
## 1 3.58
## 2 3.53
## 3 4.79
## 4 4.79
## 5 4.17
## 6 4.98

Now, do the same for 100 random samples of size 50.

# Enter code here
many.samples.50 <- do(100) * mean( ~ lead, data = nhanes.pop %>% sample(size=50))
head(many.samples.50)
##    mean
## 1 3.288
## 2 4.130
## 3 3.746
## 4 3.232
## 5 3.512
## 6 4.594

Create two histograms of sample means: one from samples of size 10 and one from samples of size 50. Here note that the two samples generated contain ONLY one variable called mean. Also, recall how to create histograms from Lab 2.

NOTE: For all histograms created in this lab, if necessary, adjust the width of “bins” using the binwidth argument to make the histograms look less noisy.

# Enter code here
gf_histogram(~ mean, data = many.samples.10, binwidth = 1, title = "Distribution of Sample Means)")

gf_histogram(~ mean, data = many.samples.50, binwidth = 0.5, title = "Distribution of Sample Means")

STOP! Answer Questions 3 and 4 now.

Part 3: Describing the distribution of variables in the population.

Create a histogram and find the mean of lead in the population, not your sample. Recall how to create histograms from Lab 2.

# Enter code here
gf_histogram(~ lead, data = nhanes.pop, binwidth = 2, title = "Distribution of Lead in the Population")

mean(~ lead, data = nhanes.pop)
## [1] 3.926821

STOP! Answer Questions 5–7 now.

Part 4: Other variables

Draw 100 random samples of size 50 from the population and compute the sample means of everSmoke. Produce a histogram of the sample means, and compute the mean of those sample means. Also produce a histogram of everSmoke in the population, and compute the population mean.

# Enter code here
many.samples.smoke <- do(100) * mean(~ everSmoke, data = nhanes.pop %>% sample(size = 50))

gf_histogram(~ mean, data = many.samples.smoke, binwidth = 0.05, title = "Distribution of Sample Means for everSmoke")

mean(~ mean, data = many.samples.smoke)
## [1] 0.5188
gf_histogram(~ everSmoke, data = nhanes.pop, title = "Distribution of everSmoke in the Population")

mean(~ everSmoke, data = nhanes.pop)
## [1] 0.5122177

STOP! Answer Questions 8–11 now.

Draw 100 random samples of size 50 from the population and compute the sample means of heightCm. Produce a histogram of the sample means, and compute the mean of those sample means. Also produce a histogram of heightCm in the population.

# Enter code here
many.samples.height <- do(100) * mean(~ heightCm, data = nhanes.pop %>% sample(size = 50))

gf_histogram(~ mean, data = many.samples.height, binwidth = 1, title = "Distribution of Sample Means for heightCm ")

mean(~ mean, data = many.samples.height)
## [1] 166.2264
gf_histogram(~ heightCm, data = nhanes.pop, binwidth = 5, title = "Distribution of heightCm in the Population")

mean(~ heightCm, data = nhanes.pop)
## [1] 166.3403

STOP! Answer Questions 12–15 now.

Please turn in your completed worksheet (DOCX, i.e., word document), and your RMD file and updated HTML file to Carmen by the due date. Here, ensure to upload all the three (3) files before you click on the “Submit Assignment” tab to complete your submission.