library(tidyverse)
rm(list=ls())Sampling Simulations
In class, we covered the central limit theorem and the law of large numbers. This exercise will demonstrate these concepts using a sample of coin flips from a fair coin.
General housekeeping items
Let’s begin by opening libraries and clearing the environment:
Simulate and plot data on coin flips
Imagine that 1,000 students on campus are each given a fair coin. Each student is asked to flip the coin 10 times and record the number of heads and tails (0=tails and 1=heads). In expectation, half of the flips should be tails and half of the flips should be heads.
We are going to simulate this scenario. Let’s store the number of coin flips per person in n_tries and the number of people flipping coins in n_samples. Also, let’s store a list of possible outcomes 0 = tails and 1 = heads. We are going to change these later.
n_tries <- 10
n_samples <- 1000
outcomes <- c(0,1)Next, we will build a simulation using a “for loop” where each person in n_samples flips a coin n_tries times. Each flip represents a random draw from a table called “values”, which in this case contains values 0 or 1 (where 0 represents tails and 1 represents heads). The results of each sample are appended to a dataset called “sim_data.”
values <- tibble(value = outcomes)
set.seed(42)
for (s in 1:n_samples) {
if (s == 1) {
sim_data <- data.frame(value = slice_sample(values, replace = T, n = n_tries),
sample = s)
} else {
sim_data <- bind_rows(sim_data,
data.frame(value = slice_sample(values, replace = T, n = n_tries),
sample = s))
}
print(s)
}We should have a new dataset called “sim_data” containing 2 variables: value - the outcome of each coin flip, and sample - an indicator for each person flipping a coin. Each observation represents a coin flip so we should have n_tries x n_samples total observations. Take a moment to look through the dataset.
Wrangle and plot the flips data
Next, lets wrangle the flips data such that we get one observation per person and a variable containing the average value of each person’s coin flips.
averages <- sim_data %>%
group_by(sample) %>%
summarize(average_value = mean(value))If you think about it, what we have now is the average proportion of heads for a whole bunch of individual samples. This is like a “sample of samples.” Next, let’s visualize the data to investigate the distribution of the sample averages:
ggplot(averages, aes(x=average_value)) +
geom_bar() +
scale_x_continuous(breaks=seq(0,1,0.1), limits = c(-0.1,1.1)) +
labs(title = 'Disribution of Means',
x = 'Average Values',
y = 'Count',
caption = paste("Note: Tails = 0 and Heads = 1. Each sample consists of",n_tries,"coin flips.", "There are",n_samples,"samples in total.")
)Notice that the most common averages fall around 0.5, but many (most!) are not exactly 0.5. In fact, some coin flippers had averages far away from 0.5 (e.g., 2 tails and 8 heads in 10 flips). However, as we get farther away from the “true mean” of 0.5, the instances become more rare… more precisely, the distribution of sample averages follows a normal distribution. This is the central limit theorem at work!
- Explore the law of large numbers by changing the value of n_tries and rerun the simulation. What happens to the distribution? Why?
- Change the values in n_samples and rerun the simulation. What happens to the distribution? Why?
- Advanced exercise: Can you alter the program above to simulate the outcomes from rolling two 6-sided dice?
Imagine that you were one of the coin flippers who flipped 9 heads and 1 tail. Suppose this is all the information that you were allowed to collect on your coin. What might you conclude about the “fairness” of your coin? Is your conclusion “reasonable” and are you “correct”?
These questions and concepts will come up later in the course.
Simulate sampling from the diamonds dataset
The example above may seem a little abstract. Let’s see these principles in action using the diamonds dataset. Here, let’s imagine that 1,000 auditors each take an independent sample from the diamonds dataset. Each auditor takes a sample of 30 diamonds and is asked to determine the average carat weight.
As above, let’s store our number of samples, sample size, and other information in objects (so that we can make easy changes later).
n_samples <- 1000
sample_size <- 30
sample_variable <- sym('carat')Next, let’s take a look at summary statistics and the distribution of carat weight.
diamonds %>%
select(sample_variable) %>%
summary()Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
ℹ Please use `all_of()` or `any_of()` instead.
# Was:
data %>% select(sample_variable)
# Now:
data %>% select(all_of(sample_variable))
See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
carat
Min. :0.2000
1st Qu.:0.4000
Median :0.7000
Mean :0.7979
3rd Qu.:1.0400
Max. :5.0100
ggplot(diamonds, aes(x = !!sample_variable)) +
geom_histogram(bins = 50, fill = "blue", color = "black", alpha = 0.7) +
labs(title = paste("Distribution of",sample_variable),
x = sample_variable,
y = "Frequency") +
theme_minimal()Notice that the distribution of carat weight is positively skewed and has several breakpoints.
Let’s take our own sample of the diamonds dataset.
example_sample <- slice_sample(diamonds, replace = F, n = 30)
summary(example_sample) carat cut color clarity depth
Min. :0.300 Fair : 1 D:1 SI1 :10 Min. :58.20
1st Qu.:0.560 Good : 4 E:4 VS2 : 7 1st Qu.:60.90
Median :0.710 Very Good: 4 F:6 VS1 : 4 Median :61.85
Mean :0.827 Premium : 5 G:6 VVS2 : 4 Mean :61.60
3rd Qu.:1.028 Ideal :16 H:8 SI2 : 3 3rd Qu.:62.35
Max. :1.820 I:4 VVS1 : 2 Max. :63.60
J:1 (Other): 0
table price x y
Min. :54.00 Min. : 563 Min. :4.340 Min. :4.300
1st Qu.:56.00 1st Qu.: 1538 1st Qu.:5.282 1st Qu.:5.253
Median :57.50 Median : 2516 Median :5.785 Median :5.755
Mean :57.81 Mean : 3957 Mean :5.899 Mean :5.900
3rd Qu.:58.00 3rd Qu.: 4996 3rd Qu.:6.537 3rd Qu.:6.510
Max. :70.00 Max. :15025 Max. :7.680 Max. :7.750
z
Min. :2.640
1st Qu.:3.300
Median :3.535
Mean :3.634
3rd Qu.:4.010
Max. :4.840
- Take a look at your sample above. Change the options in slice_sample and see what changes.
- Type ?slice_sample into the console and run it. What are some of the features of slice_sample?
Following the coin flips example, let’s simulate taking independent samples, determining the sample average, and plotting the results.
set.seed(42)
for (s in 1:n_samples) {
if (s == 1) {
diamonds_samples <- slice_sample(diamonds, replace = F, n = sample_size) %>%
select(sample_variable) %>%
mutate(sample = s)
} else {
diamonds_samples <- bind_rows(diamonds_samples,
slice_sample(diamonds, replace = F, n = sample_size) %>%
select(sample_variable) %>%
mutate(sample = s))
}
print(s)
}
diamond_averages <- diamonds_samples %>%
group_by(sample) %>%
summarize(average_value = mean(!!sample_variable))ggplot(diamond_averages, aes(x = average_value)) +
geom_histogram(bins = 50, fill = "blue", color = "black", alpha = 0.7) +
labs(title = paste("Distribution of Sample Means for",sample_variable),
x = "Sample Mean of Variable",
y = "Frequency") +
theme_minimal()- What do you notice about the difference between the distribution of carat weight in the entire population versus the distribution of the averages from the samples? What forces are at work here?
- Tinker with the inputs to the simulation (e.g., n_samples, sample_size, and sample_variable). What happens to the distributions? Why?