Sampling Simulations

In class, we covered the central limit theorem and the law of large numbers. This exercise will demonstrate these concepts using a sample of coin flips from a fair coin.

General housekeeping items

Let’s begin by opening libraries and clearing the environment:

library(tidyverse)
rm(list=ls())

Simulate and plot data on coin flips

Imagine that 1,000 students on campus are each given a fair coin. Each student is asked to flip the coin 10 times and record the number of heads and tails (0=tails and 1=heads). In expectation, half of the flips should be tails and half of the flips should be heads.

We are going to simulate this scenario. Let’s store the number of coin flips per person in n_tries and the number of people flipping coins in n_samples. Also, let’s store a list of possible outcomes 0 = tails and 1 = heads. We are going to change these later.

n_tries <- 10
n_samples <- 1000
outcomes <- c(0,1)

Next, we will build a simulation using a “for loop” where each person in n_samples flips a coin n_tries times. Each flip represents a random draw from a table called “values”, which in this case contains values 0 or 1 (where 0 represents tails and 1 represents heads). The results of each sample are appended to a dataset called “sim_data.”

values <- tibble(value = outcomes)

set.seed(42)
for (s in 1:n_samples) {
  if (s == 1) {
    sim_data <- data.frame(value = slice_sample(values, replace = T, n = n_tries), 
                           sample = s)
    } else {
    sim_data <- bind_rows(sim_data, 
                          data.frame(value = slice_sample(values, replace = T, n = n_tries), 
                                     sample = s))
    }
  print(s)
  }

We should have a new dataset called “sim_data” containing 2 variables: value - the outcome of each coin flip, and sample - an indicator for each person flipping a coin. Each observation represents a coin flip so we should have n_tries x n_samples total observations. Take a moment to look through the dataset.

Wrangle and plot the flips data

Next, lets wrangle the flips data such that we get one observation per person and a variable containing the average value of each person’s coin flips.

averages <- sim_data %>%
  group_by(sample) %>% 
  summarize(average_value = mean(value))

If you think about it, what we have now is the average proportion of heads for a whole bunch of individual samples. This is like a “sample of samples.” Next, let’s visualize the data to investigate the distribution of the sample averages:

ggplot(averages, aes(x=average_value)) +
  geom_bar() + 
  scale_x_continuous(breaks=seq(0,1,0.1), limits = c(-0.1,1.1)) +
  labs(title = 'Disribution of Means', 
       x = 'Average Values', 
       y = 'Count',
       caption = paste("Note: Tails = 0 and Heads = 1. Each sample consists of",n_tries,"coin flips.", "There are",n_samples,"samples in total.")
  )

Notice that the most common averages fall around 0.5, but many (most!) are not exactly 0.5. In fact, some coin flippers had averages far away from 0.5 (e.g., 2 tails and 8 heads in 10 flips). However, as we get farther away from the “true mean” of 0.5, the instances become more rare… more precisely, the distribution of sample averages follows a normal distribution. This is the central limit theorem at work!

Exercises:

Explore the law of large numbers by changing the value of n_tries and rerun the simulation. What happens to the distribution? Why?
Change the values in n_samples and rerun the simulation. What happens to the distribution? Why?
Advanced exercise: Can you alter the program above to simulate the outcomes from rolling two 6-sided dice?

Thought exercise:

Imagine that you were one of the coin flippers who flipped 9 heads and 1 tail. Suppose this is all the information that you were allowed to collect on your coin. What might you conclude about the “fairness” of your coin? Is your conclusion “reasonable” and are you “correct”?

These questions and concepts will come up later in the course.

Simulate sampling from the diamonds dataset

The example above may seem a little abstract. Let’s see these principles in action using the diamonds dataset. Here, let’s imagine that 1,000 auditors each take an independent sample from the diamonds dataset. Each auditor takes a sample of 30 diamonds and is asked to determine the average carat weight.

As above, let’s store our number of samples, sample size, and other information in objects (so that we can make easy changes later).

n_samples <- 1000
sample_size <- 30
sample_variable <- sym('carat')

Next, let’s take a look at summary statistics and the distribution of carat weight.

diamonds %>% 
  select(sample_variable) %>% 
  summary()

Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
ℹ Please use `all_of()` or `any_of()` instead.
  # Was:
  data %>% select(sample_variable)

  # Now:
  data %>% select(all_of(sample_variable))

See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.

     carat       
 Min.   :0.2000  
 1st Qu.:0.4000  
 Median :0.7000  
 Mean   :0.7979  
 3rd Qu.:1.0400  
 Max.   :5.0100

ggplot(diamonds, aes(x = !!sample_variable)) +
         geom_histogram(bins = 50, fill = "blue", color = "black", alpha = 0.7) +
         labs(title = paste("Distribution of",sample_variable), 
              x = sample_variable, 
              y = "Frequency") +
         theme_minimal()

Notice that the distribution of carat weight is positively skewed and has several breakpoints.

Let’s take our own sample of the diamonds dataset.

example_sample <- slice_sample(diamonds, replace = F, n = 30)
summary(example_sample)

     carat              cut     color    clarity       depth      
 Min.   :0.300   Fair     : 1   D:1   SI1    :10   Min.   :58.20  
 1st Qu.:0.560   Good     : 4   E:4   VS2    : 7   1st Qu.:60.90  
 Median :0.710   Very Good: 4   F:6   VS1    : 4   Median :61.85  
 Mean   :0.827   Premium  : 5   G:6   VVS2   : 4   Mean   :61.60  
 3rd Qu.:1.028   Ideal    :16   H:8   SI2    : 3   3rd Qu.:62.35  
 Max.   :1.820                  I:4   VVS1   : 2   Max.   :63.60  
                                J:1   (Other): 0                  
     table           price             x               y        
 Min.   :54.00   Min.   :  563   Min.   :4.340   Min.   :4.300  
 1st Qu.:56.00   1st Qu.: 1538   1st Qu.:5.282   1st Qu.:5.253  
 Median :57.50   Median : 2516   Median :5.785   Median :5.755  
 Mean   :57.81   Mean   : 3957   Mean   :5.899   Mean   :5.900  
 3rd Qu.:58.00   3rd Qu.: 4996   3rd Qu.:6.537   3rd Qu.:6.510  
 Max.   :70.00   Max.   :15025   Max.   :7.680   Max.   :7.750  
                                                                
       z        
 Min.   :2.640  
 1st Qu.:3.300  
 Median :3.535  
 Mean   :3.634  
 3rd Qu.:4.010  
 Max.   :4.840

Exercise

Take a look at your sample above. Change the options in slice_sample and see what changes.
Type ?slice_sample into the console and run it. What are some of the features of slice_sample?

Following the coin flips example, let’s simulate taking independent samples, determining the sample average, and plotting the results.

set.seed(42)
for (s in 1:n_samples) {
  if (s == 1) {
    diamonds_samples <- slice_sample(diamonds, replace = F, n = sample_size) %>% 
      select(sample_variable) %>%
      mutate(sample = s)
  } else {
    diamonds_samples <- bind_rows(diamonds_samples, 
                                  slice_sample(diamonds, replace = F, n = sample_size) %>%
                                    select(sample_variable) %>%
                                    mutate(sample = s))
  }
  print(s)
}

diamond_averages <- diamonds_samples %>%
  group_by(sample) %>% 
  summarize(average_value = mean(!!sample_variable))

ggplot(diamond_averages, aes(x = average_value)) +
         geom_histogram(bins = 50, fill = "blue", color = "black", alpha = 0.7) +
         labs(title = paste("Distribution of Sample Means for",sample_variable), 
              x = "Sample Mean of Variable", 
              y = "Frequency") +
         theme_minimal()

Exercises

What do you notice about the difference between the distribution of carat weight in the entire population versus the distribution of the averages from the samples? What forces are at work here?
Tinker with the inputs to the simulation (e.g., n_samples, sample_size, and sample_variable). What happens to the distributions? Why?