Simulating coin flips and visualizing distributions

Today we’re going to learn how to simulate coin flips and visualize distributions using scatterplots and histograms.

Coin flips are surprisingly useful processes to simulate, and the basic tools we’ll use will set us up to simulate many other kinds of processes. We’ll visualize the distributions of the processes we simulate with histograms (in the single-variable case) and scatterplots (in the two-variable case). Because we’re simulating a random variable (generating random numbers), everyone will get slightly different results. Part of the beauty of statistics is how similar the overall patterns will be despite the differences in details!

Before we get started, load the tidyverse. If you have it installed, also load the cowplot package. If you don’t have the cowplot package installed, no worries—cowplot has a very useful function, plot_grid(), which can combine plots. You can generate the figures separately and follow all the important details just fine. (I recommend installing cowplot when you can. plot_grid() is really useful and easy to use!)

library(tidyverse)
library(cowplot) # we'll only use this for plot_grid()

Single coin flips (Bernoulli distribution)

To start, let’s simulate a sequence of coin flips by random generating 1s and 0s. We’ll consider 1 as heads and 0 as tails. To do this, we’re going to use the rbinom() function.

rbinom() is short for random binomial, as it generates numbers drawn from a binomial distribution. A binomial distribution, which we’ll look at in the next section, is used to represent the process of flipping batches of coins. When the batch is just a single coin, it’s called a Bernoulli distribution after Jacob Bernoulli.

rbinom() takes three arguments: n, the number of batches of coins to flip; size, the number of coins in a batch being flipped; and prob, the probability a single coin comes up heads (i.e., probability of getting a 1). In stats/R lingo, n is the length of the vector, size is the number of trials, prob is the probability of success. To generate draws from a Bernoulli distribution, we’ll set size to 1.

Here’s an example of using rbinom() in this way to simulate 100 tosses of a single fair coin:

single_coin_flips <- rbinom(n=100, size=1, prob=0.5)
single_coin_flips
##   [1] 1 1 1 0 0 0 0 0 0 0 0 0 1 0 0 1 0 1 1 0 1 0 0 0 1 0 0 1 0 1 0 0 0 1 0 0 0
##  [38] 1 1 1 0 0 1 1 1 1 1 1 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1 1 0 1 1 0 0 1
##  [75] 1 0 0 0 1 0 0 1 1 1 0 1 0 1 0 0 1 1 1 1 0 0 1 1 0 0

To visualize the number of times the coin came up heads or tails, we’ll use geom_histogram(). But first we need to put single_coin_flips in a data frame (see the data structures section of Lab 0 for more on this).

coin_flips <- data.frame(bernoulli=single_coin_flips) # put the vector we generated into a data frame object

Exercise 1.1. Generate a histogram of the coin flips using geom_histogram().

single_flips_hist <- ggplot(data = coin_flips) + geom_histogram(aes(x=bernoulli))
single_flips_hist
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Exercise 1.2. Set the bins argument to 2.

single_flips_hist <- ggplot(data = coin_flips) + geom_histogram(aes(x=bernoulli), bins=2, color="blue", fill="skyblue")
single_flips_hist

Exercise 1.3. (Discuss as a group) What is the best value for bins when visualizing coin flips? Why?

Batches of coin flips (Binomial distribution)

Next, let’s simulate a sequence of batches of coin flips. All we need to do is make the size argument in the rbinom() function larger than one. For example, to simulate 100 batches of 25 fair coins being flipped,

batch_coin_flips <- rbinom(n=100, size=25, prob=0.5)
batch_coin_flips
##   [1] 14 13 12 15 11 11 11 14  9 13  9 12 17 17 12 16 12 13 10 12 15 13 10  9 12
##  [26] 13 12 11 13 12  7 17 11 12 11 14 16  9  9 15 13 11 13 11 12 17 13 13  9 16
##  [51] 13 15 13  7 17  8 16  5 13 15 13 16 10 14 13 15 13 10  9 14 16 12 11 10 11
##  [76] 13  8 12 13 17 17 10 12 11 13  9 16 11 11 12 11  8 15 13 12 13 10  9 11 10

Exercise 2.1. What are all the possible values for a single entry in batch_coin_flips? How do you know this?

Integers from 0 to 25 are possible. We know this because any combination of coins could come up tails or heads. If all come up tails, we get 0; if all come up heads, we get 25. In between are all the other possible combinations.

Exercise 2.2. Generate a histogram of the coin flips. (You’ll need to put the coin flips in a data frame first.)

many_coin_flips <- data.frame(binomial=batch_coin_flips)
binomial_flips <- ggplot(data = many_coin_flips) + geom_histogram(aes(x=binomial))
binomial_flips
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Exercise 2.3. (Discuss as a group) What is the best value for bins in this case, and why? How would your answer vary as we change n, size, and prob?

Visualizing joint distributions with scatterplots

Now let’s do some really fun stuff: generate a new variable based on some random variables, and examine their joint distribution with a scatterplot.

Exercise 3.1. Generate a sequence of batches of coin flips with n=1000, size=100, prob=0.25 and put it in a data frame named dfrm as a column named draws.

dfrm <- data.frame(draws=rbinom(n=1000, size=100, prob=0.25))

Exercise 3.2. Mutate a new variable into the data frame, new_variable = (100 - draws)/10 + log(draws)

dfrm <- dfrm %>% mutate(new_variable = ((100 - draws)/10 + log(draws) + runif(n=1000, min=-5, max=5)^0.75) )

Exercise 3.3. Generate histograms of draws and new_variable. What patterns do you observe in the distributions?

(Note: If you have the cowplot package installed and loaded, you can display the plots side-by-side using plot_grid(), e.g. if the two plots are named plot1 and plot2 you would run plot_grid(plot1, plot2, ncol=2). If not, you can just display them sequentially.)

draws_hist <- ggplot(data = dfrm) + geom_histogram(aes(x=draws))
new_variable_hist <- ggplot(data = dfrm) + geom_histogram(aes(x=new_variable))
plot_grid(draws_hist, new_variable_hist, ncol=2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 500 rows containing non-finite values (stat_bin).

Exercise 3.4. Generate a scatterplot with draws on the x axis and new_variable on the y axis. What pattern do you observe?

(Hint: use geom_point() to create a scatterplot layer.)

scatter <- ggplot(data = dfrm) + geom_point(aes(x=draws, y=new_variable))
scatter
## Warning: Removed 500 rows containing missing values (geom_point).

Exercise 3.5. (Discuss as a group) What are the differences in information you get out of the pair of histograms versus the scatterplot? Does one give you more information than the other? When would you prefer to use the histograms or scatterplots?