Exercises to work on will be marked in numbered headings. You’ll have 35-40 minutes to work on these problems as a group. Professor Rao will visit each breakout group a few times to help clear up any questions. We will use the 25-30 minutes after to go over the solutions and discuss any questions. If your group finishes early, you may return to the main room and/or work on lab reports or problem sets.

Simulating coin flips and visualizing distributions

Today we’re going to learn how to simulate coin flips and visualize distributions using scatterplots and histograms.

Coin flips are surprisingly useful processes to simulate, and the basic tools we’ll use will set us up to simulate many other kinds of processes. We’ll visualize the distributions of the processes we simulate with histograms (in the single-variable case) and scatterplots (in the two-variable case). Because we’re simulating a random variable (generating random numbers), everyone will get slightly different results. Part of the beauty of statistics is how similar the overall patterns will be despite the differences in details!

Before we get started, load the tidyverse. If you have it installed, also load the cowplot package. If you don’t have the cowplot package installed, no worries—cowplot has a very useful function, plot_grid(), which can combine plots. You can generate the figures separately and follow all the important details just fine. (I recommend installing cowplot when you can. plot_grid() is really useful and easy to use!)

library(tidyverse)
library(cowplot) # we'll only use this for plot_grid()

Single coin flips (Bernoulli distribution)

To start, let’s simulate a sequence of coin flips by random generating 1s and 0s. We’ll consider 1 as heads and 0 as tails. To do this, we’re going to use the rbinom() function.

rbinom() is short for random binomial, as it generates numbers drawn from a binomial distribution. A binomial distribution, which we’ll look at in the next section, is used to represent the process of flipping batches of coins. When the batch is just a single coin, it’s called a Bernoulli distribution after Jacob Bernoulli.

rbinom() takes three arguments: n, the number of batches of coins to flip; size, the number of coins in a batch being flipped; and prob, the probability a single coin comes up heads (i.e., probability of getting a 1). In stats/R lingo, n is the length of the vector, size is the number of trials, prob is the probability of success. To generate draws from a Bernoulli distribution, we’ll set size to 1.

Here’s an example of using rbinom() in this way to simulate 100 tosses of a single fair coin:

single_coin_flips <- rbinom(n=100, size=1, prob=0.5)
single_coin_flips
##   [1] 1 0 0 1 0 0 0 0 1 0 0 0 0 1 1 0 1 0 1 0 1 1 1 0 1 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 1 0 1 0 1 0
##  [48] 1 0 1 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 1 0 1 0 1 1 1 1 0 1 0 0 1 1 1 1 0 1 0 0 1 0 1 1 0 0 0 1 0
##  [95] 0 1 1 1 0 0

To visualize the number of times the coin came up heads or tails, we’ll use geom_histogram(). But first we need to put single_coin_flips in a data frame (see the data structures section of Lab 0 for more on this).

coin_flips <- data.frame(bernoulli=single_coin_flips) # put the vector we generated into a data frame object

Exercise 1.1. Generate a histogram of the coin flips using geom_histogram().

Exercise 1.2. Set the bins argument to 2.

Exercise 1.3. (Discuss as a group) What is the best value for bins when visualizing coin flips? Why?

Batches of coin flips (Binomial distribution)

Next, let’s simulate a sequence of batches of coin flips. All we need to do is make the size argument in the rbinom() function larger than one. For example, to simulate 100 batches of 25 fair coins being flipped,

batch_coin_flips <- rbinom(n=100, size=25, prob=0.5)
batch_coin_flips
##   [1] 14 12 17 12  9  8 12 15 14  8 10 13 15 13 13 15  8 11 17 11 11 15 16 13 15 10 15 16 13 14 14
##  [32] 10 14 12 10 15 13  7 16 10 11 14 14 14  8 12 12 11 15 12 14 14 16 14 10  8 15  9 11 18 12  9
##  [63] 16 11 14 15 13 11 17 11 11 11  9 11 17 11 12 10 13 12 10 13 15 14 10 16 12 13  8 14 11 11 11
##  [94] 14 13 11 14 18 15 13

Exercise 2.1. What are all the possible values for a single entry in batch_coin_flips? How do you know this?

Exercise 2.2. Generate a histogram of the coin flips. (You’ll need to put the coin flips in a data frame first.)

Exercise 2.3. (Discuss as a group) What is the best value for bins in this case, and why? How would your answer vary as we change n, size, and prob?

Visualizing joint distributions with scatterplots

Now let’s do some really fun stuff: generate a new variable based on some random variables, and examine their joint distribution with a scatterplot.

Exercise 3.1. Generate a sequence of batches of coin flips with n=1000, size=100, prob=0.25 and put it in a data frame as a column named draws.

Exercise 3.2. Mutate a new variable into the data frame, new_variable = ((100 - draws)/10 + log(draws) + runif(n=1000, min=-5, max=5)^0.75)

Exercise 3.3. Generate histograms of draws and new_variable. What patterns do you observe in the distributions?

(Note: If you have the cowplot package installed and loaded, you can display the plots side-by-side using plot_grid(), e.g. if the two plots are named plot1 and plot2 you would run plot_grid(plot1, plot2, ncol=2). If not, you can just display them sequentially.)

Exercise 3.4. Generate a scatterplot with draws on the x axis and new_variable on the y axis. What pattern do you observe?

(Hint: use geom_point() to create a scatterplot layer.)

Exercise 3.5. (Discuss as a group) What are the differences in information you get out of the pair of histograms versus the scatterplot? Does one give you more information than the other? When would you prefer to use the histograms or scatterplots?