This exercise is intended to teach you the following things:
Rdata.frame() functionbins argument when making
histogramsToday we’re going to learn how to simulate coin flips and visualize distributions using scatterplots and histograms.
Coin flips are surprisingly useful processes to simulate, and the basic tools we’ll use will set us up to simulate many other kinds of processes. We’ll visualize the distributions of the processes we simulate with histograms (in the single-variable case) and scatterplots (in the two-variable case). Because we’re simulating a random variable (generating random numbers), everyone will get slightly different results. Part of the beauty of statistics is how similar the overall patterns will be despite the differences in details!
Before we get started, load the tidyverse. If you have it installed,
also load the cowplot package. If you don’t have the cowplot package
installed, no worries—cowplot has a very useful function,
plot_grid(), which can combine plots. You can generate the
figures separately and follow all the important details just fine. (I
recommend installing cowplot when you can. plot_grid() is
really useful and easy to use!)
library(tidyverse)
library(cowplot) # we'll only use this for plot_grid()
To start, let’s simulate a sequence of coin flips by random
generating 1s and 0s. We’ll consider 1 as heads and 0 as tails. To do
this, we’re going to use the rbinom() function.
rbinom() is short for random
binomial, as it generates numbers drawn from a binomial
distribution. A binomial distribution, which we’ll look at in the next
section, is used to represent the process of flipping batches of coins.
When the batch is just a single coin, it’s called a Bernoulli
distribution after Jacob
Bernoulli.
rbinom() takes three arguments: n, the
number of batches of coins to flip; size, the number of
coins in a batch being flipped; and prob, the probability a
single coin comes up heads (i.e., probability of getting a 1). In
stats/R lingo, n is the length of the
vector, size is the number of
trials, prob is the probability of
success. To generate draws from a Bernoulli distribution, we’ll
set size to 1.
Here’s an example of using rbinom() in this way to
simulate 100 tosses of a single fair coin:
single_coin_flips <- rbinom(n=100, size=1, prob=0.5)
single_coin_flips
## [1] 0 0 1 1 0 0 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 1 0 1 1 0 1 0 0 0 1 0 1 0 1 1 1 1 0 1 0 0 0 1 1 0 1
## [48] 0 1 1 1 1 0 1 1 0 0 1 0 0 0 1 1 1 0 0 1 1 1 1 0 1 0 0 0 1 1 1 1 0 0 1 0 0 1 0 0 0 0 1 1 1 1 1
## [95] 0 0 1 1 1 1
To visualize the number of times the coin came up heads or tails,
we’ll use geom_histogram(). But first we need to put
single_coin_flips in a data frame (see the data structures
section of Lab 0 for more on this).
coin_flips <- data.frame(bernoulli=single_coin_flips) # put the vector we generated into a data frame object
geom_histogram().bins argument to 2.bins when visualizing coin flips? Why?Next, let’s simulate a sequence of batches of coin flips. All we need
to do is make the size argument in the
rbinom() function larger than one. For example, to simulate
100 batches of 25 fair coins being flipped,
batch_coin_flips <- rbinom(n=100, size=25, prob=0.5)
batch_coin_flips
## [1] 10 14 14 9 10 17 13 7 14 11 10 12 13 12 13 12 11 8 12 14 13 12 9 13 15 9 8 14 13 14 14
## [32] 17 16 14 10 14 8 9 14 17 14 15 14 10 12 8 12 10 11 11 12 12 11 8 10 10 8 15 12 8 11 14
## [63] 12 15 16 11 11 14 7 11 14 15 14 14 13 15 7 18 15 10 10 13 14 11 15 9 15 17 15 13 11 8 13
## [94] 16 12 13 11 13 12 9
batch_coin_flips? How do you know this?bins in this case, and why? How would your answer vary as
we change n, size, and prob?Now let’s do some really fun stuff: generate a new variable based on some random variables, and examine their joint distribution with a scatterplot.
n=1000, size=100, prob=0.25 and
put it in a data frame as a column named draws.new_variable = ((100 - draws)/10 + log(draws) + runif(n=1000, min=-5, max=5)^0.75)draws and
new_variable. What patterns do you observe in the
distributions?(Note: If you have the cowplot package installed and loaded, you
can display the plots side-by-side using plot_grid(),
e.g. if the two plots are named plot1 and
plot2 you would run
plot_grid(plot1, plot2, ncol=2). If not, you can just
display them sequentially.)
draws on the
x axis and new_variable on the y axis. What pattern do you
observe?(Hint: use geom_point() to create a scatterplot
layer.)