Exercises to work on will be marked in numbered headings. You’ll have 35-40 minutes to work on these problems as a group. Professor Rao will visit each breakout group a few times to help clear up any questions. We will use the remaining time to go over the solutions and discuss any questions. If your group finishes early, you may return to the main room and/or work on lab reports or problem sets.

Calculating probabilities from data

Today we’re going to learn how to calculate probabilities from data, and compare these results to the value predicted by a Normal distribution. We’ll then see how these values change as the sample size increases. As usual, we’ll start by loading tidyverse and cowplot.

library(tidyverse)
library(cowplot) # we'll only use this for plot_grid()

Simulating the data

For this set of exercises you’ll need to create four different dataframes containing vectors of simulated data:

  1. the first, few_events, which is 20 Binomial trials of 1000 coin flips with probability 0.5;

  2. the second, some_events, which is 100 Binomial trials of 1000 coin flips with probability 0.5;

  3. the third, many_events, which is 500 Binomial trials of 1000 coin flips with probability 0.5;

  4. the fourth, funny_business, which is 500 Binomial trials of 1000 coin flips with probability 0.001.

In each case, the dataframe should have a single column named outcome. It may help to plot histograms of each vector before going forward, just to make sure you understand what the data look like.

Empirical probabilities

First, we’re going to calculate “empirical probabilities”. This is a fancy way of saying “we’re going to calculate the proportion of times an outcome occurs in the data”.

Here’s an example of how you would calculate the probability of observing fewer than 475 heads from a dataframe like some_events:

# Generate a vector similar to `some_events`
dfrm_like_some_events <- data.frame(outcome=rbinom(n=100, size=1000, prob=0.5))

# Calculate the empirical probability of seeing fewer than 475 heads
dfrm_like_some_events %>% 
  filter(outcome<475) %>% 
  summarise(probability = n()/nrow(dfrm_like_some_events))
##   probability
## 1        0.02

You’ll most likely see a different number every time you run this chunk and everyone in your group will have different numbers. Putting that aside for a moment, suppose you saw a probability of “0.02”.

Generating data
few_events <- data.frame(outcome=rbinom(n=20, size=1000, prob=0.5))
some_events <- data.frame(outcome=rbinom(n=100, size=1000, prob=0.5))
many_events <- data.frame(outcome=rbinom(n=500, size=1000, prob=0.5))
funny_business <- data.frame(outcome=rbinom(n=500, size=1000, prob=0.001))

hist_few <- ggplot(data=few_events, aes(x=outcome)) + geom_histogram(bins=15)
hist_some <- ggplot(data=some_events, aes(x=outcome)) + geom_histogram(bins=15)
hist_many <- ggplot(data=many_events, aes(x=outcome)) + geom_histogram(bins=15)
hist_funny <- ggplot(data=funny_business, aes(x=outcome)) + geom_histogram(bins=15)
plot_grid(hist_few, hist_some, hist_many, hist_funny, ncol=2)

Exercise 1.1. How many times was there a batch of coin flips with fewer than 475 heads if the empirical probability was 0.02?

Next, we’re going to work with the first three dataframes you created, few_events, some_events, and many_events.

Exercise 1.2. For the first three dataframes, calculate the empirical probability of observing more than 521 heads. Discuss the following statement as a group: “All of the probabilities we calculated are equally accurate”. Do you agree with the statement? What does the word “accurate” mean here?

(Looking at the histograms may help.)

few_events %>% 
  filter(outcome<521) %>% 
  summarise(probability = n()/nrow(few_events))
##   probability
## 1        0.85
some_events %>% 
  filter(outcome<521) %>% 
  summarise(probability = n()/nrow(some_events))
##   probability
## 1        0.92
many_events %>% 
  filter(outcome<521) %>% 
  summarise(probability = n()/nrow(many_events))
##   probability
## 1        0.89

Exercise 1.3. For the first three dataframes, calculate the probability of observing fewer than 475 heads or more than 527 heads (i.e. a single probability reflecting both events).

few_events %>% 
  filter(outcome>527 | outcome < 475) %>% 
  summarise(probability = n()/nrow(few_events))
##   probability
## 1        0.05
some_events %>% 
  filter(outcome>527 | outcome < 475) %>% 
  summarise(probability = n()/nrow(some_events))
##   probability
## 1        0.13
many_events %>% 
  filter(outcome>527 | outcome < 475) %>% 
  summarise(probability = n()/nrow(many_events))
##   probability
## 1       0.086

Finally, let’s look at the fourth dataframe, funny_business.

Exercise 1.4. For the funny_business dataframe only, calculate the probability of observing 1 or fewer heads.

funny_business %>% 
  filter(outcome<=1) %>% 
  summarise(probability = n()/nrow(funny_business))
##   probability
## 1       0.738

Theoretical probabilities

Now let’s turn to calculating theoretical probabilities. “Calculating theoretical probabilities” is a concise way of saying “calculate the probabilities under a specific distribution using the formula”.

We’ll do this using the formula for the Normal distribution. Fortunately R has some really handy functions for us. We’re going to be using pnorm here.

As an example, suppose I wanted to calculate the theoretical probability of observing fewer than 475 heads from a dataframe like some_events, assuming a Normal distribution was a good representation of the data. I would use the pnorm function as below:

# Calculate the theoretical probability of seeing fewer than 475 heads
pnorm(q = 475, mean = mean(dfrm_like_some_events$outcome), sd = sd(dfrm_like_some_events$outcome))
## [1] 0.04717064

If I wanted to calculate the probability of observing more than 475 heads, I would use 1 - pnorm as below:

# Calculate the theoretical probability of seeing more than 475 heads
1 - pnorm(q = 475, mean = mean(dfrm_like_some_events$outcome), sd = sd(dfrm_like_some_events$outcome))
## [1] 0.9528294

Notice that the two probabilities sum to 1.

Exercise 2.2. Discuss as a group: how many batches of coin flips would it take for the empirical and theoretical probabilities to match exactly?

Exercise 2.3. Calculate the theoretical probability of observing 1 or fewer heads for the fourth dataframe (funny_business). Discuss as a group: does this match the empirical probability as closely as the theoretical probability for some_events does? Why or why not?

pnorm(q = 1, mean = mean(funny_business$outcome), sd = sd(funny_business$outcome))
## [1] 0.5114643