The Birthday Problem

If we randomly selected a group of people, what are the chances that at least two people have the same birthday?

We will use Monte Carlo Simulation to calculate this.
- To simplify the calculation, we’ll assume no one was born on February 29th.

Birthdays can be represented by numbers between 1 and 365.
Below is a code for 50 random sample bithdays using sample() function:

n <- 50
bdays <- sample(1:365, n, replace = TRUE)

To check if there are at least two same birthday, we use the duplicate() function. This returns “true” when an element of a vector already appeared in that vector.
Here is an example of duplicate() function:

duplicated(c(1, 2, 3, 1, 4, 3, 5))

## [1] FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE

Returns “true” on 1 & 3 since it already appeard in vector.

To check if there were duplicate birthdays, use any() and duplicate() function like this:

any(duplicated(bdays))

## [1] TRUE

Out of our 50 random samples, we did have duplciated birthdays.

Estimate Probability of Duplicate Birthdays

Use the Monte Carlo simulation to repeat the experiment we created above at least 10,000 times.

Use the replicate() function:

B <- 1000
results<- replicate(B, {
  bdays <- sample(1:365, n, replace = TRUE)
  any(duplicated(bdays))
})

Next, show the above result using the mean() function:

mean(results)

## [1] 0.973

The probability of at least 2 people having the same birthday in a group of 50 random samples is about 98%.

Were you expecting the probability to be so high? People usually underestimate these probabilities, so that’s when your chances of winning a bet can be much higher.

Note: For this Birthday example, if we have 365 random samples in our group, the probability will equal to 1.

Next, we’ll compute the probability of different group sizes and see how fast it gets close to 1.

S-apply

If you want to bet with friends about two people having the same birthday in a group of people, you should create a lookup table.
- We want to show, what are the chances larger than 50%? larger than 75%?
- We can create a function to compute this for any group.

Call the lookup table function: compute prob. Make the calculatins for the probability of two peoplw having the same birthday.
- Use a small Monte Carlo simulation to do this.

compute_prob <- function(n, B=10000){
  same_day <- replicate(B, {
    bdays <- sample(1:365, n, replace = TRUE)
    any(duplicated(bdays))
  })
  mean(same_day)
}
n<- seq(1,60)

Next, we want to apply the compute prob function we just created to several values of n. Let’s say from 1 to 60.
a. Define n as a sequence starting at 1 and ending at 60.
b. Use a for loop to apply this function to each value in n.
- Here’s the caveat. In R, using the for loop is rarely a good idea.

In R, operations are performed in the entire vectors. If we create a vector that has x equal 1 to 10, and we compute the square root of x, the arithmetic operation will compute it to every element in the vector.

x <- 1:10
sqrt(x)

##  [1] 1.000000 1.414214 1.732051 2.000000 2.236068 2.449490 2.645751
##  [8] 2.828427 3.000000 3.162278

If we define y to be 1 to 10, and multiply x and y, it will multiply by every element in the vector.
So, there’s no need for loops in R.

y <- 1:10
x * y

##  [1]   1   4   9  16  25  36  49  64  81 100

However, not all functions in R work this way.
You can’t just send a vector to any function in R.
For example, the function we just created compute_prob(n) will not work. It would just return a “0”.

What we can do instead is to use the sapply() (“s” apply) function.
Sapply lets us perform element-wise operations on any function. - If we use sapply(), it applys the square root operation to each element in the vector “x”. Square root function already does this without sapply, so this is just an example.

x <- 1:10
sapply(x, sqrt)

##  [1] 1.000000 1.414214 1.732051 2.000000 2.236068 2.449490 2.645751
##  [8] 2.828427 3.000000 3.162278

S-apply on Function

Using S-apply on a function: type prob equals sapply n. n is the vector.
- It assigns each element of prob (the probability) of two people having the same birthday for that n.

prob <- sapply(n, compute_prob)

Next, make a plot that shows the probability of two people having the same birthday.

plot(n, prob)

Compute Exact Probabilities

So instead of using a Monte Carlo simulation, compute the exact probabilities.

To make the math simper, compute the probability of it not happening instead of it happening. We can use the multiplication rule.

The probability that 1 person has a unique birthday = 1.
\(Pr(person 1 has a unique birthday) = 1\)
The probability that the 2nd person has a unique birthday given that person 1 already took one of the days = 364 / 365.
\(Pr(person 2 has a unique birthday | person 1 has a unique birthday) = 364/365\)

k <- 364
j <- 365
k/j

## [1] 0.9972603

sprintf("%.1f %%", 100*(k/j))

## [1] "99.7 %"

The probability that the 3rd person will have a unique birthday given that the two people already have unique birthdays = 363 / 365.
\(Pr(person 3 has a unique birthday | person 1 and 2 has a unique birthday) = 363/365\)

k <- 363
j <- 365
k/j

## [1] 0.9945205

sprintf("%.1f %%", 100*(k/j))

## [1] "99.5 %"

If we continue this way and find the chances of 50 people having unique birthfays, we would multiply 1 times 364/365, times 363/365, … all the way to the 50th value.
\(1 * 364/365 * 363/365 ... 365-n+1/365\)

There’s a function that will do this.
Lets call it exact prob. It takes n as a argument, and computes the probability using the code below:

exact_prob <- function(n){
  prob_unique <- seq(365, 365-n+1)/365
  1 - prod( prob_unique)
}

Now, we can compute each probability for each n using s-apply.

eprob <- sapply(n, exact_prob)

Next, plot the probabilities so we can see that the Monte Carlo simulations were almost exactly right. They were almost the exact approximations of the actual values.

plot(n, prob)
lines(n, eprob, col = "red")

How many Monte Carlo experiments are enough?

In our previous examples, 10,000 Monte carlo experiments provided us a very accurate estimate. But for more complex calculations, we may need more experiments.
On the other hand, 10,000 experiment may be more than we need (not computationally feasible).

We’ll use the letter B to represent the number of experiments. But to be able to answer how large B should be, we need advanced theoretical statistics training.

One practical approach we will learn is to check for the stability of the estimate.
Here is a birthday problem example.
- Use n = 22, there are 22 people.
- Run a simulation to compute or estimate the probability of two people having a certain birthday using different sizes of the Monte Carlo simulations.

B <- 10^seq(1, 5, len = 100)
compute_prob <- function(B, n=22){
  same_day <- replicate(B, {
    bdays <- sample(1:365, n, replace=TRUE)
    any(duplicated(bdays))
  })
  mean(same_day)
  
}

The value of “B” will go from 10, to 20, to 40, to 100, and so forth.
Next, compute the simulation and look at the values that we get for each simulation.

prob <- sapply(B, compute_prob)
plot(log10(B), prob, type = "l")