If we randomly selected a group of people, what are the chances that at least two people have the same birthday?
We will use Monte Carlo Simulation to calculate this.
- To simplify the calculation, we’ll assume no one was born on February 29th.
Birthdays can be represented by numbers between 1 and 365.
Below is a code for 50 random sample bithdays using sample() function:
n <- 50
bdays <- sample(1:365, n, replace = TRUE)
To check if there are at least two same birthday, we use the duplicate() function. This returns “true” when an element of a vector already appeared in that vector.
Here is an example of duplicate() function:
duplicated(c(1, 2, 3, 1, 4, 3, 5))
## [1] FALSE FALSE FALSE TRUE FALSE TRUE FALSE
To check if there were duplicate birthdays, use any() and duplicate() function like this:
any(duplicated(bdays))
## [1] TRUE
Use the Monte Carlo simulation to repeat the experiment we created above at least 10,000 times.
Use the replicate() function:
B <- 1000
results<- replicate(B, {
bdays <- sample(1:365, n, replace = TRUE)
any(duplicated(bdays))
})
Next, show the above result using the mean() function:
mean(results)
## [1] 0.973
Were you expecting the probability to be so high? People usually underestimate these probabilities, so that’s when your chances of winning a bet can be much higher.
Note: For this Birthday example, if we have 365 random samples in our group, the probability will equal to 1.
Next, we’ll compute the probability of different group sizes and see how fast it gets close to 1.
If you want to bet with friends about two people having the same birthday in a group of people, you should create a lookup table.
- We want to show, what are the chances larger than 50%? larger than 75%?
- We can create a function to compute this for any group.
Call the lookup table function: compute prob. Make the calculatins for the probability of two peoplw having the same birthday.
- Use a small Monte Carlo simulation to do this.
compute_prob <- function(n, B=10000){
same_day <- replicate(B, {
bdays <- sample(1:365, n, replace = TRUE)
any(duplicated(bdays))
})
mean(same_day)
}
n<- seq(1,60)
Next, we want to apply the compute prob function we just created to several values of n. Let’s say from 1 to 60.
a. Define n as a sequence starting at 1 and ending at 60.
b. Use a for loop to apply this function to each value in n.
- Here’s the caveat. In R, using the for loop is rarely a good idea.
In R, operations are performed in the entire vectors. If we create a vector that has x equal 1 to 10, and we compute the square root of x, the arithmetic operation will compute it to every element in the vector.
x <- 1:10
sqrt(x)
## [1] 1.000000 1.414214 1.732051 2.000000 2.236068 2.449490 2.645751
## [8] 2.828427 3.000000 3.162278
If we define y to be 1 to 10, and multiply x and y, it will multiply by every element in the vector.
So, there’s no need for loops in R.
y <- 1:10
x * y
## [1] 1 4 9 16 25 36 49 64 81 100
However, not all functions in R work this way.
You can’t just send a vector to any function in R.
For example, the function we just created compute_prob(n) will not work. It would just return a “0”.
What we can do instead is to use the sapply() (“s” apply) function.
Sapply lets us perform element-wise operations on any function. - If we use sapply(), it applys the square root operation to each element in the vector “x”. Square root function already does this without sapply, so this is just an example.
x <- 1:10
sapply(x, sqrt)
## [1] 1.000000 1.414214 1.732051 2.000000 2.236068 2.449490 2.645751
## [8] 2.828427 3.000000 3.162278
Using S-apply on a function: type prob equals sapply n. n is the vector.
- It assigns each element of prob (the probability) of two people having the same birthday for that n.
prob <- sapply(n, compute_prob)
Next, make a plot that shows the probability of two people having the same birthday.
plot(n, prob)
So instead of using a Monte Carlo simulation, compute the exact probabilities.
The probability that 1 person has a unique birthday = 1.
\(Pr(person 1 has a unique birthday) = 1\)
The probability that the 2nd person has a unique birthday given that person 1 already took one of the days = 364 / 365.
\(Pr(person 2 has a unique birthday | person 1 has a unique birthday) = 364/365\)
k <- 364
j <- 365
k/j
## [1] 0.9972603
sprintf("%.1f %%", 100*(k/j))
## [1] "99.7 %"
k <- 363
j <- 365
k/j
## [1] 0.9945205
sprintf("%.1f %%", 100*(k/j))
## [1] "99.5 %"
exact_prob <- function(n){
prob_unique <- seq(365, 365-n+1)/365
1 - prod( prob_unique)
}
eprob <- sapply(n, exact_prob)
plot(n, prob)
lines(n, eprob, col = "red")
In our previous examples, 10,000 Monte carlo experiments provided us a very accurate estimate. But for more complex calculations, we may need more experiments.
On the other hand, 10,000 experiment may be more than we need (not computationally feasible).
We’ll use the letter B to represent the number of experiments. But to be able to answer how large B should be, we need advanced theoretical statistics training.
One practical approach we will learn is to check for the stability of the estimate.
Here is a birthday problem example.
- Use n = 22, there are 22 people.
- Run a simulation to compute or estimate the probability of two people having a certain birthday using different sizes of the Monte Carlo simulations.
B <- 10^seq(1, 5, len = 100)
compute_prob <- function(B, n=22){
same_day <- replicate(B, {
bdays <- sample(1:365, n, replace=TRUE)
any(duplicated(bdays))
})
mean(same_day)
}
prob <- sapply(B, compute_prob)
plot(log10(B), prob, type = "l")