Solving the Birthday Paradox: Simulation and Analysis

Luz Melo

9/26/2022

Suppose we are asked the following question:

In a class of N individuals, what is the probability that at least two students will share a birthday?.

In this article we will explain how to answer the birthday problem with both simulation and analytic tools. We will assume that birthdays are equally likely to fall on any day of the year, ignoring leap year.

Simulation Solution

Let’s take a look on how we can solve the birthday problem using simulation. First, we have to think on how we wish to represent the birthdays in a class. One way to do this would be by representing each birthday as a month/day/year format for each student. In this case, we could make a function that assigns a random month, day, and year to each student. However, we will think of shared birthdays as sharing only month and day, meaning that someone born in 1999 could share the same birthday with someone born in 1985, as long as the day and month coincide.

Now, we have all the information we needed to create the simulation. First, we begin by defining a function that will generate a class of N students and with random birthday each. To do this, we use a built-in function in R: the sample function. The sample function takes a sample of the specified size from the elements of x using either with or without replacement.

Key Parameters

Parameter Explanation
x A vector of elements from which to choose
size Number of items to choose
replace TRUE if sampling with replacement and FALSE if not
prob A vector of probability weights for obtaining the elements of the vector being sampled
generate_class <- function(N){
  sample(
      x = 1:365
    , size = N
    , replace = TRUE
    , prob = c(rep(1,365))
  )
}

In this case, since we are interested in obtaining the birthdays of N students with birthdays being uniformly distributed, the variable x = 1:365 accounts for every day of the year from January 1st to December 31st, ignoring leap year. The variable size = N represents the number of students being sampled and replace = TRUE since we allow the possibility of students to share birthdays. Also, we will ignore seasonal variation in infant mortality and assume that birthdays are uniformly distributed. Thus, the variable prob = c(rep(1,365)) is a vector of 1’s since each birthday is equally likely. For instance, generate_class(50) generates a class roster of 50 students and corresponding birthdays for each, which is represented by a vector of 50 elements each with value between 1 and 365:

sample_roster = generate_class(50)

sample_roster
##  [1] 237 316 214 339 360 324 177 246 327 179 176 109  18  81 118 272 218 267 364
## [20] 137 347 351  31 297 210 250  82 308 122 184 122 145  28  14 252  97 311  72
## [39] 215 276 217 255 347  45 115  91   2 264 353 145
check_shared <- function(roster){
  sum(duplicated(roster)) > 0
}

Next, we create the check_shared function, which indicates whether at least two students share a birthday in a class roster. To do this, we use a built-in function in R: the duplicated function. This function takes a vector of elements and returns a vector of logical indicators that illustrates which of our are not unique. The first occurrence of a non-unique element is set to FALSE, while the following non-unique elements are set to TRUE.

Then, since TRUE has a value of 1 and FALSE has a value 0, by adding the elements of this vector we obtain how many students share a birthday for the given class roster. If the value is greater than 1 then we know that at least two students share a birthday. For instance, using the sample_roster from before, the following command returns TRUE if at least two students share a birthday and FALSE otherwise:

check_shared(sample_roster)
## [1] TRUE

Now that we have created all the needed functions, it is time to estimate the desired probability for each size group of students. To do this, we use a built-in function in R: the replicate function. This function (along with the for loop) will generate 10000 different class roster for each fixed number of students.

out1 <- data.frame(N = seq(0,100,by=1), prob = NA)

for(i in 1:nrow(out1)){
out1[i,"prob"] <- replicate(10000, out1[i,"N"] %>% generate_class()%>%check_shared)%>%mean
}

That is, the first iteration will calculate the probability that two students share a birthday in a class of 1 student by simulating 10000 different possible class rosters of size 1. In this case, the probability will always be zero since there has to be at least two students in order to share a birthday with someone else in the first place.

simulation_graph <- ggplot(out1,aes(x=N,y=prob))+geom_line(linetype = "dashed",size = 0.75)+labs(x="Size of group",y="Probability of at least one shared birthday",title="Simulation solution")+theme(plot.title = element_text(hjust=0.5))
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
simulation_graph

Advantages
  • Generally very intuitive
  • Does not require extensive knowledge of math functions
  • Yields insights into the importance of each variables
Disadvantages
  • Requires knowledge of programming language
  • The accuracy of our estimations depend on the quality of the model, amount of repetitions of the model
  • Programming mistakes are more prone
  • Could take a long time to run because of the repetitions

Analytical Solution

To solve this analytically, we can use the field of mathematics known as combinations, which deals with the likelihood of different combinations. To find the probability that in a group of N students at least two share a birthday directly can be quite challenging because there is many ways you can get a birthday match in a class roster. Instead, we will calculate the complement. That is, we will calculate the probability that in a group of N students all birthdays are unique. To do this, we start small. In a group of N = 1 student, then the probability of the birthday being unique is \(\frac{365}{365} = 1\), since there is no other student to match with. In a group of N = 2 students, then the probability is \(\frac{365}{365}\times \frac{364}{365} = 0.997\) since the second student added only has 364 possible days of the year to have a unique birthday from the first student. Then

N probability
1 \(\frac{365}{365}\)
2 \(\frac{365}{365}\times \frac{364}{365}\)
3 \(\frac{365}{365}\times \frac{364}{365}\times \frac{363}{365}\)
\(\vdots\) \(\vdots\)
n \(\frac{365}{365}\times \frac{364}{365}\times \frac{363}{365}\cdots \frac{365-n+1}{365}=\frac{365\times 364\times 363\times \cdots \times (365-n+1)}{365^n}=\frac{365!}{365^n(365-n)!}\)

Therefore, the probability of that in class of N students at least two students share a birthday is given by \(1 - \frac{365!}{365^n(365-n)!}\). To code this solution into R, we notice that factorial(365) is too big of a number for R to handle. Instead, we make use of the fact that log() computes natural logarithms (ln) and so log() and exp() are inverse functions of each other. Also, we use the fact that lfactorial() gives the natural logarithm factorial. Hence,

\[ \begin{aligned} 1 - \frac{365!}{365^n(365-n)!} &= e^{log\left(1 - \frac{365!}{365^n(365-n)!}\right)}\\ & = e^{log(1)}-e^{log\left(\frac{365!}{365^n(365-n)!}\right)}\\ & = e^0-e^{log(365!)-log(365^n(365-n)!)}\\ &=1-e^{log(365!)-[log(365^n)+log((365-n)!)]}\\ &=1-e^{log(365!)-nlog(365)-log((365-n)!)}\\ &=1 - exp(lfactorial(365)-n*log(365)-lfactorial(365-n)). \end{aligned} \]

out2 <- data.frame(N = seq(0,100,by=1), prob = NA)

for(n in 1:nrow(out2)){
out2[n,"prob"] <- 1 - exp(lfactorial(365)-n*log(365)-lfactorial(365-n))
}
analytic_graph<-ggplot(out2,aes(x=N,y=prob))+geom_line(size = 0.75, colour = "turquoise4")+labs(x="Size of group",y="Probability of at least one shared birthday",title="Analytic solution")+theme(plot.title = element_text(hjust=0.5))

analytic_graph

Advantages
  • Much quicker to implement
  • Does not require extensive knowledge of programming language
  • Transparency in which terms are interacting with each other
  • Exact answers
Disadvantages
  • Requires knowledge of math properties
  • Some problems are very hard (or impossible) to solve
plot_grid(simulation_graph, analytic_graph, labels = "AUTO")

df <- data.frame(N = seq(0,100,by=1),out1$prob,out2$prob)

ggplot(df, aes(N))+            
  geom_line(aes(y=out1.prob, color = "Simulation"),size= 0.75, linetype = "dashed")+
  geom_line(aes(y=out2.prob, color = "Analytic"),size= 0.75)+
  scale_colour_manual("Solution Type",
  breaks = c("Simulation", "Analytic"),
  values = c("black", "turquoise4"))+
  labs(x="Size of group",y="Probability of at least one shared birthday",title="Simulation solution vs Analytic solution")+
  theme(plot.title = element_text(hjust=0.5))

Note: In the last graph we see how the black line is under the red line. This means that the simulation solution is a under estimation of the desired probabilities. To make the estimations even more accurate we can increase the number of repetition at the expense of slower run time in R.