Problem

Suppose we have a sample of \(n\) people, let’s take the days of the year such that 1 = January 1st and 365 = December 31st (assuming a “normal” year), then considering the hypothesis that being born on any day of the year is equalt likely, what is the probability that AT LEAST two of the \(n\) people were born on the exact same day?

In the context of probability, it’s often surprising how seemingly unlikely events become highly probable in certain scenarios. One such scenario is the probability of two people in a group having the same birthday. For instance, in a group of at least 23 randomly chosen people, the probability of at least two people sharing a birthday is more than 50%. When the number of people is 42 or more, the probability increases to more than 90% In this code, we will calculate and simulate this probability using R.

Real Solution

The first person has a 100 % chance of a unique number (of course)

The second has a (1 - 1/365) chance (all but 1 number from the 365)

The third has a (1 - 2/365) chance (all but 2 numbers)….

To compute the probability \(p(n)\) that in a group of \(n\) people, at least two share the same birthday, we assume that all birthdays are equally likely and distributed uniformly over a year of 365 days.

To simplify the problem, we first calculate the probability that all \(n\) birthdays are different. If \(n\) is greater than 365, by the pigeonhole principle, at least two people must have the same birthday, so the probability is 0. When \(n\) is less than or equal to 365, we can calculate the probability as follows:

\[1-p(n) = 1 * (1 – \frac{1}{365}) * (1 – \frac{2}{365}) * \cdot \cdot \cdot * (1- \frac{n-1}{365}) = \frac{365 \cdot 364 \cdot {}\cdot{} \cdot {} (365 -n+1) }{365^n}=\frac{365! }{365^n(365-n)!} \]

To carry out approximations when n is getting larger and larger, we use the following lemma:

If \(p_n\) tends to 0 and \(n p_n\) tends to lambda as n approaches 0, then:

\[ (1 -p_n)^n = e^{-\lambda}\] Therefore, \[p^{c}_n = 1 \cdot e^{-1/365} \cdot e^{-2/365} . . . \cdot e^{-(n-1)/365} = e^{\frac{-n^2+n}{730}}\]

Graph of probabililty of no match against n:

library(ggplot2)

# Generate data
n <- 1:60
probs <- exp((-n^2 + n)/730)

# Find n values where probability is 50% and 90%
n_50 <- n[which.min(abs(probs - 0.5))]
n_90 <- n[which.min(abs(probs - 0.9))]
n_99 <- n[which.min(abs(probs - 0.99))]

# Create plot
ggplot(data = data.frame(n, probs), aes(x = n, y = probs)) +
  geom_line() +
  scale_y_log10() +
  geom_vline(xintercept = n_50, linetype = "dashed", color = "red") +
  geom_vline(xintercept = n_90, linetype = "dashed", color = "blue") +
  geom_vline(xintercept = n_99, linetype = "dashed", color = "lightblue") +
  geom_label(aes(x = n_50, y = probs[which.min(abs(n - n_50))] + 0.01,
                 label = paste0("n = ", n_50, "\n50%"), fill = "red"),
             vjust = 0, color = "white", size = 3, show.legend = FALSE,
             label.padding = unit(0.5, "lines"), label.size = NA) +
  geom_label(aes(x = n_90, y = probs[which.min(abs(n - n_90))] + 0.01,
                 label = paste0("n = ", n_90, "\n90%"), fill = "blue"),
             vjust = 0, color = "white", size = 3, show.legend = FALSE,
             label.padding = unit(0.5, "lines"), label.size = NA) +
  geom_label(aes(x = n_99, y = probs[which.min(abs(n - n_99))] + 0.01,
                 label = paste0("n = ", n_99, "\n99%"), fill = "lightblue"),
             vjust = 0, color = "white", size = 3, show.legend = FALSE,
             label.padding = unit(0.5, "lines"), label.size = NA) +
  labs(x = "n", y = expression(e^frac(-n^2 + n, 730)),
       title = "No Match Probabilities as a function of n") +
  theme_minimal() +
  theme(legend.position = "none")

Graph Matching Probabilities

library(ggplot2)

# Generate data
n <- 1:60
probs <- 1-exp((-n^2 + n)/730)

# Find n values where probability is 50% and 90%
n_50 <- n[which.min(abs(probs - 0.5))]
n_90 <- n[which.min(abs(probs - 0.9))]
n_99 <- n[which.min(abs(probs - 0.99))]

# Create plot
ggplot(data = data.frame(n, probs), aes(x = n, y = probs)) +
  geom_line() +
  scale_y_log10() +
  geom_vline(xintercept = n_50, linetype = "dashed", color = "red") +
  geom_vline(xintercept = n_90, linetype = "dashed", color = "blue") +
  geom_vline(xintercept = n_99, linetype = "dashed", color = "lightblue") +
  geom_label(aes(x = n_50, y = probs[which.min(abs(n - n_50))] + 0.01,
                 label = paste0("n = ", n_50, "\n50%"), fill = "red"),
             vjust = 0, color = "white", size = 3, show.legend = FALSE,
             label.padding = unit(0.5, "lines"), label.size = NA) +
  geom_label(aes(x = n_90, y = probs[which.min(abs(n - n_90))] + 0.01,
                 label = paste0("n = ", n_90, "\n90%"), fill = "blue"),
             vjust = 0, color = "white", size = 3, show.legend = FALSE,
             label.padding = unit(0.5, "lines"), label.size = NA) +
  geom_label(aes(x = n_99, y = probs[which.min(abs(n - n_99))] + 0.01,
                 label = paste0("n = ", n_99, "\n99%"), fill = "lightblue"),
             vjust = 0, color = "white", size = 3, show.legend = FALSE,
             label.padding = unit(0.5, "lines"), label.size = NA) +
  labs(x = "n", y = expression(e^frac(-n^2 + n, 730)),
       title = "No Match Probabilities as a function of n") +
  theme_minimal() +
  theme(legend.position = "none")
## Warning: Transformation introduced infinite values in continuous y-axis

Simulation solution

Let us consider \(n=180\). The first question we should ask is: how do we define a success? To answer this, we generate a sample of size \(n\) (representing the individuals) with replacement from the numbers 1 to 365.

n<-180 ; set.seed(2)
M<-sample(x = 1:365,   #vector 1:365
          size = n,    #n size sample
          replace = T) #with replacement.
M
##   [1] 341 198 262 273 349 204 297 178  75 131 306 311  63 136 231 289  54 361
##  [19] 112 171  38 361 110 144  45 238 208 134 339   9 350 130 244   3 129 304
##  [37] 297 301 289 274   8 164 350  37 226 149 205 327 242 358  44 276 156   9
##  [55] 106 175 306 326 182 224 271  13 328 189  96 333 166 265  53 143  36  17
##  [73] 241  80 127 267 329  79 362  78 204 153 192  23  28   1 196 342 185 221
##  [91] 102 151 348  71 275 293 212 239 269 133 154 163  91 352 225 191 142  25
## [109]  75 197 291  38   1 172 318 297 337 343 136 324  94  34  72  77  81 176
## [127] 231  29  51 200 289  29 298  82 212 109  12 222 208  54  76 195  53 315
## [145] 229 242 175  96 118 163 180 107 258 247  14 354  56 222 332 347 139 318
## [163]  35 342 217  78 344  70 345  95 150 103 299 272 216  38 109 318 267 306

Within this vector, we need to check if we have at least one duplicate value, which would mean that two people have the same exact birthday. To achieve this, we will use the function which, if there is a duplicate value, returns the id (or position within the vector) of the first duplicated value, i.e., if an “x” has already appeared before, it returns the id of the next immediate value identical to “x”; if there are no duplicates, then it will return “0”.

id<-anyDuplicated(x = M) ; id
## [1] 22
#Which?
which(M==M[id])
## [1] 18 22

In this case, the number in the 18th entry was repeated in the 22st entry, as indicated by the value returned by the function.

Therefore, we will have a success if \(\neq 0\) (as this would indicate that at least two people were born on the same day in our sample). Let’s mark a success as 1 and a failure as 0:

try<-anyDuplicated(x = M)

#Success=1, Failure=0
success<-ifelse(test = try!=0,yes = 1,no = 0) ; success
## [1] 1

To estimate the desired probability empirically, we need to repeat the experiment of generating random samples and checking for duplicates many times and calculate the proportion of successful trials out of total trials. To repeat the experiment, we can use the function replicate, which repeats a certain experiment a specified number of times. For this problem, we can consider a sample size of \(n=7\) and a number of trials \(Trials=100,000\):

set.seed(66); n<-7 ; trials<-100000

M<-replicate(n = trials, # "trialt" times, n:
             expr = sample(x = 1:365, size = n, replace = T))

So, Our object its a \(n\times\text{trials}\) matrix with numbers between 1 and 365 (distributed uniformly), which tells us that we have “trials” of \(n\) subjects each, and each column is a sample. We apply per column:

# try per columns:
try<-apply(X = M,                #Matrix M,
              MARGIN = 2,           #PER COLUMNS,
              FUN = anyDuplicated)  #Function anyDuplicated.

We turn 1 into successes, and 0 to failures and we can estimate the probability as:

#Entry by entry, if success,1, if not, 0.
successes<-ifelse(test = try!=0,yes = 1,no = 0)
#length(successes)

#Estimated Probabilty:
no.successes<-sum(successes)
no.successes/trials
## [1] 0.05572

theoretically, for \(n\), the real probability is:

1- exp((-n^2 + n)/730)
## [1] 0.05591044

Now, lets try more values of \(n\) by using two functions. One for estimating and other for “real” probabilities.

prob.estimated<-function(n=100,trials=10000){
  
  M<-replicate(n = trials, #do "trials" times the following expression:
               expr = sample(x = 1:365, size = n, replace = T))
  #per columns:
  try<-apply(X = M,                # Matrix M,
                MARGIN = 2,           #per col,
                FUN = anyDuplicated)  #anyDuplicated.
  
  
  successes<-ifelse(test = try!=0,yes = 1,no = 0)
  
  #Estimated:
  no.successes<-sum(successes)
  return(no.successes/trials)
  
}

#REAL FUNCTION:

prob.aprox<-function(n){
  
  return(1- exp((-n^2 + n)/730))
  
}
n<-seq(from = 5, # 5
       to = 100,  # 50
       by = 5)   #.



#Aprox
Real_Approximation<-sapply(X = n, 
                  FUN = prob.aprox) 

#Estimated
set.seed(21)
Estimated<-sapply(X = n, 
                  FUN = prob.estimated) 


Res<-data.frame(n,Real_Approximation,Estimated)
library(knitr) ; kable(Res)
n Real_Approximation Estimated
5 0.0270254 0.0256
10 0.1159907 0.1210
15 0.2499919 0.2505
20 0.4058051 0.4075
25 0.5604122 0.5654
30 0.6963200 0.7096
35 0.8040973 0.8070
40 0.8819900 0.8847
45 0.9336180 0.9415
50 0.9651313 0.9684
55 0.9828969 0.9859
60 0.9921663 0.9961
65 0.9966494 0.9972
70 0.9986618 0.9995
75 0.9995009 0.9996
80 0.9998262 0.9998
85 0.9999435 1.0000
90 0.9999828 1.0000
95 0.9999951 1.0000
100 0.9999987 1.0000
#Graph
plot(Real_Approximation~Estimated, 
     main="Real vs Estimation",
     col="red", #,
     lwd=2)     


abline(a=0,      
       b = 1,     
       col="black", 
       lwd=2)