Suppose we have a sample of \(n\) people, let’s take the days of the year such that 1 = January 1st and 365 = December 31st (assuming a “normal” year), then considering the hypothesis that being born on any day of the year is equalt likely, what is the probability that AT LEAST two of the \(n\) people were born on the exact same day?
In the context of probability, it’s often surprising how seemingly unlikely events become highly probable in certain scenarios. One such scenario is the probability of two people in a group having the same birthday. For instance, in a group of at least 23 randomly chosen people, the probability of at least two people sharing a birthday is more than 50%. When the number of people is 42 or more, the probability increases to more than 90% In this code, we will calculate and simulate this probability using R.
The first person has a 100 % chance of a unique number (of course)
The second has a (1 - 1/365) chance (all but 1 number from the 365)
The third has a (1 - 2/365) chance (all but 2 numbers)….
To compute the probability \(p(n)\) that in a group of \(n\) people, at least two share the same birthday, we assume that all birthdays are equally likely and distributed uniformly over a year of 365 days.
To simplify the problem, we first calculate the probability that all \(n\) birthdays are different. If \(n\) is greater than 365, by the pigeonhole principle, at least two people must have the same birthday, so the probability is 0. When \(n\) is less than or equal to 365, we can calculate the probability as follows:
\[1-p(n) = 1 * (1 – \frac{1}{365}) * (1 – \frac{2}{365}) * \cdot \cdot \cdot * (1- \frac{n-1}{365}) = \frac{365 \cdot 364 \cdot {}\cdot{} \cdot {} (365 -n+1) }{365^n}=\frac{365! }{365^n(365-n)!} \]
To carry out approximations when n is getting larger and larger, we use the following lemma:
If \(p_n\) tends to 0 and \(n p_n\) tends to lambda as n approaches 0, then:
\[ (1 -p_n)^n = e^{-\lambda}\] Therefore, \[p^{c}_n = 1 \cdot e^{-1/365} \cdot e^{-2/365} . . . \cdot e^{-(n-1)/365} = e^{\frac{-n^2+n}{730}}\]
Graph of probabililty of no match against n:
library(ggplot2)
# Generate data
n <- 1:60
probs <- exp((-n^2 + n)/730)
# Find n values where probability is 50% and 90%
n_50 <- n[which.min(abs(probs - 0.5))]
n_90 <- n[which.min(abs(probs - 0.9))]
n_99 <- n[which.min(abs(probs - 0.99))]
# Create plot
ggplot(data = data.frame(n, probs), aes(x = n, y = probs)) +
geom_line() +
scale_y_log10() +
geom_vline(xintercept = n_50, linetype = "dashed", color = "red") +
geom_vline(xintercept = n_90, linetype = "dashed", color = "blue") +
geom_vline(xintercept = n_99, linetype = "dashed", color = "lightblue") +
geom_label(aes(x = n_50, y = probs[which.min(abs(n - n_50))] + 0.01,
label = paste0("n = ", n_50, "\n50%"), fill = "red"),
vjust = 0, color = "white", size = 3, show.legend = FALSE,
label.padding = unit(0.5, "lines"), label.size = NA) +
geom_label(aes(x = n_90, y = probs[which.min(abs(n - n_90))] + 0.01,
label = paste0("n = ", n_90, "\n90%"), fill = "blue"),
vjust = 0, color = "white", size = 3, show.legend = FALSE,
label.padding = unit(0.5, "lines"), label.size = NA) +
geom_label(aes(x = n_99, y = probs[which.min(abs(n - n_99))] + 0.01,
label = paste0("n = ", n_99, "\n99%"), fill = "lightblue"),
vjust = 0, color = "white", size = 3, show.legend = FALSE,
label.padding = unit(0.5, "lines"), label.size = NA) +
labs(x = "n", y = expression(e^frac(-n^2 + n, 730)),
title = "No Match Probabilities as a function of n") +
theme_minimal() +
theme(legend.position = "none")
Graph Matching Probabilities
library(ggplot2)
# Generate data
n <- 1:60
probs <- 1-exp((-n^2 + n)/730)
# Find n values where probability is 50% and 90%
n_50 <- n[which.min(abs(probs - 0.5))]
n_90 <- n[which.min(abs(probs - 0.9))]
n_99 <- n[which.min(abs(probs - 0.99))]
# Create plot
ggplot(data = data.frame(n, probs), aes(x = n, y = probs)) +
geom_line() +
scale_y_log10() +
geom_vline(xintercept = n_50, linetype = "dashed", color = "red") +
geom_vline(xintercept = n_90, linetype = "dashed", color = "blue") +
geom_vline(xintercept = n_99, linetype = "dashed", color = "lightblue") +
geom_label(aes(x = n_50, y = probs[which.min(abs(n - n_50))] + 0.01,
label = paste0("n = ", n_50, "\n50%"), fill = "red"),
vjust = 0, color = "white", size = 3, show.legend = FALSE,
label.padding = unit(0.5, "lines"), label.size = NA) +
geom_label(aes(x = n_90, y = probs[which.min(abs(n - n_90))] + 0.01,
label = paste0("n = ", n_90, "\n90%"), fill = "blue"),
vjust = 0, color = "white", size = 3, show.legend = FALSE,
label.padding = unit(0.5, "lines"), label.size = NA) +
geom_label(aes(x = n_99, y = probs[which.min(abs(n - n_99))] + 0.01,
label = paste0("n = ", n_99, "\n99%"), fill = "lightblue"),
vjust = 0, color = "white", size = 3, show.legend = FALSE,
label.padding = unit(0.5, "lines"), label.size = NA) +
labs(x = "n", y = expression(e^frac(-n^2 + n, 730)),
title = "No Match Probabilities as a function of n") +
theme_minimal() +
theme(legend.position = "none")
## Warning: Transformation introduced infinite values in continuous y-axis
Let us consider \(n=180\). The first question we should ask is: how do we define a success? To answer this, we generate a sample of size \(n\) (representing the individuals) with replacement from the numbers 1 to 365.
n<-180 ; set.seed(2)
M<-sample(x = 1:365, #vector 1:365
size = n, #n size sample
replace = T) #with replacement.
M
## [1] 341 198 262 273 349 204 297 178 75 131 306 311 63 136 231 289 54 361
## [19] 112 171 38 361 110 144 45 238 208 134 339 9 350 130 244 3 129 304
## [37] 297 301 289 274 8 164 350 37 226 149 205 327 242 358 44 276 156 9
## [55] 106 175 306 326 182 224 271 13 328 189 96 333 166 265 53 143 36 17
## [73] 241 80 127 267 329 79 362 78 204 153 192 23 28 1 196 342 185 221
## [91] 102 151 348 71 275 293 212 239 269 133 154 163 91 352 225 191 142 25
## [109] 75 197 291 38 1 172 318 297 337 343 136 324 94 34 72 77 81 176
## [127] 231 29 51 200 289 29 298 82 212 109 12 222 208 54 76 195 53 315
## [145] 229 242 175 96 118 163 180 107 258 247 14 354 56 222 332 347 139 318
## [163] 35 342 217 78 344 70 345 95 150 103 299 272 216 38 109 318 267 306
Within this vector, we need to check if we have at least one duplicate value, which would mean that two people have the same exact birthday. To achieve this, we will use the function which, if there is a duplicate value, returns the id (or position within the vector) of the first duplicated value, i.e., if an “x” has already appeared before, it returns the id of the next immediate value identical to “x”; if there are no duplicates, then it will return “0”.
id<-anyDuplicated(x = M) ; id
## [1] 22
#Which?
which(M==M[id])
## [1] 18 22
In this case, the number in the 18th entry was repeated in the 22st entry, as indicated by the value returned by the function.
Therefore, we will have a success if \(\neq 0\) (as this would indicate that at least two people were born on the same day in our sample). Let’s mark a success as 1 and a failure as 0:
try<-anyDuplicated(x = M)
#Success=1, Failure=0
success<-ifelse(test = try!=0,yes = 1,no = 0) ; success
## [1] 1
To estimate the desired probability empirically, we need to repeat
the experiment of generating random samples and checking for duplicates
many times and calculate the proportion of successful trials out of
total trials. To repeat the experiment, we can use the function
replicate, which repeats a certain experiment a specified
number of times. For this problem, we can consider a sample size of
\(n=7\) and a number of trials \(Trials=100,000\):
set.seed(66); n<-7 ; trials<-100000
M<-replicate(n = trials, # "trialt" times, n:
expr = sample(x = 1:365, size = n, replace = T))
So, Our object its a \(n\times\text{trials}\) matrix with numbers between 1 and 365 (distributed uniformly), which tells us that we have “trials” of \(n\) subjects each, and each column is a sample. We apply per column:
# try per columns:
try<-apply(X = M, #Matrix M,
MARGIN = 2, #PER COLUMNS,
FUN = anyDuplicated) #Function anyDuplicated.
We turn 1 into successes, and 0 to failures and we can estimate the probability as:
#Entry by entry, if success,1, if not, 0.
successes<-ifelse(test = try!=0,yes = 1,no = 0)
#length(successes)
#Estimated Probabilty:
no.successes<-sum(successes)
no.successes/trials
## [1] 0.05572
theoretically, for \(n\), the real probability is:
1- exp((-n^2 + n)/730)
## [1] 0.05591044
Now, lets try more values of \(n\) by using two functions. One for estimating and other for “real” probabilities.
prob.estimated<-function(n=100,trials=10000){
M<-replicate(n = trials, #do "trials" times the following expression:
expr = sample(x = 1:365, size = n, replace = T))
#per columns:
try<-apply(X = M, # Matrix M,
MARGIN = 2, #per col,
FUN = anyDuplicated) #anyDuplicated.
successes<-ifelse(test = try!=0,yes = 1,no = 0)
#Estimated:
no.successes<-sum(successes)
return(no.successes/trials)
}
#REAL FUNCTION:
prob.aprox<-function(n){
return(1- exp((-n^2 + n)/730))
}
n<-seq(from = 5, # 5
to = 100, # 50
by = 5) #.
#Aprox
Real_Approximation<-sapply(X = n,
FUN = prob.aprox)
#Estimated
set.seed(21)
Estimated<-sapply(X = n,
FUN = prob.estimated)
Res<-data.frame(n,Real_Approximation,Estimated)
library(knitr) ; kable(Res)
| n | Real_Approximation | Estimated |
|---|---|---|
| 5 | 0.0270254 | 0.0256 |
| 10 | 0.1159907 | 0.1210 |
| 15 | 0.2499919 | 0.2505 |
| 20 | 0.4058051 | 0.4075 |
| 25 | 0.5604122 | 0.5654 |
| 30 | 0.6963200 | 0.7096 |
| 35 | 0.8040973 | 0.8070 |
| 40 | 0.8819900 | 0.8847 |
| 45 | 0.9336180 | 0.9415 |
| 50 | 0.9651313 | 0.9684 |
| 55 | 0.9828969 | 0.9859 |
| 60 | 0.9921663 | 0.9961 |
| 65 | 0.9966494 | 0.9972 |
| 70 | 0.9986618 | 0.9995 |
| 75 | 0.9995009 | 0.9996 |
| 80 | 0.9998262 | 0.9998 |
| 85 | 0.9999435 | 1.0000 |
| 90 | 0.9999828 | 1.0000 |
| 95 | 0.9999951 | 1.0000 |
| 100 | 0.9999987 | 1.0000 |
#Graph
plot(Real_Approximation~Estimated,
main="Real vs Estimation",
col="red", #,
lwd=2)
abline(a=0,
b = 1,
col="black",
lwd=2)