Proba Sim 24 - Session 9

c.f. https://rpubs.com/janselmi/TD2 for exercises and code of this session
c.f. https://github.com/jonatha-anselmi/INFO4-PS for course page
c.f. https://github.com/davidelegacci/probasim24 for tutorial page

Mr. and Mrs. Smith

Directly

Label the possible configurations as $1 = FF, 2 = FM, 3 = MF, 4 = MM$
Draw $N$ random integers between 1 and 4 with uniform probability, allowing repetition (replace = TRUE)
every drawn integer is a family configuration (two children, either male or female)

N=10

S=sample(x=1:4,size=N,replace=T)
print(S)

##  [1] 2 4 2 1 2 3 1 1 3 2

We want to se how many times

given that one is a girl (that is, eliminating all the 4s from the list)
the other is a boy (that is, how many 2s and 3s are left)

print(S[S<4])

## [1] 2 2 1 2 3 1 1 3 2

S[S < 4]: This filters the elements in S where the condition S < 4 is TRUE. In other words, it keeps only those elements in S that are less than 4, that in this example are 9 over the initial 10.
Once there are only 1s, 2s and 3s in S, we need to calculate the ratios of 2s and 3s.

print(S[S<4]!=1)

## [1]  TRUE  TRUE FALSE  TRUE  TRUE FALSE FALSE  TRUE  TRUE

print(  mean(S[S<4]!=1)  )

## [1] 0.6666667

(S[S < 4] != 1 is a logical vector; TRUE is evaluated as 1 and FALSE as 0
mean on this logical vector returns the ratios of 2s and 3s in the list, that is 6 over 9, i.e. 0.6666667.

Putting all together and running the experiment many times,

N=100000
S=sample(x=1:4,size=N,replace=T)

prob=mean(S[S<4]!=1)
print(prob)

## [1] 0.6652858

Alternatively, mirroring more explicitly the formula for the conditional probability \[P(A | B) = \frac{P(A \text{ and } B)}{P(B)}\]

prob = length( S[S<4 & S > 1] ) / length( S[S<4] )
print(prob)

## [1] 0.6652858

Some practice with dataframes

# Number of experiments, that is family instances
N = 10

Create a Data Frame where each family (row) is assigned a possible configuration (the column values)

df = data.frame(
    C1 = sample(x = c("F", "G"), size = N, replace = TRUE),
    C2 = sample(x = c("F", "G"), size = N, replace = TRUE)
)

This creates a data frame df with two columns, C1 and C2, each containing N random samples of either "F" (female) or "G" (male).
sample(x = c("F", "G"), size = N, replace = TRUE) randomly samples "F" or "G" N times, allowing repetitions (replace = TRUE), meaning each child can independently be male or female.
Each row in df represents a “family” with two children, where C1 is the gender of the first child and C2 is the gender of the second child.

print(df)

##    C1 C2
## 1   G  F
## 2   G  F
## 3   G  F
## 4   F  F
## 5   F  F
## 6   F  G
## 7   G  F
## 8   F  G
## 9   G  F
## 10  G  F

next, filter the families with at Least one female

df = df[df$C1 == "F" | df$C2 == "F", ]

This line filters df to include only families where at least one child is female.
df$C1 == "F" | df$C2 == "F" creates a logical vector that is TRUE for rows where C1 or C2 is "F".
df[...] subsets df to include only the rows that meet this condition.
After this line, df contains only families with at least one female child.

print(df)

##    C1 C2
## 1   G  F
## 2   G  F
## 3   G  F
## 4   F  F
## 5   F  F
## 6   F  G
## 7   G  F
## 8   F  G
## 9   G  F
## 10  G  F

next calculate the probability of having a boy in these families

prob = nrow(df[df$C1 == "G" | df$C2 == "G", ]) / nrow(df)

df[df$C1 == "G" | df$C2 == "G", ] filters df again, but this time it keeps only rows where at least one child is male (C1 == "G" | C2 == "G").
nrow(df[df$C1 == "G" | df$C2 == "G", ]) gives the number of families with at least one boy, among those with at least one female
nrow(df) gives the total number of families with at least one female
The division nrow(...) / nrow(...) calculates the probability that a family with at least one female also has a male, that is

prob

## [1] 0.8

Putting all together and letting N big,

N=1000000
df=data.frame(C1=sample(x=c("F","G"),size=N,replace=T),
              C2=sample(x=c("F","G"),size=N,replace=T));
df=df[df$C1=="F" | df$C2=="F",] # only families with at least a female
prob=nrow(df[df$C1=="G" | df$C2=="G",])/nrow(df) # how many families with a boy (among the ones with at least a female)
prob

## [1] 0.6660227

A winner among K

$\newcommand{\winner}{i^{\ast}}$ $\newcommand{\players}{K}$ Along the lines of the solution in https://rpubs.com/janselmi/TD2, the sample space is $\Omega = \{0, 1\}^k$ endowed with the uniform probability (with $k \geq 3$). That is, let $\omega = (\omega_i)_{i = 1, \dots, k} \in \Omega$ be the profile of coin tosses, where $\omega_i \in {0, 1}$ is the outcome of player’s $i$ toss, for all $i \in \players := \{1, \dots, k\}$. We have $|{\Omega}| = 2^k$, and $p(\omega) = \frac{1}{2^k}$. The event “there is a winner” is \[W \subset \Omega = \{\omega \in \Omega: \text{there exists precisely one } \winner \in \players \text{ such that } \omega_\winner \neq \omega_i \forall i \in \players - \{\winner\} \} \] We can enumerate the points in $W$:

1: the first player tosses $0$ and all other toss $1$
2: the second player tosses $0$ and all other toss $1$
…
k: the $k$-th player tosses $0$ and all other toss $1$
k+1: the first player tosses $1$ and all other toss $0$
k+2: the second player tosses $1$ and all other toss $0$
…
2k: the $k$-th player tosses $1$ and all other toss $0$

Thus we have $|W| = 2k$, and \[p(W) = \frac{2k}{2^k} = \frac{k}{2^{k-1}}\]

Let $T_k$ be the RV denoting the number of games to play for observing a winner. What are the law and expectation of this RV?

$\newcommand{pber}{p_k}$ Each play is a Bernoulli event with probability $\pber$, thus by definition $T_k$ follows a geometric distribution with parameter $\pber)$, i.e. for all $t = 1, 2, \dots$ we have

\[ P(T_k = t) = \pber(1-\pber)^{t-1} \]

with expected value $\frac{1}{\pber}$.

R code

Direct solution

Set the Number of Simulations and Players

N = 10
K = 5

for each of the N trials generate a game instance, that is a binary vector of size K
represent it as a matrix with N rows and K columns.

S=replicate(K, sample(x=0:1,N,replace=T)) 
print(S)

##       [,1] [,2] [,3] [,4] [,5]
##  [1,]    1    1    1    0    0
##  [2,]    0    0    1    0    0
##  [3,]    0    0    0    1    1
##  [4,]    0    1    0    0    0
##  [5,]    0    1    1    0    1
##  [6,]    1    1    0    0    0
##  [7,]    1    0    1    0    0
##  [8,]    1    0    0    0    0
##  [9,]    1    1    1    1    0
## [10,]    1    0    1    1    1

Next, calculate the probability of having exactly 1 or (K - 1) ones in each row:

p_K = mean(rowSums(S)==1 | rowSums(S)==K-1)

rowSums(S) calculates the sum of each row in the matrix S, which gives the count of 1s in each row.
rowSums(S) == 1 | rowSums(S) == K - 1 is a logical vector that is TRUE for rows where the sum is exactly 1 or K - 1 (in this case, 5 out of 10).
mean(...) calculates the average of this logical vector, effectively giving the probability that a row has exactly 1 or K - 1 ones.

p_K

## [1] 0.5

Putting all together for N large,

N=1000000
K=5
S=replicate(K, sample(x=0:1,N,replace=T)) 
p_K=mean(rowSums(S)==1 | rowSums(S)==K-1)
p_K

## [1] 0.312423

Notice that the theoretical value is $\frac{5 }{ 2^{4} } = 0.3125$.

Solution with dataframes

Set the Number of Simulations N and Players K

N = 10
K = 5

First, we represent the history of plays of one player as a data frame df_parties with one column, called x1, and N rows, containing random binary values (0 or 1).

# generates a sample of `0`s and `1`s of length `N`.
df_parties = data.frame(x1 = sample(x = c(0,1), size = N, replace = TRUE))
print(df_parties)

##    x1
## 1   0
## 2   1
## 3   0
## 4   1
## 5   1
## 6   0
## 7   0
## 8   0
## 9   1
## 10  1

Then, we add the history of plays of the other players (that is, we add K-1 columns of the dataframe):

for(i in 2:K) { 
    df_parties = cbind(df_parties, data.frame(x = sample(x = c(0,1), size = N, replace = TRUE)))
    names(df_parties)[i] = paste0("x", i)
}
print(df_parties)

##    x1 x2 x3 x4 x5
## 1   0  1  0  0  0
## 2   1  0  1  1  0
## 3   0  1  0  0  0
## 4   1  1  0  0  1
## 5   1  0  0  0  0
## 6   0  0  0  0  1
## 7   0  1  1  0  1
## 8   0  0  1  1  0
## 9   1  0  0  1  1
## 10  1  1  1  0  1

This loop adds K-1 additional columns to df_parties, each filled with random binary values (0 or 1).
In each iteration, we add a new column to df_parties using cbind() and rename it as "x2", "x3", etc., using paste0("x", i).

Initialize a Sum Column

We add a column called sum to df_parties and initialize it to 0. This column will store the row-wise sum of the values in each x column.

df_parties$sum = 0
print(df_parties)

##    x1 x2 x3 x4 x5 sum
## 1   0  1  0  0  0   0
## 2   1  0  1  1  0   0
## 3   0  1  0  0  0   0
## 4   1  1  0  0  1   0
## 5   1  0  0  0  0   0
## 6   0  0  0  0  1   0
## 7   0  1  1  0  1   0
## 8   0  0  1  1  0   0
## 9   1  0  0  1  1   0
## 10  1  1  1  0  1   0

Calculate the Sum of Each Row and store it in the sum column

for(i in 1:K) { 
    df_parties$sum = df_parties$sum + df_parties[, paste0("x", i)]
}
print(df_parties)

##    x1 x2 x3 x4 x5 sum
## 1   0  1  0  0  0   1
## 2   1  0  1  1  0   3
## 3   0  1  0  0  0   1
## 4   1  1  0  0  1   3
## 5   1  0  0  0  0   1
## 6   0  0  0  0  1   1
## 7   0  1  1  0  1   3
## 8   0  0  1  1  0   2
## 9   1  0  0  1  1   3
## 10  1  1  1  0  1   4

This loop iterates over each column from x1 to xK, adding each value to the sum column.
The result is that df_parties$sum contains the total number of 1s in each row.

Calculate the Probability

Finally, calculate the probability that each row has exactly 1 or K-1 (in this case 4) ones
mean(df_parties$sum == 1 | df_parties$sum == K - 1) calculates the fraction of rows where sum is 1 or K-1.

p_K = mean(df_parties$sum == 1 | df_parties$sum == K - 1)
p_K

## [1] 0.5