The Binomial Distribution

Introduction

In this document, we will review the binomial distribution, which is a probability distribution that governs a set of trials when three conditions are met: [i] binary outcomes, [ii] independence (the probability of A is not conditional of previous realizations) , and [iii] a constant probability (i.e. probability of outcome A and B stays fixed).

Consider a classic example of the binomial distribution: what is the probability of a certain number of boys out of a certain number of offspring? This question has only two discrete outcomes¹, a fixed probability for boys and girls, and independent set of trials (i.e. previously having a boy does not say anything about the probability of the sex of the next child). In our motivating example, we are going to assume a mother is going to have 3 children. We can therefore ask: what is the probability of having \(k\) boys, i.e. \(P(X=k)\).

Suppose \(k=2\), how many combinations of two boys out of three children can we find: \(\{MMF\},\{MFM\},\{FMM\}\). Clearly three. Note that this the combination of boys does not consider order. The number of permutations does consider order, in which case we would have six outcomes \(\{M_1M_2F_1\},\{M_2F_1M_1\},...,\{F_1M_2M_1\}\). We therefore introduce our first two formulas: \[\text{Combination: } \hspace{1em} C(n,k)=\frac{n!}{n!(n-k)!}\] \[\text{Permutation: } \hspace{1em} P(n,k)=\frac{n!}{(n-k)!}\]

Returning to our specific example, if there are three combinations of having two boys out of three children, then the probability of having two boys is: \(3 \cdot p(Boy)\cdot p(Boy) \cdot p(Not Boy)=3\cdot (\frac{1}{2})^3=\frac{3}{8}\).² We can now introduce our binomial distribution formula. As the equation below shows, \(Binomial(n=3,k=2,p=0.5)\) will result in \(3/8\).

\[\text{Binomial(n,k,p): } \hspace{1em} C(n,k)*p^{k}*(1-p)^{n-k}\]

When in doubt, use pseudo-random numbers

In the previous section I asserted the definition of the binomial distribution, and given the source of the authorship you are no doubt right to be concerned about its empirical validity. Therefore, lets use R to simulative the relative frequencies of six children, and compare this to

f <- 1000 # Number of samples
k <- 6 # Number of children
# If f families have k kids, what percent of the families will have m boys of k kids?
f.sampl <- 1:f %>% lapply(.,function(f) runif(k) %>% round(0))
f.count <- f.sampl %>% lapply(sum) %>% unlist # Get the frequency table
f.freq <- table(f.count)/length(f.count) # Get the relative frequency table
f.df <- data.frame(f.freq) # Turn into data.frame for ggplot
# Create the ggplot
m.births.gg <- ggplot(f.df,aes(x=f.count,y=Freq*100)) + 
                  geom_bar(stat='identity',fill='blue') + 
                  ylab('Frequency (%)') + xlab('Number of boys (out of 6)') + 
                  ggtitle('Chart1: Simulated Distribution of Male Births')

As Chart 1 below shows, we get a nice bell-shaped distribution of boys, with three boys occuring the plurarilty of the time (~31%). We can confirm this number by using our probability and cumulative distribution function, found in the R code attached at the bottom of the page (p.bn and P.bn).

# pdf/cdf table
pcdf.table <- mapply(function(k1,k2) c(p.bn(k1,n=6,p=0.5),P.bn(k2,n=6,p=0.5)),1:6,1:6) %>% t %>%
                    data.frame(row.names=1:6) %>% round(3) %>% setNames(c('pdf','cdf'))

##     pdf   cdf
## 1 0.094 0.984
## 2 0.234 0.891
## 3 0.312 0.656
## 4 0.234 0.344
## 5 0.094 0.109
## 6 0.016 0.016

### Inference

This is a good point to segway to the likelihood function of our binomial distribution. A likelihood function is an expression of a probability distribution as a function of the parameters that govern the distribution, conditional on a set of data: \(L(p|X)\). Up until now, we have thought of our binomial distribution as a function of \(k\), as \(Binomial(.)=f(k;n,p)\), as \(n\) and \(p\) were assumed to be known (we knew how many children there were and the probability of the sexes). The likelihood considers the opposite scenario where we have observed the data (i.e. we know how many children (\(n\)) were born and how many of them were male (\(k\))) and we want to ask how likely is a given assumption of \(p\) given the data we have seen. In other words, each \(L(p_i;n,k)\) says given the null hypothesis of \(p_i\), what is the probability of observing this data? In the example below, we’ll have 10 boys be born out of 25 children, and plot the likelihood function.

k0 <- 10 # Number of boys
n0 <- 25 # Number of children
## ggplot
likeli.gg <- ggplot(data.frame(x=c(0,1)),aes(x)) +
                stat_function(fun=p.bn,color='blue',geom='line',size=1.5,args=list(k=k0,n=n0)) +
                ylab('Likelihood') + xlab('p') + 
                ggtitle('Chart 2: Likelihood for 10 boys (of 25 kids)') +
                geom_vline(xintercept=k0/n0,color='red',linetype='dashed',size=1.5)

As Chart 2 shows, the likelihood function peaks at \(p=0.4\), which is just the frequency of boys to children in this sample. We can then ask, is the difference between the two null hypothesis \(L(p_{max}=0.4)\) and the fair-sex hypothesis \(L(p_{fair}=0.5)\) sufficiently different to warrant changing our view of an equal number of boys and girls?

This is a simplifcation as some babies are born intersex.↩
Note that the order of the product is irrelevent of course, and we could have written: \(p(Boy) \cdot p(Not Boy) \cdot p(Boy)\).↩