Binomial Distribution

Author

Dr Andrew Dalby

Typically the binomial distribution is introduced by using coin tosses. The binomial is used to calculate the number of successes (head or tails whichever you prefer) in a designated number of flips. The problem with this is that the probability is symmetric. By that I mean the probability for a success is the same as that for a failure and this can result in some confusion and so I am going to use asymmetric examples where the probabilities are different.

Formal Mathematical Definition

The binomial distribution gives the probability of the number of successes (k) from n Bernoulli trials. Where a Bernoulli trial is a random experiment with two possible outcomes that are defined as success or failure.

These trials are named after Jacob Bernoulli (there are a lot of Bernoulli family members involved in maths).

It is defined by two parameters, n - the number of trials and p the probability of a success.

\[ f(k,n,p)= P(X=k) = {n \choose k }p^{k}(1-p)^{n-k} \]

Sometimes 1-p is written as q.

The binomial distribution is:

A discrete random variable (total probability is 1)
The probability of a success is constant.
Only valid for a fixed number of trials (n)

We have already seen an example of the binomial distribution in practice in the example I gave for a discrete random variable, where a success was finding a cytosine in a window of 3 nucleotides of sequence if all the nucleotides have the same abundance (n=3 for that example and p=0.25). This was also used to prove that the distribution is a random variable and that the probabilities all add up to one.

Rather than using the formula it is easier to use the built in binomial distribution function within R to calculate the probabilities.

From the previous calculations:

P(X=0) = P(3 [A,T,G]) = \((\frac{3}{4})^{3} = \frac{27}{64}\)

P(X=1) = P(C, 2 [A,T,G]) = \(3(\frac{1}{4}(\frac{3}{4})^{2}) = \frac{27}{64}\)

P(X=2) = P(CC, [ATG]) = \(3((\frac{1}{4})^{2}\frac{3}{4})=\frac{9}{64}\)

P(X=3) =P(CCC) = \((\frac{1}{4})^{3} = \frac{1}{64}\)

library(ggplot2)

n <- c(0:3)
p <- c(27/64,27/64,9/64,1/64)
data <- data.frame(n,p)
plot <- ggplot(data=data, aes(x=n, y=p))+
  geom_bar(stat="identity", fill="cornflowerblue") +
  labs(title="Probability Distribution for the Number of \n Cytosines Found in a Three Nucleotide Window Assuming Equal Abundance", x="Number of Cytosines", y="Probability")
plot

library(ggplot2)

x <- c(0:3)
y <- dbinom(x, 3, 0.25)
data <- data.frame(x,y)
plot <- ggplot(data=data, aes(x=x, y=y))+
  geom_bar(stat="identity", fill="cadetblue") +
  labs(title="Probability Distribution for the Number of \n Cytosines Found in a Three Nucleotide Window Assuming Equal Abundance", x="Number of Cytosines", y="Probability")
plot

This is exactly the same plot as we calculated before.

Example Question

It is known that 20% of a population of laboratory mice are infertile.

a) if 5 mice are chosen at random, find the probability that exactly 2 of them are infertile

b) Find the least number of mice that need to chosen for the probability of choosing at least one infertile mouse to be greater than 0.99

For part a we can use R

dbinom(2, 5, 0.2)

[1] 0.2048

For the second part we have to calculate the number of trials which is a bit more complex.

\(P(X \ge 1) > 0.99 = 1 - P(X=0) = 1-(0.8)^{n}\)

\((0.8)^{n} < 0.01\)

\(n log(0.8) < log(0.01)\)

\(n > \frac{log(0.01)}{log(0.8)}\)

log(0.01)/log(0.8)

[1] 20.6377

So \(n \ge 21\)

More Examples

Assuming that the sex of any child born to a particular couple is equally likely to be male or female, find the probability that if they have 4 children they are all girls.

dbinom(4,4,0.5)

[1] 0.0625

It is known that 15% of the apples produced by an orchard are blemished. The apples are packed into boxes of 10. Find the probability that a box contains more than two blemished apples.

To calculate this you need to work out the probability of having 2 or less blemished apples and subtract it from 1.

1-pbinom(2,10,0.15)

[1] 0.1798035

In a large population of aphids, 10% are type A and the remainder are type B. If five aphids are selected at random, what is the probability that they are all of type B?

dbinom(0,5,0.1)

[1] 0.59049

In the same population of aphids what is the size of the sample that must be selected from the population so that the probability of there being at least one type A aphid is greater than 0.9?

P(X>1) > 0.9

\((0.9)^{n}<0.1\)

\(n log(0.9) < log(0.1)\)

log(0.1)/log(0.9)

[1] 21.85435

\(n \ge 22\)

These calculations are VERY IMPORTANT for sample size calculations when you have a population containing a small number of the people affected by the condition that you want to study. If the numbers are very low then a case-control study makes more sense rather than a cohort study.

The final part of this series is going to cover the geometric/exponential distribution and the Poisson Distribution which are closely connected to one another.

Appendix - Example Binomial Distributions

library(ggplot2)
library(RColorBrewer)
library(knitr)

x <- c(0:4)
success <- x
success <- as.factor(success)
prob <- dbinom(x, 4, 0.2)
data <- data.frame(success,prob)

#| label: tbl-binomial1
#| tbl-cap: "Binomial n=4 p=0.2 Distribition"
#| tbl-colwidths: [10,20]
#| echo: fenced
kable(data, col.names = c("Successes","Probability") )

Successes	Probability
0	0.4096
1	0.4096
2	0.1536
3	0.0256
4	0.0016

p <- ggplot(data=data, aes(x=success, y=prob, fill=success))+
  geom_bar(stat="identity", color="black") +
  scale_fill_brewer(palette = "Blues") +
  theme(legend.position="none") +
  labs(title="The n=4 p=0.2 Binomial Distribition" , x="Successes", y="Probability")

p

library(ggplot2)
library(RColorBrewer)
library(knitr)

x <- c(0:4)
success <- x
success <- as.factor(success)
prob <- dbinom(x, 4, 0.5)
data <- data.frame(success,prob)

#| label: tbl-binomial1
#| tbl-cap: "Binomial n=4 p=0.5 Distribition"
#| tbl-colwidths: [10,20]
#| echo: fenced
kable(data, col.names = c("Successes","Probability") )

Successes	Probability
0	0.0625
1	0.2500
2	0.3750
3	0.2500
4	0.0625

p <- ggplot(data=data, aes(x=success, y=prob, fill=success))+
  geom_bar(stat="identity", color="black") +
  scale_fill_brewer(palette = "Purples") +
  theme(legend.position="none") +
  labs(title="The n=4 p=0.5 Binomial Distribition" , x="Successes", y="Probability")

p

library(ggplot2)
library(RColorBrewer)
library(knitr)

x <- c(0:3)
success <- x
success <- as.factor(success)
prob <- dbinom(x, 3, 0.4)
data <- data.frame(success,prob)

#| label: tbl-binomial1
#| tbl-cap: "Binomial n=3 p=0.4 Distribition"
#| tbl-colwidths: [10,20]
#| echo: fenced
kable(data, col.names = c("Successes","Probability") )

Successes	Probability
0	0.216
1	0.432
2	0.288
3	0.064

p <- ggplot(data=data, aes(x=success, y=prob, fill=success))+
  geom_bar(stat="identity", color="black") +
  scale_fill_brewer(palette = "Greens") +
  theme(legend.position="none") +
  labs(title="The n=3 p=0.4 Binomial Distribition" , x="Successes", y="Probability")

p

library(ggplot2)
library(RColorBrewer)
library(knitr)

x <- c(0:7)
success <- x
success <- as.factor(success)
prob <- dbinom(x, 7, 0.4)
data <- data.frame(success,prob)

#| label: tbl-binomial1
#| tbl-cap: "Binomial n=7 p=0.4 Distribition"
#| tbl-colwidths: [10,20]
#| echo: fenced
kable(data, col.names = c("Successes","Probability") )

Successes	Probability
0	0.0279936
1	0.1306368
2	0.2612736
3	0.2903040
4	0.1935360
5	0.0774144
6	0.0172032
7	0.0016384

p <- ggplot(data=data, aes(x=success, y=prob, fill=success))+
  geom_bar(stat="identity", color="black") +
  scale_fill_brewer(palette = "Oranges") +
  theme(legend.position="none") +
  labs(title="The n=7 p=0.4 Binomial Distribition" , x="Successes", y="Probability")

p

library(ggplot2)
library(RColorBrewer)
library(knitr)

x <- c(0:10)
success <- x
success <- as.factor(success)
prob <- dbinom(x, 10, 0.4)
data <- data.frame(success,prob)

#| label: tbl-binomial1
#| tbl-cap: "Binomial n=10 p=0.4 Distribition"
#| tbl-colwidths: [10,20]
#| echo: fenced
kable(data, col.names = c("Successes","Probability") )

Successes	Probability
0	0.0060466
1	0.0403108
2	0.1209324
3	0.2149908
4	0.2508227
5	0.2006581
6	0.1114767
7	0.0424673
8	0.0106168
9	0.0015729
10	0.0001049

p <- ggplot(data=data, aes(x=success, y=prob, fill=success))+
  geom_bar(stat="identity", color="black") +
  scale_fill_brewer(palette = "BrBG") +
  theme(legend.position="none") +
  labs(title="The n=10 p=0.4 Binomial Distribition" , x="Successes", y="Probability")

p