library(ggplot2)
<- c(0:3)
n <- c(27/64,27/64,9/64,1/64)
p <- data.frame(n,p)
data <- ggplot(data=data, aes(x=n, y=p))+
plot geom_bar(stat="identity", fill="cornflowerblue") +
labs(title="Probability Distribution for the Number of \n Cytosines Found in a Three Nucleotide Window Assuming Equal Abundance", x="Number of Cytosines", y="Probability")
plot
Binomial Distribution
Typically the binomial distribution is introduced by using coin tosses. The binomial is used to calculate the number of successes (head or tails whichever you prefer) in a designated number of flips. The problem with this is that the probability is symmetric. By that I mean the probability for a success is the same as that for a failure and this can result in some confusion and so I am going to use asymmetric examples where the probabilities are different.
Formal Mathematical Definition
The binomial distribution gives the probability of the number of successes (k) from n Bernoulli trials. Where a Bernoulli trial is a random experiment with two possible outcomes that are defined as success or failure.
These trials are named after Jacob Bernoulli (there are a lot of Bernoulli family members involved in maths).
It is defined by two parameters, n - the number of trials and p the probability of a success.
\[ f(k,n,p)= P(X=k) = {n \choose k }p^{k}(1-p)^{n-k} \]
Sometimes 1-p is written as q.
The binomial distribution is:
A discrete random variable (total probability is 1)
The probability of a success is constant.
Only valid for a fixed number of trials (n)
We have already seen an example of the binomial distribution in practice in the example I gave for a discrete random variable, where a success was finding a cytosine in a window of 3 nucleotides of sequence if all the nucleotides have the same abundance (n=3 for that example and p=0.25). This was also used to prove that the distribution is a random variable and that the probabilities all add up to one.
Rather than using the formula it is easier to use the built in binomial distribution function within R to calculate the probabilities.
From the previous calculations:
P(X=0) = P(3 [A,T,G]) = \((\frac{3}{4})^{3} = \frac{27}{64}\)
P(X=1) = P(C, 2 [A,T,G]) = \(3(\frac{1}{4}(\frac{3}{4})^{2}) = \frac{27}{64}\)
P(X=2) = P(CC, [ATG]) = \(3((\frac{1}{4})^{2}\frac{3}{4})=\frac{9}{64}\)
P(X=3) =P(CCC) = \((\frac{1}{4})^{3} = \frac{1}{64}\)
library(ggplot2)
<- c(0:3)
x <- dbinom(x, 3, 0.25)
y <- data.frame(x,y)
data <- ggplot(data=data, aes(x=x, y=y))+
plot geom_bar(stat="identity", fill="cadetblue") +
labs(title="Probability Distribution for the Number of \n Cytosines Found in a Three Nucleotide Window Assuming Equal Abundance", x="Number of Cytosines", y="Probability")
plot
This is exactly the same plot as we calculated before.
Example Question
It is known that 20% of a population of laboratory mice are infertile.
a) if 5 mice are chosen at random, find the probability that exactly 2 of them are infertile
b) Find the least number of mice that need to chosen for the probability of choosing at least one infertile mouse to be greater than 0.99
For part a we can use R
dbinom(2, 5, 0.2)
[1] 0.2048
For the second part we have to calculate the number of trials which is a bit more complex.
\(P(X \ge 1) > 0.99 = 1 - P(X=0) = 1-(0.8)^{n}\)
\((0.8)^{n} < 0.01\)
\(n log(0.8) < log(0.01)\)
\(n > \frac{log(0.01)}{log(0.8)}\)
log(0.01)/log(0.8)
[1] 20.6377
So \(n \ge 21\)
More Examples
- Assuming that the sex of any child born to a particular couple is equally likely to be male or female, find the probability that if they have 4 children they are all girls.
dbinom(4,4,0.5)
[1] 0.0625
- It is known that 15% of the apples produced by an orchard are blemished. The apples are packed into boxes of 10. Find the probability that a box contains more than two blemished apples.
To calculate this you need to work out the probability of having 2 or less blemished apples and subtract it from 1.
1-pbinom(2,10,0.15)
[1] 0.1798035
- In a large population of aphids, 10% are type A and the remainder are type B. If five aphids are selected at random, what is the probability that they are all of type B?
dbinom(0,5,0.1)
[1] 0.59049
- In the same population of aphids what is the size of the sample that must be selected from the population so that the probability of there being at least one type A aphid is greater than 0.9?
P(X>1) > 0.9
\((0.9)^{n}<0.1\)
\(n log(0.9) < log(0.1)\)
log(0.1)/log(0.9)
[1] 21.85435
\(n \ge 22\)
These calculations are VERY IMPORTANT for sample size calculations when you have a population containing a small number of the people affected by the condition that you want to study. If the numbers are very low then a case-control study makes more sense rather than a cohort study.
The final part of this series is going to cover the geometric/exponential distribution and the Poisson Distribution which are closely connected to one another.
Appendix - Example Binomial Distributions
library(ggplot2)
library(RColorBrewer)
library(knitr)
<- c(0:4)
x <- x
success <- as.factor(success)
success <- dbinom(x, 4, 0.2)
prob <- data.frame(success,prob)
data
#| label: tbl-binomial1
#| tbl-cap: "Binomial n=4 p=0.2 Distribition"
#| tbl-colwidths: [10,20]
#| echo: fenced
kable(data, col.names = c("Successes","Probability") )
Successes | Probability |
---|---|
0 | 0.4096 |
1 | 0.4096 |
2 | 0.1536 |
3 | 0.0256 |
4 | 0.0016 |
<- ggplot(data=data, aes(x=success, y=prob, fill=success))+
p geom_bar(stat="identity", color="black") +
scale_fill_brewer(palette = "Blues") +
theme(legend.position="none") +
labs(title="The n=4 p=0.2 Binomial Distribition" , x="Successes", y="Probability")
p
library(ggplot2)
library(RColorBrewer)
library(knitr)
<- c(0:4)
x <- x
success <- as.factor(success)
success <- dbinom(x, 4, 0.5)
prob <- data.frame(success,prob)
data
#| label: tbl-binomial1
#| tbl-cap: "Binomial n=4 p=0.5 Distribition"
#| tbl-colwidths: [10,20]
#| echo: fenced
kable(data, col.names = c("Successes","Probability") )
Successes | Probability |
---|---|
0 | 0.0625 |
1 | 0.2500 |
2 | 0.3750 |
3 | 0.2500 |
4 | 0.0625 |
<- ggplot(data=data, aes(x=success, y=prob, fill=success))+
p geom_bar(stat="identity", color="black") +
scale_fill_brewer(palette = "Purples") +
theme(legend.position="none") +
labs(title="The n=4 p=0.5 Binomial Distribition" , x="Successes", y="Probability")
p
library(ggplot2)
library(RColorBrewer)
library(knitr)
<- c(0:3)
x <- x
success <- as.factor(success)
success <- dbinom(x, 3, 0.4)
prob <- data.frame(success,prob)
data
#| label: tbl-binomial1
#| tbl-cap: "Binomial n=3 p=0.4 Distribition"
#| tbl-colwidths: [10,20]
#| echo: fenced
kable(data, col.names = c("Successes","Probability") )
Successes | Probability |
---|---|
0 | 0.216 |
1 | 0.432 |
2 | 0.288 |
3 | 0.064 |
<- ggplot(data=data, aes(x=success, y=prob, fill=success))+
p geom_bar(stat="identity", color="black") +
scale_fill_brewer(palette = "Greens") +
theme(legend.position="none") +
labs(title="The n=3 p=0.4 Binomial Distribition" , x="Successes", y="Probability")
p
library(ggplot2)
library(RColorBrewer)
library(knitr)
<- c(0:7)
x <- x
success <- as.factor(success)
success <- dbinom(x, 7, 0.4)
prob <- data.frame(success,prob)
data
#| label: tbl-binomial1
#| tbl-cap: "Binomial n=7 p=0.4 Distribition"
#| tbl-colwidths: [10,20]
#| echo: fenced
kable(data, col.names = c("Successes","Probability") )
Successes | Probability |
---|---|
0 | 0.0279936 |
1 | 0.1306368 |
2 | 0.2612736 |
3 | 0.2903040 |
4 | 0.1935360 |
5 | 0.0774144 |
6 | 0.0172032 |
7 | 0.0016384 |
<- ggplot(data=data, aes(x=success, y=prob, fill=success))+
p geom_bar(stat="identity", color="black") +
scale_fill_brewer(palette = "Oranges") +
theme(legend.position="none") +
labs(title="The n=7 p=0.4 Binomial Distribition" , x="Successes", y="Probability")
p
library(ggplot2)
library(RColorBrewer)
library(knitr)
<- c(0:10)
x <- x
success <- as.factor(success)
success <- dbinom(x, 10, 0.4)
prob <- data.frame(success,prob)
data
#| label: tbl-binomial1
#| tbl-cap: "Binomial n=10 p=0.4 Distribition"
#| tbl-colwidths: [10,20]
#| echo: fenced
kable(data, col.names = c("Successes","Probability") )
Successes | Probability |
---|---|
0 | 0.0060466 |
1 | 0.0403108 |
2 | 0.1209324 |
3 | 0.2149908 |
4 | 0.2508227 |
5 | 0.2006581 |
6 | 0.1114767 |
7 | 0.0424673 |
8 | 0.0106168 |
9 | 0.0015729 |
10 | 0.0001049 |
<- ggplot(data=data, aes(x=success, y=prob, fill=success))+
p geom_bar(stat="identity", color="black") +
scale_fill_brewer(palette = "BrBG") +
theme(legend.position="none") +
labs(title="The n=10 p=0.4 Binomial Distribition" , x="Successes", y="Probability")
p