library(ggplot2)
<- c(0:3)
n <- c(27/64,27/64,9/64,1/64)
p <- data.frame(n,p)
data <- ggplot(data=data, aes(x=n, y=p))+
plot geom_bar(stat="identity", fill="cornflowerblue") +
labs(title="Probability Distribution for the Number of \n Cytosines Found in a Three Nucleotide Window Assuming Equal Abundance", x="Number of Cytosines", y="Probability")
plot
Discrete Random Variables
This is going to be a much more text heavy and much less coder friendly page as it introduces some of the theoretical frameworks and background.
As a personal note this is also based on the Further Mechanics and Probability Curriculum that I studied for my A-levels in the late 1980s.
Sometimes you need to have definitions in place in order to understand what is happening.
Definitions
A variable is an element that can take multiple different values.
These values can be numerical or descriptive.
If they are numerical then they can be discrete or continuous if they are descriptive then they form discrete categories.
If they are descriptive they can be ordered or unordered.
A variable is represented by a capital letter e.g. X
The values of a variable are represented by lower case letters e.g. x
The possible values of x are defined for each variable.
Examples of Variables
The number of sixes on ten dice (discrete numerical)
The heights of students (continuous numerical)
Number of siblings (discrete numerical - but maybe more complex)
Favourite colours (discrete descriptive - categorical - nominal)
Size of t-shirts (discrete descriptive - categorical - ordinal)
Examples of Variable Notation
X is the number of sixes on three dice, X = x for x = 0,1,2,3
X is a student’s favourite colour, X = x for {red, green, blue, orange, …}
X is the value from a single die roll, X = x for x=1,2,3,4,5,6
Already you can see that there are issues with these definitions and models where the reality is more complex. For example there are many different variations of the major colours? In the case of siblings how do you deal with step-siblings and half-siblings?
Reality is messy but we will keep it simple with clearly defined values for now.
The Properties of a Discrete Random Variable
Each of the possible values of x must be mutually exclusive.
The total probability for all of the possible values of x must be 1.
If I am counting the total number of cytosine nucleic acids in a codon there can be 0,1,2 or 3. That defines all of the possibilities. If the total is 1 it cannot also be 2 or 3 and the events are mutually exclusive.
The probability for the events does not have to be equal. In some cases we can work out the probability by working out all of the possible alternative ways of getting x and counting them (working them out by enumeration).
In biology we often don’t know what the probabilities will be even if we do know all the possible values of x. In the worst possible case we don’t even know all the possible values for x.
Mathematically we can write this definition as:
If X is a random variable:
\[ \sum^{n}_{i=1}P(X=x_{i})=1 \]
If there are a small number of possible values of x as in the examples then we can often enumerate the probability and calculate the sum to check if this is true. We can also calculate this sum if the probability is a form of geometric progression.
The Geometric Case
Imagine that the probability of an actin filament extending is 0.6 at each time step. If the random variable X is “the number of attempts to successfully add an actin monomer”, this can take any value from 1 to infinity. This makes it impossible to evaluate the probability for all values of x.
P(X=1) = 0.6
P(X=2) = (0.4)(0.6)
P(X=3) = (0.4)(0.4)(0.6) …
\[ P(X)=\sum^{\infty}_{n=0}(0.6)(0.4)^{n} \]The sum of a geometric progression is given by:
\[ \frac{a}{1-r} \]
In this case a=0.6 and r=0.4 and so the sum is 1.
Therefore the number of attempts for the successful addition of an actin monomer is a random variable.
Probability Distributions
If I have a moving window along a biological sequence and if I assume that all four nucleotides have the same abundance within that sequence then if X is the number of cytosines in the window then it can take the value 0,1,2 and 3.
For one cytosine this could be in the first, second or third position. There are therefore three ways of getting a single cytosine.
For two cytosines the other A,T or G could be in the first second or third position and so again there are three ways of getting two cytosines and a single other base.
P(X=0) = P(3 [A,T,G]) = \((\frac{3}{4})^{3} = \frac{27}{64}\)
P(X=1) = P(C, 2 [A,T,G]) = \(3(\frac{1}{4}(\frac{3}{4})^{2}) = \frac{27}{64}\)
P(X=2) = P(CC, [ATG]) = \(3((\frac{1}{4})^{2}\frac{3}{4})=\frac{9}{64}\)
P(X=3) =P(CCC) = \((\frac{1}{4})^{3} = \frac{1}{64}\)
We should do a quick “sanity check” to make sure that it is a random variable and that all the probabilities add up to 1.
27 + 27 + 9 + 1 = 64.
The probability distribution can be plotted as a bar chart.
You might see some textbooks say that this is plotted as a histogram but from my perspective a histogram is only for continuous data and it should not have gaps along the x-axis, and it can also have different widths for the ranges. This is an example of where there are differences of opinion. This does not mean that there is one right and one wrong answer, it means that different communities use the terms in different ways.
Probability Density Functions (p.d.f.s)
In case above all of the probabilities were calculated in a similar way using the probability of having a cytosine and the probability of not having a cytosine. I could write a formula for the calculation of the probability for each value of x.
P(X=x) = \({3\choose x} (\frac{1}{4})^{x}(\frac{3}{4})^{3-x}\)
You might recognise this as a specific example of the binomial distribution but we will come back to that later.
Notice also that this only applies for x = 0,1,2,3 and so the probability density functions should be written as:
P(X=x) = \({3\choose x} (\frac{1}{4})^{x}(\frac{3}{4})^{3-x}\) for x = 0,1,2,3
or alternatively and more formally.
\[ f(x)=\Biggl\{ \begin{array}{cc} {3\choose x} (\frac{1}{4})^{x}(\frac{3}{4})^{3-x} & 0 \le x \le 3\\ 0 & \text{otherwise} \\ \end{array} \]
The values of x are not infinite and they have a well defined range.
Sometimes it is not possible to use a single function to describe the probability density and so it can be given as a multiple part function. A simple example of this is the sum of two die.
x | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
P(X=x) | 1/36 | 2/36 | 3/36 | 4/36 | 5/36 | 6/36 | 5/36 | 4/36 | 3/36 | 2/36 | 1/36 |
\[ f(x) = \Biggl \{ \begin{array}{ccc} \frac{x-1}{36} & \text{for} & 2 \le x \le 7 \\ \frac{13-x}{36} & \text{for} & 8 \le x \le 12 \\ 0 & \text{otherwise} & \end{array} \]
I was looking at the textbook examples and there is a very annoying problem on page 358 (Bostock and Chandler 1985).
The probability density function of a random variable Y is \(f(y)=y^{2}(k-y)\) for y = 1,2,3. Find the value of k.
I got the answer right as \(\frac{37}{14}\) but why is that so frustrating to me?
I got a D in Further Mathematics and I am beginning to understand why. Some things are just an abomination to me, they feel wrong. I have too much intuition and not enough logic.
The next part of this series will cover the Binomial Distribution which is widely used in biology.