Probability review

Probability theory provides a set of formal rules to deal with uncertainty and randomness (stochasticity).

Basics of probability

Probability is a number between 0 and 1 (inclusive) representing

  • long term relative frequencies (proportions, fractions, rates) of events - called frequentist probability
  • degree of belief that an event occurs - called Bayesian probability

While frequentist and Bayesian probability is motivated differently, they are treated exactly in the same way and together simply referred to as probability.

Random variable

Random variable is a variable that can take different values randomly.

In statistics, a random variable is often denoted by an uppercase letter such as \(X\) and the values it can take in lowercase letters such as \(x_1, x_2, \ldots\) or simply \(x\).

In machine learning, a random variable as well as its possible values are often denoted by lowercase letters \(x, \mathbf{x}, \mathit{x}\) (possibly in different font) depending on the author. Though this may create a confusion, especially to the inexperienced, it is usually clear from the context whether the lowercase \(x\) is meant for a random variable or its value.

(In this course we will follow the statistical notation as much as possible though possibly not always.)

Random variables (r.v.) may be

  • discrete - have finite (or countably infinite) number of possible values (states), e.g. integers, labels or categories
  • continuous - take values (states) in the real space, i.e. real numbers

Probability distribution

Probability distribution is a description of how likely the random variable is to take on each of its possible values.

We typically denote probability by capital \(P\) so that the probability that the variable \(X\) takes a value \(x\) is \(P(x)\) or \(P(X = x)\) if we make the random variable explicit.

For a r.v. \(X\) we use the symbol \(\sim\) to specify which distribution it follows (from which it is generated). For example, \(X \sim N(0,1)\) means the r.v. \(X\) follows the standard normal distribution.

Probability mass function

For discrete r.v. the probability distribution can be described by a probability mass function (pmf). The pmf \(p(x)\) assigns to each possible state \(x\) of the variable \(X\) a probability in between \([0,1]\). \[p(x) = P(X = x), \text{ for all x}\] Warning: We often use the letter \(p\) to indicate the pmf for different random variables with possibly different random distributions. \(p(x)\) and \(p(y)\) are usually not the same functions and their identity is recognized by the name of the random variable \(X, Y\) rather then the function \(p\). Sometimes people use \(p_X(x)\) to be more specific.

For a function \(p\) to be a pmf of a random variable \(X\)

  • the domain of \(p\) must be the set of all possible values of \(X\)
  • for any state \(x\) we have \(0 \leq p(x) \leq 1\)
  • The probability that the variable \(X\) will take at least one of its possible values is 1 (it has to take at least one value), \(p(x_1 \text{ or } x_2 \text{ or } x_3 \text{ or } \ldots \text{ or } x_k) = \sum_{x \in X} p(x) = 1\).

For example, discrete uniform distribution on \(X\) places equal probability on each of the possible values. For \(k\) possible values we have \[p(X = x_i) = \frac{1}{k}, \quad \forall i=1, \ldots, k \\ \sum_i p(X = x_i) = \sum_i \frac{1}{k} = \frac{k}{k} = 1\]

Notation: \(\forall\) indicates for all.

Probability density function

For continuous r.v. the probability distribution can be described by a probability density function (pdf). The pdf \(p(x)\) assigns to each value \(x\) of a continuous r.v. \(X\) a probability density.

For a function \(p\) to be a pdf of a random variable \(X\)

  • the domain of \(p\) must be the set of all possible values of \(X\)
  • for any state \(x\) we have \(p(x) \geq 0\), note we do not require \(p(x) \leq 1\)
  • \(\int p(x) dx = 1\).

Notation: \(\int\) indicates an integral. Integral is similar to sum but when working with function values over continuous domain instead of sequences. You use \(\sum_{n=1}^k a_n\) to sum all values in a sequence \(\{ a_n \}\) for \(n = 1, 2, \dots\). You use \(\int_a^b f(x) dx\) to sum all values \(f(x)\) when \(x \in (a,b)\). You use \(\int f(x) dx\) to sum all values \(f(x)\) across the whole domain of \(f\).

Warning: For continuous r.v. \(p(x)\) gives a density for each point \(x\) instead of a probability. This is because each \(x\) is a real number and therefore the probability of falling exactly onto it is actually zero (formally it is of measure zero and therefore has zero probability). However, we can use the probability density function to find the probability of points in an interval \(P(X \in (a,b)) = \int_{a}^b p(x) dx\).

For example, the continuous uniform distribution on \(X\) with values in an interval \((a, b)\), \(X \in (a,b)\) has a density function \(p(x) = \frac{1}{a-b}, \forall x \in (a,b)\). \[P(x_1 \leq X \leq x_2) = \int_{x_1}^{x_2} p(x) dx \geq 0\\ \int_a^b p(x) dx = 1\]

We can use both pmf and pdf to describe the distribution generating the random variable and say \(X \sim p(x)\) (where it is clear from context whether \(p(x)\) is a pmf or pdf).

Marginal, joint, conditional probability

When discussing more than one random variable (e.g. \(X, Y, Z\) or \(X_1, X_2, \ldots, X_n\)) we need to be clear about what is the probability referring to.

A random vector (or vector random variable) is composed of multiple scalar random variables as elements for each dimension, e.g. the n-dimensional r.v. \(\mathbf{X} = (X_1, X_2, \ldots, X_n), \ \mathbf{X} \in \mathbb{R}^n\) with each \(X_i \in \mathbb{R}\).

Joint probability

The joint probability distribution of a random vector describes the probability that the whole random vector will take its possible values. Equivalently, it describes the probability that all of its elements will take some values simultaneously.

For example, the joint probability for 2-dimensional random vector \(\mathbf{Z} = (X, Y)\) is the probability of \(X\) taking a value \(x\) and at the same time \(Y\) taking a value \(y\), \(P(\mathbf{Z} = \mathbf{z}) = P(X = x, Y = x) = P(x, y)\) .

The joint pmf and pdf are indicated as \(p(x,y)\).

For more than two random variables we define the joint probability distribution in analogy as the distribution of all the variables taking a value simultaneously, e.g. \(p(X =x, Y=y, Z=z) = p(x,y,z)\) or \(p(X_1 = x_1, X_2 = x_2, \ldots , X_n, = x_n) = p(x_1, x_2, \ldots, x_n)\).

Marginal probability

When dealing with vector r.v. \(\mathbf{X} = (X_1, X_2, \ldots, X_n)\), we may still be interested in the probability distribution of each individual r.v. \(X_i\). We call these the marginal probability distributions and correspondingly either the marginal pmf or pdf \(p(x_i)\), for all \(i = 1, \ldots n\).

We can calculate the marginal pmf and pdf by marginalization as follows:

For 2-dimensional random vector \((X, Y)\) of discrete r.v. we have \[p(x) = \sum_y p(x, y), \qquad \text{ and } \qquad p(y) = \sum_x p(x, y), \] For 3-dimensional random vector \((X_1, X_2, X_3)\) of discrete r.v. we have \[p(x_1, x_2) = \sum_{x_3} p(x_1, x_2, x_3), \qquad \text{ and } \qquad p(x_1) = \sum_{x_3} \sum_{x_2} p(x_1, x_2, x_3) \enspace ,\] and in analogy for the other pmf such as \(p(x_2, x_3)\) etc.

For continuous random vectors we need to integrate over the whole continuous space (instead of summing over the discrete space) and therefore have \[p(x) = \int p(x, y) dy, \qquad \text{ and } \qquad p(y) = \int p(x, y) dx, \] and \[p(x_1, x_2) = \int p(x_1, x_2, x_3) dx_3, \qquad \text{ and } \qquad p(x_1) = \int\int p(x_1, x_2, x_3) dx_2 dx_3 \enspace .\] We apply similar strategy to marginalize pmf or pdf of higher dimensional random vectors.

Conditional probability

Often times, knowledge about the value of one r.v. \(X\) can give us some information about the value of another r.v. \(Y\).

The conditional probability \(P(Y = y \, | \, X = x) = P(y \, | \, x)\) is the probability that \(Y\) takes the value \(y\) given that \(X\) took the value \(x\). It is defined as \[P(Y = y \, | \, X = x) = \frac{P(Y = y , X = x)}{P(X = x)}, \qquad \text{ if } P(X = x) > 0 \enspace .\] Similarly, the conditional pmf and pdf are defined as \[p(y \, | \, x) = \frac{p(y , x)}{p(x)}, \qquad \text{ if } p(x) > 0 \enspace .\] ### Chain rule of probability

We can extend the above conditioning to more than 2 r.v. First observe from the above that \[p(y , x) = p(x) \, p( y \, | \, x) \enspace .\] For an \(n\)-dimensional random vector we then have \[\textbf{chain rule of probability: } \quad p(x_1, x_2, \ldots, x_n) = p(x_1) \, \prod_{i=2}^n p( x_i \, | \, x_1, \ldots x_{i-1}) \enspace .\] For example for 3-dimensional \((X_1, X_2, X_3)\) we have one factorization \[p(x_1, x_2, x_3) = p(x_1 \, | \, x_2, x_3) \, p(x_2, x_3) \quad \text{treating } (x_2, x_3) \text{ as a single r.v.}\\ p(x_2, x_3) = p(x_2 \, | \, x_3) p(x_3) \\ p(x_1, x_2, x_3) = p(x_1 \, | \, x_2, x_3) p(x_2 \, | \, x_3) p(x_3) \enspace .\]

Another possible factorization is \[p(x_1, x_2, x_3) = p(x_3 \, | \, x_2, x_1) \, p(x_2, x_1) \quad \text{treating } (x_2, x_1) \text{ as a single r.v.}\\ p(x_2, x_1) = p(x_2 \, | \, x_1) p(x_1) \\ p(x_1, x_2, x_3) = p(x_3 \, | \, x_2, x_1) p(x_2 \, | \, x_1) p(x_1) \enspace .\]


Two r.v. \(X\) and \(Y\) are called independent (denoted as \(X \perp Y\)) if for their joint pmf or pdf and marginal pmfs or pdfs we have \[p(x, y) = p(x) p(y)\] This implies for the conditional probability \[p(x \, | \,y) = \frac{p(x, y)}{p(y)} = \frac{p(x) p(y)}{p(y)} = p(x) \enspace ,\] because the probability of the variable \(X\) does not depend on what values \(Y\) takes.

Conditional independence

Two r.v. \(X\) and \(Y\) are called conditionally independent given r.v. \(Z\) ((denoted as \(X \perp Y \, | \, Z\))) if the conditional pmf or pdf factorizes as \[p(x, y \, | \, z) = p(x \,| \, z) \, p(y \, | \, z) \enspace .\]

Bayes rule

Using the defintion of conditional probabability we can derive an important rule for updating probabilities. For 2 r.v. \(A\) and \(B\) we have \[\textbf{Bayes rule:} \quad p(a \,|\, b) = \frac{p(b \,|\, a) \, p(a)}{p(b)} \enspace .\] The rule is easy to proof: first note that \(p(a \,|\, b)p(b) = p(a, b) = p(b \,|\, a) p(a)\) then devide all by \(p(b)\) to get the Bayes rule above.

There is a specific terminology related to the Bayes rule:

  • \(p(a)\) is called the prior (a priori) and describes the probability of the r.v. \(A\) without any additional information.
  • \(p(a \,|\, b)\) is called the posterior (a posteriori) and describes the probability of the r.v. \(A\) given that we know that \(b\) happend.
  • \(p(b \,|\, a)\) is called the likelihood of \(b\) given that \(a\) is true.
  • \(p(b)\) is called the evidence or marginal likelihood and it is the total probability of observing \(b\) (irrespective of whether \(a\) is true or not).

Expectation, variance, covariance


The expectation or expected value of some random variable \(X\) (denoted \(\mu_X = \mathbb{E} [X]\)) is the average (mean) of the possible values weighted with respect to the probability distribution of \(X\). \[\mu_X = \mathbb{E} [X] = \begin{cases} \sum_x x \, p(x) = \sum_x x \, P(X = x) & \text{ for discrete } X \\ \int x \, p(x) dx & \text{ for continuous } X \\ \end{cases} \enspace ,\] provided the sum or integral exists.

For a function \(g(x)\) of the random variable \(X\) the expectation is \[\mu_{g(x)} = \mathbb{E} [g(x)] = \begin{cases} \sum_x g(x) \, p(x) = \sum_x g(x) \, P(X = x) & \text{ for discrete } X \\ \int g(x) \, p(x) dx & \text{ for continuous } X \\ \end{cases} \enspace ,\] provided the sum or integral exists.

Expectation is linear so that for any constants \(a, b, c\) we have \[\mathbb{E} [a g(x) + b h(x) + c] = a \mathbb{E} [g(x)] + b \mathbb{E} [h(x)] + c \enspace .\]


The variance (denoted as \(\sigma_X^2 = \text{Var}[X]\)) is a measure of how much the values of the random variable \(X\) vary around its expectation \(\mu_X = \mathbb{E} [x]\). It is defined as \[\sigma_X^2 = \text{Var}[X] = \mathbb{E} \big[ (x - \mathbb{E} [x])^2 \big] = \mathbb{E} [x^2] - (\mathbb{E} [x])^2 \enspace .\]

For a function \(g(x)\) the variance is \[\text{Var}[g(x)] = \mathbb{E} \big[ (g(x) - \mathbb{E} [g(x)])^2 \big] = \mathbb{E} [g(x)^2] - (\mathbb{E} [g(x)])^2 \enspace .\]

Variance is not linear so that for any constants \(a, b\) we have \[\text{Var} [a g(x) + b ] = a^2 \text{Var} [g(x)] \enspace .\]

The square root of variance is known as standard deviation, \(\sigma_X\).


The covariance gives an indication of how much are two r.v. \(X\) and \(Y\) linearly related. It is defined as \[\text{Cov}(X,Y) = \mathbb{E} \big[ (X - \mu_X) (Y - \mu_Y) \big] = \mathbb{E} [X Y] - \mu_X \mu_Y \enspace .\] High positive values of covariance indicate that both variables tend to take relatively high values compared to their respective means simultaneously (or small values simultaneously). High negative values indicate that one of the variables takes relatively low values when the other takes relatively high values and vice versa.

The correlation normalizes the contribution of the variables so that the result is not influenced by the scaling of the variables \[\text{Corr}(X,Y) = \frac{\mathbb{E} \big[ (X - \mu_X) (Y - \mu_Y) \big]}{\sigma_X \sigma_Y} \enspace .\] The correlation is always within \(-1 \leq \text{Corr}(X,Y) \leq 1\), with \(1\) and \(-1\) indicating perfect linear relationship (positive or negative).

The covariance and correlation of two functions \(g(x), h(y)\) are defined in analogy.

When two variables are independent we have \(\text{Cov}(X,Y) = \text{Corr}(X,Y) = 0\). However, this is not true the other way round. Even with \(\text{Cov}(X,Y) = \text{Corr}(X,Y) = 0\) the variables can still be non-linearly dependent.

The covariance matrix is a matrix of pair-wise covariances between elements of a random vector \(\mathbf{X} = (X_1, X_2, \ldots, X_n), \ \mathbf{X} \in \mathbb{R}^n\). The \(n \times n\) covariance matrix has the elements \[\text{Cov}(\mathbf{X})_{ij} = \text{Cov}(X_i, X_j) \\ \text{Cov}(\mathbf{X})_{ii} = \text{Var}(X_i) \]

Common probability distributions

Here we list only the most common probability distributions.

Gaussian (normal) distribution

The Gaussian or normal probability distribution (denoted as \(N(\mu, \sigma^2)\)) is the most commonly used for continuous r.v.

Its probability density function has the form \[ p(x; \mu, \sigma^2) = \frac{1}{\sigma \sqrt{2\pi}} e^{-\frac{(x-\mu)^2}{2\sigma^2}} \enspace ,\] where \(\mu, \sigma^2\) are the parameters of the distribution representing its expectation \(\mathbb{E}[X]\) and variance \(\text{Var}[X]\).

In this graph you can see the plot of the normal pdf for different values of \(\mu = m\) and \(\sigma = s\) and its evaluations \(p(a)\).

Bernoulli distribution

The Bernoulli distribution (denoted Ber\((\phi)\)) is a distribution over a single binary r.v. \(X\) (variable that can take only the values \(\{0, 1\}\)). It is controlled by a single parameter \(\phi \in [0,1]\) and its pmf has the form \[p(1) = P(X = 1) = \phi\\ p(0) = P(X = 0) = 1 - \phi \enspace . \] This can also be written as \[p(x) = P(X = x) = \phi^x \, (1-\phi)^x\] Further \[\mathbb{E}[X] = \phi \\ \text{Var}[X] = \phi \, (1-\phi)\]

Categorical distribution

The categorical distribution (denoted Cat\((k)\)) is a distribution over a single discrete r.v. \(X\) with \(k\) possible states (values). It is parametrized by a vector \(\phi \in [0,1]^{k-1}\), where \(\phi_i\) gives the probability of the i-th state. The final state’s probability is given as the remainder to 1. \[p(x_i) = P(X = x_i) = \begin{cases} \phi_i & \text{ for } i=1, \ldots, k-1 \\ 1 - \sum_{j=1}^{k-1} \phi_j & \text{ for } i=k \end{cases} \] The possible values (states) of the variable \(X\) are simply categories, often not even numerical and therefore the expectation and variance of the distribution are not of great interest.