**Probability theory** provides a set of formal rules to deal with uncertainty and randomness (stochasticity).

**Probability** is a number between 0 and 1 (inclusive) representing

- long term relative frequencies (proportions, fractions, rates) of events - called
*frequentist*probability - degree of belief that an event occurs - called
*Bayesian*probability

While *frequentist* and *Bayesian* probability is motivated differently, they are treated exactly in the same way and together simply referred to as *probability*.

**Random variable** is a variable that can take different values randomly.

In *statistics*, a random variable is often denoted by an uppercase letter such as \(X\) and the values it can take in lowercase letters such as \(x_1, x_2, \ldots\) or simply \(x\).

In *machine learning*, a random variable as well as its possible values are often denoted by lowercase letters \(x, \mathbf{x}, \mathit{x}\) (possibly in different font) depending on the author. Though this may create a confusion, especially to the inexperienced, it is usually clear from the context whether the lowercase \(x\) is meant for a random variable or its value.

(In this course we will follow the statistical notation as much as possible though possibly not always.)

Random variables (r.v.) may be

**discrete**- have finite (or countably infinite) number of possible values (states), e.g. integers, labels or categories**continuous**- take values (states) in the real space, i.e. real numbers

**Probability distribution** is a description of how likely the random variable is to take on each of its possible values.

We typically denote **probability** by capital \(P\) so that the probability that the variable \(X\) takes a value \(x\) is \(P(x)\) or \(P(X = x)\) if we make the random variable explicit.

For a r.v. \(X\) we use the symbol \(\sim\) to specify which distribution it follows (from which it is generated). For example, \(X \sim N(0,1)\) means the r.v. \(X\) follows the standard normal distribution.

For **discrete r.v.** the probability distribution can be described by a **probability mass function** (pmf). The pmf \(p(x)\) assigns to each possible state \(x\) of the variable \(X\) a probability in between \([0,1]\). \[p(x) = P(X = x), \text{ for all x}\] **Warning:** We often use the letter \(p\) to indicate the pmf for different random variables with possibly different random distributions. \(p(x)\) and \(p(y)\) are usually not the same functions and their identity is recognized by the name of the random variable \(X, Y\) rather then the function \(p\). Sometimes people use \(p_X(x)\) to be more specific.

For a function \(p\) to be a pmf of a random variable \(X\)

- the domain of \(p\) must be the set of all possible values of \(X\)
- for any state \(x\) we have \(0 \leq p(x) \leq 1\)
- The probability that the variable \(X\) will take at least one of its possible values is 1 (it has to take at least one value), \(p(x_1 \text{ or } x_2 \text{ or } x_3 \text{ or } \ldots \text{ or } x_k) = \sum_{x \in X} p(x) = 1\).

For example, **discrete uniform distribution** on \(X\) places equal probability on each of the possible values. For \(k\) possible values we have \[p(X = x_i) = \frac{1}{k}, \quad \forall i=1, \ldots, k \\
\sum_i p(X = x_i) = \sum_i \frac{1}{k} = \frac{k}{k} = 1\]

**Notation:** \(\forall\) indicates *for all*.

For **continuous r.v.** the probability distribution can be described by a **probability density function** (pdf). The pdf \(p(x)\) assigns to each value \(x\) of a continuous r.v. \(X\) a probability density.

For a function \(p\) to be a pdf of a random variable \(X\)

- the domain of \(p\) must be the set of all possible values of \(X\)
- for any state \(x\) we have \(p(x) \geq 0\),
**note**we do not require \(p(x) \leq 1\) - \(\int p(x) dx = 1\).

**Notation:** \(\int\) indicates an integral. Integral is similar to sum but when working with function values over continuous domain instead of sequences. You use \(\sum_{n=1}^k a_n\) to sum all values in a sequence \(\{ a_n \}\) for \(n = 1, 2, \dots\). You use \(\int_a^b f(x) dx\) to sum all values \(f(x)\) when \(x \in (a,b)\). You use \(\int f(x) dx\) to sum all values \(f(x)\) across the whole domain of \(f\).

**Warning:** For continuous r.v. \(p(x)\) gives a **density** for each point \(x\) instead of a probability. This is because each \(x\) is a real number and therefore the probability of falling **exactly** onto it is actually zero (formally it is of measure zero and therefore has zero probability). However, we can use the probability density function to find the probability of points in an interval \(P(X \in (a,b)) = \int_{a}^b p(x) dx\).

For example, the **continuous uniform distribution** on \(X\) with values in an interval \((a, b)\), \(X \in (a,b)\) has a density function \(p(x) = \frac{1}{a-b}, \forall x \in (a,b)\). \[P(x_1 \leq X \leq x_2) = \int_{x_1}^{x_2} p(x) dx \geq 0\\
\int_a^b p(x) dx = 1\]

We can use both pmf and pdf to describe the distribution generating the random variable and say \(X \sim p(x)\) (where it is clear from context whether \(p(x)\) is a pmf or pdf).

When discussing more than one random variable (e.g. \(X, Y, Z\) or \(X_1, X_2, \ldots, X_n\)) we need to be clear about what is the probability referring to.

A **random vector** (or vector random variable) is composed of multiple scalar random variables as elements for each dimension, e.g. the n-dimensional r.v. \(\mathbf{X} = (X_1, X_2, \ldots, X_n), \ \mathbf{X} \in \mathbb{R}^n\) with each \(X_i \in \mathbb{R}\).

The **joint probability distribution** of a random vector describes the probability that the whole random vector will take its possible values. Equivalently, it describes the probability that all of its elements will take some values *simultaneously*.

For example, the joint probability for 2-dimensional random vector \(\mathbf{Z} = (X, Y)\) is the probability of \(X\) taking a value \(x\) *and at the same time* \(Y\) taking a value \(y\), \(P(\mathbf{Z} = \mathbf{z}) = P(X = x, Y = x) = P(x, y)\) .

The joint pmf and pdf are indicated as \(p(x,y)\).

For more than two random variables we define the joint probability distribution in analogy as the distribution of all the variables taking a value simultaneously, e.g. \(p(X =x, Y=y, Z=z) = p(x,y,z)\) or \(p(X_1 = x_1, X_2 = x_2, \ldots , X_n, = x_n) = p(x_1, x_2, \ldots, x_n)\).

When dealing with vector r.v. \(\mathbf{X} = (X_1, X_2, \ldots, X_n)\), we may still be interested in the probability distribution of each individual r.v. \(X_i\). We call these the **marginal probability distributions** and correspondingly either the *marginal* pmf or pdf \(p(x_i)\), for all \(i = 1, \ldots n\).

We can calculate the marginal pmf and pdf by **marginalization** as follows:

For 2-dimensional random vector \((X, Y)\) of discrete r.v. we have \[p(x) = \sum_y p(x, y), \qquad \text{ and } \qquad p(y) = \sum_x p(x, y), \] For 3-dimensional random vector \((X_1, X_2, X_3)\) of discrete r.v. we have \[p(x_1, x_2) = \sum_{x_3} p(x_1, x_2, x_3), \qquad \text{ and } \qquad p(x_1) = \sum_{x_3} \sum_{x_2} p(x_1, x_2, x_3) \enspace ,\] and in analogy for the other pmf such as \(p(x_2, x_3)\) etc.

For continuous random vectors we need to integrate over the whole continuous space (instead of summing over the discrete space) and therefore have \[p(x) = \int p(x, y) dy, \qquad \text{ and } \qquad p(y) = \int p(x, y) dx, \] and \[p(x_1, x_2) = \int p(x_1, x_2, x_3) dx_3, \qquad \text{ and } \qquad p(x_1) = \int\int p(x_1, x_2, x_3) dx_2 dx_3 \enspace .\] We apply similar strategy to marginalize pmf or pdf of **higher dimensional** random vectors.

Often times, knowledge about the value of one r.v. \(X\) can give us some information about the value of another r.v. \(Y\).

The **conditional probability** \(P(Y = y \, | \, X = x) = P(y \, | \, x)\) is the probability that \(Y\) takes the value \(y\) given that \(X\) took the value \(x\). It is defined as \[P(Y = y \, | \, X = x) = \frac{P(Y = y , X = x)}{P(X = x)}, \qquad \text{ if } P(X = x) > 0 \enspace .\] Similarly, the conditional pmf and pdf are defined as \[p(y \, | \, x) = \frac{p(y , x)}{p(x)}, \qquad \text{ if } p(x) > 0 \enspace .\] ### Chain rule of probability

We can extend the above conditioning to more than 2 r.v. First observe from the above that \[p(y , x) = p(x) \, p( y \, | \, x) \enspace .\] For an \(n\)-dimensional random vector we then have \[\textbf{chain rule of probability: } \quad p(x_1, x_2, \ldots, x_n) = p(x_1) \, \prod_{i=2}^n p( x_i \, | \, x_1, \ldots x_{i-1}) \enspace .\] For example for 3-dimensional \((X_1, X_2, X_3)\) we have one factorization \[p(x_1, x_2, x_3) = p(x_1 \, | \, x_2, x_3) \, p(x_2, x_3) \quad \text{treating } (x_2, x_3) \text{ as a single r.v.}\\ p(x_2, x_3) = p(x_2 \, | \, x_3) p(x_3) \\ p(x_1, x_2, x_3) = p(x_1 \, | \, x_2, x_3) p(x_2 \, | \, x_3) p(x_3) \enspace .\]

Another possible factorization is \[p(x_1, x_2, x_3) = p(x_3 \, | \, x_2, x_1) \, p(x_2, x_1) \quad \text{treating } (x_2, x_1) \text{ as a single r.v.}\\ p(x_2, x_1) = p(x_2 \, | \, x_1) p(x_1) \\ p(x_1, x_2, x_3) = p(x_3 \, | \, x_2, x_1) p(x_2 \, | \, x_1) p(x_1) \enspace .\]

Two r.v. \(X\) and \(Y\) are called **independent** (denoted as \(X \perp Y\)) if for their joint pmf or pdf and marginal pmfs or pdfs we have \[p(x, y) = p(x) p(y)\] This implies for the conditional probability \[p(x \, | \,y) = \frac{p(x, y)}{p(y)} = \frac{p(x) p(y)}{p(y)} = p(x) \enspace ,\] because the probability of the variable \(X\) does not depend on what values \(Y\) takes.

Two r.v. \(X\) and \(Y\) are called **conditionally independent** given r.v. \(Z\) ((denoted as \(X \perp Y \, | \, Z\))) if the conditional pmf or pdf factorizes as \[p(x, y \, | \, z) = p(x \,| \, z) \, p(y \, | \, z) \enspace .\]

Using the defintion of conditional probabability we can derive an important rule for updating probabilities. For 2 r.v. \(A\) and \(B\) we have \[\textbf{Bayes rule:} \quad p(a \,|\, b) = \frac{p(b \,|\, a) \, p(a)}{p(b)} \enspace .\] The rule is easy to proof: first note that \(p(a \,|\, b)p(b) = p(a, b) = p(b \,|\, a) p(a)\) then devide all by \(p(b)\) to get the Bayes rule above.

There is a specific terminology related to the Bayes rule:

- \(p(a)\) is called the
**prior**(a priori) and describes the probability of the r.v. \(A\) without any additional information. - \(p(a \,|\, b)\) is called the
**posterior**(a posteriori) and describes the probability of the r.v. \(A\) given that we know that \(b\) happend. - \(p(b \,|\, a)\) is called the
**likelihood**of \(b\) given that \(a\) is true. - \(p(b)\) is called the
**evidence**or**marginal likelihood**and it is the total probability of observing \(b\) (irrespective of whether \(a\) is true or not).

The **expectation** or **expected value** of some random variable \(X\) (denoted \(\mu_X = \mathbb{E} [X]\)) is the average (mean) of the possible values weighted with respect to the probability distribution of \(X\). \[\mu_X = \mathbb{E} [X] =
\begin{cases}
\sum_x x \, p(x) = \sum_x x \, P(X = x) & \text{ for discrete } X \\
\int x \, p(x) dx & \text{ for continuous } X \\
\end{cases} \enspace ,\] provided the sum or integral exists.

For a function \(g(x)\) of the random variable \(X\) the **expectation** is \[\mu_{g(x)} = \mathbb{E} [g(x)] =
\begin{cases}
\sum_x g(x) \, p(x) = \sum_x g(x) \, P(X = x) & \text{ for discrete } X \\
\int g(x) \, p(x) dx & \text{ for continuous } X \\
\end{cases} \enspace ,\] provided the sum or integral exists.

Expectation is **linear** so that for any constants \(a, b, c\) we have \[\mathbb{E} [a g(x) + b h(x) + c] = a \mathbb{E} [g(x)] + b \mathbb{E} [h(x)] + c \enspace .\]

The **variance** (denoted as \(\sigma_X^2 = \text{Var}[X]\)) is a measure of how much the values of the random variable \(X\) vary around its expectation \(\mu_X = \mathbb{E} [x]\). It is defined as \[\sigma_X^2 = \text{Var}[X] = \mathbb{E} \big[ (x - \mathbb{E} [x])^2 \big] = \mathbb{E} [x^2] - (\mathbb{E} [x])^2 \enspace .\]

For a function \(g(x)\) the **variance** is \[\text{Var}[g(x)] = \mathbb{E} \big[ (g(x) - \mathbb{E} [g(x)])^2 \big] = \mathbb{E} [g(x)^2] - (\mathbb{E} [g(x)])^2 \enspace .\]

Variance is **not linear** so that for any constants \(a, b\) we have \[\text{Var} [a g(x) + b ] = a^2 \text{Var} [g(x)] \enspace .\]

The square root of variance is known as **standard deviation**, \(\sigma_X\).

The **covariance** gives an indication of how much are two r.v. \(X\) and \(Y\) linearly related. It is defined as \[\text{Cov}(X,Y) = \mathbb{E} \big[ (X - \mu_X) (Y - \mu_Y) \big] = \mathbb{E} [X Y] - \mu_X \mu_Y \enspace .\] High positive values of covariance indicate that both variables tend to take relatively high values compared to their respective means simultaneously (or small values simultaneously). High negative values indicate that one of the variables takes relatively low values when the other takes relatively high values and vice versa.

The **correlation** normalizes the contribution of the variables so that the result is not influenced by the scaling of the variables \[\text{Corr}(X,Y) = \frac{\mathbb{E} \big[ (X - \mu_X) (Y - \mu_Y) \big]}{\sigma_X \sigma_Y} \enspace .\] The correlation is always within \(-1 \leq \text{Corr}(X,Y) \leq 1\), with \(1\) and \(-1\) indicating perfect linear relationship (positive or negative).

The covariance and correlation of two functions \(g(x), h(y)\) are defined in analogy.

When two variables are **independent** we have \(\text{Cov}(X,Y) = \text{Corr}(X,Y) = 0\). However, this is not true the other way round. Even with \(\text{Cov}(X,Y) = \text{Corr}(X,Y) = 0\) the variables can still be **non-linearly dependent**.

The **covariance matrix** is a matrix of pair-wise covariances between elements of a random vector \(\mathbf{X} = (X_1, X_2, \ldots, X_n), \ \mathbf{X} \in \mathbb{R}^n\). The \(n \times n\) covariance matrix has the elements \[\text{Cov}(\mathbf{X})_{ij} = \text{Cov}(X_i, X_j) \\
\text{Cov}(\mathbf{X})_{ii} = \text{Var}(X_i)
\]

Here we list only the most common probability distributions.

The **Gaussian** or **normal** probability distribution (denoted as \(N(\mu, \sigma^2)\)) is the most commonly used for continuous r.v.

Its probability density function has the form \[ p(x; \mu, \sigma^2) = \frac{1}{\sigma \sqrt{2\pi}} e^{-\frac{(x-\mu)^2}{2\sigma^2}} \enspace ,\] where \(\mu, \sigma^2\) are the parameters of the distribution representing its expectation \(\mathbb{E}[X]\) and variance \(\text{Var}[X]\).

In this graph you can see the plot of the normal pdf for different values of \(\mu = m\) and \(\sigma = s\) and its evaluations \(p(a)\).

The **Bernoulli** distribution (denoted Ber\((\phi)\)) is a distribution over a single binary r.v. \(X\) (variable that can take only the values \(\{0, 1\}\)). It is controlled by a single parameter \(\phi \in [0,1]\) and its pmf has the form \[p(1) = P(X = 1) = \phi\\
p(0) = P(X = 0) = 1 - \phi \enspace .
\] This can also be written as \[p(x) = P(X = x) = \phi^x \, (1-\phi)^x\] Further \[\mathbb{E}[X] = \phi \\
\text{Var}[X] = \phi \, (1-\phi)\]

The **categorical** distribution (denoted Cat\((k)\)) is a distribution over a single discrete r.v. \(X\) with \(k\) possible states (values). It is parametrized by a vector \(\phi \in [0,1]^{k-1}\), where \(\phi_i\) gives the probability of the i-th state. The final state’s probability is given as the remainder to 1. \[p(x_i) = P(X = x_i) =
\begin{cases}
\phi_i & \text{ for } i=1, \ldots, k-1 \\
1 - \sum_{j=1}^{k-1} \phi_j & \text{ for } i=k
\end{cases}
\] The possible values (states) of the variable \(X\) are simply categories, often not even numerical and therefore the expectation and variance of the distribution are not of great interest.