The Basics of Probability Theory

Objective: This webpage aims to make the foundational principles of Data Science more accessible to non-statisticians. Many definitions and examples are based on the book Fundamentals of Probability with Stochastic Processes (Ghahramani 2016).

1 Introduction

As a child, I loved playing soccer. Even now, whenever I ride my bicycle past a soccer field, I feel a sense of nostalgia, especially when I see very young children just starting out. They clumsily miss the ball, pass it in the wrong direction, or get distracted and forget all about the play. And that is perfectly natural.

Probability theory is very much the same. In the beginning, we stumble a lot, staring at proofs for hours without truly understanding them. When I first came across Ghahramani’s book Fundamentals of Probability with Stochastic processes, I must admit I felt anxious; it contained over 600 pages and even had thunderbolts on the cover. But here is the point: do not feel discouraged when learning probability. Like soccer, it takes practice, mistakes, and persistence before it finally clicks. And once it does, you begin to see its true beauty.

Mathematics as a whole is beautiful, but probability theory holds a special place because it forms the foundation of so many other fields: data science and machine learning are two important examples. If we imagine data science and machine learning as a house or apartment building, then probability forms the supporting pillars of the structure. Any variable or feature collected and analyzed by data scientists arises from a random or stochastic process. Randomness or stochasticity in this context means that there is uncertainty in how data is collected.

Whether we are studying the health of a penguin population on the Antarctic Peninsula or, in a very different example, the satisfaction rate of web users who purchase your product, it is often impossible to measure every penguin or gather feedback from every single customer. That is why data scientists collect data from a small, representative sample of the population and use statistical methods to draw conclusions about the whole. But to do that properly, we first need to understand the basics of probability theory.

Let us begin.

2 Three key ingredients

The Sample Space

We will start with a very simple example, the toss of a coin. Still widely used at the start of soccer, American football and cricket matches, the coin toss really never becomes old fashioned. In this, let us call it an “experiment”, there are two possible outcomes: heads (the portrait side of the coin) or tails (the opposite side).

In probability theory a possible outcome of an experiment is called a sample. Therefore it makes sense to define the sample space $S$ as the set of all possible outcomes of an experiment. The word “space” here does not literally refer to a physical space, but if it helps you, you can visualize a small cardboard box in your mind that contains both possible outcomes. Take a local coin and imagine the portrait side on one end of the imaginary box and the opposite side of the coin on the other end. In this way you can get a visual image of the sample space in your mind and it will help you to also better understand the mathematical notation.

Following Saeed Ghahramani’s book (Ghahramani 2016) I define the sample space $S$ of the toss of a coin in the following way

\[S = \{ \text{Heads}, \text{Tails} \}\] Now consider another example. If the coin lands on tails on the first toss, a dice is rolled. If it lands on heads, the coin is tossed again. Then we will have the following sample space

\[S = \{ (\text{Tails}, 1), (\text{Tails}, 2), (\text{Tails}, 3), (\text{Tails}, 4), (\text{Tails}, 5), (\text{Tails}, 6), (\text{Heads}, \text{Tails}), (\text{Heads}, \text{Heads})\}\] Again try to imagine a cardboard box with all the possible outcomes of the experiment, this will make the mathematical notation come alive!

The Event Space

Now we have a good understanding of the first key ingredient of probability theory, namely the sample space $S$, let us introduce the second one: the event space $\mathcal{F}$. In mathematics we like to use abbreviations like $S$ and $\mathcal{F}$, because they not only make the text more concise, but also turn mathematics into a universal language. When everyone around the world agrees on the same notation for a mathematical concept, it becomes much easier to communicate ideas and results with one another. The event space $\mathcal{F}$ is the set of all possible events. Try to imagine an event as a question that you could ask about the sample space. For example, coming back to the experiment with the coin and the dice roll, what are the possible outcomes if the first coin toss landed on heads? Let’s call this event $A$. Or what are the possible outcomes if the first coin toss landed on tails and the die roll is an odd number? Let us define this event as $B$. Then event $A$ and event $B$ are described by the following two sets of outcomes

\[ \begin{aligned} A &= \{(\text{Heads}, \text{Tails}), (\text{Heads}, \text{Heads}) \} \\[0.5ex] B &= \{(\text{Tails},1), (\text{Tails}, 3), (\text{Tails},5) \} \end{aligned} \] Also the empty set $\emptyset$ is called an event (what if nothing happens at all?) and the sample space $S$ itself (what are all possible outcomes?). The set or the collection of all possible events is the event space,

\[ \mathcal{F} = \{\emptyset, (\text{Tails}, 1), (\text{Tails}, 2), (\text{Tails}, 3), (\text{Tails}, 4), (\text{Tails}, 5), (\text{Tails}, 6), A, B, \ ... \ , S \} \] The event set might be a little bit more difficult to imagine, but perhaps you could imagine a television screen that every time you change the channel shows a different possible event. For example, the event that the coin landed on tails and the number eyes rolled on the dice was 3, or the event that the coin landed on tails and the number eyes on die roll was even etc. etc.

To make it more concrete, any feasible question about a data set can be described as an event. For example, we have a data set containing a large number of transactions. The data set could be considered the sample space $S$ and a question of interest, let’s say “which transactions are classified as fraudulent? as an event $A$ from the event space $\mathcal{F}$.

Let us now come back to a simpler example, namely the roll of a dice. Let us define $A$ (note that any alphabetical letter can be used to define events) as the event that the number of eyes rolled on the dice is smaller than four. And let us define $B$ as the event of rolling an odd number on the dice. Then events $A$ and $B$ are represented by the following sets

\[ \begin{aligned} A &= \{1,2,3 \} \\[0.5ex] B &= \{1,3,5 \} \end{aligned} \] Now, there are four concepts that I would like to discuss regarding sets: unions, intersections, differences and complements. In itself they are actually also called events. First of all, an event is called a union between two events $A$ and $B$ if it occurs whenever at least one of them occurs and is denoted by $A \cup B$. For example, the union between the events $A$ and $B$ that we just defined in our dice example would be

\[ A \cup B = \{1,2,3,5 \} \] You may see an event also as a light bulb that turns on or a bell that starts ringing whenever it occurs. In the case of the union $A \cup B$ the bell would start to ring whenever $A$ occurs or $B$ occurs or both $A$ and $B$ occur at the same time. An event is called the intersection between two events $A$ and $B$ if it occurs only whenever these two events happen simultaneously. In our example the intersection between event $A$ and $B$ denoted by $A \cap B$ is

\[ A \cap B = \{1,3\} \] An event is called the difference between two events $A$ and $B$ if it occurs whenever $B$ occurs but $A$ does not and is denoted by $B - A$. In our example the difference $B - A$ would be represented by

\[ B - A = \{5\} \] An event is called the complement of an event $A$ if it only occurs whenever $A$ does not occur and it is denoted by $A^c$. In our dice example $A^c$ would be give by

\[ A^c = \{4,5,6\} \] So $A^c$ is actually the event of rolling a number on the dice that is larger than 3. A simple way of understanding how events relate to one another is by drawing so called Venn diagrams, introduced by John Venn in 1880 in the Philosophical Magazine and Journal of Science. On the Venn diagrams shown below it can be seen very clearly what a union is, an intersection, a difference and a complement.

Based on these diagrams we can actually find new ways of relating events to one another. For example, looking at the Venn diagram that shows the difference $B - A$ we can see that actually $B - A = B \cap A^c$. Also note that the complement of the complement of $A$ is equal to $A$, that is $(A^c)^c = A$. And further note that the complement of the union of $A$ and $B$ equals the intersection between A complement and B complement $(A \cup B)^c = A^c \cap B^c$ and the complement of the intersection between $A$ and $B$ equals the union between $A$ complement and $B$ complement $(A \cap B)^c = A^c \cup B^c$, which are also called Morgan’s first and second law. Interesting, right? Lastly, there are also ways to relate three events or more to one another. For example, it can be proven that $(A \cap B) \cup C = (A \cup C) \cap (B \cup C)$ and $(A \cup B) \cap C = (A \cap C) \cup (B \cap C)$, which are called the distributive laws.

Let us conclude Venn diagrams by making an exercise. Prove that the following relation holds

\[ (A \cup B - A \cap B) = A \cap B^c \cup A^c \cap B \] To not get confused with the $\cup$ and $\cap$ notation we will use the shorthand notation $A \cap B = AB$ for intersections. Then the relation that we wish to prove becomes

\[ (A \cup B - A B) = A B^c \cup A^c B \] By using the relation $B - A = B \cap A^c$ we can rewrite the left part of the equation as

\[ (A \cup B - A B) = (A \cup B)(AB)^c \] Then, by using Morgan’s second law we can rewrite this as

\[ (A \cup B)(AB)^c = (A \cup B)(A^c \cup B^c) \] Now we can use the distributive laws to rewrite the right part as

\[ (A \cup B)(A^c \cup B^c) = A(A^c \cup B^c) \cup B(A^c \cup B^c) \] And again we can use the same distributive laws to rewrite this as

\[ A(A^c \cup B^c) \cup B(A^c \cup B^c) = AA^c \cup AB^c \cup BA^c \cup BB^c \] Since $AA^c = \emptyset$, $BB^c = \emptyset$ and $BA^c = A^cB$ we get our final result

\[ AA^c \cup AB^c \cup BA^c \cup BB^c = \emptyset \cup AB^c \cup A^cB \cup \emptyset = AB^c \cup A^cB \quad \blacksquare \] The black square indicates that the proof is complete. One of the joys of mathematics is that it often allows you to discover new ways to prove statements just by looking at your work from a different angle, even upside down! For example, you could start from the final statement and trace your steps backward all the way to the beginning, and it still makes perfect sense. Welcome to the world of mathematics; feel free to explore, experiment, and make yourself at home!

Probabilities

Then a last and final ingredient is missing, namely we still need to assign probabilities to all the possible events in the event set. Before doing so we should define the cornerstones on which probability theory is build, the so called probability axioms. Axioms are simple, indisputable and consistent statements that do not require any justification. To be able to read the axioms I should introduce some mathematical notation. First of all, there is a simple of writing sums using the capital greek letter $\Sigma$ (sigma). For example,

\[ 1 + 2 + 3 = \sum^3_{i=1} i \] Here, $i = 1$ below $\Sigma$ indicates where the summation should start and the $3$ on top of $\Sigma$ indicates where the summation should stop. Meaning that $i$ takes the values $1$, $2$ and $3$. The formula that we should evaluate here is simply $i$. Filling in $i = 1$ we get $1$, filling in $i = 2$ we get $2$ and filling in $i = 3$ we get 3. Adding up these three terms we get $1 + 2 + 3$. I hope it is clear how we can use the $\Sigma$ to write summations in a more compact way. It especially proves its usefulness for very large summations and even we can use summations to write out infinite sums. For example,

\[ 1 + 2 + 3 \,+ \, ... \, = \sum^{\infty}_{i=1} i \] Here, the summation starts at $i = 1$, but $i$ can take any positive integer value up to infinity. In the probability theory context we can do exactly that with unions and intersections. Let us write the union between three events $A_1$, $A_2$ and $A_3$ (which is just another way of defining three events $A$, $B$ and $C$) in a more compact way

\[ A_1 \cup A_2 \cup A_3 = \bigcup^{3}_{i=1} A_i \] And in the same way as the summations we could also write out an infinite union between the infinitely many events $A_1$, $A_2$, $A_3$, $A_4$ …

\[ A_1 \cup A_2 \cup A_3 \cup A_4 \, \cup \, ... \, = \bigcup^{\infty}_{i=1} A_i\] Next to some mathematical notation there is one last concept that I should explain before at last getting into the axioms, which is the concept of mutually exclusivity. For two events to be mutually exclusive is simply means that their intersection is the empty set $\emptyset$. Let me give an easy example, imagine again that we are rolling a dice. Let us define $A_1$ as the event of rolling an even number and $A_2$ as the event of rolling an odd number

\[ \begin{aligned} A_1 &= \{2,4,6 \} \\[0.5ex] A_2 &= \{1,3,5 \} \end{aligned} \] I think it is very clear from this example that when rolling a dice we cannot get an even number as well as an odd number in a single roll. Either the number is even or it is odd. Therefore, $A_1 \cap A_2 = \emptyset$ or we can abbreviate the expression as $A_1 A_2 = \emptyset$. To make the concept of mutually exclusiveness crystal clear I also made a Venn diagram.

As you can see the two circles do not intersect one another, which gives a visual representation of mutually exclusiveness. A set of events $\{A_1, A_2, A_3, A_4 \,...\,\}$ is called mutually exclusive if and only if every pair of them is mutually exclusive, meaning that for every $i\neq j$, $A_i A_j = \emptyset$. So for example $A_1 A_2 = \emptyset$, $A_1 A_3 = \emptyset$, $A_1 A_4 = \emptyset$, $A_2 A_3 = \emptyset$ etc. etc.

Now we are ready for the most important laws or axioms of probability theory. Let $S$ be the sample space of a random phenomenon. Suppose that to each event $A$ of the event space $\mathcal{F}$, a number denoted by $P(A)$ is associated. If $P$ satisfies the following axioms, then it is called a probability and the number $P(A)$ is said to be the probability of $A$

\[ \begin{aligned} \text{Axiom 1.} \quad & \mathbf{P}(A) \geq 0. \\ \text{Axiom 2.} \quad & \mathbf{P}(S) = 1. \\ \text{Axiom 3.} \quad & \text{If } \{A_1,A_2,A_3, ... \} \text{ is a set of mutually exclusive events} \, (A_i A_j = \emptyset \, \text{for every} \, i\neq j), \\ & \text{then } \mathbf{P}\Big(\bigcup_{i=1}^{\infty} A_i\Big) = \sum_{i=1}^{\infty}\mathbf{P}(A_i) \end{aligned} \]

Many results can be derived from these three axioms and I will just list the top of the iceberg that I find important for the rest of the topics that we are going to discuss.

Result 1: Let $\{A_1, A_2, A_3, A_4, \, ..., A_n \}$ be a set of mutually exclusive events. Then from axiom 3 it can be proven that

\[ P\left(\bigcup^n_{i=1} A_i\right) = \sum^n_{i=1} P(A_i) \] I find the formal proof in Ghahramani’s book quite beautiful, it uses the fact that the probability of the empty set $\emptyset$ equals $0$, $P(\emptyset) = 0$. Without giving a formal proof here, this intuitively makes a lot of sense if we look at axiom 2. Axiom 2 states that if we perform an experiment with sample space $S$, such as rolling a dice, it is certain that one of the possible outcomes in $S$ will occur. If we roll a dice for instance, it is guaranteed that we will roll either 1, 2, 3, 4, 5 or 6, which resembles the sample space in this experiment. If it is certain that one of the possible outcomes in $S$ will occur then we also know it is impossible for the empty set $\emptyset$ to occur, which is the complement of the sample space $S^c$.

Proof: Let us define all events $i > n$ to be the empty set, that is $P(A_i) = \emptyset$ for $i > n$ and suppose that we have an infinite series of mutually exclusive events. Then because all event $i > n$ are the empty set we can write

\[ P\left(\bigcup^n_{i=1}A_i\right) = P\left(\bigcup^{\infty}_{i=1}A_i\right), \] since adding up zeros will not change the equation. Then, because we have a series of mutually exclusive events we can use axiom 3 and the fact that all events $i > n$ have probability $0$ to obtain

\[ P\left(\bigcup^{\infty}_{i=1}A_i\right) = \sum^{\infty}_{i=1} P(A_i) = \sum^{n}_{i=1} P(A_i) \ + \sum^{\infty}_{i=n+1} P(A_i) = \sum^{n}_{i=1} P(A_i) \ + 0 = \sum^{n}_{i=1} P(A_i) \quad \blacksquare\] Therefore, for example if we have a set $\{A_1, A_2, A_3\}$ of mutually exclusive events, then

\[P(A_1 \cup A_2 \cup A_3) = P(A_1) + P(A_2) + P(A_3)\]

It is, however, important to note that if the set $\{A_1, A_2, A_3\}$ is not mutually exclusive we can not write the probability of the union of three events $A_1$, $A_2$ and $A_3$ as just the sum of the probabilities of the individual events. Let us first look back at the four Venn diagrams of the previous chapter. Note that it does not make any difference if we write two events as $A$ and $B$ or $A_1$ and $A_2$. Notice that if we add the area of $A$ and the area of $B$ together that we count their intersection twice. Therefore, we need to correct for that and the probability of the union of two events $A$ and $B$ that are not mutually exclusive becomes

\[ P(A \cup B) = P(A) + P(B) - P(AB) \] Now, for three events $A$, $B$ and $C$ that are not mutually exclusive we get the following Venn diagram

Notice that now we have four intersections between the three events: $AB$, $AC$, $BC$ and $ABC$. If we want to add up the areas of $A$, $B$ and $C$ we will again count the intersection between $A$ and $B$ twice. However, we will also count the intersection between $A$ and $C$ and the intersection between $B$ and $C$ twice. The dark blue part in the middle, intersection $ABC$ we will count three times in total if we add up $A$, $B$ and $C$. And therefore if we correct for the double counting of $AB$, $AC$ and $BC$, we actually subtract the middle dark blue part three times as well. So in order to keep the intersection between $A$, $B$ and $C$ we will have to add it up once more which gives the following formula

\[ P(A \cup B \cup C) = P(A) + P(B) + P(C) - P(AB) - P(AC) - P(BC) + P(ABC) \] Note again that this is the same as writing

\[ P(A_1 \cup A_2 \cup A_3) = P(A_1) + P(A_2) + P(A_3) - P(A_1A_2) - P(A_1A_3) - P(A_2A_3) + P(A_1A_2A_3) \] Result 2: For any event $A$, \[ P(A^c) = 1 - P(A) \] Since the proof is fairly easy to grasp I will show it here.

Proof: Since $AA^c = \emptyset$ we can use result 1 to get

\[ P(A \cup A^c) = P(A) + P(A^c) \] Since $A \cup A^c = S$ and using axiom 2 we get that

\[ 1 = P(S) = P(A \cup A^c) = P(A) + P(A^c) \] We have now that $1 = P(A) + P(A^c)$ and therefore we can rewrite the expression as

\[ P(A^c) = 1 - P(A) \quad \blacksquare \] Result 3. Let $S$ be the sample space of an experiment. If $S$ has $N$ points that are all equally likely to occur then for any event $A$ of $S$,

\[ P(A) = \frac{N(A)}{N},\] where $N(A)$ is the number of points of $A$.

Proof: Suppose that a sample space $S$ has $N$ sample points

\[ S = \{ s_1, s_2, ..., s_N\} \] Then, if all sample points are equally likely to occur, then any sample point $s_i$ has the probability $P(s_i) = 1/N$. Suppose now that we have an event $A$ that consists of $N(A)$ sample points. Here, I will define $A$ as

\[ A = \{ a_1, a_2, ..., a_{N(A)}\} \subseteq S \]

$a_1, a_2, ...,a_{N(A)}$ can be equal to any combination of sample points in $S$, for example $A = \{s_1, s_3, s_5\}$, where $a_1 = s_1$, $a_2 = s_3$ and $a_3 = a_{N(A)} = s_5$. Then, because $\{ a_1, a_2, ...,a_{N(A)} \}$ is a set mutually exclusive events we can use result 1 to obtain

\[ P(A) = P(a_1 \cup a_2 \cup ... \cup a_{N(A)}) = P(a_1) + P(a_2) + ... + P(a_{N(A)}) = P(A) = \underbrace{\frac{1}{N} + \frac{1}{N} + \dots + \frac{1}{N}}_{N(A) \text{ terms}} = \frac{N(A)}{N} \quad \blacksquare \] To conclude, the outcome or sample space $S$, the event space $\mathcal{F}$ and the probability $\mathbf{P}$ that is assigned to each and every event in $\mathcal{F}$ form together the universe of this exciting topic of probability theory.

3 Permutations and Combinations

Imagine we have a standard deck of 52 playing cards. If you are dealt 5 cards, what is the probability that at least one of them is a queen?

For most of us, including myself during my university studies, learning about permutations and combinations requires a lot of contemplation and practice. Therefore, I will introduce all the necessary ingredients that are needed to grasp them step by step. At the end of this chapter you will be able to answer questions about playing cards and many other examples with much ease.

First of all, we will discuss the counting principle. Suppose that there are three routes from town A to town B, and three routes from town B to town C. If we decide to go from town A to town C passing through B, then for each route that we choose from A to B, we have three choices to get from B to C. This might sound confusing. Personally, concepts become much clearer when I visualize them. Let us take a look at the following tree diagram.

Our starting point is clearly town A. Then, as you can see there are three ways to go from town A to town B, namely via route 1, 2 and 3. Then again, from town B there are three ways to travel to town C, namely via route 4, 5 and 6. This results in nine possible ways to travel from town A to town C passing through B: from town A to B we could take route 1 and from town B to C we could take route 4, from A to B we could take route 1 and from town B to C we could take route 5, from town A to B we could take route 1 and from town B to C we could take route 6 etc. etc. Let it be clear that for each route that we choose from town A to B, there are again three routes that we can choose from town B to C. This results in a multiplication of the number of options offered in the first choice by the number of options offered in the second choice. The counting principle states that if there are two independent choices to be made, with the first offering $a$ options and the second offering $b$ options, then the total number of ways to choose is $a \times b$. In our traveling example we were offered three different options to go from town A to town B and three different options to go from town B to C, which results in $3 \times 3 = 9$ ways to go from town A to town C via town B.

There is a more formal definition of the counting principle, but since the purpose of this webpage is to make probability theory more accessible for non-statisticians, I will try to avoid abstract definitions whenever possible. However, I do believe that formal definitions are very important. They prevent ambiguity and provide a framework for proving certain statements and building on top of these proven statements. That said, these formal definitions are less relevant for data scientists, so we will not go into them here.

Next, we are going to extend the counting principle to the generalized counting principle. Suppose now there is a fourth town D that we can reach from town A by passing through town B and town C and suppose that again there are three different ways in which we can travel from town C to town D. Then the generalized counting principle states that the total number of ways to choose can be calculated by multiplying the number of options offered in the first choice by the number of options offered in the second choice by the number of options offered in the third choice. Therefore, there are $3 \times 3 \times 3 = 3^3 = 27$ ways to travel from town A to D via town B and C. We could repeat this process for a fifth town E and a sixth town F etc., but we would just keep multiplying the number of options there are to travel to each town. If the generalized counting principle is not clear, then have look again at the tree diagram from before. Imagine that for each of the nine options to travel from town A to town C via B there are again three options to travel from town C to town D. Then, by simply applying the counting principle we could say that traveling from town A to town C via B is one independent choice and traveling from town C to town D is one independent choice. This results in $9 \times 3 = 27$ ways. Therefore, another way of looking at the generalized counting principle is that it actually applies the counting principle multiple times.

Choosing ones Breakfast

Let us discuss another example to make the generalized counting principle crystal clear. Suppose, like me, every morning you make porridge for breakfast. I like to add a few spices to my porridge, such as cinnamon powder and turmeric powder. I also like to add different kinds of fruit, such as blueberries, green apple, wild peach, or strawberries. Lastly, there are various toppings I like, such as organic peanut butter, flax seeds, or chia seeds. Now, suppose that every morning you must choose one out of three different spices, one out of four different kinds of fruit, and one out of seven toppings. Using the generalized counting principle, and without having to draw a huge tree diagram, we can immediately see that there are

\[ 3 \times 4 \times 7 = 84 \]

different ways to prepare your porridge. This simple multiplication allows us to count all possibilities quickly and efficiently.

Lunchtime

Suppose now that for lunch we are making pasta and while preparing the sauce for the pasta the order in which we put all the ingredients matters. Let us define a set $I$ that contains ten ingredients that one could use to prepare a delicious pasta sauce

\[ I = \{\text{Tomato}, \text{Aubergine}, \text{Paprika}, \text{Salt}, \text{Black Pepper}, \text{Paprika Powder},\\ \text{Dried Oregano}, \text{Fresh Basil leaves}, \text{Grated Cheese}, \text{A few Spoons of Olive Oil}\} \] Suppose we take three ingredients out of set $I$, then in how many different ways can we order them? Suppose we take the tomato, dried oregano and a grated cheese and list the different orderings

\[ \ \{ \text{Tomato}, \text{Dried Oregano}, \text{Grated Cheese} \} \\ \ \{ \text{Tomato}, \text{Grated Cheese}, \text{Dried Oregano} \} \\ \ \{ \text{Dried Oregano}, \text{Grated Cheese}, \text{Tomato} \} \\ \ \{ \text{Dried Oregano}, \text{Tomato}, \text{Grated Cheese} \} \\ \ \{ \text{Grated Cheese}, \text{Dried Oregano}, \text{Tomato} \} \\ \ \{ \text{Grated Cheese}, \text{Tomato}, \text{Dried Oregano} \} \] Therefore, there are six different ways to order this list of three ingredients. Formally, we call the lists above 3-element permutations of a set containing 3 objects, also denoted by $\,_{3}P_{3}$. When the sets get bigger, for example we would like to know the number of 5-element permutations of a set of 5 ingredients, then we might not be able to list all the possibilities by hand. Therefore, we use the generalized counting principle again. Yet, this time as we choose an option the set becomes smaller. For example, for the first ingredient that we wish to put in the sauce we can choose one out of the three ingredient items, which give us three options to choose from. Then with one ingredient item already chosen, we have two options left. And finally there is only one ingredient left to add to the sauce, which is our remaining option. Using the generalized counting principle, this gives us $3 \times 2 \times 1 = 6$ ways to add the three pasta ingredients. Notice that this exactly corresponds to the number of orderings we made by hand!

Let us now add paprika powder and a few spoons of olive oil to the set

\[ \{\text{Tomato}, \text{Paprika Powder}, \text{Dried Oregano}, \text{Grated Cheese}, \text{A few Spoons of Olive Oil}\} \] In how many ways can we order this list of ingredients? By the generalized counting principle we know that there are $5 \times 4 \times 3 \times 2 \times 1 = 120$ different ways to order this list of five ingredients. Formally we call this 5-element permutations of a set of 5 objects. Having paid close attention, you might have noticed that there is a general pattern in these calculations. If not, then do not worry, this is what we will discuss next. The patterns $3 \times 2 \times 1$ and $5 \times 4 \times 3 \times 2 \times 1$ are called factorials and are denoted by $3!$ and $5!$, respectively. I can imagine that you might not be familiar with the factorial or surprise notation “!”. To give a little bit of context, it was introduced by Christian Kramp in 1808 and according to Saeed Grahrahmani that was perhaps because $n!$ (here $n$ could be 0 or any positive integer) gets surprisingly large even for small numbers (Ghahramani 2016). Here are some examples

\[ \begin{aligned} 0! &= 1 \\ 1! &= 1 \\ 2! &= 2 \times 1 = 2 \\ 3! &= 3 \times 2 \times 1 = 6 \\ 4! &= 4 \times 3 \times 2 \times 1 = 24 \\ 5! &= 5 \times 4 \times 3 \times 2 \times 1 = 120 \\ 6! &= 6 \times 5 \times 4 \times 3 \times 2 \times 1 = 720 \\ 7! &= 7 \times 6 \times 5 \times 4 \times 3 \times 2 \times 1 = 5040 \\ 8! &= 8 \times 7 \times 6 \times 5 \times 4 \times 3 \times 2 \times 1 = 40320 \\ 9! &= 9 \times 8 \times 7 \times 6 \times 5 \times 4 \times 3 \times 2 \times 1 = 362880 \\ 10! &= 10 \times 9 \times 8 \times 7 \times 6 \times 5 \times 4 \times 3 \times 2 \times 1 = 3628800 \end{aligned} \]

So as you can see the general pattern is

\[ n! = n(n-1)(n-2) \cdots 3 \cdot 2 \cdot 1, \]

for any positive integer $n$, and $0!$ is being defined as equal to 1. Also observe that for example $2! = 2 \times 1!$, $3! = 3 \times 2!$, $4! = 4 \times 3!$ etc., which can be generalized as $n! = n \times (n-1)!$.

Although the concept of the formula above first appeared in the works of Persian and Arab mathematicians in the twelfth century, there are indications that the mathematicians of India were aware of this rule already a few hundred years before Christ. Amazing, right?

Now let us add the final ingredient before we can fully understand permutations. Suppose we would like to know out of the list of 10 pasta ingredients that was mentioned before how many smaller sets of three ingredients we can make, where the order in which the ingredients are chosen matters. Then, we simply have 10 ingredients that we could choose from at first. Secondly, for each of these 10 possible choices of ingredients we have 9 ingredients left to choose from (one ingredient was already chosen in the first choice and we can not choose it again) and for each of those 9 ingredients we have 8 ingredients left to choose from, which results in $10 \times 9 \times 8 = 720$ sets of three ingredients where the order matters. These are called 3-element permutations of a set of 10 objects, denoted by $\,_{10}P_{3}$. Now we will introduce a general formula for the number of $r$-element permutations of a set containing $n$ objects

\[ \begin{aligned} \,_nP_r &= n(n-1)(n-2) \dots (n - (r-1)) \\ &= n(n-1)(n-2) \dots (n - r + 1) \end{aligned} \] which we can rewrite as

\[ \begin{aligned} \,_nP_r \cdot (n-r)! &= [n(n-1)(n-2) \cdots (n - r + 1)][(n-r)(n-r-1) \cdots 3 \cdot 2 \cdot 1] \\ &= n! \end{aligned} \] Let us take $\,_{10}P_{3}$ as an example to understand why this is the case. According to the general formula we can write $\,_{10}P_{3}$ out as

\[ \,_{10}P_{3} = 10 \cdot 9 \cdot 8, \] where $n = 10$ and $r = 3$. Now if we multiply $\,_{10}P_{3}$ by $(n-r)! = (10-3)! = 7!$ we get

\[ \,_{10}P_{3} \cdot 7! = 10 \cdot 9 \cdot 8 \cdot 7 \cdot 6 \cdots 3 \cdot 2 \cdot 1 = 10! \] I hope it clear now why the multiplication of $(n-r)!$ works.

By rewriting $\,_nP_r \cdot (n-r)! = n!$ we get

\[ \,_nP_r = \frac{n!}{(n-r)!} \]

And also by filling in $r = n$ we have that

\[ \begin{aligned} \,_nP_n &= \frac{n!}{(n-n)!} \\ &= \frac{n!}{(\color{red}{n} - \color{red}{n})!} \\ &= \frac{n!}{0!} \\ &= n! \end{aligned} \] since $0! = 1$. Great!

Before we move on to combinations I will give one last permutations example based on Ghahramani’s book. Suppose two mathematics books, four computer science books, three statistics, three biology, and five music books are put randomly on a bookshelf. In how many ways the books of the same subject can be put together?

First note that based on what we learned about permutations we know that two mathematics books can be ordered in $\,_{2}P_{2} = 2! = 2 \cdot 1 = 2$ different ways, namely $\{\text{Mathematics 1}, \text{Mathematics 2}\}$ and $\{\text{Mathematics 2}, \text{Mathematics 1}\}$. The four computer science books can be ordered in $\,_{4}P_{4} = 4! = 4 \cdot 3 \cdot 2 \cdot 1 = 24$ different ways, three statistics books can be ordered in $\,_{3}P_{3} = 3! = 3 \cdot 2 \cdot 1 = 6$ ways, the three biology books can also be ordered in $\,_{3}P_{3} = 3! = 3 \cdot 2 \cdot 1 = 6$ ways and lastly five music books can be ordered in $\,_{5}P_{5} = 5! = 5 \cdot 4 \cdot 3 \cdot 2 \cdot 1 = 120$ ways. Now it is important to see that for each permutation of the mathematics books (for example $\{\text{Mathematics 1}, \text{Mathematics 2}\}$) there are 24 ways to order the computer science books, 6 ways to order the statistics books, 6 ways to order the biology books and 120 ways to order the music books. According to the generalized counting principle there are therefore $2 \cdot 24 \cdot 6 \cdot 6 \cdot 120 = 207360$ permutations or different ways in which we can order the books on the bookshelf where each subject is together and where the order of the subjects is mathematics, computer science, statistics, biology and music. So the last thing we need to take into account is that we can also order the subjects in different ways on the shelf. Since we have 5 subjects there are $\,_{5}P_{5} = 5! = 5 \cdot 4 \cdot 3 \cdot 2 \cdot 1 = 120$ ways to order the different subjects on the shelf. To conclude, this gives us $120 \cdot (2 \cdot 24 \cdot 6 \cdot 6 \cdot 120) = 24883200$ different ways we can order the books on the bookshelf where the books of each subject are together. I can imagine now that librarians have decision stress!

Introducing Combinations

Next we will discuss combinations and how they differ from permutations. Suppose we have a mini deck of four playing cards: 11 (hearts), Jack (hearts), 11 (spades) and Jack (spades). If two cards are dealt in how many different hands can we get exactly one jack? Let us visualize the different possible hands in a tree diagram

The tree diagram shows that first we need to choose whether we want to have the 11 card first or the Jack card, this corresponds to $\,_{2}P_2 = 2! = 2 \times 1 = 2$ permutations. The second choice is to choose the colour of the first card and since we have only two colours in this mini deck this corresponds to $\,_{2}P_1 = 2$ permutations. Lastly we need to choose the colour of the remaining card, which also corresponds to $\,_{2}P_1 = 2$ permutations. By using the generalized counting principle we know that this results in $2 \times 2 \times 2 = 2^3 = 8$ different hands of two cards where one of them is a jack. Those who are familiar with playing cards will have noticed that receiving 11 hearts first and jack hearts second and vice versa do not have different meanings in the game. This will prove to be the key point in explaining the difference between permutations and combinations. Let me illustrate this using our previous bookshelf example.

We could see the different ranks of the playing cards as the different books of each subject and the colours as the different subjects. Suppose we have two mathematics books and two computer science books. Now out of the four books we can choose only two to put on the bookshelf and one of them has to be a mathematics book for example. In that case both of the subjects will be on the shelf and thus the first choice we need to make is to put the mathematics book first or the computer science book. This gives us $\,_{2}P_{2} = 2! = 2 \times 1 = 2$ options. Next, note that we have two different mathematics books and two different computer science books, from which we can both make $\,_{2}P_{1} = 2$ permutations. So in the end we get $2 \times 2 \times 2 = 2^3 = 8$ ways to put two books on the shelf where the subject mathematics is included.

However, there is an important difference between the bookshelf example and the playing card example. Whereas in the bookshelf example putting one mathematics book first or the other has a significance, in the playing card example there is actually no significance in the game between getting the 11 card first and the jack card second or vice versa. Therefore, the number of hands can be reduced and we can see that the upper part and the lower part of the tree diagram shown before are actually the distinguishable. Thus, there are 4 different hands of two cards with exactly one jack and formally we call these different hands combinations. In the case of combinations the order of the elements in a set does not matter or in other words, different orders have the same meaning. In order to calculate the number of combinations in our mini deck example we simply divide every $r$-element permutations by $r!$ (the number of ways we can permute a set of r elements). Thus, the calculation of the number hands we can receive a jack becomes

\[ \frac{\,_{2}P_{2}}{2!} \times \frac{\,_{2}P_{1}}{1!} \times \frac{\,_{2}P_{1}}{1!} = \frac{2}{2} \times \frac{2}{1} \times \frac{2}{1} = 1 \times 2 \times 2 = 4 \] Here it can be seen very clearly that the first choice between having the 11 card or jack card first is redundant, since the first part of the calculation reduces to 1. The general formula of the number of $r$-element combinations of a set of $n$ objects is given by

\[ \,_nC_r = \frac{n!}{(n-r)! \, r!} \]

Closely observe here that this is exactly what we get if we divide $\,_{n}P_{r}$ by $r!$

\[ \frac{\,_{n}P_{r}}{r!} = \,_{n}P_{r} \cdot \frac{1}{r!} = \frac{n!}{(n-r)!} \cdot \frac{1}{r!} = \frac{n!}{(n-r)!r!} = \,_{n}C_{r} \] There is a special notation for the $r$-element combination formula

\[ \,_nC_r = \frac{n!}{(n-r)! \, r!} = \binom{n}{r} \] which is pronounced as “choose r out of n” and we will use this notation in the rest of this chapter.

Let’s practice!

Let us end the chapter with two permutation examples and two combination examples to make both of these concepts crystal clear. Also I will show how we can use permutations and combinations to calculate probabilities.

Exercise 1: Ariana puts one piece of fruit in her child’s lunch bag every day to school. (a) If she has two bananas and three pears for the next five school days, in how many ways can she do this? (b) What is the probability of not having the same fruit on two consecutive school days?

To answer permutation questions I always first like to picture an example of one of the possible outcomes. In this way it becomes clear how all the possible outcomes should look like. One way to distribute the fruit over the week is

\[ \{\text{Mon: Banana}, \text{Tue: Banana}, \text{Wed: Banana}, \text{Thu: Pear}, \text{Fri: Pear} \} \] We can order this set of five fruits in $\,_{5}P_{5} = 5! = 120$ different ways. However, some of these permutations will not be distinguishable. Imagine all possible orders as a big tree diagram. Some branches lead to identical outcomes because some the bananas and the pears amongst one another are indistinguishable. By ‘pruning’ these branches, we account for overcounting, which mathematically shows up as dividing by the number of permutations of the identical fruits. Therefore, we should divide the 120 options by the number of permutations we can make with the bananas, which is $\,_{3}P_{3} = 3! = 6$ and divide by the number of permutations we can make with the pears, which is $\,_{2}P_{2} = 2! = 2$. This gives

\[ \frac{\,_{5}P_{5}}{\,_{3}P_{3}\,_{2}P_{2}} = \frac{5!}{3!\,2!} = \frac{120}{12} = 10 \ \text{ways} \]

To answer part (b) I will again picture one of the possible outcomes

\[ \{\text{Mon: Banana}, \text{Tue: Pear}, \text{Wed: Banana}, \text{Thu: Pear}, \text{Fri: Banana} \} \] Since the fruits are interchangeable, this is actually the only way in which we could have not the same fruit on two consecutive days. Let us define event $A$ as such and calculate the probability

\[ P(A) = \frac{N(A)}{N} = \frac{1}{10} = 0.1 \] In conclusion, there is a 10% probability of not having the same fruit on two consecutive days.

Exercise 2: Ginny puts at most one piece of fruit in her child’s lunch bag every day. (a) If she has three bananas and two pears for the next eight lunches of her child, in how many ways can she do this? (b) What is the probability of having no fruit on three consecutive school days?

Again I will picture one of the possible outcomes

\[ \{\text{Mon: Banana}, \text{Tue: Banana}, \text{Wed: Banana}, \text{Thu: Pear}, \text{Fri: Pear},\\ \text{Mon: No Fruit}, \text{Tue: No Fruit}, \text{Wed: No Fruit} \} \] We could tackle this permutation problem again by seeing that there are $\,_{8}P_{8} = 8! = 40320$ possible ways to order a set of eight objects. However, we should divide by the number of ways we can permute the bananas, the pears and the days with no fruit. This results in the following calculation

\[ \frac{\,_{8}P_{8}}{\,_{3}P_{3}\,_{2}P_{2}\,_{3}P_{3}} = \frac{8!}{3!\,2!\,3!} = \frac{40320}{72} = 560 \ \text{ways} \] To answer part (b) I will picture three examples of having no fruit on three consecutive school days. Here is example 1:

\[ \{\text{Mon:} \ \textbf{No Fruit}, \text{Tue:} \ \textbf{No Fruit}, \text{Wed:} \ \textbf{No Fruit}, \text{Thu: Pear}, \text{Fri: Pear},\\ \text{Mon: Banana}, \text{Tue: Banana}, \text{Wed: Banana} \} \] Here is example 2:

\[ \{\text{Mon: Banana}, \text{Tue:} \ \textbf{No Fruit}, \text{Wed:} \ \textbf{No Fruit}, \text{Thu:} \ \textbf{No Fruit}, \text{Fri: Pear},\\ \text{Mon: Pear}, \text{Tue: Banana}, \text{Wed: Banana} \} \] And lastly example 3:

\[ \{\text{Mon: Banana}, \text{Tue: Banana}, \text{Wed:} \ \textbf{No Fruit}, \text{Thu:} \ \textbf{No Fruit}, \text{Fri:} \ \textbf{No Fruit},\\ \text{Mon: Pear}, \text{Tue: Pear}, \text{Wed: Banana} \} \] Notice that we can shift this block of three days without fruit six times (1. Mon, Tue, Wed - 2. Tue, Wed, Thu - 3. Wed, Thu, Fri - 4. Thu, Fri, Mon - 5. Fri, Mon, Tue - 6. Mon, Tue, Wed). Having chosen the three consecutive days without having fruit, we still have to choose on which of the remaining five days we give bananas and on which days we give the pears. This we have already done in exercise 1 and we know that the answer is 10. Therefore, for each of the six possible blocks of three days on which the child does not receive fruit there are 10 different ways to order the fruits. This give $6 \times 10 = 60$ ways of having no fruit on three consecutive days. Let us define the event $A$ as such, then the probability $P(A)$ becomes

\[ P(A) = \frac{N(A)}{N} = \frac{60}{560} \approx 0.107 \] In conclusion, there is a 11% probability of receiving no fruit on three consecutive school days. I would advise Ginny to buy some extra fruits so we can prevent this from happening!

Exercise 3: Suppose we have a mini deck of 6 cards: 10 (hearts), 11 (hearts), Jack (hearts), 10 (spades), 11 (spades) and Jack (spades). (a) When three cards are dealt, in how many different hands can we get exactly one Jack and (b) What is the probability of obtaining exactly one Jack?

First of all, we should derive from the context whether the order of the hands matter. This is not the case in a playing card example. Therefore, we know we should calculate the number of hands using combinations and not permutations. The first choice that needs to be made is what the first card will be in the hand that we receive. Since the order does not matter, let us assume this is the jack. In order to receive a jack there are

\[ \,_{2}C_{1} = \binom{2}{1} = \frac{2!}{1! \cdot 1!} = \frac{2 \times 1}{1} = 2 \ \text{options}, \] or we get a jack (hearts) or we get a jack (spades).

The second choice we need to make is the next two cards. Since there are only four cards left to choose from (we can not choose the jack anymore, since we want exactly one jack) and since these two cards can be of any type or rank, this gives

\[ \,_{4}C_{2} = \binom{4}{2} = \frac{4!}{2!2!} = \frac{4 \times 3 \times \color{red}{2 \times 1}}{\color{red}{2 \times 1} \times 2 \times 1} = \frac{12}{2} = 6 \ \text{options} \] Therefore, using the counting principle there are $2 \times 6 = 12$ different hands of three cards with exactly one jack.

To answer question (b) there are

\[ \,_{6}C_{3} = \binom{6}{3} = \frac{6!}{3!3!} = \frac{6 \times 5 \times 4 \times \color{red}{3 \times 2 \times 1}}{\color{red}{3 \times 2 \times 1} \times 3 \times 2 \times 1} = \frac{120}{6} = 20 \ \text{options} \] or different hands if from a mini deck of six playing cards where three cards are dealt. Therefore, if we define $A$ as the event of obtaining exactly one jack in a hand of three cards, the probability $P(A)$ of obtaining one jack is

\[ P(A) = \frac{N(A)}{N} = \frac{12}{20} = \frac{3}{5} = 0.6 \]

Exercise 4: Suppose we have a standard deck of 52 playing cards. (a) If five cards are dealt, how many different hands contain at least one queen? (b) What is the probability of obtaining at least one queen?.

It is important to note the words “at least”, since it means that a hand could have one, two, three or four queens (there are four queens in a standard deck). The number of different hands that contain one queen is equal to choosing one out of four queens

\[ \binom{4}{1} = 4 \] And the other four cards will be chosen from the remaining 48 cards that do not contain any queens

\[ \binom{48}{4} = 194580 \] By the counting principle there are

\[ \binom{4}{1} \cdot \binom{48}{4} = 778320 \] different ways of obtaining one queen in a hand of five cards. In the same way there are

\[ \binom{4}{2} \cdot \binom{48}{3} = 103776 \] ways of obtaining two queens and a similar calculation holds for three queens and four queens. Summing up the number of different hands in which we get one, two, three or four queens we have

\[ \binom{4}{1} \cdot \binom{48}{4} + \binom{4}{2} \cdot \binom{48}{3} + \binom{4}{3} \cdot \binom{48}{2} + \binom{4}{4} \cdot \binom{48}{1} = \sum_{q=1}^{n} \binom{4}{q} \cdot \binom{48}{5-q} \] where the number of queens $q = 1, 2, 3, 4$. Then, in order to find the probability we simply divide this number by the total number of different hands of five cards in a standard deck of 52 playing cards. Let $A$ be the event of having at least one queen in a hand of five cards, then $P(A)$ is

\[ P(A) = \frac{\sum_{q=1}^{n} \binom{4}{q} \cdot \binom{48}{5-q}}{\binom{52}{5}} \approx 0.34 \] This probability can also be obtained by using the complement rule. We could calculate the number of different hands that do not contain any queens and calculate the probability of obtaining at least one queen as 1 minus the probability of obtaining no queens. This results in the following calculation

\[ P(A) = 1 - \frac{N(A^c)}{N} = 1 - \frac{\binom{48}{5}}{\binom{52}{5}} \approx 0.34 \] yielding the same result.

4 Conditional Probability (*)

Two other important concepts in probability theory are conditional probabilities and independence. They also play a vital role in data science as we will see later on. Suppose that all the freshman of the bachelor Econometrics & Operations Research at the Vrije Universiteit Amsterdam took probability theory and linear algebra during the last semester. Suppose that 55% of the students passed probability theory, 70% passed linear algebra, and 45% passed both. If a randomly selected freshman is found to have passed linear algebra last semester, what is then the probability that he or she also passed probability theory?

Given that we know certain information, or in other words that a certain event has occurred, probabilities may change. Let us call $A$ the event that a randomly selected student passed probability theory and $B$ the event that he or she passed linear algebra. Then $P(A) = 0.55$, $P(B) = 0.70$ and $P(A \cap B) = 0.45$ are the probabilities of a student passing probability theory, a student passing linear algebra and the probability that he or she passed both, respectively. Knowing that $B$ has occurred changes the chances of the occurrence of $A$. This probability is denoted as $P(A \ | \ B)$ and it should be read as the probability of $A$ given $B$.

Now, let’s say we have 100 students, then out of 100 students 70 would have passed linear algebra, 55 would have passed probability theory and 45 of them would have passed both. Then given that $B$ occurred, we know already that he or she is in the group of 70 students. Further, we know that 45 students in that group also passed probability theory. This translates into the proportion $45/70 \approx 0.64$ of linear algebra graduates that also graduated for probability theory. So, the probability of passing probability theory is somewhat higher if you have already passed linear algebra.

5 Independence (*)

If the probability of passing probability theory and the probability of passing linear algebra were to be independent form one another, then $P(A) \cdot P(B) = P(A \cap B)$ and $P(A \ | \ B) = P(A)$. Which makes a lot of sense intuitively, knowing that $B$ occurred does not change the probability of $A$, since these two events are independent from one another. Independence also comes from $P(A \cap B) = P(A \ | \ B) \cdot P(B) = P(A) \cdot P(B)$.

6 Random Variables (RV’s)

Introducing Random Variables

The easiest way to define a random or stochastic variable is perhaps how Grhahramani states it in his book: it is a quantity that does not have a fixed value. He gives the example of the number of babies born in a certain hospital. But of course there are numerous examples of quantities without fixed values. To illustrate a few other examples: the amount of rainfall today in your hometown, the end score of the soccer match final of your favorite soccer team, the amount of traffic you experience on your way to work and even the thoughts and emotions that you experience on a given day follow a stochastic process. In short, any quantity where uncertainty plays a roll we call a random variable.

Following the book I will give one illustration and then the formal definition of a random variable $X$. Imagine again two dice with the following sample space $S$

\[ S = \{(x_1,x_2): x_1,x_2 \in \{1,2,3,4,5,6\} \}, \] which can be read as that $S$ contains elements of the form $(x_1,x_2)$ such that $x_1$ and $x_2$ are numbers from the set $\{1,2,3,4,5,6\}$. Examples are: $(1,1),(1,2),(1,3),(2,5),(4,5),(6,6)$ etc. etc. In other words, the sample space $S$ is limited by the fact that a dice can only take six different values. Further, $x_1$ and $x_2$ each represent one roll and $(x_1,x_2)$ represents the outcome of the two rolls. I hope this representation of the sample space of two dice rolls is crystal clear now.

Then let us define a random variable $X$ as the sum of two dice rolls. Note that $X$ has the possible outcomes $\{2,3,4,5,6,7, ...,12\}$ which is also called the range of $X$. The minimum value of $X$ is equal to $2$ (throwing two 1’s) and the maximum value is $12$ (throwing two 6’s). The random variable $X$ takes an element from the sample space as its input and transforms it according to how $X$ is defined. This phenomenon is also called a mapping in mathematics.

To illustrate what a mapping is I will use a very simple function that you might remember from high school

\[ y(x) = \frac{1}{2}x \] In mathematics we write lower case letters for quantities that are deterministic and upper case letters for quantities that are random. In high school we were always dealing with deterministic variables and therefore we write expressions with lower case letters, like in the example above. The mapping of this function or expression is illustrated by the following picture

The two oval spheres represent the sets of values that $x$ and $y(x)$ can take respectively. $x$ can take any real number, which is represented in mathematics by $\mathbf{R}$. Any number between $-\infty$ and $\infty$ is a real number. The same holds for $y(x)$, which can also take any real number. Formally, we would write out this expression as

\[ y: \mathbf{R} \to \mathbf{R} \] As you can see on the picture any real value or number $x$ that you enter into $y(x)$ (or abbreviated as $y$) gets transformed to $1/2x$: $2$ becomes $1$, $4$ becomes $2$, $6$ becomes $3$, $8$ becomes $4$ etc. But note that you can also enter decimal numbers: $1.5$ becomes $0.75$, $1.42$ becomes $0.71$ etc. etc. Decimal numbers are also part of $\mathbf{R}$.

Being a little more familiar with mappings it is time to introduce the formal definition of a random variable $X$. Let $S$ be the sample space of an experiment. A real-valued function $X: S \to \mathbf{R}$ is called a random variable of the experiment if, for each interval $I \subseteq \mathbf{R}$, the set $\{s: X(s) \in I\}$ is an event. This might look a little bit complicated at first hand, but let us go through the definition step by step. First of all, $X: S \to \mathbf{R}$ is easy to understand, since it is very similar to the mapping that we have seen before. However, the elements that are entered into the transformation or mapping are now coming from the sample space $S$ and they get transformed by $X(s)$ or simply $X$ to a real-valued number. Again consider two dice rolls and let $X$ be the sum of the two rolls. Then, we have the following mapping,

\[ X: S = \{(x_1,x_2): x_1,x_2 \in \{1,2,3,4,5,6\} \} \to \{2,3,4,5,6,7,...,12\} \] Again I will show a picture to make the idea of a random variable very clear

Here $(1,1)$ (rolling two 1’s) gets transformed into $2$ (the sum of two dice rolls with the outcome 1), $(2,2)$ gets mapped into $4$ and $(3,3)$ into $6$ etc.

The second part of the definition may be challenging to grasp: for each interval $I \subseteq \mathbf{R}$, the set $\{s: X(s) \in I\}$ is an event. First of all, note that an event, let’s say $A$, is always a subset of the sample space $S$ and $A = \{s: X(s) \in I\}$ is just another way of defining an event. $A$ is not a subset or interval of the range of $X$. Suppose we do have a subset or interval $I$ — a subset being discrete, e.g. $\{1,2,3,4,5,6\}$ and an interval being continuous, e.g. $[1,6]$, which includes decimal numbers — then the outcomes of the sample space $S$ that correspond to the values in $I$ should together form an event. I will give a simple example for the discrete case. Let us take $X$ again as the sum of two dice rolls and let us see if the condition for random variables holds. In this example it means that for each subset $I \subseteq \{2,3,4,5,6,7,...,12\}$ — so for each subset of the range of $X$ — the set of corresponding outcomes of the sample space $\{s: X(s) \in I\}$ is an event. If we take $I = \{2,3\}$, the corresponding outcomes of the sample space would be $A = \{(1,1),(1,2),(2,1)\}$, which is the event that the sum of the two dice rolls is smaller than $4$. And if $I = \{11,12\}$, then $A = \{s: X(s) \in I\} = \{(5,6),(6,5),(6,6)\}$, i.e. the event that the sum of the two dice rolls is larger than 10.

Distribution functions (*)

As we have seen, probability theory is built on three key ingredients: the sample space $S$, the event space $\mathcal{F}$ and a probability measure $\mathbf{P}$ that assigns a probability to each event in $\mathcal{F}$. Random variables provide a convenient way to express and compute the probabilities of events. Let us return to our favorite example: rolling two dice. Define the random variable $X$ as the sum of the numbers showing on the two dice. Then,

\[ \begin{aligned} P(X = 2) &= P(\{(1,1)\} = \frac{1}{6} \cdot \frac{1}{6} = \frac{1}{36} \\ P(X = 3) &= P(\{(1,2),(2,1)\}) = \frac{1}{36} + \frac{1}{36} = \frac{2}{36} = \frac{1}{18} \\ P(X = 4) &= P(\{(2,2),(1,3),(3,1)\}) = \frac{1}{36} + \frac{1}{36} + \frac{1}{36} = \frac{3}{36} = \frac{1}{12} \\ \vdots \\ P(X = 12) &= P(\{(6,6\}) = \frac{1}{36} \end{aligned} \] This example illustrates how random variables allow us to define events concisely, without having to necessarily list every individual outcome. The calculations rely on the fact that all outcomes in the sample space $S$ are mutually exclusive. Thus, the probability of an event involving multiple outcomes is simply the sum of the probabilities of the individual outcomes. Often we are not only interested in probabilities of the form $P(X = a)$, where $a \in Range(X)$, but also in probabilities of the form $P(X < a)$. For example, what is the probability that the sum of two dice rolls is at least 8? Mathematicians have designed a special function for this purpose. If $X$ is a random variable, then the function $F$ defined on $(-\infty,\infty)$ by $F(t) = P(X \leq t)$ is called the distribution function of $X$. Since $F$ “accumulates” all of the probabilities of the values of $X$ up to and including $t$, sometimes it is called the cumulative distribution function (or CDF) of $X$. Let us compute the CDF of $X$, where $X$ is the sum of two dice rolls. Remember that de range of $X$ is $\{2,3,4, ..., 12\}$

\[ \begin{aligned} F(2) = P(X \leq 2) &= P(X = 2) = \mathbf{\frac{1}{36}} \\[1.5ex] F(3) = P(X \leq 3) &= P(\{X = 2\} \cup \{X = 3\}) \\[1.5ex] &= P(X = 2) + P(X = 3) \\[0.5ex] &= \frac{1}{36} + \frac{2}{36} = \frac{3}{36} = \mathbf{\frac{1}{12}} \\[1.5ex] F(4) = P(X \leq 4) &= P(\{X = 2\} \cup \{X = 3\} \cup \{X = 4\}) \\[1.5ex] &= P(X = 2) + P(X = 3) + P(X = 4) \\[0.5ex] &=\frac{1}{36} + \frac{2}{36}+ \frac{3}{36} = \frac{6}{36} = \mathbf{\frac{1}{6}} \\[0.5ex] \vdots \\[0.5ex] F(12) = P(X \leq 12) &= P(\bigcup^{12}_{a=2} \{X = a\}) = P(S) = \mathbf{1} \end{aligned} \] From these calculations is can be clearly seen how $F(t)$ accumulates the probabilities of events. It can also be shown graphically

The CDF of a discrete random variable $X$ — meaning that the sample space of $X$ is countable or countably infinite, e.g. $\{1,2,3,4,5,6\}$ or $\{1,2,3,...\}$ — usually looks like a step function similar to the one I have created above. The reason is that if $X$ takes a countable or countably infinite sample space it can also only have a range that is countable or countably infinite. Therefore, when we visualize a CDF of a discrete random variable there will be jumps at the points where $X$ is defined and the CDF will not look continuous or smooth. For example, we have seen that $P(X = 2) = 1/36$ and $P(X = 3) = 1/12$. Therefore $F(3-) = P(X < 3) = 1/36$, but $F(3) = P(X \leq 3) = 1/36 + 1/12 = 1/9$. The minus here indicates that $3$ itself is not included in the inequality.

CDF’s have certain important qualities that we will discuss now

F is nondecreasing: if $t < u$, then $F(t) \leq F(u)$. If we take for example $t = 2$ and $u = 3$, then the event $\{X \leq 2\} = \{(1,1)\}$ implies the event $\{X \leq 3\} = \{(1,1),(1,2),(2,1)\}$, since $(1,1)$ is an element of $\{X \leq 3\}$. Therefore, it is easy to see that for any $t < u$, $\{X \leq t\} \subseteq \{X \leq u\}$, since $F(t)$ not only accumulates probabilities, but the events associated with those probabilities also get accumulated. If $\{X \leq t\} \subseteq \{X \leq u\}$, then $P(X \leq t) \leq P(X \leq u)$. This is equivalent to $F(t) \leq F(u)$. To understand this clearly take any event $A$ and $B$. Then if $A \subseteq B$, $P(A) \leq P(B)$. Here $B = A \cup (B - A)$ and by using the fact that $A \cap (B - A) = \emptyset$, we get that $P(B) = P(A) + P(B - A) \geq P(A)$, because of axiom 1: $P(A) \geq 0$ for any event $A$.
$\mathbf{\lim_{t \to \infty} F(t) = 1}$: this statement seems very analytical and of course it is. Mathematical analysis is a tough and abstract subject and therefore I will try my best to keep any analytical proofs as concise as possible using easy to grasp language and notation. The statement can be read in the following way: when t approaches infinity, $F(t)$ converges to $1$. A sequence $\{A_n, n \geq 1\}$ of events of a sample space is called increasing if

\[ A_1 \subseteq A_2 \subseteq A_3 \subseteq \cdots \subseteq A_n \subseteq A_{n+1} \cdots \] For an increasing sequence of events $\{A_n, n \geq 1\}$, by $\lim_{n \to \infty} A_n$ we mean the event that at least one $A_i$, $1 \leq i \leq \infty$ occurs. Therefore,

\[ \lim_{n \to \infty} A_n = \bigcup^{\infty}_{n = 1} A_n \]

Then suppose $B_1 = A_1$, $B_2 = A_2 - A_1$, $B_3 = A_3 - A_2$, …, $B_n = A_n - A_{n-1}$, $\dots$. Clearly, $\{B_m, m \geq 1\}$ is a mutually exclusive set of events, since $A_{i-1} \subseteq A_i$, where $i \geq 2$. Or in more simple words, when you subtract $A_2$ from $A_3$, you basically also subtract $A_1$ from $A_3$, since it is included in $A_2$. Therefore, we get the following relations

\[ \begin{aligned} \bigcup^n_{i = 1} B_i &= \bigcup^n_{i = 1} A_i = A_n, \qquad n = 1,2,3,...,\\[1.5ex] \bigcup^{\infty}_{i = 1} B_i &= \bigcup^{\infty}_{i = 1} A_i \end{aligned} \]

Hence,

\[ \begin{aligned} P(\lim_{n \to \infty} A_n) &= P(\bigcup^{\infty}_{i=1} A_i) = P(\bigcup^{\infty}_{i=1} B_i) = \sum^{\infty}_{i = 1} P(B_i) \\ &= \lim_{n \to \infty} \sum^n_{i = 1} P(B_i) = \lim_{n \to \infty} P(\bigcup^n_{i = 1} B_i) = \lim_{n \to \infty} P(\bigcup^n_{i = 1} A_i) \\ &= \lim_{n \to \infty} P(A_n) \quad \blacksquare \end{aligned} \] Using this result we can show that for any increasing sequence $\{t_n\}$ of real numbers that converges to $\infty$

\[ \lim_{n \to \infty} P(X \leq t_n) = P(\lim_{n \to \infty}\{X \leq t_n\}) = P\left(\bigcup^{\infty}_{n=1}\{X \leq t_n \}\right) = P(X < \infty) = 1 \]

which means that

\[ \lim_{t \to \infty} F(t) = 1 \]

$\mathbf{\lim_{t \to -\infty} F(t) = 0}$. The proof of this is similar to the proof that $\mathbf{\lim_{t \to \infty} F(t) = 1}$.
F is right continuous. That is, for every $t \in \mathbb{R}$, $F(t+) = F(t)$. This means that if $t_n$ is a decreasing sequence of real numbers converging to $t$, then

\[ \lim_{n \to \infty} F(t_n) = F(t) \] To prove this, note that since $t_n$ decreases to $t$, the events $\{X \leq t_n\}$ form a decreasing sequence that converges to the event $\bigcap^{\infty}_{n=1}\{X \leq t_n\} = \{X \leq t\}$. Thus, by the continuity property of the probability function,

\[ \lim_{n \to \infty} P(X \leq t_n) = \bigcap^{\infty}_{n=1}\{X \leq t_n\} = \{X \leq t\} \]

which means that,

\[ \lim_{n \to \infty} F(t_n) = F(t). \]

7 Expectations and Variances

Expectations and Variances of Discrete Random Variables

The expected value or expectation of a random variable $X$ is a fundamental concept in probability and statistics. For discrete random variables it is a weighted average of $X$. Each possible value $x \in \{x_1,x_2,x_3,...\}$ — the discrete set of values that $X$ can take — is weighted by a so called probability mass function (pmf). Formally, the pmf, denoted by $p$, is a function from the real numbers $\mathbf{R}$ to $\mathbf{R}$ that satisfies the following properties:

\[ \begin{alignedat}{2} \textbf{(a)} \quad & p(x) = 0 \ \ \text{if } x \notin \{x_1,x_2,x_3,...\} \\[1.5ex] \textbf{(b)} \quad & p(x_i) = P(X = x_i) \ \ \text{and hence } p(x_i) \ge 0 \ (i = 1,2,3,...) \\[1.5ex] \textbf{(c)} \quad & \sum_{i=1}^{\infty} p(x_i) = 1 \end{alignedat} \]

For example, $X$ denotes the number of heads (the portrait side of a coin) in two flips of a fair coin. Here, heads and tails are equally likely to occur. Then, we know that the sample space of this experiment is

\[S = \{(T,T),(H,T),(T,H),(H,H)\},\] And therefore we can calculate $p(x)$, where $x \in \{x_1, x_2, x_3\} = \{0,1,2\}$ easily in the following way

\[ \begin{aligned} p(0) &= P(X = 0) = \frac{N(\{(T,T)\})}{N(S)} = \frac{1}{4} \\[1.5ex] p(1) &= P(X = 1) = \frac{N(\{(H,T),(T,H)\})}{N(S)} = \frac{2}{4} = \frac{1}{2} \\[1.5ex] p(2) &= P(X = 2) = \frac{N(\{(H,H)\})}{N(S)} = \frac{1}{4}, \end{aligned} \] where $N(.)$ stands for the number of possible outcomes in an event.

If $x \notin \{0,1,2\}$ then $p(x) = 0$ ($0$,$1$ and $2$ are the only possible outcomes). Lastly, $\sum^3_{i=1}p(x_i) = 1/4 + 1/2 + 1/4 = 1$. Therefore, this example satisfies the three conditions (a), (b) and (c).

Let us now visualize the pmf of $X$ graphically

On the horizontal axis the possible outcomes of $X$ are depicted ($0$,$1$ or $2$ heads) and the vertical axis shows the value of $p(x)$ associated with the possible outcomes of $X$. Theoretically speaking, there is a $1/4$ probability of obtaining zero heads, a $1/2$ probability of obtaining a single head and a $1/4$ probability of obtaining two heads when flipping two fair coins as we have seen in our calculations of $p(0)$, $p(1)$ and $p(2)$ above. If we want to compute the expectation of $X$ we simply compute the weighted average of $X$, using $p(x_i)$, $i = 1, 2, 3$ as weights. Therefore,

\[ \begin{aligned} E(X) &= 0 \times p(0) + 1 \times p(1) + 2 \times p(2) \\ &= 0 \times \frac{1}{4} + 1 \times \frac{1}{2} + 2 \times \frac{1}{4} \\ &= 1 \end{aligned} \] Suppose we have a wooden stick weighing $1$ kilogram, with the weight distributed as follows: $250$ grams on each end and $500$ grams in the middle. Where would you place your index finger to balance the stick? In other words, where is its center of gravity? It will be right in the middle. Now, suppose $250$ grams from the middle shift toward the right end. In that case, the center of gravity moves to the right as well. You can try this yourself: for example, when balancing a spoon on your index finger or thumb, you will notice that the center of gravity is not exactly in the middle but slightly toward the curved side of the spoon, because most of its weight is concentrated there.

The general formula of the expectation of $X$, $E(X)$ is given by

\[ E(X) = \sum_{x \ \in \ Range(X)} x \cdot p(x) \]

If we would repeat the experiment of tossing two fair coins an infinite amount of times (the number of repetitions $n \to \infty$) the sample average value of $X$, $\bar{X}_n$, will converge to what we call the expectation of $X$, denoted by $E(X)$. Here, $\bar{X}_n$ is simply the arithmetic mean of the outcomes of the $n$ repetitions,

\[ \bar{X}_n = \frac{1}{n} \sum^{n}_{i = 1} X_i \]

In practice, it is not possible to repeat an experiment an infinite amount of times, so let us just assume that $n$ is very large, for example $n = 10000$. I will show to you that for large $n$ the sample average $\bar{X}_n$ converges to the expectation $E(X)$ by using a programming language called R. In the following lines of code I have simulated 10000 coin flips for two fair coins and I have calculated the proportion of zero heads, one heads and two heads. Also I have computed the sample mean. It is not necessary to be able to follow every line of code, it is merely here as an illustration of how simulation can be done with the help of programming language like R.

# Making sure my results are reproducible by setting a seed
set.seed(1256)

# Creating a R function that...
simulate_coins_n <- function(n) {
  # 1. Simulates n coin tosses for two fair coins      
  coin1 <- sample(c(0,1), n, replace = TRUE, prob = c(0.5,0.5))
  coin2 <- sample(c(0,1), n, replace = TRUE, prob = c(0.5,0.5))
  
  # 2. Collects the outcomes of these n repetitions in a data frame
  fair_coins <- data.frame(coin1 = coin1, coin2 = coin2)
  
  # 3. Counts the number of (T,T), (T,H), (H,T) and (H,H) outcomes
  counts <- table(fair_coins)
  
  # 4. Calculates the proportion of zero heads, one heads and two heads
  zero_heads <- counts[1] / n
  one_heads <- (counts[2] + counts[3]) / n
  two_heads <- counts[4] / n
  
  # 5. Collects the proportions in a vector and computes the sample mean
  px <- c(zero_heads, one_heads, two_heads)
  mean_heads <- sum(px * c(0,1,2))
  
  # 6. Returns the results in a list
  return(list(px = px, mean = mean_heads))
}

# Applying the R function that we have just created to n = 10000
sims <- simulate_coins_n(10000)

# Retrieving the proportions p(x) from the list of results
sims$px

## [1] 0.2571 0.4925 0.2504

As you can see the outcomes are pretty close to what the probability mass function is showing us ($1/4$ probability for zero heads, $1/2$ probability for one heads and $1/4$ probability for two heads).

In our simulation example we can retrieve the sample average of the 10000 repetitions in the following way where the proportions of zero, one and two heads are used as weights

# We can retrieve the sample average from our list of results
X_bar <- sims$mean

#or by computing
X_bar <- sum(sims$px * c(0,1,2))

#or equivalently by computing
X_bar <- sims$px[1]*0 + sims$px[2]*1 + sims$px[3]*2

X_bar

## [1] 0.9933

Notice that this is exactly the same as computing the arithmetic mean

# Making sure my results are reproducible by setting a seed
set.seed(1256)

# Simulating 10000 coin tosses for two fair coins
coin1 <- sample(c(0,1), 10000, replace = TRUE, prob = c(0.5,0.5))
coin2 <- sample(c(0,1), 10000, replace = TRUE, prob = c(0.5,0.5))

# Calculating the sample average of the number of heads
mean(coin1 + coin2)

## [1] 0.9933

This is interesting, right? It seems that we have found a relation between the sample average and the estimated expectation, that is,

\[ \hat{E}(X) = \sum_{x \in \{0,1,2\}} x \cdot \hat{p}(x) = \frac{1}{n} \sum^{n}_{i = 1} X_i = \bar{X} \]

In order to proof this for our experiment let us define a function $f$: $\{0,1,2\} \to \mathbf{R}$, where

\[ f(x) = \#\{i: X = x\} \] For example if $n = 100$ and we have found that in 30 repetitions of the experiment we found zero heads, in 50 repetitions we found one head and in 20 repetitions we found two heads, then $f(0) = 30$, $f(1) = 50$ and $f(2) = 20$, receptively. Then, if we multiply $f(x)$ by $x$ and take the sum over the possible outcomes of $x$ we will find that

\[ \sum^2_{x=0} x \cdot f(x) = \sum^n_{i = 1} X_i, \]

which is the total number of heads in $n$ experiments. Also we have that

\[ \hat{p}(x) = \frac{f(x)}{n}, \] which is exactly what we did in the R function to calculate the proportion of zero, one and two heads.

Then, we get

\[ \hat{E}(X) = \sum^{2}_{x = 0} x \cdot \hat{p}(x) = \sum^{2}_{x = 0} x \cdot \frac{f(x)}{n} = \frac{1}{n} \sum^{2}_{x = 0} x \cdot f(x) = \frac{1}{n} \sum^n_{i = 1} X_i = \bar{X}_n \quad \blacksquare \]

This proof describes very beautifully the relation between probability theory and statistics.

Now, when we increase the number of repetitions $n$, we will see that the average number of heads in two coin flips will converge to the theoretical expectation $E(X) = 1$. This is actually one of the most fundamental laws in statistics called the Law of Large Numbers (LLN). The following figure shows how it functions

Here you can see exactly the same figure of the pmf of $X$ as in the beginning of this chapter, but I have added the results of four simulations in a darker blue color, where $n$ takes the values $10$, $100$, $1000$ and $10000$. As $n$ increases, the proportion of zero, one and two heads come closer and closer to the theoretical pmf values and the estimation of the expectation $\hat{E}(X)$ — which is none other than the sample mean — converges to the theoretical expectation $E(X) = 1$.

If you look closely at the figure, you’ll notice that for $n = 10000$ we obtained a slightly different estimate of $\hat{E}(X)$ this time, namely $0.998$ instead of $0.993$. This happens because the samples are generated using a pseudo-random number generator (PRNG). A PRNG produces sequences of numbers that depend on a starting point, called the seed. By fixing the seed before running your code, you can ensure that the exact same sequence of numbers — and therefore the same simulation results — are reproduced. However, when we run simulations for different sample sizes ($n = 10, 100, 1000, 10000$), we want to obtain independent samples. Setting the same seed each time would just regenerate the same sequence of numbers, which is not what we want. That is why the sample used for $n = 10000$ came from a different seed than before, because the initial seed had already been used for $n = 10$.

This leads to a key point: there is always sample variability in simulations. In practice, this also makes a lot of sense. Imagine tossing two fair coins $100$ times today, and then repeating the same $100$ tosses of two coins tomorrow. You would not expect to get exactly the same sequence of heads and tails both times. The PRNG is designed to mimic this phenomenon. Even though the experiment we are simulating stays the same, different random samples will give slightly different values of $\hat{E}(X)$. As $n$ grows, these values typically get closer to the true expectation, but they will never be exactly the same across independent simulations. The next example illustrates this point even more clearly.

Another way of showing the LLN in action is to create $m$ independent simulations, each consisting of $n$ repetitions of two fair coin tosses. For each simulation $j$, $j = 1, \dots, m$, we compute a sample mean $\bar{X}^{(j)}_n$ (equivalently written as $\hat{E}^{(j)}(X)$). The boxplot below shows the sample means from these $m$ simulations for different values of $n$. You can see that the values of the sample means vary quite a lot when $n$ is small, but as $n$ increases this variability becomes smaller and smaller. By the time we reach $n = 10000$, the box is so narrow that it almost looks like a single line at the true expectation $E(X) = 1$.

Suppose we want to predict the value of a discrete random variable $X$ and let $X$ be the number of heads in two coin tosses. Let us say that the prediction is the value $t$. Then, the error of the prediction is given by $\epsilon = X - t$ (the Greek letter $\epsilon$ is commonly used to denote prediction errors). For example, if we make the prediction $t = 1$ and the outcome of the experiment is $X = 0$, then the error is $ = X - t = -1$ (a negative error). And if our prediction is $t = 1$ and the outcome is $X = 2$, then the error is $\epsilon = X - t = 1$ (a positive error). For our prediction $t = 1$ we have the following probability mass function $p$ for the error $\epsilon$,

\[ \begin{aligned} p(-1) &= P(X - t = -1) = P(X = 0) = \frac{1}{4} \\[1.5ex] p(0) &= P(X - t = 0) = P(X = 1) = \frac{1}{2} \\[1.5ex] p(1) &= P(X - t = 1) = P(X = 2) = \frac{1}{4} \end{aligned} \] If we measure a quantity a very large number of times (for example 10000 tosses of two coins), the positive and negative errors of the same magnitudes will occur with equal probability. This is again the Law of Large Numbers at work. Let $\mathbf{1}_{\{X-t = k\}}$ be an indicator function that is equal to $1$ if $X-t = k$ and equal to $0$ if $X-t \neq k$. So if $n = 10000$, $\sum^n_{i= 1} \mathbf{1}_{\{X_i-t = k\}}$ simply sums up how many prediction errors are equal to $k$, where $k = -1,0,1$ in the experiment of tossing two fair coins and calculating the number of heads. Then according to the LLN,

\[ \begin{aligned} \frac{1}{n} \sum^n_{i= 1} \mathbf{1}_{\{X_i-t = 1\}} \xrightarrow[\text{n}\to\infty]{\text{LLN}} E[\mathbf{1}_{\{X_i-t = 1\}}] &= 1 \cdot P(X - t = 1) + 0 \cdot P(X - t \neq 1) = P(X - t = 1) = \frac{1}{4} \\[1.5ex] \frac{1}{n} \sum^n_{i= 1} \mathbf{1}_{\{X_i-t = 0\}} \xrightarrow[\text{n}\to\infty]{\text{LLN}} E[\mathbf{1}_{\{X_i-t = 0\}}] &= 0 \cdot P(X - t = 0) + 0 \cdot P(X - t \neq 0) = P(X - t = 0) = \frac{1}{2} \\[1.5ex] \frac{1}{n} \sum^n_{i= 1} \mathbf{1}_{\{X_i-t = -1\}} \xrightarrow[\text{n}\to\infty]{\text{LLN}} E[\mathbf{1}_{\{X_i-t = -1\}}] &= -1 \cdot P(X - t = -1) + 0 \cdot P(X - t \neq -1) = P(X - t = -1) = \frac{1}{4} \end{aligned} \] Therefore, the empirical expected value or sample mean of the error in this experiment will be close to 0, which is not very informative. That is why we might consider $E[(X - t)^2]$, the expected value of the squared error. Minimizing $E[(X - t)^2]$ with respect to $t$ gives

\[ \begin{aligned} \frac{d}{dt} E[(X - t)^2] &= \frac{d}{dt} \sum_{x \in \{x_1,x_2,x_3,...\}} (x-t)^2 \, p(x) \\[1.5ex] &= \sum_{x \in \{x_1,x_2,x_3,...\}} -2(x-t) \, p(x) \\[1.5ex] &= 2 \sum_{x \in \{x_1,x_2,x_3,...\}} t \, p(x) - 2 \sum_{x \in \{x_1,x_2,x_3,...\}} x \, p(x) \\[1.5ex] &= 2t \underbrace{\sum_{x \in \{x_1,x_2,x_3,...\}} p(x)}_{=\,1} - 2 E(X) \\[1.5ex] &= 2t - 2E(X) = 0 \\[1.5ex] & \Leftrightarrow \\[1.5ex] & t = E(X) \end{aligned} \] Thus, $\text{Var}(X) = \displaystyle \min_{t} E[(X - t)^2]$, where $t = E(X)$ is the unique minimizer of $E[(X - t)^2]$. Thus, the variance measures how spread out $X$ is around its expectation $E(X)$: the smaller the variance $\text{Var}(X)$, the more reliable the expectation is as a predictor for $X$.

Let $X$ be a discrete random variable with a set of possible values $\{x_1,x_2,x_3,...\}$, probability mass function $p(x)$ and $E(X) = \mu$. Then $\sigma_X$ and $\text{Var}(X)$, called the standard deviation and variance, respectively, are defined by

\[ \sigma_X = \sqrt{E[(X - \mu)^2]} \quad \text{and} \quad \text{Var}(X) = E[(X - \mu)^2]. \]

Then, by the law of the unconscious statistician

\[ \text{Var}(X) = E[(X - \mu)^2] = \sum_{x \in \{x_1,x_2,x_3,...\}} (x - \mu)^2 \ p(x) \]

The variance of $X$ can also be written as

\[ \begin{aligned} \text{Var}(X) &= E[(X - \mu)^2] \\ &= E[(X^2 - 2 \mu X + \mu^2)] \\ &= E[X^2] - 2 \mu E[X] + \mu^2 \\ &= E[X^2] - 2 E[X]^2 + E[X]^2 \\ &= E[X^2] - E[X]^2 \end{aligned} \] Since $\text{Var}(X) = E[(X - \mu)^2] \geq 0$ ($(X - \mu)^2$ is nonnegative number, because of the square) the result above implies that

\[ E(X^2) \geq [E(X)]^2 \] Let $X$ be a discrete random variable. Then, for constants $a$ and $b$ we have that

\[ \text{Var}(aX + b) = a^2\, \text{Var}(X),\\ \sigma_{aX + b} = |a|\, \sigma_{X} \]

Expectations and Variances of Continuous Random Variables

To understand how we can calculate probabilities for continuous random variables, for instance the probability that a random watermelon in a certain field weighs between $6$ and $9$ kilograms, we need a basic understanding of measure theory. For discrete random variables, and especially those with countable sample spaces, it was not necessary to introduce formal definitions for sets and measures. For calculating probabilities we could simply divide the number of possible outcomes in a certain event by the number of possible outcomes in the entire sample space. However, for the continuous case there is no way around it. We have seen before that the universe of probability theory consists of three key elements: the sample space $S$, the event space $\mathcal{F}$ and the probability measure $\mathbf{P}$ that assigns probabilities to all subsets in $\mathcal{F}$. More generally speaking, $(X, \mathcal{A}, \mu)$ is called a measure space, where $X$ is simply a set, $\mathcal{A}$ is a collection of measurable sets called the $\sigma$-algebra over the set $X$ and $\mu: \mathcal{A} \to [0,\infty]$ is a measure defined on $\mathcal{A}$. When we take the sample space $S$ as our set, the event space $\mathcal{F}$ as our $\sigma$-algebra over $S$ and define the measure $\mu$ as $\mathbf{P}: \mathcal{F} \to [0,1]$, we get the probability space $(S, \mathcal{F}, \mathbf{P})$ that we are familiar with. So, the probability space $(S, \mathcal{F}, \mathbf{P})$ is actually an application of a more general concept, called the measure space. If you are not familiar with measure theory, it is normal that some of the concepts above may not be immediately clear. Let us therefore step by step introduce some definitions that are important to understand the basic principles of measure theory.

Let us say we have a set $X$, not necessarily in the context of probability theory, but any set of countable (or uncountable) elements. Let’s take a look at a simple example,

\[ X = \{a,b\} \] Then the power set of $X$, denoted by $\mathcal{P}(X)$, is the set of all possible subsets that we can form out of $X$. Here, the empty set $\emptyset$ and $X$ itself are always considered as subsets of $X$. Therefore, the power set $\mathcal{P}(X)$ of this first example becomes,

\[ \mathcal{P}(X) = \{\emptyset, \{a\}, \{b\}, X \} \]

Suppose now we look at a slightly bigger set $X$, namely

\[ X = \{a,b,c,d\} \]

Then the power set $\mathcal{P}(X)$ consists of the empty set $\emptyset$, $X$ itself, the individual elements of the set, but also all combinations that we can form using these individual elements, such as $\{a,b\}$ and $\{a,b,c\}$. Therefore, we get

\[ \mathcal{P}(x) = \{\emptyset, X, \{a\}, \{b\}, \{c\}, \{d\}, \{a,b\}, \{a,c\}, \{a,d\}, \{b,c\}, \{b,d\}, \{c,d\}, \{a,b,c\}, \{a,b,d\}, \{a,c,d\}, \{b,c,d\} \} \] The power set grows exponentially by the number of elements in $X$. When $X$ contains just $2$ elements, the power set consists of $4$ subsets. When $X$ contains $4$ elements, the power set consists of $16$ subsets. If you remember the chapter on combinations it is easy to verify that the number of subsets in the power set above is,

\[ \binom{4}{0} + \binom{4}{1} + \binom{4}{2} + \binom{4}{3} + \binom{4}{4} = \sum^{4}_{k=0}\binom{4}{k} = 2^4 = 16 \] At first we choose the empty set $\emptyset$ (choosing zero elements out of a set containing four elements), secondly we choose individual elements of the form $\{x\}$, we choose subsets of two elements $\{x_1, x_2\}$, subsets of three elements $\{x_1, x_2, x_3\}$ and lastly the set $X$ itself (choosing four elements out of a set containing four elements). This results in a sum, which can be evaluated by using the binomial expansion theorem that states,

\[ (x + y)^n = \sum^n_{k=0}\binom{n}{k}x^{n-k}y^k, \] for any $n \geq 0$.

Filling in $x=1$, $y=1$ and $n=4$ we get

\[ 2^4 = (1 + 1)^4 = \sum^4_{k=0} \binom{4}{k}1^{n-k}1^k = \sum^4_{k=0} \binom{4}{k} \] In general, we have that the power set of a set $X$ containing $n$ elements consists of $\sum^n_{k=0} \binom{n}{k} = 2^n$ subsets. Note that for a collection or family of subsets a different style of letters is used, namely of the form $\mathcal{A}, \mathcal{B}$ or as for the power set, $\mathcal{P}$. This is all done to make it perfectly clear when we are speaking about a set of individual elements and when we speak about a collection of subsets. For a mathematician, it is essential to remain constantly aware that the same notation, for example, $X$, can carry different meanings in different contexts. In probability theory, $X$ typically denotes a random variable, that is, as we will come to understand later, a measurable function defined on a probability space. In measure theory, however, $X$ might simply represent a set on which measures are defined. Thus, flexibility of thought is indispensable for understanding mathematical texts.

Now suppose we have defined a certain measure on a set $X$. We may in practice only be able to measure a certain collection of subsets and not to the entire power set $\mathcal{P}(X)$. That collection of measurable subsets for now we call $\mathcal{A}$. Then, $\mathcal{A} \subseteq \mathcal{P}(X)$ is called a $\sigma$-algebra if

a. $\emptyset, X \in \mathcal{A}$
b. if $A \in \mathcal{A}$, then $A^c := (X - A) \in \mathcal{A}$
c. $A_i \in \mathcal{A}$, $i \in \mathbb{N}$, then $\bigcup^{\infty}_{i=1} A_i \in \mathcal{A}$

Then, any set $A \in \mathcal{A}$ is called a measurable set with respect to the $\sigma$-algebra $\mathcal{A}$. Now, it is not so difficult to show that if $(\mathcal{A}_i)_{i \in I}$ are $\sigma$-algebras on $X$, where $I$ is just an arbitrary index set, for example $\{1,2,3,4\}$ or $\mathbb{N}$, then the intersection of each $\sigma$-algebra $\mathcal{A}_i$

\[ \displaystyle \bigcap_{i \in I} \mathcal{A}_i, \] is also a $\sigma$-algebra on the set $X$.

Proof:
Let $(\mathcal{A}_i)_{i \in I}$ be a collection of $\sigma$-algebras on the set $X$, with $I$ being a arbitrary index set. Now, we know from the definition of a $\sigma$-algebra that

\[ \emptyset, S \in \mathcal{A}_i \ \text{for each} \ i \in I \ \text{and therefore} \ \emptyset, S \in \displaystyle \bigcap_{i \in I} \mathcal{A}_i, \] Now, let us take an arbitrary

\[ A \in \displaystyle \bigcap_{i \in I} \mathcal{A}_i, \ \text{then} \ A \in \mathcal{A}_i \ \text{for each} \ i \in I \] Because $(\mathcal{A}_i)_{i \in I}$ are $\sigma$-algebras we get that

\[ A^c \in \mathcal{A}_i \ \text{for each} \ i \in I \ \text{and therefore} \ A^c \in \displaystyle \bigcap_{i \in I} \mathcal{A}_i \] Lastly, suppose we have a sequence $(A_j)_{j \in \mathbb{N}}$, then if

\[ A_j \in \displaystyle \bigcap_{i \in I} \mathcal{A}_i, \ \text{for every} \ j \in \mathbb{N}, \ \text{then for each} \ i \in I, \ A_j \in \mathcal{A}_i \ \text{for every} \ j \in \mathbb{N} \] Because each $\mathcal{A}_i$ is a $\sigma$-algebra we have that

\[ \bigcup^{\infty}_{j=1} A_j \in \mathcal{A}_i \ \text{for each} \ i \in I \ \text{and therefore} \ \bigcup^{\infty}_{j=1} A_j \in \displaystyle \bigcap_{i \in I} \mathcal{A}_i \] Thus, $\displaystyle \bigcap_{i \in I} \mathcal{A}_i$ satisfies all three conditions of a $\sigma$-algebra and hence itself is a $\sigma$-algebra. $\quad \blacksquare$

For a collection of subsets $\mathcal{M} \subseteq \mathcal{P}(S)$ — where $\mathcal{M}$ is not a $\sigma$-algebra — there is a smallest $\sigma$-algebra that contains $\mathcal{M}$,

\[ \sigma(\mathcal{M}) := \displaystyle \bigcap_{\substack{ \mathcal{A}_i \text{ is a σ-algebra on } X \\ \mathcal{M} \subseteq \mathcal{A}_i}} \mathcal{A}_i \] From our last proof we know that the intersection of $\sigma$-algebras is itself a $\sigma$-algebra. The only difference here is that each $\sigma$-algebra should contain $\mathcal{M}$. Suppose again we have the example where $X = \{a, b, c, d\}$. Let us then take a collection of subsets $\mathcal{M} = \{a, b\}$ and create the smallest possible $\sigma$-algebra with $\mathcal{M}$, that is $\sigma(\mathcal{M})$. We know there are three conditions a., b. and c. that need to be fulfilled. So the first obvious step would be to add the empty set $\emptyset$ and $X$ itself

\[ \sigma(\mathcal{M}) = \{\emptyset, X, \{a\}, \{b\}, \, ...\}, \] where the dots ($...$) indicate that we have not reached a $\sigma$-algebra yet. The most efficient next step would be to add the countable unions of every subset in the set above, which just gives

\[ \sigma(\mathcal{M}) = \{\emptyset, X, \{a\}, \{b\}, \{a,b\}, \, ... \}, \] since, $\emptyset \cup X = X$, $X \cup \{a\} = X$ etc. Then, the last step would be to add the complements of all the subset, yielding the final result

\[ \sigma(\mathcal{M}) = \{\emptyset, X, \{a\}, \{b\}, \{a,b\}, \{b,c,d\}, \{a,c,d\}, \{c,d\}\}. \] Remember, however, that continuous random variables do not have countable sample spaces $S$. Therefore, we need to consider measurable spaces where $S$ contains infinitely many elements. Suppose that $S = \mathbb{R} = (-\infty,\infty)$. In this case, the entire power set $\mathcal{P}(\mathbb{R})$ can not be taken as measurable, since not every subset of $\mathbb{R}$ is measurable. On the other hand, the non-measurable subsets of $\mathbb{R}$ are quite unusual and are rarely encountered in practical applications. In the context of data science, they are essentially irrelevant. Nevertheless, it is important to know that we cannot define a measure on the full power set of $\mathbb{R}$, even though we often work with measures (like the Lebesgue measure that we will discuss later) that cover all practically relevant subsets.

A natural choice for a $\sigma$-algebra on $\mathbb{R}$ is the so-called Borel $\sigma$-algebra, denoted by $\mathcal{B}(\mathbb{R})$. This is the $\sigma$-algebra generated by all the open intervals in $\mathbb{R}$ of the form $(a,b)$, with $a < b$. Here, each subset in $\mathcal{B}(\mathbb{R})$ is called a Borel set. Because $\mathcal{B}(\mathbb{R})$ is a $\sigma$-algebra, it is closed under complements, countable unions, and countable intersections. In order to see that a $\sigma$-algebra is also closed under countable intersections we need to revise De Morgan’s first law. Suppose we choose two subsets $A_1$ and $A_2$ of a $\sigma$-algebra $\mathcal{A}$. Then, by the definition of a $\sigma$-algebra, $A_1^c$ and $A_2^c$ are also contained in $\mathcal{A}$. Because a $\sigma$-algebra is closed under countable unions, we get that $A_1^c \cup A_2^c \in \mathcal{A}$. By using again that $\sigma$-algebras are closed under complements and using De Morgan’s first law we get that

\[ (A_1^c \cup A_2^c)^c = A_1 \cap A_2 \in \mathcal{A} \] Thus, the Borel $\sigma$-algebra on $\mathbb{R}$ also contains: complements of open intervals, for instance, sets of the form $(-\infty, a] \cup [b,\infty)$, countable unions and intersections of open intervals, singletons, closed intervals of the form $[a,b]$ and more complex sets built from complements, countable unions and intersections.

In practice, the Borel $\sigma$-algebra contains all sets that are relevant for real-valued continuous random variables, and together with $\mathbb{R}$ it forms the standard measurable space for probability theory on $\mathbb{R}$.

Now, it is time to define what exactly a measure is. Let $(X, \mathcal{A})$ be a measurable space. Then, the map $\mu: \mathcal{A} \to [0,\infty)\cup\{\infty\} = [0,\infty]$ is called a measure if it satisfies:

a. $\mu(\emptyset) = 0$
b. $\mu(\bigcup^{\infty}_{i=1} A_i) = \sum^{\infty}_{i=1} \mu(A_i) \ \text{with} \ A_i \cap A_j = \emptyset, \, i \neq j \ \text{for all} \ A_i \in \mathcal{A}$ ($\sigma$-additivity)

These conditions may look very familiar if you look back at the three most important axioms in probability theory. Further, note that $\bigcup^{\infty}_{i=1} A_i$ is always part of a $\sigma$-algebra (see part c. of the definition of a $\sigma$-algebra) and therefore it is always measurable.

Now, we can define $(X, \mathcal{A}, \mu)$ as the measure space. Note that it does not matter for the definition if we interchange the letter $\mathcal{A}$ with the letter $\mathcal{F}$ to indicate that the $\sigma$-algebra that we are using corresponds to the event space in probability theory. The same holds for interchanging $X$ with the sample space $S$. Then, we have the measure space $(S, \mathcal{F}, \mu)$. The last step in order to reach the probability space that we are familiar with is to get a better understanding of the measure $\mu$.

Ideally we would like to have a measure $\mu$ on the entire power set of $\mathcal{P}(\mathbb{R})$ that satisfies the following properties:

a. $\mu([a,b]) = b-a, \, b>a$
b. $\mu(x + A) = \mu(A), \, A \in \mathcal{P}(\mathbb{R}) \ \text{and} \ x \in \mathbb{R}$ (translation invariance)

However, for more than 100 years it is known that this measure problem is not solvable for the entire power set $\mathcal{P}(\mathbb{R})$. The famous Vitali set that Giuseppe Vitali introduced in 1905 serves as a counterargument.

The Lebesgue measure — introduced by Henri Lebesgue in 1902 — defined on the Borel $\sigma$-algebra is a measure that satisfies a. and b. and it is indicated by the letter $\lambda$. We have that

1. $\lambda([a,b]) = b-a, \, b>a$
2. $\lambda(x + A) = \lambda(A), \, A \in \mathcal{B}(\mathbb{R}) \ \text{and} \ x \in \mathbb{R}$

Some examples of the Lebesgue measure are

\[ \begin{align*} \lambda([0,1]) &= 1 - 0 = 1 \\[1.5ex] \lambda(1+[0,1]) &= \lambda([1,2]) = 2 - 1 = 1 \\[1.5ex] \lambda(\{1\}) &= 0 \\[1.5ex] \lambda(\{1,2,3,4,5\}) &= \lambda(\{1\}) + \lambda(\{2\}) + \lambda(\{3\}) + \lambda(\{4\}) + \lambda(\{5\}) = 5 \times 0 = 0 \\[1.5ex] \lambda(\mathbb{N}) &= \lambda(\bigcup^{\infty}_{i=1} \{i\}) = \sum^{\infty}_{i=1} \lambda(\{i\}) = 0 \end{align*} \] And now we arrive at the final theorem that brings everything together. This theorem provides a rigorous framework for computing probabilities of continuous random variables, connecting measure theory, probability measures, and the so-called probability density function in a single elegant result.

Let $(X, \mathcal{A})$ be a measurable space, and let $\nu$ and $\mu$ be measures defined on $(X, \mathcal{A})$. If $\nu$ is absolutely continuous with respect to $\mu$ (denoted as $\nu \ll \mu$), that is

\[ \mu(A) = 0 \quad\Rightarrow \quad \nu(A) = 0 \quad \text{for all} \ A \in \mathcal{A}, \] then there exists a measurable function $f: X \to [0,\infty)$ such that for every $A \in \mathcal{A}$,

\[ \nu(A) = \int_A f \, d \mu \] The function $f$ is called the Radon-Nikodym derivative or density of $\nu$ with respect to $\mu$ and is denoted by,

\[ f = \frac{d\nu}{d\mu} \] In probability theory we can apply the Radon-Nikodym theorem to the case that $\nu$ equals the probability measure $\mathbf{P}$ and $\mu$ equals the Lebesgue measure $\lambda$. In order to have that $\mathbf{P} \ll \lambda$, we need that for a measurable space $(S, \mathcal{F})$, where $S$ is the outcome space of an experiment and $\mathcal{F}$ is the event space, that

\[ \lambda(A) = 0 \quad \Rightarrow \quad \mathbf{P}(A) = 0 \quad \text{for all} \ \text{events} \ A \in \mathcal{F} \] Therefore, for continuous random variables we cannot measure single points or countable sets of outcomes, since the Lebesgue measure of such sets is equal to 0. If $\mathbf{P} \ll \lambda$, we get that there exists a measurable function $f: S \to [0,\infty)$ such that for every $A \in \mathcal{F}$,

\[ P(A) = \int_A f \, d \lambda \] The function $f$ is now called the probability density function (pdf) of $\mathbf{P}$ with respect to $\lambda$ and is denoted by,

\[ f = \frac{d\mathbf{P}}{d\lambda} \] Because $[a,b]$ is a Borel set and therefore measurable by $\mathbf{P}$, and $f$ is continuous on $\mathbb{R}$ we have that

\[ P([a,b]) = \int^b_a f(x) \, d\lambda(x) = \int^b_a f(x) \, dx \] Here, the more general Lebesgue integral and the Riemann-integral that we know from high school coincide.

Similar to the discrete case, the expectation of a continuous random variable $X$ can be calculated as

\[ E(X) = \int^{\infty}_{-\infty} x f \, dx \] Moreover, the variance and standard deviation can be calculated as

\[ \begin{align*} \text{Var}(X) &= E(X^2) - [E(X)]^2 \\ &= \int^{\infty}_{-\infty} x^2 f \, dx - \left[\int^{\infty}_{-\infty} x f \, dx\right]^2, \end{align*} \] and the standard deviation $\sigma_X$ (not to be confused with the $\sigma$-algebras) is equal to the square root of the variance, that is,

\[ \sigma_X = \sqrt{\text{Var}(X)} \] Let us now recall the watermelon example. Let $X$ denote the weight of a random watermelon in kilograms (now $X$ is again a random variable). Suppose we know that the expected value $E(X)$ of a watermelon is approximately 8 kg and that the standard deviation $\sigma_X$ is 1.5 kg. Then, the probability distribution function (pdf) of $X$ may look like the following figure

Here is an example of how we can calculate the probability that a random watermelon weighs between $6$ and $9$ kilograms. As we know from measure theory, the probability measure can be calculated by integrating the probability density function over a certain interval. And as you may remember from high school, integrating corresponds to calculating the area under the curve. This is represented by the following figure.

So the area under the curve and thereby the probability of obtaining a watermelon between $6$ and $9$ kilograms is

\[ \int^9_6 f(x) \, dx \] The only ingredient we are missing here is of course the function $f$ itself. In the next chapter we will learn a lot about common probability density functions and I will provide some examples of calculating probabilities for continuous random variables.

We know from measure theory and by simply looking at the figures above that for continuous random variables we can not compute the probability of a single point, since the area under the PDF for a single point is zero. Therefore, the probability of a single point will always be zero. However, given a small number $\delta > 0$, we can compute the probability of a continuous random variable being in the neighborhood of a certain point $a$, that is $P([a - \delta, a + \delta])$. This gives some intuition behind the value of $f(x)$.

If $\delta$ is sufficiently small, then the area under the curve can be approximated by $f(a) \cdot \Delta x$ and $f(b) \cdot \Delta x$, where $\Delta x = 2 \cdot \delta$. Therefore, as we can see in the figure above, it is more likely to find a watermelon in the neighborhood of $9$ kilograms than to find one in the neighborhood of $6$ kilograms.

8 Common Distributions

In this chapter I will discuss three common discrete and three common continuous distributions that are widely used in data science. We start with the discrete case.

Bernoulli distribution

In this section we will finally discuss the Bernoulli trial, which is perhaps the simplest type of random variable. A Bernoulli trial has only two possible outcomes: success (s) or failure (f). Therefore, the sample space always contains two points, s and f

\[ S = \{s,f\} \] Let us define the random variable $X$ as a Bernoulli random variable if it has the following two possible outcomes

\[ X(s) = 1 \ \text{and} \ X(f) = 0 \] If $p$ is the probability of a success, then we can define $1-p$ as the probability of a failure. Hence, the probability mass function of $X$ is as follows

\[ p(x) = \begin{cases} 1-p & \text{if} \ x = 0 \\ p & \text{if} \ x = 1 \\ 0 & \text{otherwise} \end{cases} \] In the figure below I give two examples

The cumulative distribution function of a Bernoulli random variable is as follows

\[ F(x) = \begin{cases} 0 & \text{if} \ x < 0 \\ 1-p & \text{if} \ 0 \leq x < 1 \\ 1 & \text{if} \ x \geq 1 \\ \end{cases} \] Below I have displayed two examples of cdf’s of Bernoulli random variables

The last thing we need to know in order to understand Bernoulli random variables to its fullest is to calculate its moments, that is, the expectation of $X$ or also called the first moment, $E(X)$, the second moment $E(X^2)$, the variance $\text{Var}(X)$ and the standard deviation $\sigma_X$. This is fairly straightforward if we use the definitions that we have discussed in the chapter on expectations and variances of discrete random variables.

\[ \begin{align*} E(X) &= 0 \cdot P(X = 0) + 1 \cdot P(X = 1) \\ &= 0 \cdot (1-p) + 1 \cdot p \\ &= p \\[1.5ex] E(X^2) &= 0^2 \cdot P(X = 0) + 1^2 \cdot P(X = 1) \\ &= 0 \cdot (1-p) + 1 \cdot p \\ &= p \\[1.5ex] \text{Var}(X) &= E(X^2) - [E(X)]^2 \\ &= p - p^2 \\ &= p\,(1-p) \\[1.5ex] \sigma_X &= \sqrt{p\,(1-p)} \end{align*} \] I will end every section by showing an example or completing an exercise from Ghahramani’s book.

If in a throw of a fair die the event of obtaining $4$ or $6$ is called a success, and the event of obtaining $1,2,3$, or $5$ is called a failure, then describe the probability mass function, the expected value and the variance of such a random variable.

Let $X$ be the number of eyes rolled in a single throw of a fair dice and let $X$ be a Bernoulli random variable. Then,

\[ X = \begin{cases} 1 & \text{if a} \ 4 \ \text{or} \ 6 \ \text{is thrown} \\ 0 & \text{otherwise} \end{cases} \] The pmf of $X$ is then given by

\[ p(x) = \begin{cases} 2/3 & \text{if} \ x = 0\\ 1/3 & \text{if} \ x = 1\\ 0 & \text{elsewhere}\\ \end{cases} \] Now, the expectation and variance of $X$ are simply given by

\[ E(X) = p = 1/3, \ \text{Var}(X) = 1/3(1-1/3) = 2/9. \] As we begin to explore, modeling a random variable with a common, well-known probability distribution is highly convenient, since the expectation, variance, and higher moments are already derived. Therefore, in practical applications, we typically start by determining whether such a distribution is appropriate for the situation at hand.

Binomial distribution

The Binomial random variable naturally arises from the Bernoulli random variable. Let $X_1, X_2, X_3,...$ be a sequence of Bernoulli random variables. If, for all $j_i \in \{0,1\}$, the sequence of events $\{X_1 = j_1\}, \{X_2 = j_2\}, \{X_3 = j_3\},..$ are independent, we say that $\{X_1,X_2,X_3,...\}$ and the corresponding Bernoulli trials are independent. Consider an experiment where $n$ Bernoulli trials are performed independently. The sample space, $S$, of such an experiment is the set of different orderings of length $n$ with $x (x = 0,1,...,n)$ successes (s’s) and $n-x$ failures (f’s). For example, if, in an experiment, three Bernoulli trials are performed independently, then the sample space is

\[ \{fff, sff,fsf,ffs,fss,sfs,ssf,sss\} \] Here, using our knowledge about combinations we can see that there are

\[ \binom{3}{0} = 1, \ \binom{3}{1} = 3,\ \binom{3}{2} = 3,\ \binom{3}{3} = 1 \] ways of obtaining zero, one, two and three successes. If $n$ Bernoulli trials all with probability of success $p$ are performed independently, then $X$, the number of successes, is one of the most important random variables. It is called a binomial with parameters $n$ and $p$. The set of possible values $X$ is $\{0,1,2,...,n\}$, it is described on the set $S$ described previously, and its probability mass function is given by

\[ p(x) = P(X=x) = \begin{cases} \binom{n}{x}\,p^x\,(1-p)^{n-x} & \text{if} \ x = 0,1,2,...,n \\ 0 & \text{elsewhere} \end{cases} \] Suppose we take the example of three coin tosses. Then the sample space is

\[ \{TTT, HTT,THT,TTH,THH,HTH,HHT,HHH\} \] If we call heads (H) as a success and tails (T) as a failure, we can calculate the probability of obtaining zero, one, two and three successes as follows

\[ \begin{align*} p(0) &= P(X=0) = \binom{3}{0} (1/2)^0\cdot(1/2)^3 = (1/2)^3 = 0.125 \\ p(1) &= P(X=1) = \binom{3}{1} (1/2)^1\cdot(1/2)^2 = 3 \cdot (1/2)^3 = 3 \cdot 0.125 = 0.375 \\ p(2) &= P(X=2) = \binom{3}{2} (1/2)^2\cdot(1/2)^1 = 3 \cdot (1/2)^3 = 3 \cdot 0.125 = 0.375 \\ p(3) &= P(X=3) = \binom{3}{3} (1/2)^3\cdot(1/2)^0 = (1/2)^3 = 0.125 \\ \end{align*} \]

Note, that this is equivalent to using the formula

\[ P(A) = \frac{\# \text{elements in}\ A}{\# \text{elements in}\ S}, \] that we have used previously to calculate probabilities.

$p(x)$ is called binomial, because the binomial expansion theorem guarantees that it is a probability mass function

\[ \sum^n_{x=0} p(x) = \sum^n_{x=0} \binom{n}{x} p^x(1-p)^{n-x} = [(1-p) + p]^n = 1^n = 1. \]

A few examples of pmf’s of Binomial random variables are displayed below

The cumulative distribution function of a Binomial random variable is given by \[ F(x) = \begin{cases} 0 & x < 0 \\[2mm] \displaystyle \sum_{k=0}^{\lfloor x \rfloor} \binom{n}{k} p^k (1-p)^{\,n-k} & 0 \le x \le n \\[1mm] 1 & x > n, \end{cases} \] where $\lfloor x \rfloor$ rounds $x$ down to an integer. Below, a few examples

\[ \begin{align*} E(X) &= \sum^n_{x=0} x \cdot P(X = x) \\ &= \sum^n_{x=0} x \cdot \binom{n}{x}\,p^{x}(1-p)^{n-x} \\ &= \sum^n_{x=1} x \cdot \binom{n}{x}\,p^{x}(1-p)^{n-x} \quad \color{blue}{\left[0 \cdot \binom{n}{0}\,p^0(1-p)^n = 0 \cdot (1-p)^n = 0\right]} \\ &\stackrel{*}{=} \sum^n_{x=1} n \cdot \binom{n-1}{x-1}\,p^{x}(1-p)^{n-x} \\ &= \sum^n_{x=1} n \cdot \binom{n-1}{x-1}\,p\,p^{x-1}(1-p)^{(n-1)-(x-1)} \quad \color{blue}{\left[p^x = p^{1 + (x-1)} = p\,p^{x-1}\right]} \\ &= np \sum^n_{x=1} \cdot \binom{n-1}{x-1}\,p^{x-1}(1-p)^{(n-1)-(x-1)} \\ &= np \sum^{n-1}_{y=0} \cdot \binom{n-1}{y}\,p^{y}(1-p)^{(n-1)-y}\\ &= np [(1-p) + p]^{n-1} \quad \color{blue}{\left[\text{Binomial expansion theorem}\right]}\\ &= np \end{align*} \] The (*) equality holds because

\[ \begin{align*} x \cdot \binom{n}{x} &= x \cdot \frac{n!}{x!\,(n-x)!} \\ &= \color{red}{x} \cdot \frac{n!}{\color{red}{x}\,(x-1)!\,(n-x)!} \\ &= \frac{n(n-1)!}{(x-1)![(n-1)-(x-1)]!} \quad \color{blue}{\left[n-x = n - 1 + 1 - x = (n-1) - (x-1)\right]} \\ &= n \cdot \frac{(n-1)!}{(x-1)![(n-1)-(x-1)]!} \\ &= n \cdot \binom{n-1}{x-1} \end{align*} \] A similar derivation holds for the second moment of $X$

\[ \begin{align*} E(X^2) &= \sum^n_{x=0} x^2 \cdot P(X = x) \\ &= \sum^n_{x=0} x^2 \cdot \binom{n}{x}\,p^{x}(1-p)^{n-x} \\ &= \sum^n_{x=0} [x(x-1)+x] \cdot \binom{n}{x}\,p^{x}(1-p)^{n-x} \quad \color{blue}{\left[x^2 = x(x-1)+x\right]}\\ &= \sum^n_{x=0} x(x-1) \cdot \binom{n}{x}\,p^{x}(1-p)^{n-x} + \sum^n_{x=0} x \cdot \binom{n}{x}\,p^{x}(1-p)^{n-x} \\ &= \sum^n_{\color{blue}{x=2}} x(x-1) \cdot \binom{n}{x}\,p^{x}(1-p)^{n-x} + \color{blue}{np} \\ &\stackrel{*}{=} \sum^n_{x=1} n(n-1) \cdot \binom{n-2}{x-2}\,p^{x}(1-p)^{n-x} + np\\ &= \sum^n_{x=2} n(n-1) \cdot \binom{n-2}{x-2}\,p^2\,p^{x-2}(1-p)^{(n-2)-(x-2)} + np \quad \color{blue}{\left[p^x = p^{2 + (x-2)} = p^2\,p^{x-2}\right]} \\ &= n(n-1)p^2 \sum^n_{\color{blue}{y=0}} \binom{n-2}{\color{blue}{y}}\,p^{\color{blue}{y}}(1-p)^{(n-2)-\color{blue}{y}} + np \\ &= n(n-1)p^2 \cdot [(1-p) + p]^{n-2} + np \quad \color{blue}{\left[\text{Binomial expansion theorem}\right]}\\ &= n(n-1)p^2 + np \end{align*} \]

The (*) equality holds because

\[ \begin{align*} x(x-1) \cdot \binom{n}{x} &= x(x-1) \cdot \frac{n!}{x!\,(n-x)!} \\ &= \color{red}{x(x-1)} \cdot \frac{n!}{\color{red}{x(x-1)}\,(x-2)!\,(n-x)!} \\ &= \frac{n(n-1)(n-2)!}{(x-2)![(n-2)-(x-2)]!} \quad \color{blue}{\left[n-x = n - 2 + 2 - x = (n-2) - (x-2)\right]} \\ &= n(n-1) \cdot \frac{(n-2)!}{(x-2)![(n-2)-(x-2)]!} \\ &= n(n-1) \cdot \binom{n-2}{x-2} \end{align*} \]

Therefore, the variance of $X$ is given by

\[ \text{Var}(X) = E(X^2) - [E(X)]^2 = n(n-1)p^2 + np - n^2p^2 = np - np^2 = np(1-p) \] From an ordinary deck of 52 cards, cards are drawn at random and with replacement. What is the probability that, of the first eight cards drawn, four are spades?

Let $X$ be a Binomial random variable with $n = 8$ and $p = 1/4$, that is

\[ X \sim \text{Bin}(8, 1/4) \] Here, we consider obtaining a spades card as a success. Since there are four suits and the cards are drawn with replacement, the probability of obtaining a spades card for eac independent draw is $1/4$. $8$ cards are drawn and therefore $n=8$. To obtain the probability that four cards are spades (four successes) we can simply fill in the pmf of $X$

\[ P(\text{Four spades}) = P(X = 4) = \binom{8}{4}(1/4)^4(1-1/4)^{8-4} \approx 0.087 \]

Poisson distribution

The French mathematician Simeon-Denis Poisson introduced in 1837 the following procedure to obtain the formula that approximates the binomial probability mass function when the number of trials is large ($n \to \infty$), the probability of success is small ($p \to 0$) and the average number of successes remains a constant value ,$np = \lambda$, that is not extremely large or small. Note that when $n$ get large, $p$ has to take smaller and smaller values to obtain a value of $\lambda$ that is not too extreme. Let $X$ be a binomial random variable with parameters $(n,p)$, then

\[ \begin{align*} P(X = x) &= \binom{n}{x}p^x(1-p)^{n-x} = \frac{n!}{(n-x)!\,x!} \left(\frac{\lambda}{n}\right)^x \left(1 - \frac{\lambda}{n}\right)^{n-x} \quad \color{blue}{\left[p = \lambda/n\right]} \\[1.5ex] &= \frac{n(n-1)(n-2)\cdots(n-x+1)}{x!} \left(\frac{\lambda}{n}\right)^x \left(1 - \frac{\lambda}{n}\right)^{n-x} \quad \color{blue}{\left[ \frac{n!}{(n-x)!} = \,_{n}P_{x} = n(n-1)(n-2)\cdots(n-x+1) \right]} \\[1.5ex] &= \frac{n(n-1)(n-2)\cdots(n-x+1)}{x!} \frac{\lambda^x}{n^x} \left(1 - \frac{\lambda}{n}\right)^{n-x} \quad \color{blue}{\left[ (\frac{a}{b})^x = \underbrace{\frac{a}{b} \cdot \frac{a}{b} \cdots \frac{a}{b}}_{x \ \text{terms}} = \frac{a \cdot a \cdots a}{b \cdot b \cdots b} = \frac{a^x}{b^x} \right]} \\[1.5ex] &= \frac{n(n-1)(n-2)\cdots(n-x+1)}{x!} \frac{\lambda^x}{n^x} \frac{\left(1-\frac{\lambda}{n}\right)^n}{\left(1-\frac{\lambda}{n}\right)^x} \quad \color{blue}{\left[ a^{n-x} = a^n \cdot a^{-x} = \frac{a^n}{a^x} \right]} \\[1.5ex] &= \frac{n(n-1)(n-2)\cdots(n-x+1)}{n^x} \frac{\lambda^x}{x!} \frac{\left(1-\frac{\lambda}{n}\right)^n}{\left(1-\frac{\lambda}{n}\right)^x} \quad \color{blue}{\left[ \frac{a}{c} \cdot \frac{b}{d} = \frac{a \cdot b}{c \cdot d} = \frac{a \cdot b}{d \cdot c} = \frac{a}{d} \cdot \frac{b}{c} \right]} \end{align*} \] Then, we have that

\[ \frac{n(n-1)(n-2)\cdots(n-x+1)}{n^x} = \prod^{x-1}_{k=0} \frac{n-k}{n} = \prod^{x-1}_{k=0} 1 - \frac{k}{n} \to 1, \\ \] as $n \to \infty$;

\[ \left(1-\frac{\lambda}{n}\right)^x \to 1, \] as $n \to \infty$; and

\[ \left(1-\frac{\lambda}{n}\right)^n \to e^{-\lambda}, \] as $n \to \infty$. The last result can be derived from a famous result from calculus that states

\[ \left(1+\frac{1}{n}\right)^n \to e, \] as $n \to \infty$.

Therefore, as $n \to \infty$

\[ P(X = x) \to \frac{e^{-\lambda}\lambda^x}{x!} \] Three years after his book was published, Poisson died. It is regrettable that Simeon-Denis did not live to witness the profound influence of his work on probability theory. Although his Recherches sur la probabilité des jugements (1837) introduced what would later be known as the Poisson distribution, the significance of his approximation remained largely unacknowledged for several decades. It was not until 1889 that the German-Russian mathematician L. V. Bortkiewicz demonstrated its theoretical and practical value, most famously through his study of fatal horse-kick accidents among soldiers in the Prussian army. Poisson’s ideas, well ahead of their time, were thus recognized only long after his death.

Bortkiewicz argued that since

\[ \begin{align*} \sum^{\infty}_{x=0} \frac{e^{-\lambda}\lambda^x}{x!} &= e^{-\lambda} \sum^{\infty}_{x=0} \frac{\lambda^x}{x!} \\[1.5ex] &\stackrel{*}{=} e^{-\lambda}e^{\lambda} \\[1.5ex] &= e^{-\lambda+\lambda} \\[1.5ex] &= e^0 \\[1.5ex] &= 1, \end{align*} \] Poisson’s approximation by itself is a probability mass function. The (*) equality follows from the fact that there are two equivalent ways to define the number $e$, namely

\[ \displaystyle \lim_{n \to \infty} \left(1 + \frac{1}{n} \right)^n = \sum^{\infty}_{x=0} \frac{1}{x!} = e, \] which can be proven by using the binomial expansion theorem that we have seen before and the dominated convergence theorem (for those who are interested).

Let $X$ be a discrete random variable with possible values $0, 1, 2, 3,...$, then $X$ is called Poisson with parameter $\lambda$, $\lambda > 0$, if

\[ P(X = x) = \frac{e^{-\lambda}\lambda^x}{x!}, \qquad x = 0,1,2,3,... \] Below I displayed a few examples of Poisson pmf’s with different parameters $\lambda$,

The cumulative distribution function of a Poisson random variable is given by \[ F(x) = \begin{cases} 0 & x < 0 \\[2mm] \displaystyle \sum_{k=0}^{\lfloor x \rfloor} \frac{e^{-\lambda}\lambda^k}{k!} & x \geq 0 \end{cases} \] where $\lfloor x \rfloor$ rounds $x$ down to an integer. Below, a few examples

It is reasonable to think that the expected value of a Poisson random variable is $\lambda$, since the Poisson pmf is an approximation of the binomial random variable, whose expected value is $np = \lambda$. Let us prove it algebraically,

\[ \begin{align*} E(X) &= \sum^{\infty}_{x=0} x \, P(X=x) \\ &= \sum^{\infty}_{\color{blue}{x=1}} x \,\frac{e^{-\lambda} \lambda^x}{x!} \\ &= \lambda e^{-\lambda} \sum^{\infty}_{x=1} \frac{\color{red}{x}\,\lambda^{x-1}}{\color{red}{x}(x-1)!} \\ &= \lambda e^{-\lambda} \sum^{\infty}_{x=1} \frac{\lambda^{x-1}}{(x-1)!} \\ &= \lambda e^{-\lambda} \sum^{\infty}_{\color{blue}{i=0}} \frac{\lambda^{\color{blue}{i}}}{\color{blue}{i}!} \quad \color{blue}{\left[i = x-1\right]} \\ &= \lambda e^{-\lambda} e^{\lambda} \\ &= \lambda \end{align*} \] The second moment of $X$ follows from a similar procedure in addition with the intelligent application of a derivative

\[ \begin{align*} E(X^2) &= \sum^{\infty}_{x=0} x^2 \, P(X=x) \\ &= \sum^{\infty}_{\color{blue}{x=1}} x^2 \,\frac{e^{-\lambda} \lambda^x}{x!} \\ &= \lambda e^{-\lambda} \sum^{\infty}_{x=1} \frac{\color{red}{x}\cdot x\,\lambda^{x-1}}{\color{red}{x}(x-1)!} \\ &= \lambda e^{-\lambda} \sum^{\infty}_{x=1} \frac{1}{(x-1)!} \color{blue}{\frac{d}{d\lambda}}(\lambda^x) \\ &= \lambda e^{-\lambda} \color{blue}{\frac{d}{d\lambda}} \left[\sum^{\infty}_{x=1} \frac{\lambda^x}{(x-1)!}\right] \\ &= \lambda e^{-\lambda} \frac{d}{d\lambda} \left[\lambda \sum^{\infty}_{x=1} \frac{\lambda^{x-1}}{(x-1)!}\right] \\ &= \lambda e^{-\lambda} \frac{d}{d\lambda}(\lambda e^{\lambda}) \\ &= \lambda e^{-\lambda} (e^{\lambda} + \lambda e^{\lambda}) \\ &= \lambda + \lambda^2 \end{align*} \]

And finally, this gives us the variance of $X$

\[ \text{Var}(X) = E(X^2) - [E(X)]^2 = \lambda + \lambda^2 - \lambda^2 = \lambda, \quad \sigma_X = \sqrt{\lambda}. \] Suppose that $3$% of the families in a large city have an annual income of over $60,000. What is the probability that, of $\,60$ random families, at most three have an annual income of over $60,000?

Let $X$ be the number of families with an annual income of $60,000. $X$ is then Binomially distributed with parameters $n = 60$ and $p = 0.03$. Since $n$ is large, $p$ is small and $np = 60 \times 0.03 = 1.8$ is of a moderate value, we can approximate $X$ by a Poisson distribution with parameter $\lambda = np = 1.8$. The probability that at most three families have an annual income of $60,000 is therefore given by

\[ P(X \leq 3) = F(3) = \sum_{k=0}^{3} \frac{e^{-\lambda}\lambda^k}{k!} \approx 0.161 \]

To conclude, there is a 16.1% probability that at most three families have an annual income of $60.000.

As we have seen in this section, binomial probabilities can be approximated by Poisson probabilities. Such approximations are generally good if $p<0.1$ and $np = \lambda \leq 10$. If $\lambda > 10$, it would be more appropriate to use the normal approximation that will be discussed in the coming sections.

Normal distribution

The normal or Gaussian distribution is a very essential distribution that is widely used in practical applications such as data science and econometrics. As we will see later, under certain conditions the normal distribution models averages very well regardless of the underlying distribution of the observations or the outcomes of a number of trials. One of the first applications — the modeling of errors of observations in astronomy — was given by Gauss in 1809 and therefore the normal distribution is sometimes called after him. The probability density function is given by the following expression

\[ f(x) = \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{(x-\mu)^2}{2\sigma^2}}, \quad -\infty < x < \infty \] or equivalently to improve readability

\[ f(x) = \frac{1}{\sigma\sqrt{2\pi}} \text{exp}\left[-\frac{(x-\mu)^2}{2\sigma^2}\right], \quad -\infty < x < \infty \] Here $\mu$ and $\sigma$ represent the expected value of the random variable $X$, $E(X)$, and the standard deviation of $X$, $\sigma_X$, respectively.

Below I show a figure of three normal distributions with different values of $\mu$ and $\sigma$. The case where $\mu = 0$ and $\sigma = 1$ is called the standard normal distribution.

$f(x)$ for these different values of $\mu$ and $\sigma$ are pdf’s, since the integral over the whole range of $x \in (-\infty, \infty)$ evaluates to one. For example, it can be proven using polar coordinates — for those who are interested — that the integral,

\[ I = \int^{\infty}_{-\infty}\text{exp}\left[-\frac{1}{2}x^2\right] dx = \sqrt{2\pi} \] Therefore, for the standard normal distribution where $\mu = 0$ and $\sigma = 1$ we have that

\[ \frac{1}{\sqrt{2\pi}}\int^{\infty}_{-\infty}\text{exp}\left[-\frac{1}{2}x^2\right] dx = \frac{1}{\sqrt{2\pi}} \cdot \sqrt{2\pi} = 1. \]

Then, the same can be proven for $y = (x - \mu)/2\sigma^2$, simply by a change of variables, since $\mu$ and $\sigma$ are “known” constants. I put known between parenthesis, because in practical applications $\mu$ and $\sigma$ usually need to be estimated and in fact they remain unknown constants. Nevertheless, since they are constants, known or unknown, we can use the change of variables trick to evaluate the integral above for different values of $\mu$ and $\sigma$.

The Central Limit Theorem (CLT) is perhaps the most important result in the history of probability theory and statistics when it comes to practical applications like data science. What makes the CLT so useful is that no assumption about the distribution of a family of random variables $(X_i)_{i=1...n}$ is required. We just need that $(X_i)_{i=1...n}$ are independently and identically distributed and that the variance of $X_1$, $\text{Var}(X_1) < \infty$. This is equivalent to saying that the variances of $X_2, X_3, X_4 ... X_n$ also exist, since $(X_i)_{i=1...n}$ are identically distributed. This further means that the first and second moment, $E(X_1)$ and $E(X_1^2)$ exist, because the variance follows from these two quantities. Remember,

\[ \text{Var}(X_1) = E(X_1^2) - [E(X_1)]^2 \] So, if both of these quantities are finite, the variance of $X_1$ is also finite. The CLT tells us something about the averages of random variables and therefore let us derive the expectation and variance of such an average. Since $E(X_1) = E(X_2) = E(X_3) = ... = E(X_n)$ and since expectations have the additivity property $E(X_1 + X_2 + X_3 + ... + X_n) = E(X_1) + E(X_2) + E(X_3) + ... + E(X_n)$ we get that the expectation of the average of a family of random variables is,

\[ E(\bar{X}_n) = E(\frac{1}{n}\sum^n_{i=1} X_i) = \frac{1}{n}\sum^n_{i=1}E(X_i) = \frac{1}{n} \sum^n_{i=1} E(X_1) = \frac{1}{n} \cdot n \,E(X_1) = E(X_1) \]

Because $(X_i)_{i=1...n}$ are independently distributed (covariances between the random variables can be neglected) and identically distributed, the variance is derived as follows

\[ \text{Var}(\bar{X}_n) = \text{Var}(\frac{1}{n}\sum^n_{i=1} X_i) = \frac{1}{n^2} \sum^n_{i=1} \text{Var}(X_i) = \frac{1}{n^2} \sum^n_{i=1} \text{Var}(X_1) = \frac{1}{n^2} \cdot n \, \text{Var}(X_1) = \frac{\text{Var}(X_1)}{n} \] It is interesting to note that as $n$ increases the variance of the average of $n$ random variables decreases. This means that as $n$ increases the distribution of the averages becomes thinner and thinner. The CLT now tells us that if we standardize $\bar{X}_n$, its distribution converges to the standard normal distribution. Standardizing means that we subtract the mean and divide the whole by the standard deviation of $\bar{X}_n$, which gives

\[ \frac{\bar{X}_n - E[X_1]}{\frac{\sigma_{X_1}}{\sqrt{n}}} \stackrel{d}{\to} N(0,1) \] as $n \to \infty$. Or more formally, let

\[ Z_n = \frac{\bar{X}_n - E[X_1]}{\frac{\sigma_{X_1}}{\sqrt{n}}} \] Then,

\[ P(Z_n \leq z) = \frac{1}{\sqrt{2\pi}} \int^z_{-\infty} e^{-\frac{1}{2}z^2}dz = \Phi(z) \] as $n \to \infty$, where $\Phi(z)$ is the cumulative distribution function of a standard normal distribution. This result is immensely powerful, because no assumptions have to be made about the underlying distribution of $X_1, X_2, X_3, ..., X_n$. We just need that they are independently and identically distributed and that $\text{Var}(X_1) < \infty$.

To illustrate the CLT, let’s suppose that we are playing a board game where it is favourable to throw six eyes on a dice. Since there are six possible outcomes in this “experiment” and each of them is equally likely to occur, the probability $p$ of obtaining six eyes is equal to $1/6$. Suppose we roll the dice $n$ times, then $X_1, X_2, X_3, ..., X_n$ are all Bernoulli distributed with success probability $p = 1/6$. Or we suppose that $X = X_1 + X_2 + X_3 + ... + X_n$ is Binomially distributed with number of trials $n$ and success probability $p = 1/6$. Both of these descriptions of the experiment are equivalent. Note that $X_1, X_2, X_3, ..., X_n$ are independently and identically distributed. Moreover, $\text{Var}(X_1) = p(1-p) = 1/6 \cdot 5/6 = 5/36 < \infty$. Therefore, the two conditions for the CLT in this specific example are satisfied. Now, I will use the programming language R again to simulate the above example and to show how the CLT works.

#We set a seed in order to produce reproducible results
set.seed(123)

#The random variable X can take two possible values: failure (0) and success (1)
x <- c(0,1)
#The succes probability p is equal to 1/6
p <- 1/6
#We throw a dice n times
n <- 2000
#We repeat this experiment of throwing n dice m times
m <- 2000

#The outcomes are given in a n x m matrix
outcomes <- replicate(m, sample(x,n,replace = TRUE, prob = c(1-p,p))) 

#We take the average of each of the m columns
averages <- apply(outcomes, 2, mean)

#We standardize the averages
standardized_averages <- (averages - p)/sqrt(p*(1-p)/n)

#Print the first 10 standardized averages
head(standardized_averages,10)

##  [1] -0.98  1.12 -1.34 -0.92 -0.86 -0.26  1.00 -1.22  0.10 -2.60

The Quantile-Quantile plot (Q-Q plot) below shows that indeed $Z_n$’s distribution lies very close to a standard normal distribution. The plot compares the quantiles of the theoretical standard normal distribution (the x-axis) with the sample quantiles of the standardized averages that we computed from our $m$ simulations (the y-axis). If the sample quantiles lie approximately on a straight line (the blue line) it means that the distributions are similar. The CLT states that $Z_n$’s distribution will ultimately converge to a standard normal distribution as $n$ approaches infinity.

For those that are not so familiar with quantiles, I will give a short introduction. Remember the definition of the cumulative distribution function

\[ F(x) = P(X \leq x) \] The cdf $F(x)$ calculates the probability that the random variable $X$ takes on a value less than or equal to $x$. But what if we would like to ask the reverse question. Given that $F(x) = p$, what is the corresponding value of $x$? This can be calculated by taking the inverse of the cdf $F(x)$ on both sides of the equation $F(x) = p$

\[ \begin{align*} F^{-1}[F(x)] &= F^{-1}(p) \quad\Leftrightarrow \\ x &= F^{-1}(p) \end{align*} \] The quantiles of the distribution of a random variable $X$ are often denoted by $q$ and can also be interpreted as that $p \cdot 100\%$ of the distribution lies below $q$. To make this idea clearer, I show the inverse of the standard normal cdf, where compared to the cdf of $N(0,1)$, the x-axis and y-axis are simply inverted.

As we know from a standard normal distribution, 50% of the distribution lies below 0, 95% of the distribution lies below 1.64, 97.5% of the distribution lies below 1.96 and almost the entire distribution lies below 3. This corresponds perfectly to the figure above.

Lastly, we will discuss an application of the CLT for the Binomial distribution, the De Moivre La Place theorem. Let $X_1 + X_2 + X_3 + \dots + X_n$ be Bernoulli random variables with success probability $p$. Then,

\[ \displaystyle \lim_{n \to \infty} P\left(\frac{X_1 + X_2 + X_3 + \dots + X_n - np}{\sqrt{np(p-1)}} < z \right) = \frac{1}{\sqrt{2\pi}} \int^z_{-\infty} e^{-z^2/2} \, dz. \]

Again, we need the two conditions that $X_1, X_2, X_3, ..., X_n$ are independently and identically distributed and $\text{Var}(X_1) < \infty$. Note that, since the sum of $n$ Bernouilli random variables each with parameter $p$ is equal to a Binomial random variable with parameters $n$ and $p$, the expression above is equivalent to

\[ \displaystyle \lim_{n \to \infty} P\left(\frac{\bar{X}_n - p}{\sqrt{\frac{p(p-1)}{n}}} < z \right) = \frac{1}{\sqrt{2\pi}} \int^z_{-\infty} e^{-z^2/2} \, dz, \]

where $\bar{X}_n = 1/n \sum^n_{i=1} X_i$, $p = E(X_1)$ and $\sqrt{p(p-1)/n} = \sigma_{X_1}$. The first expression shows the normal approximation of the binomial distribution and the latter expression shows exactly how we have applied the CLT in our simulation example. But in fact they are equivalent.

Exponential distribution

The exponential distribution is another highly important distribution in applied probability, which originates from Poisson processes. Suppose that, starting from a time point labeled $t = 0$, we begin counting the number of events. Then for each value of $t$, we obtain a number denoted by $N(t)$, which is the number of events that have occured during $[0,t]$. Clearly, for each value of $t$, $N(t)$ os a discrete random variable with the set of possible values $\{0,1,2,...\}$

Lognormal distribution

9 Bivariate Distributions

10 Multivariate Distributions

11 More Expectations and Variances

12 Sums of Independent RV’s

References

Ghahramani, Saeed. 2016. Fundamentals of Probability with Stochastic Processes. 3rd ed. Chapman & Hall/CRC, Taylor & Francis Group.

The Basics of Probability Theory

Making the foundational principles behind Data Science accessible to non-statisticians

R. B. van der Huizen
BSc in Econometrics & Operations Research (cum laude)
Vrije Universiteit Amsterdam

2025-12-13

1 Introduction

2 Three key ingredients

3 Permutations and Combinations

4 Conditional Probability (*)

5 Independence (*)

6 Random Variables (RV’s)

7 Expectations and Variances

8 Common Distributions

9 Bivariate Distributions

10 Multivariate Distributions

11 More Expectations and Variances

12 Sums of Independent RV’s

References

The Basics of Probability Theory

Making the foundational principles behind Data Science accessible to non-statisticians

R. B. van der Huizen BSc in Econometrics & Operations Research (cum laude) Vrije Universiteit Amsterdam

2025-12-13

1 Introduction

2 Three key ingredients

3 Permutations and Combinations

4 Conditional Probability (*)

5 Independence (*)

6 Random Variables (RV’s)

7 Expectations and Variances

8 Common Distributions

9 Bivariate Distributions

10 Multivariate Distributions

11 More Expectations and Variances

12 Sums of Independent RV’s

References

R. B. van der Huizen
BSc in Econometrics & Operations Research (cum laude)
Vrije Universiteit Amsterdam