Objective: This webpage aims to explain the statistical methods I use as a data scientist in a clear and simple way, so that even people without a background in statistics can understand them. It also shows how these methods can support your data science needs - this includes answering questions, solving problems and making well-informed decisions based on your data.

1 The Palmer Penguins Dataset

To illustrate the statistical methods that are important for my work as a Data Scientist I will use the Palmer Penguins dataset. This dataset contains size measurements for three penguin species observed in the Palmer Archipelago, a group of islands near the northwestern coast of Antarctica. The data were collected from 2007 - 2009 by Dr. Kristen Gorman with the Palmer Station Long Term Ecological Research Program (Horst, Hill, and Gorman 2020). Meet the Chinstrap, Gentoo and Adelie penguins!

The first step in any data science project is to load the data - in this case penguin data - into a working environment using your preferred programming language. My working environment is RStudio and my preferred or I would say favourite programming language is R. The R language was developed in the early to mid 1990s by statistician Ross Ihaka and statistician and bioinformaticist Robert Gentleman at the University of Auckland and is in my opinion one of the best pieces of software for applying statistical methods and performing data analysis. Because I am so enthusiastic about R I will show “chunks” or pieces of code on this webpage, so that you may try to follow along what I am doing at every step. Without a basic understanding of the R programming language it might not be possible to understand every piece of code that I will present, but the most important aim here is to show the usefulness of R and how my work can help you with your data science needs.

“the most important aim here is to show the usefulness of R and how my work can help you with your data science needs.”

Here is the R code that is needed to load the Palmer Penguin dataset and to show the first 5 rows of the data in three simple steps.

#Loading the palmer penguins library
library(palmerpenguins)

#Loading the palmer penguins dataset
data(penguins) 

#Showing the first 5 rows of the data
head(penguins, 5) 
## # A tibble: 5 × 8
##   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
## 1 Adelie  Torgersen           39.1          18.7               181        3750
## 2 Adelie  Torgersen           39.5          17.4               186        3800
## 3 Adelie  Torgersen           40.3          18                 195        3250
## 4 Adelie  Torgersen           NA            NA                  NA          NA
## 5 Adelie  Torgersen           36.7          19.3               193        3450
## # ℹ 2 more variables: sex <fct>, year <int>

If you would like to get a basic understanding of the R programming language which I will be using to find new exciting insights in your data or if you would like to know how data exploration methods can help you with your data science needs, then please click on the following link (data exploration methods). Otherwise, we will enter the fascinating realm of statistics!

2 What is Statistics?

Statistics aims to explain certain characteristics of a population. In the example of penguins, we could say that all the Adelie penguins together form a population. There are currently approximately 10 million individual Adelie penguins in Antarctica and on surrounding islands, which could be compared to the population of Sweden. So definitely there will not be enough researchers to take beak or body mass measurements from the entire Adelie population and perhaps some penguin colonies live in places that are not easily accessible. Therefore, researchers look at a sample, a smaller yet representative group of Adelie penguins, that could explain certain characteristics for the whole population. This is exactly what the Palmer Penguins researchers were aiming for. They collected measurements from a small sample of Adelie penguins in order to learn more about the Adelie penguin in general and how it differs from other penguin species. Let’s see how many Adelie penguins are there in this dataset using the R language.

#Loading a very handy library for transforming datasets
library(dplyr) 

#Creating a new dataset containing the Adelie penguins
Adelie <- penguins %>% filter(species == "Adelie")

#Showing the number of Adelie penguins
nrow(Adelie) 
## [1] 152

Hence, we have 152 Adelie penguins in our penguin dataset. While learning about statistics we can not avoid mathematical notation, however I will try to keep it as simple as possible.

“While learning about statistics we can not avoid mathematical notation, however I will try to keep it as simple as possible.”

One of the measurements that the Palmer Penguin researchers took is body mass. Let \(X_{1}, X_{2}, X_{3} ... X_{152}\) be the body mass measurements that the researchers took from the 152 Adelie penguins. I will show the first 10 measurements, that is, \(X_{1}, X_{2}, X_{3} ... X_{10}\). The body mass measurements were taken in grams.

#Showing the body mass measurements for the first 10 penguins
head(Adelie$body_mass_g, 10)
##  [1] 3750 3800 3250   NA 3450 3650 3625 4675 3475 4250

As we can see in the output above (you can ignore the “## [1]” part before the body mass values), there is a penguin where the body mass value is missing. In R this is characterized by NA (Not Available). How to deal with missing values in a dataset is a whole different topic and for simplicity and for the sake of our discussion on statistical methods I will simply remove it. Because we know it is the fourth penguin in the dataset it can be simply removed in the following way.

#Removing the fourth penguin in the dataset
Adelie <- Adelie[-4,] 

This leaves us with 151 Adelie penguins and their corresponding measurements.

In mathematics the capital letter \(X\) (or any other capital letter that one prefers) refers to a random variable. The randomness here comes from the fact that if the researchers were to take many samples of Adelie penguins, for example a hundred samples of 151 Adelie penguins out of the total population of approximately 10 million Adelie penguins, each sample would be different. Here we assume for simplicity that the penguins were not tagged, so the researchers can not trace back the same group or sample of penguins. Suppose we have a hundred different samples of 151 Adelie penguins, then we would also have a hundred different sets of 151 body mass measurements. Therefore, there is an uncertainty in what the values of \(X_{1}, X_{2}, X_{3} ... X_{151}\) will be. In order to get a better idea of which values are more likely and which values are less likely to occur, we can count the number of times values occur in a representative sample and form a so called distribution. A simple way to do this is to create a histogram. Let’s do exactly that!

3 Histograms

In order to explain what a histogram is, I will first simply create one. This can be done in R using the following lines of code.

#Creating a histogram with the Adelie penguin data
with(Adelie, hist(body_mass_g, 
                  breaks = 20,   
                  col = "#FF8C00", 
                  border = "white",
                  main = "Histogram of body mass (Adelie)",
                  xlab = "body mass (in grams)",
                  ylab = "frequency count"))

The histogram above contains 20 bars - or bins as they are formally called - of equal width, namely 100 grams of body mass. The first bin counts the number of Adelie penguins that have a body mass bigger than 2800 and smaller or equal to 2900 grams, that is, a body mass that is contained in the interval (2800,2900]. The second bar counts the number of penguins with a body mass in (2900,3000], the third one in (3000,3100], etc. The frequency count (how many penguins fall in each interval) is represented by the length of each bin. I will proof with the following code that this is exactly the case.

#Counting for each of the 20 bins how many penguins fall in each interval
freq_counts <- sapply(1:20, function(i) Adelie %>% filter(body_mass_g <= 2900 + 100*(i-1) & body_mass_g > 2800 + 100*(i-1)) %>% count) 

#Printing the result in a suitable format
as.numeric(freq_counts) 
##  [1]  5  4  6 10  7 11 16 12 10 15 10 10  5  5 10  4  4  2  3  2

Compare the frequency counts that I calculated above with the length of each bin in the histogram. They exactly correspond to the figure, right?

For this histogram I specifically chose 20 bins by specifying the breaks = 20 argument, but actually one can choose any number of bins. As can be seen in the following figure, naturally the width of each bin and the shape of the histogram will change.

#Determining the positioning of each figure and setting the margins around each figure
par(mfrow = c(2, 2), mar = c(5, 4, 2, 1)) 

#Listing four examples of bins options
nbins_list <- c(5, 10, 20, 40) 

#Defining a function that creates a histogram with a certain number of bins
my_hist <- function(nbins) { 
        with(Adelie, hist(body_mass_g,
                      breaks = nbins,
                      col = "#FF8C00",
                      border = "white",
                      main = paste(nbins, "bins"),
                      xlab = "body mass (g)",
                      ylab = "frequency count",
                      ))
}

#Creating four histograms with each a different number of bins
for(nbins in nbins_list) my_hist(nbins)

#Resetting the positioning of figures for future use
par(mfrow = c(1, 1))

Then, when looking at these histograms, a natural question arises: “How many bins represent the data best?” Statisticians have developed several rules to determine an optimal number of bins, one of the simplest and most popular being Sturges’ Rule. According to this rule, \[n_{\text{bins}} = \lceil \log_2(n) + 1 \rceil,\]

where \(n\) is the number of data sample points.

In our example, \(n = 151\) (since we have a sample of 151 Adelie penguins), which gives:

\[n_{\text{bins}} = \lceil \log_2(151) + 1 \rceil = 9\]

You might not be very familiar with logarithms, and that’s fine — the key intuition is that taking a base-2 logarithm (\(\log2\)) of a large number (e.g., 1000) yields a much smaller number (about 10). As a result, even when the dataset grows much larger, the recommended number of bins increases only slowly, which helps keep the histogram both informative and easy to read.

Here is the code for the histogram that uses Sturges’ Rule.

#Calculating the outcome of Sturges' Rule
sturges <- ceiling(log2(nrow(Adelie)) + 1) 

#Creating a histogram where the number of bins is according to Sturges' Rule
with(Adelie, hist(body_mass_g, 
                  breaks = seq(2800, 4800, length.out = sturges+1),
                  col = "#FF8C00",
                  border = "white",
                  main = "Sturges's Rule",
                  xlab = "body mass (in grams)",
                  ylab = "frequency count"))

Then, finally, I will explore what is possible with histograms using a more sophisticated R visualization package called ggplot2. Here, I show the body mass distributions of all three penguin species in a single figure. To make the differences between the species more visible, the frequency counts for each species are stacked on top of each other. The count for Chinstrap (in purple) is stacked on top of the count for Gentoo (in green), and the count for Adelie (in orange) is stacked on top of the count for Chinstrap. In other words, the raw counts we saw in the simpler histograms earlier are now stacked vertically. To find the raw count for a specific species in a given bin, you simply look at that species’ segment within the stack - for example, to find the raw count for Adelie, you simply look at the orange part.

#Loading the ggplot2 library
library(ggplot2)

#Creating a ggplot histogram
g = ggplot2::ggplot(penguins, aes(x = body_mass_g, fill = species))
g = g + geom_histogram(position = "stack", binwidth = 100)
g = g + labs(title = "Histogram of body mass (Adelie, Chinstrap and Gentoo)", x = "body mass (in grams)", y = "stacked count")
g = g + scale_fill_manual(values = c("Adelie" = "#FF8C00", "Chinstrap" = "#A020F0", "Gentoo" = "#008B8B"))
g

4 Mean, Median and Percentiles

The theory of histograms develops naturally into the concepts of mean, median and percentiles. First of all, the mean, that is, the average value of the data or also called the center of mass. It is calculated by summing up all the values of a variable of interest and dividing the total by the number of data points. In statistics it is represented by a capital letter (in this example we chose \(X\)) with a dash or vertical line on the top, that is \(\bar{X}\). In mathematical notation we get

\[\bar{X} = \frac{1}{n}\sum_{i = 1}^{n}X_i,\]

where \(n\) is the number of data points.

Let’s calculate the mean of the Adelie penguin body mass and show it visually in the histogram we created before.

with(Adelie, hist(body_mass_g, 
                  breaks = seq(2800, 4800, length.out = sturges+1),
                  col = "#FF8C00",
                  border = "white",
                  main = expression("Calculating the sample mean (" * bar(X) * ")"),
                  xlab = "body mass (in grams)",
                  ylab = "frequency count"))

with(Adelie, abline(v = mean(body_mass_g), lwd = 2))

mtext(expression(bar(X) == 3701),
      side = 1,     
      line = 0.1,      
      cex = 0.9,
      adj = 0.45)

The black line that appeared in the center of the histogram above represents the mean or the center of mass. Perhaps the name, center of mass, makes much more sense now, since the body mass values are centered around the mean. The mean is equal to 3701 grams (rounded to the nearest whole number).

The median on the other hand is the point where 50% of data lies below it. If the distribution is perfectly symmetric the median and the mean will be exactly the same, however if the distribution of a random variable is skewed towards one side, the median and the mean will not correspond to one and another. To calculate the median we need to follow a few steps. The first step is to sort the data from the lowest to the highest value, this can be easily done in R with the sort() function. In statistics the ordered data would be represented by \(X_{(1)}, X_{(2)},X_{(3)},...,X_{(151)}\). Here \(X_{(1)}\) represents the lowest value and \(X_{(151)}\) the highest value. Now, the median will be simply the \(X_{((n+1)/2)} = X_{((151+1)/2)} = X_{(76)}\), that is the 76th number in the series of sorted body masses. This turns out to be 3700, which is very close the mean that we calculated. Therefore, the body mass data is nicely symmetric, which makes it very ideal to analyze.

The median is also called the 50th percentile or the 0.5 quantile. Percentiles and quantiles are essentially synonyms and describe the portion of the data that lies below its value. For the 50th percentile or 0.5 quantile it means that 50% of the sorted data points lie below it and for example for the 25th percentile or 0.25 quantile it means that 25% of the sorted data lies below its value. For our penguin data I will calculate a number of percentiles or quantiles with the R function quantile().

quantiles <- with(Adelie, quantile(body_mass_g, probs = seq(0,1,0.25)))

quantiles
##   0%  25%  50%  75% 100% 
## 2850 3350 3700 4000 4775

Naturally, the 0th percentile or 0 quantile is the minimum value of the data (0% of the data lies beneath it) and the 100th percentile or 1 quantile is the maximum value (100% of the data lies beneath it). We can see in the output above that the 50th percentile or 0.5 quantile nicely corresponds to the median that we calculated earlier on. Let’s visualize it nicely in a histogram.

with(Adelie, hist(body_mass_g, 
                  breaks = seq(2800, 4800, length.out = sturges+1),
                  col = "#FF8C00",
                  border = "white",
                  main = expression("Calculating sample percentiles or sample quantiles (" * X[alpha] * ")"),
                  xlab = "body mass (in grams)",
                  ylab = "frequency count"))

with(Adelie, abline(v = quantiles, lwd = 2))

mtext(expression(X[min] == 2850),
      side = 1,     
      line = 0.1,      
      cex = 0.8,
      adj = 0)

mtext(expression(X[0.25] == 3350),
      side = 1,     
      line = 0.1,      
      cex = 0.8,
      adj = 0.25)

mtext(expression(X[0.50] == 3700),
      side = 1,     
      line = 0.1,      
      cex = 0.8,
      adj = 0.44)

mtext(expression(X[0.75] == 4000),
      side = 1,     
      line = 0.1,      
      cex = 0.8,
      adj = 0.62)

mtext(expression(X[max] == 4750),
      side = 1,     
      line = 0.1,      
      cex = 0.8,
      adj = 1.0)

5 Variance and Standard Deviation

The variance and standard deviation measure the spread of the data. Are there a lot of penguins with a body mass around the mean value of 3701 grams? Or are there on the contrary more penguins with a much bigger or smaller body mass than the average value. The spread or variance is a very important characteristic of data, because it tells us a lot about the shape of a distribution. Let’s slowly work towards the variance in mathematical notation. First of all, if we want to know how much \(X_1\), the body mass of the first Adelie penguin, deviates from the mean value we would calculate

\[X_1 - \bar{X} = 49 \ \text{grams},\] (rounded to the nearest whole number).

However, for example

\[X_3 - \bar{X} = -451 \ \text{grams},\]

(rounded to the nearest whole number). Because there will be negative numbers (penguins with a body mass lower than the average) we can not simply add up all the deviations. First we have to make the negative numbers positive. One way to that is to take the square, which gives us

\[(X_1 - \bar{X})^2, (X_2 - \bar{X})^2, (X_3 - \bar{X})^2 \ \text{etc.},\]

that is, the squared deviations from the sample mean. The sample variance is defined as the sum of all the squared deviations divided by the number of data points \(n\) minus one. In mathematical notation we get

\[s_{X}^2 = \frac{1}{n-1}\sum_{i=1}^n(X_i - \bar{X})^2\] This will give us a measure for the average squared deviation from the mean. Here \(\frac{1}{n-1}\) is used instead of \(\frac{1}{n}\), which perhaps does not seem so intuitive. The reason for the minus one I will explain in a later section about one of the most important laws in statistics, the Law of Large Numbers (LLN).

The sample standard deviation is simply the square root of the sample variance

\[s_X = \sqrt{s_{X}^2} = \sqrt{\frac{1}{n-1}\sum_{i=1}^n(X_i - \bar{X})^2}\]

Here I should make a clear distinction between the variance of the population and the sample variance and the same holds for the standard deviation of a population and the sample standard deviation. The variance and standard deviation of the population is unknown, because as I wrote in the section “What is Statistics?”, in most cases we are unable to observe the entire population. That is why \(s_{X}^2\) and \(s_{X}\) refer to the variance and standard deviation of the sample that we are studying. The variance and standard deviation of the population would be \(\sigma_{X}^2\) and \(\sigma_X\), respectively. It is very important to keep this distinction in mind and that is why in our mathematical notation we do exactly that.

The topics on variance and percentiles (or quantiles) are actually very related. For example if we look again at the histogram of Adelie penguin body masses where we drew vertical lines for the 0%, 25%, 50%, 75% and 100% quantiles, we observe that the distance between the 25th percentile and 75th percentile is relatively small, namely 50% of the penguins have a body mass between 3350 and 4000 grams (by definition 75% of the data will lie below the 75th percentile, therefore 50% of the data will be contained in between the 25th and 75th percentile). That means that 50% of the data points lie relatively close to the mean value. Thus the deviations from the mean are relatively small and we expect the sample variance and sample standard deviation also to be relatively small. Let’s find out!

Adelie_mean <- mean(Adelie$body_mass_g)
Adelie_sd <- sd(Adelie$body_mass_g)
sd_from_mean <- Adelie_mean + c(-1,1)*Adelie_sd

with(Adelie, hist(body_mass_g, 
                  breaks = seq(2800, 4800, length.out = sturges+1),
                  col = "#FF8C00",
                  border = "white",
                  main = expression("Calculating the sample standard deviation (" * s[X] * ")"),
                  xlab = "body mass (in grams)",
                  ylab = "frequency count"))

with(Adelie, abline(v = c(Adelie_mean, sd_from_mean), lwd = 2, col = "lightblue"))

mtext(expression(bar(X) - s[X] == 3242),
      col = "lightblue",
      side = 1,     
      line = 0.20,      
      cex = 0.8,
      adj = 0.20)

mtext(expression(bar(X) == 3701),
      col = "lightblue",
      side = 1,     
      line = 0.1,      
      cex = 0.8,
      adj = 0.45)

mtext(expression(bar(X) + s[X] == 4159),
      col = "lightblue",
      side = 1,     
      line = 0.20,      
      cex = 0.8,
      adj = 0.70)

The sample standard deviation of the Adelie penguin body mass equals 459 (rounded to the nearest whole number). In the histogram above I only show the sample standard deviation, because the \(s_{X}\) is of the same unit as \(X\), namely grams. Whereas the sample variance \(s_{X}^2\) is of the unit squared grams, which is perhaps not so intuitive.

6 Basic Probability Theory

To be able to continue our statistics journey we need some basic understanding of probability theory. Let’s start with a very simple example, the toss of a coin. Still widely used at the start of soccer, American football and cricket matches, the coin toss really never becomes old fashioned. In this, let’s call it “experiment”, there are two possible outcomes: heads or tails.

Let’s define the outcome or sample space \(S\) as the set of all possible outcomes of an experiment. In mathematical notation we would write

\[S = \{ \text{Heads}, \text{Tails} \}\] Moreover, assuming that the coin is fair, that is, there is an equal possibility of the coin landing on heads as there is for the coin to land on tails. In other words, there is a 50% probability of the coin landing on heads and a 50% probability of the coin landing on tails. Therefore, it makes perfect sense that a coin toss is used in sports matches as a fair way to decide for example in a soccer match whose team’s captain may choose sides.

There also exist unfair or weighted coins where the probability of the coin landing on heads is not equal to the probability of the coin landing on tails. This happens when the weight is not equally distributed towards both sides of the coin such that the probability of obtaining heads is larger than the probability of obtaining tails or vice versa. The toss of a coin and similar experiments can be generalized by a so called Bernoulli experiment. Johan Bernoulli was a famous Swiss mathematician from the late 17th and early 18th century who together with his brother Jakob formed the foundation for modern probability theory.



Before getting into Bernoulli experiments we should define the concept of events. Let’s take the example of a dice roll. A dice as we all know has six possible outcomes: 1, 2, 3, 4, 5 and 6. In mathematical notation the outcome or sample space \(S\) is

\[ S = \{ 1, 2, 3, 4, 5, 6 \} \] Here, assuming again that we are dealing with a fair dice, the probability of throwing any number of eyes is equal to 1/6, since there are six possible outcomes and each of one them is equally likely. Next, we define the event space \(\mathcal{F}\) that describes all possible events that can be formed based on \(S\). Examples of events are

\[ \begin{aligned} A &= \{1,3,5\} \ \text{(the event of throwing an odd number)} \\[0.5ex] B &= \{2,4,6\} \ \text{(the event of throwing an even number)} \\[0.5ex] C &= \{1,2\} \ \text{(the event of throwing a number smaller than 3)} \\[0.5ex] D &= \emptyset \ \text{(the event of nothing happening at all)} \\[0.5ex] E &= \{1,2,3,4,5,6\} = S \ \text{(the event of throwing any of the six possible outcomes)} \end{aligned} \] All these events can be summarized by the event space \(\mathcal{F}\)

\[ \mathcal{F} = \{\emptyset,S,\{1\},\{2\},\{3\},\{4\},\{5\},\{6\},\{1,2\},\{1,3,5\}, \ ... \} \ \text{(and all other possible events)} \] \(\mathbf{P}\) assigns a probability to each and every event in the event space \(\mathcal{F}\). For example the probabilities of the events \(A\), \(B\), \(C\), \(D\) and \(E\) are respectively

\[ \begin{aligned} \mathbf{P}(A) &= \frac{\#\{1,3,5\}}{\#S} = \frac{3}{6} = 0.5 \\[0.8ex] \mathbf{P}(B) &= \frac{\#\{2,4,6\}}{\#S} = \frac{3}{6} = 0.5 \\[0.8ex] \mathbf{P}(C) &= \frac{\#\{1,2\}}{\#S} = \frac{2}{6} = 0.33 \\[0.8ex] \mathbf{P}(D) &= \frac{\#\emptyset}{\#S} =\frac{0}{6} = 0 \\[0.8ex] \mathbf{P}(E) &= \frac{\#S}{\#S} = \frac{6}{6} = 1 \end{aligned} \]

Here, I simply divided the number of possible outcomes of a certain event (\(\#A\) for any event \(A\), e.g. \(\#\{1,3,5\} = 3\) possible outcomes) by the total number of possible outcomes in the sample space \(S\) (\(\#S = \#\{1,2,3,4,5,6\} = 6\) possible outcomes). As you might have imagined by now, probability theory is an entire universe by itself and it operates according to certain laws. Three of them being, \[ \begin{align*} 1.\quad & \mathbf{P}(\emptyset) = 0 && \text{(The probability of nothing happening is 0)} \\ 2.\quad & 0 \leq \mathbf{P}(A) \leq 1 && \text{for any event } A \quad \text{(The probability of any event } A \text{ happening is between 0 and 1)} \\ 3.\quad & \mathbf{P}(S) = 1 && \text{(The probability of something happening is 1)} \end{align*} \] There are many other laws or axioms as they are formally called in mathematics that we will not be discussing here, since the purpose was to get a basic understanding of probability theory. To conclude, the outcome or sample space \(S\), the event space \(\mathcal{F}\) and the probability \(\mathbf{P}\) that is assigned to each and every event in \(\mathcal{F}\) form together the universe of this exciting topic of probability theory.



7 Random Variables

Since they play such a fundamental role in Data Science I would like to explore random variables in more depth.

Let’s define \(s\) as an element from the outcome or sample space \(S\) that we discussed in Basic Probability Theory. In mathematical notation we would write \(s \in S\). Intuitively, you may see the lower case \(s\) as a single element of the bigger uppercase set of elements \(S\). Then we would formally define a random variable \(X\) as a function of the following form

\[ X: S \to \mathbb{R} \ \text{with} \ s \to X(s)\] This expression may seem a little bit complicated, but remember the expression \(y(x) = x^2\) from high school. Formally we would write this as

\[ y: \mathbb{R} \to \mathbb{R} \ \text{with} \ x \to x^2 \] The function \(y\) actually is a mapping from a one-dimensional space of real numbers \(\mathbb{R}\) (imagine a line of numbers between \(-\infty\) and \(\infty\)) to the same one-dimensional space \(\mathbb{R}\). Another way of looking at it is that \(y(x)\) is a transformation of \(x\). For example if we let \(x=2\), \(y\) will transform \(x\) into \(y(2) = 2^2 = 4\). Therefore a random variable \(X\) is also a mapping but then from the sample space \(S\) to the real numbers \(\mathbb{R}\). It transforms an outcome \(s\) into a (random) variable of interest.

Let’s revise our dice example, but now suppose we have two dice. Then the outcome or sample space is defined as

\[ S = \{(1,1),(1,2),(1,3),(1,4),(1,5),(1,6),(2,1), \ ... \ , (6,6)\} \] Or more compact as

\[ S = \{(x_1,x_2): x_1,x_2 \in \{1,2,3,4,5,6\} \} \] Then let’s define the random variable \(X\) as the sum of the number of eyes rolled on the two dice

\[ X := \text{The sum of the number of eyes rolled on two dice} \] Or more formally

\[ X: S \to \mathbb{R} \ \text{with} \ (x_1,x_2) \to x_1 + x_2, \] where \(s = (x_1,x_2)\) and \(X(s) = x_1 + x_2\). This translates back to the events that we discussed in Basic Probability Theory in the following way. I will give some examples

\[ \begin{aligned} A &= \{s: X(s) > 10\} = \{(5,6),(6,5),(6,6)\} \\[0.5ex] B &= \{s: X(s) = 12\} = \{(6,6)\} \\[0.5ex] C &= \{s: X(s) \leq 4 \} = \{(1,1),(1,2),(1,3),(2,1),(2,2),(3,1)\} \end{aligned} \] Here, \(A\) is the event or set of outcomes where the sum of two dice rolls is larger than 10, \(B\) the event or set of outcomes where the sum of two dice is exactly equal to 12 and \(C\) the event where the sum of two dice rolls is smaller or equal to 4. Therefore, random variables can help us to define events and the purpose of showing this was to understand how events in probability theory and random variables are related. Then, one of our ingredients is still missing, namely how random variables relate to probabilities

\[ \begin{aligned} \mathbf{P}(A) &= \frac{\#\{(5,6),(6,5),(6,6)\}}{\#S} = \frac{3}{36} = 0.08 \\[0.8ex] \mathbf{P}(B) &= \frac{\#\{(6,6)\}}{\#S} = \frac{1}{36} = 0.03 \\[0.8ex] \mathbf{P}(C) &= \frac{\#\{(1,1),(1,2),(1,3),(2,1),(2,2),(3,1)\}}{\#S} = \frac{6}{36} = 0.17 \end{aligned} \] where the number of possible outcomes in \(S\), that is, \(\#S\) is equal to 36 (there are two dice that each have 6 possible outcomes, which gives \(6^2 = 36\) pairs of outcomes).

Coming back to our penguin example, in What is Statistics? we defined a random variable \(X\) as the body mass of Adelie penguins. Now, hopefully we can understand much better what that means. It means that

\[ X: S \to \mathbb{R} \ \text{with} \ s \to X(s) \] where \(S\) is the set of possible body mass values that an Adelie penguin can take. In our sample data this is between 2850 to 4775 grams, but remember that in statistics we are interested in a population, so \(S\) would be the range of body mass values that an Adelie penguin can take given their entire population. Further, the transformation \(X(s)\) simply takes a body mass value from this range and returns the same value, a very simple and basic transformation similar to \(y(x) = x\).

To conclude, the way I see it is that the formal definitions of events and random variables are kind of the training wheels that we need to get a good and thorough understanding of probability theory. Then later we remove the training wheels and assume that we know what is implied by a more compact and simple notation. Coming back to the rolling of two dice, we will write the following event

\[ A = \{s: X(s) > 10\} = \{(5,6),(6,5),(6,6)\} \\[0.5ex] \] simply as

\[ X > 10 \] Further, the probability \(\mathbf{P}(A)\) becomes simply \(\mathbf{P}(X > 10)\). From now onwards I will use this more parsimonious and less abstract notation, but I think it is very valuable and important to keep in mind that probability theory consists of a very strong and well thought-out framework. This formal framework is essential, as it ensures clarity, prevents ambiguity, and provides a solid foundation for extending the theory to more complex or less intuitive situations.

8 Probability Distributions

Now we have a basic understanding of probability theory and random variables we can start to visualize a probability distribution for the possible outcomes of a random variable. Before doing so, I would like to explain the difference between discrete random variables and continuous random variables. Discrete simply means that the sample space \(S\) of a random variable is countable. In the case of summing up the eyes rolled on two dice, \(S\) consists of 36 possible outcomes and is therefore countable. Even if \(S\) consists of all positive integers (also called the natural numbers \(\mathbb{N}\)), that is,

\[S = \mathbb{N} = \{1,2,3,...\},\]

and let \(X\) for example be defined as the number of times before you roll a 6 on both two dice (it could take many rolls to obtain two sixes), \(X\) would still be considered a discrete random variable. The sample space is infinite, but still countably infinite. On the other hand, in the case where \(X\) is the body mass of Adelie penguins, the sample space could be any real number between the minimum and maximum body mass value of the population. When I say real number, you can for example imagine the interval [0,1]. Believe it or not, but there are actually infinitely many numbers between 0 and 1. For example, 0.1,0.01,0.001 and so on. There could be a penguin with a body mass of 2850.1 grams, a penguin of 2850.01 grams, a penguin of 2850.001 grams and so on. There are no breaks or gaps between the values \(X\) can take so to say. Therefore, if \(X\) represents the body mass of Adelie penguins, the sample space of \(X\) is uncountably infinite and \(X\) itself is called a continuous random variable.

I can imagine that it might get confusing when \(X\) can be defined as many different things, however I think it is a way thinking that at some point will start to feel more natural and the brain will get used to it in the end. \(X\) is like a soup bowl that you can fill with many different kinds of soups: sweet potato soup, courgette soup, pumpkin soup. Every time I discuss an example, I simply empty the bowl and fill it with some other kind of soup. I hope that makes sense.

In order to visualize random variables I define a probability mass function (pmf) for discrete random variables

\[ p_X: \text{Range}(X) \to [0,1] \ \text{with} \ x \to \mathbf{P}(X=x)\] where \(\text{Range}(X) = \{\, x \in \mathbb{R} \mid \exists s \in S : X(s) = x \,\}\). This means that that the range of \(X\) is all \(x\) where there exists an element \(s\) in the sample space \(S\) such that \(X(s) = x\).

9 Common Discrete Probability Distributions

9.1 Bernoulli Distribution

p <- 0.5          
x <- 0:1           

pmf <- dbinom(x, size = 1, prob = p)

barplot(pmf, names.arg = x, col = "lightgreen",
        main = "PMF of the Bernoulli distribution",
        xlab = "x", ylab = "P(X = x)")

p <- 0.5          
x <- 0:1  

cmf <- pbinom(x, size = 1, prob = p)

plot(stepfun(x, c(0, cmf)),
     xlab = "Uitkomst",
     ylab = "F(X ≤ x)",
     main = "CMF of the Bernoulli distribution",
     verticals = FALSE, pch = 19)

9.2 Binomial Distribution

n <- 10     
p <- 0.5    

x <- 0:n

pmf <- dbinom(x, size = n, prob = p)


barplot(pmf, names.arg = x, col = "lightgreen",
        main = "PMF of the binomial distribution",
        xlab = "Number of successes", ylab = "P(X = x)")

# Parameters van de binomiale verdeling
n <- 10
p <- 0.5
x <- 0:n

# Bereken de cumulatieve kans
cmf <- pbinom(x, size = n, prob = p)

# Plot de CMF als trappengrafiek
plot(stepfun(x, c(0, cmf)), 
     xlab = "Aantal successen", 
     ylab = "P(X < x)", 
     main = "CMF van een binomiale verdeling", 
     verticals = FALSE, pch = 19)

9.3 Geometric Distribution

9.4 Negative Binomial Distribution

9.5 Poisson Distribution

# Parameters
lambda <- 3           # gemiddeld aantal gebeurtenissen
x <- 0:15             # mogelijke uitkomsten

# PMF berekenen
pmf <- dpois(x, lambda)

# Barplot van de PMF
barplot(pmf, names.arg = x, col = "lightgreen",
        main = "PMF of the poisson distribution",
        xlab = "Aantal gebeurtenissen", ylab = "P(X = x)")

# Parameters
lambda <- 3           # gemiddeld aantal gebeurtenissen
x <- 0:15             # mogelijke uitkomsten

# Cumulatieve kansen berekenen
cmf <- ppois(x, lambda)

# Trappengrafiek van de CMF
plot(stepfun(x, c(0, cmf)), 
     xlab = "Aantal gebeurtenissen", 
     ylab = "P(X < x)", 
     main = "CMF of the poisson distribution", 
     verticals = FALSE, pch = 19)

9.6 Hypergeometric Distribution

9.7 Discrete Uniform Distribution

p_X <- rep(1/6,6)
X <- 1:6

barplot(p_X, names.arg = X, ylab = "P(X = x)", xlab = "Number of eyes rolled on a dice", main = "PMF of a fair dice", col = "lightgreen")

p_X <- rep(1/6, 6)
X <- 1:6

F_X <- cumsum(p_X)


plot(stepfun(X, c(0, F_X)), xlab = "x", ylab = "P(X < x)", main = "CMF of a fair dice", verticals = FALSE, pch = 19)

9.8 Multinomial Distribution

10 Common Continuous Probability Distributions

mu <- 0      
sigma <- 1 

x <- seq(-4, 4, by = 0.1)

y <- dnorm(x, mean = mu, sd = sigma)

plot(x, y, type = "l", lwd = 2, col = "lightgreen",
     main = "PDF of the standard normal distribution",
     xlab = "x", ylab = "Density")

x <- seq(-4, 4, by = 0.1)
F_x <- pnorm(x, mean = 0, sd = 1)

plot(x, F_x, type = "l", lwd = 2,
     main = "CDF of the standard normal distribution",
     xlab = "x", ylab = "F(x)")

with(Adelie, hist(body_mass_g, 
                  breaks = seq(2800, 4800, length.out = sturges+1),
                  col = "#FF8C00",
                  border = "white",
                  freq = FALSE,
                  main = "Kernel density body mass (Adelie)",
                  xlab = "body mass (in grams)",
                  ylab = "density"))

lines(density(na.omit(Adelie$body_mass_g)),
      col = "lightgreen",
      lwd = 2)

11 The Law of Large Numbers

12 The Central Limit Theorem

13 Confidence Intervals

14 Hypothesis Testing

15 P-values and Multiple Testing

16 Bootstrapping

Your Data Science Needs

Horst, Allison Marie, Alison Presmanes Hill, and Kristen B Gorman. 2020. Palmerpenguins: Palmer Archipelago (Antarctica) Penguin Data. https://doi.org/10.5281/zenodo.3960218.