This notebook has been written in a lighthearted mood. Its purpose is to deliver a solid explanation on how to apply the convolution formula in the field of Probability Theory. Also, this topic serves as a good motivation for the central limit theorem, so we will do a shallow dip —and by shallow I mean proofless— into it as well.

The schedule on this ocassion is the following:

explanation of the convolution formula
obtaining the pdf of the sum of two continuous and independent random variables
obtaining the pdf of the sum of \(n\) —where \(n > 2\)— continuous and independent random variables

1 The convolution formula explained

There are two versions of the convolution formula. You may want to calculate either the convolution of two discrete functions or the convolution of two continuous functions. What concerns us is the latter.

The continuous convolution formula is depicted on Equation (1.2). I suggest not jumping right into it, as the intuition of what the formula does and even how to make sense of the fact that there are two variables involved in it —\(\tau\) and \(t\)— can be hard to grasp.

For illustrative purposes, consider the following piecewise function:

\[\begin{equation} f(x) = \begin{cases} \large \frac{-x}{2} & \text{if} \hspace{2.5mm} -2 \leq x \leq 0 \\ \\ 0 & \text{otherwise} \end{cases} \quad \quad \quad \quad \quad \quad f(-\tau + t) = \begin{cases} \large \frac{-(-\tau + t)}{2} & \text{if} \hspace{4mm} t \leq \tau \leq 2 + t \\ \\ 0 & \text{otherwise} \end{cases} \end{equation}\]

Both on the left and on the right we have the very same function \(\boldsymbol{f}\), but on the left we assume its argument \(\boldsymbol{x}\) to be an exogenous variable while on the right we assume it to be a function \(\boldsymbol{x(\tau) = -\tau + t}\). In what ensues we will show that \(f\) behaves differently in each of these cases.

Please note that the interval \(t \leq \tau \leq 2 + t\) is obtained by manipulating \(-2 \leq x(\tau) \leq 0\) in such a way that we isolate \(\tau\) in the middle of the inequations. Also note that \(\boldsymbol{t}\) becomes a mathematical parameter that we can control in order to decide what is the interval on which \(\boldsymbol{f(-\tau + t)}\) is not equal to zero¹.

Figure 1.1: The same function plotted on different planes

On the left panel of Figure 1.1 we have plotted the function \(f(x)\) on a plane whose horizontal axis is \(x\). Regarding the plot on the right panel, in order for us to come up with it we assumed that \(x(\tau) = -\tau + 3\). Notice that the support of \(\boldsymbol{f(x(\tau))}\) is \(\boldsymbol{[3, 5]}\). That fact by itself should suffice for us to deduce that \(t = 3\). Take your time to assimilate it.

Consider another piecewise function. It could be, for example, the following:²

\[\begin{equation} g(\tau) = \begin{cases} 2 & \text{if} \hspace{2.5mm} -0.5 \leq \tau \leq 0.5 \\ 0 & \text{otherwise} \end{cases} \end{equation}\]

Consequently, the product of \(f(-\tau + t)\) and \(g(\tau)\) is this function:

\[\begin{equation} f(-\tau + t) \cdot g(\tau) = \begin{cases} -(-\tau + t) & \text{if} \hspace{4mm} max\{t, -0.5\} \leq \tau \leq min\{2 + t, 0.5\} \\ 0 & \text{otherwise} \end{cases} \tag{1.1} \end{equation}\]

Please note that in order for the above function to have a non-empty support the parameter \(t\) is only allowed to take some real values. For instance, consider \(t=20\). Then, the support becomes \([20, 0.5]\), which is an empty interval because there is no possible value of \(\tau\) that can simultaneously be greater than or equal to \(20\) and less than or equal to \(0.5\).

The mathematical operation termed convolution is denoted and defined as follows:

\[\begin{equation} (f*g)(t) = \int_{-\infty}^{+\infty}f(-\tau + t) \cdot g(\tau)d\tau \tag{1.2} \end{equation}\]

Geometrically, what does it mean to calculate \((f*g)(t)\)? We will illustrate the geometry behind it by considering the convolution of the functions \(f\) and \(g\) defined lines above evaluated on two different inputs: \(t = -2.25\) and \(t = -1.5\).

Figure 1.2: Geometric representation of the convolution

In light of all the previous discussion, we will now summarize the concept of the convolution. By considering that \(\boldsymbol{x}\) is a function \(\boldsymbol{x(\tau) = -\tau + t}\) essentially we are obtaining the reflection of the function \(\boldsymbol{f}\) across the vertical axis. Then, by fixing \(t\) on a specific value we are deciding how much do the support of the reflection of \(\boldsymbol{f}\) and the support of \(\boldsymbol{g}\) overlap³. Finally, we obtain the product of these functions and compute the area below it.

2 Application: the pdf of the sum of two continuous and independent random variables

Now you may be asking yourself “well, this whole convolution thing seems rather fancy and elaborate but what is it good for?”. That is a fair question. Why should we care at all about the area below the product of two overlapping functions —where one of them has been reflected and translated—?

As ethereal as it is, the convolution formula actually has many applications. Particularly, it becomes relevant for us because the pdf of the sum of two continuous and independent random variables is equal to the convolution of their marginal probability density functions.

Proof. Consider random variables \(X\) and \(Y\), both continuous and independent. Denote by \(Z\) the sum \(X + Y\). Note that variable \(Z\) has two sources of randomness —the two addends that make up the sum are random variables—.

On the other hand, the joint density of \(X\) and \(Z\) can be written —by manipulating the well known definition of the conditional pdf— as follows:
\[\begin{equation} f_{X,Z}(x, z) = f_{Z|X}(z|x)\cdot f_{X}(x) \tag{2.1} \end{equation}\]

Given \(X = x\), variable \(Z\) stops having two sources of randomness. Now, the uncertainty only comes from the fact that \(Y\) is random. This hints towards what we are about to show: that we can write \(f_{Z|X}(z|x)\) exclusively in terms of the pdf of \(Y\). \[\begin{align} P(Z \leq z |X = x) &= P(x + Y \leq z | X = x) \\ &= P(Y \leq z - x | X = x) \end{align}\]

Now, our assumption of \(X\) and \(Y\) being independent comes into play: \[\begin{align} \hspace{18.55mm}&= P(Y \leq z - x) \\ \\ &= \int_{-\infty}^{z - x}f_Y(y)dy \\ \end{align}\]

We take the derivative with respect to \(z\) in both sides of the equation above and finally obtain the following: \[\begin{align} \hspace{14mm}f_{Z|X}(z|x) &= f_Y(z - x) \cdot \frac{d}{dz}(z - x) \\ f_{Z|X}(z|x) &= f_Y(z - x) \end{align}\]

What we will do is replace this on Equation (2.1).

By the way, do not forget that what we have on the left hand-side of that equation is the joint density of \(X\) and \(Z\). Is there some way of obtaining the marginal density of \(Z\) from that joint density? Well, of course. According to Probability Theory, all it takes to obtain \(\boldsymbol{f_Z(z)}\) is for us to take the definite integral of \(\boldsymbol{f_{X, Z}(x, z)}\) with respect to \(\boldsymbol{x}\) on the interval \(\boldsymbol{]-\infty, +\infty[}\).

Doing that, we obtain at last \[\begin{equation} f_Z(z) = \int_{-\infty}^{+\infty}f_Y(z - x) \cdot f_X(x)dx \tag{2.2} \\ \end{equation}\]

The applicability of the convolution formula is not constrained to the sum of two continuous, independent and identically distributed random variables. Whether the addends of the sum are identically distributed or not does not matter —it wasn’t a necessary assumption for obtaining Equation (2.2)—. Nevertheless, we will put ourselves to the test of applying the convolution formula on both the case of identically distributed random variables and the case of non-identically distributed random variables. This distinction will become relevant later on because I will show you a neat trick that —unlike the convolution formula— does require the variables being added to be identically distributed.

2.1 Identically distributed random variables

Consider the random variables \(X, Y \sim \text{i.i.d. Exp}(\lambda)\).

The pdf associated to the exponential distribution is

\[\begin{equation} f_X(x) = \begin{cases} \lambda e^{-\lambda x} & \text{if} \hspace{2.5mm} x \geq 0 \\ 0 & \text{otherwise} \end{cases} \end{equation}\]

Find the pdf of \(\boldsymbol{Z = X + Y}\).

Fine. Challenge accepted. All we have to do is plug the density functions accordingly into Equation (2.2). Pretty straight forward stuff.

\[\begin{align} f_Z(z) &= \int_{-\infty}^{+\infty}\lambda e^{-\lambda (z - x)} \cdot \lambda e^{-\lambda x} dx \\ \\ f_Z(z) &= \int_{-\infty}^{+\infty}\lambda^2e^{-\lambda z}dx \\ \\ f_Z(z) &= \lambda^2e^{-\lambda z}\int_{-\infty}^{+\infty}dx \end{align}\]

Hold it right there. We will not actually integrate over the whole set \(]-\infty, +\infty[\). Similarly to how we did on Equation (1.1), we need to work out what is the support of the integrand.

Two conditions must be satisfied : \(x \geq 0\) and \(-x + z \geq 0\). The two are met on the interval \([0, z]\), which therefore is the support of our integrand.

By restricting the interval of integration to the support we get

\[\begin{align} f_Z(z) &= \lambda^2e^{-\lambda z} \cdot x\Big|_0^z \\ \\ f_Z(z) &= \lambda^2e^{-\lambda z}z \hspace{3.5mm} \forall \hspace{1.2mm} z \geq 0 \end{align}\]

FYI: this is the pdf of the second order Erlang distribution. How do we make sense of it? Well, the exponential distribution is useful for determining the probability of the extension of the time lapse until some event happens. Then, the second order Erlang distribution must be associated to the probability of the extension of the time lapse until some event happens for the second time.

2.2 Non-identically distributed random variables

Consider \(X \sim \text{Uniform}(0, 5)\) and \(Y \sim \text{Uniform}(0, 12)\). Their pdfs are the following:

\[\begin{equation} f_X(x) = \begin{cases} \large \frac{1}{5} & \text{if} \hspace{2.5mm} 0 \leq x \leq 5 \\ \\ 0 & \text{otherwise} \end{cases} \quad \quad \quad \quad \quad \quad f_Y(y) = \begin{cases} \large \frac{1}{12} & \text{if} \hspace{4mm} 0 \leq y \leq 12 \\ \\ 0 & \text{otherwise} \end{cases} \end{equation}\]

Find the pdf of \(\boldsymbol{Z = X + Y}\).

The integrand that we are going to plug into the convolution formula is

\[\begin{equation} f_Y(-x + z) \cdot f_X(x) = \begin{cases} \large \frac{1}{60} & \text{if} \hspace{2.5mm} max\{0, z-12\} \leq x \leq min\{5, z\} \\ \\ 0 & \text{otherwise} \end{cases} \end{equation}\]

The support of this product depends on the value that \(z\) takes. By carefuly checking how the limits of the support change when \(z\) is assumed to be lying on different intervals we are able to recognize three distinct cases.

\[\begin{align} &\text{if} \hspace{2.5mm} 0 \leq z \leq 5 &\text{then the support is} \hspace{1mm} &\big[0, z\big] \\ \\ &\text{if} \hspace{2.5mm} 5 \leq z \leq 12 &\text{then the support is} \hspace{1mm} &\big[0, 5\big] \\ \\ &\text{if} \hspace{2.5mm} 12 \leq z \leq 17 &\text{then the support is} \hspace{1mm} &\big[z - 12 \hspace{0.4mm}, \hspace{0.4mm}5\big] \end{align}\]

This means that the support of the pdf of \(\boldsymbol{Z}\) is split into three different intervals. On each of these intervals the pdf of \(Z\) exhibits a different behaviour.

What we have to do now is obtain for each one of the three scenarios above the corresponding pdf. In order to do that we will compute

\[\begin{align} \int_\text{lower limit}^\text{upper limit}\frac{1}{60} dx \end{align}\]

with the twist consisting in plugging the corresponding limits of the support into the integration limits for each scenario.

At last, what you should obtain is

\[\begin{equation} f_Z(z) = \begin{cases} \large \frac{z}{60} & \text{if} \hspace{2.5mm} 0 \leq z \leq 5 \\ \\ \large \frac{1}{12} & \text{if} \hspace{2.5mm} 5 \leq z \leq 12 \\ \\ \large \frac{17-z}{60} & \text{if} \hspace{2.5mm} 12 \leq z \leq 17 \\ \\ 0 & \text{otherwise} \end{cases} \end{equation}\]

3 Ideas on how to deal with the sum of \(n\) continuous and independent random variables

In Probability Theory there are —I guess you could say— notable sums of random variables. There are various sums whose distributions are well known. I will just name two:

the sum of \(n\) independent —and not necessarily identically distributed— normal variables is another normally-distributed random variable
the sum of \(n\) independent and identically distributed exponential random variables follows a \(n\text{th}\) order Erlang distribution.

The most practical strategy, then, is to look up on the internet or in books what is the distribution of the sum of our concern.

3.1 Successive convolutions

If you either come across a non-notable sum of random variables or are stubborn enough to want to derive by yourself the pdf of a notable sum then you can perform successive convolutions.

The sum of random variables is associative. That means, for the case of the sum of three random variables \(X_1\), \(X_2\) and \(X_3\), that \(X_1 + X_2 + X_3 = (X_1 + X_2) + X_3\). This implies that we can first obtain the density \(\boldsymbol{f_{\large X_1 + X_2}}\) and then —with it— obtain the density \(\boldsymbol{f_{\large (X_1 + X_2) + X_3}}\). Of course, this same argument generalizes neatly for a sum with more addends: assuming that we have already derived \(f_{\large X_1 + X_2 + X_3}\), we are able to obtain the density \(f_{\large (X_1 + X_2 + X_3) + X_4}\).

In order to illustrate the idea, we will now put ourselves to the task of deriving the pdf of the sum of three independent and identically distributed exponential random variables.

Fair enough. We have already accomplished one half of the job. The pdf of the sum of two such variables is already known to us. We derived it in Subsection 2.1. It is \(f_Z(z) = \lambda^2e^{-\lambda z}z\). With that knowledge in the back of our minds, we set off to obtain the pdf of \(W = Z + V\), where \(V \sim \text{Exp}(\lambda)\).

\[\begin{align} f_W(w) &= \int_{-\infty}^{+\infty}f_V(w-z)\cdot f_Z(z)dz\\ \\ f_W(w) &= \int_{0}^{+\infty}\lambda^3e^{-\lambda w}z \hspace{0.5mm}dz \\ \\ f_W(w) &= \lambda^3e^{-\lambda w}\int_{-\infty}^{+\infty}zdz \end{align}\]

Mirroring the same line of thought of the previous examples, we now have to determine what are the actual limits of integration. The support of our integrand satisfies these two conditions: \(z \geq 0\) and \(-z + w \geq 0\). Then, we will integrate over the interval \(\boldsymbol{[0, w]}\).

\[\begin{align} f_W(w) &= \lambda^3e^{-\lambda w} \cdot \frac{z^2}{2}\bigg|_0^w \\ \\ f_W(w) &= \frac{\lambda^3e^{-\lambda w}w^2}{2} \hspace{3.5mm} \forall \hspace{1.2mm} w \geq 0 \end{align}\]

FYI: you probably have guessed by now what \(W\) is. It is a third order Erlang-distributed random variable. It measures how much time we have to wait until witnessing some event for the third time.

3.2 A cool trick: the central limit theorem

The title of this Subsection is somewhat of a bait. Now that I lured you into it I must apologize because reducing the central limit theorem to nothing but a mere trick doesn’t do it any justice and is actually even disrespectful. About the central limit theorem it has been said that the Greeks would have revered it had they been aware of its existence.

Though it is tedious, we can resort to the method of successive convolutions if \(n\) is a small number. We have already done it for \(n=3\). If \(n\) is equal to \(5\) then the task would start to feel like a tough and repetitive ordeal, but what if —for God knows what reason— we are required to obtain the pdf of the sum of a ridiculously large number of continuous and independent random variables?

For instance, suppose that we are asked what is the pdf associated to the sum of \(30\) independent and identically distributed exponential random variables. In this particular case we are lucky because this is a notable sum. The distribution of this sum of random variables is the \(30\text{th}\) order Erlang distribution.

The pdf of the \(n\text{th}\) order Erlang distribution is

\[\begin{equation} f_{Z}(z) = \begin{cases} \Large \frac{\lambda^ne^{-\lambda z}z^{n-1}}{(n-1)!} & \text{if} \hspace{2.5mm} z \geq 0 \\ \\ 0 & \text{otherwise} \end{cases} \end{equation}\]

On the other hand, we will denote by \(\boldsymbol{A}\) the following random variable: \(\boldsymbol{A = \normalsize \frac{Z}{\sqrt n}}\).

The pdf⁴ of \(A\) is

\[\begin{equation} f_{A}(a) = \begin{cases} \Large \frac{\lambda^ne^{-\lambda a \sqrt{n}}\big(a \sqrt n\big)^{n-1} \cdot \sqrt n}{(n-1)!} & \text{if} \hspace{2.5mm} a \geq 0 \\ \\ 0 & \text{otherwise} \end{cases} \end{equation}\]

Since we know what the probability density functions of \(A\) and \(Z\) are we can use the curve function of the graphics package of R to plot them. And that is exactly what we have done in order to generate the following Figure.

Figure 3.1: The central limit theorem in action

There is a lot going on in the Figure above. We didn’t just plot the densities of \(A\) and \(Z\); we also compared the shape of their plots with the shape of normal densities whose respectives mean and variance match —in each case— those of \(A\) and \(Z\).

The mean of an Erlang-distributed random variable such as \(Z\) is \(n/\lambda\). Its variance is \(n/\lambda^2\). By telling you that in the making of these plots we have considered a value of \(3\) for \(\lambda\) —and also recalling that \(n = 30\)— you can work out the numerical values of the mean and the variance. Those same values were assigned to the parameters of the normal distribution depicted on the left panel of Figure 3.1. It stands out that the density of \(Z\) is thinner and taller than the density of a normal distribution that shares its same mean and variance. Perhaps if we scaled \(Z\) down we would be able to force it to resemble a normal distribution.

We are indeed able to do that. The variable \(A\) is \(Z\) rescaled so that its pdf resembles that of a normally-distributed random variable. Its mean is \(n/(\sqrt{n}\lambda)\) and its variance is \(1/\lambda^2\). Again, you can work out the numerical values, which are the same values assigned to the parameters of the normal distribution depicted on the right panel.

In practical terms, what this tells us is that if we want to compute the probability of an event in which the sum of several independent and identically distributed exponential random variables is involved we can resort to approximate that probability with the cumulative distribution function of the normal distribution.

\[\begin{align} P(Z \leq z) &= P \bigg(\frac{Z}{\sqrt{n}} \hspace{1.7mm} \leq \hspace{1.7mm} \frac{z}{\sqrt{n}}\bigg) \\ \\ &= P \bigg(A \hspace{1mm}\leq \hspace{1mm} \frac{z}{\sqrt{n}}\bigg) \\ \\ \text{30th order Erlang cdf evaluated on} \hspace{1.5mm} z &\approx \text{Normal cdf evaluated on} \hspace{1.5mm} \frac{z}{\sqrt{n}} \end{align}\]

In the particular case of the sum of several independent and identically distributed exponential random variables all of this is not that useful because we already count with the well known Erlang distribution to begin with, so we don’t really need to make approximations. However —and we will not prove the following claim, as the proof is beyond the scope of this humble notebook— it turns out that the central limit theorem holds if \(Z\) is the sum of several independent and identically distributed random variables irregardless of whether they follow an exponential distribution or any other distribution and even irregardless of whether that distribution is continuous or discrete.

Now it is time to express in a more precise and symbolic language all of our previous discussion.

We define the sequence of i.i.d.⁵ random variables \(\{X_i\}_{i = 1}^n\) with common mean \(E[X_i] = \mu\) and common variance \(Var(X_i) = \sigma^2\) . Let us denote by \(Z\) the following: \(Z = \sum_{i=1}^nX_i\). Then the central limit theorem guarantees that

\[\begin{equation} \frac{Z}{\sqrt{n}} \xrightarrow[n \rightarrow +\infty]{d}N\bigg(\lim_{n\to +\infty}\sqrt{n} \mu \hspace{2mm} , \hspace{2mm}\sigma^2\bigg) \tag{3.1} \end{equation}\]

In Probability Theory jargon: as \(\boldsymbol{n}\) goes to infinity, the sum of \(\boldsymbol{n}\) i.i.d. random variables converges —provided that we scaled it down by a factor of \(\boldsymbol{1/\sqrt{n}}\)— in distribution to a normally-distributed random variable with the parameters pointed above.

Closing remarks

It is trivially obvious that, since \(Z\) is a sum of random variables and unless \(\mu\) is zero, if you keep adding more and more addends to that sum its mean is going to go further and further —depending on whether \(\mu\) is positive or negative— into either \(+\infty\) or \(-\infty\). As you can ascertain, though, from the right panel of Figure 3.1, this might happen very slowly in comparison to the convergence in distribution. With only \(30\) addends we almost already observe perfect convergence in distribution while the mean of \(\normalsize \frac{Z}{\sqrt{n}}\) is not even equal to \(2\).

If for some reason there is an urge to force the mean to be equal to zero we can always resort to substracting the mean from our random variable⁶.

\[\begin{equation} \frac{Z - n\mu}{\sqrt{n}} \xrightarrow[n \rightarrow +\infty]{d}N\big(0, \sigma^2\big) \end{equation}\]

With that already out of the way, the only unresolved aspect of this notebook is the illustration —not persuasion, since a mathematical proof will not be provided— of the fact that the central limit theorem doesn’t advocate for the asymptotic normality of the sum of \(\boldsymbol{n}\) independent but non-identically distributed random variables.

We will just have to leave it unresolved for now because this notebook has already become quite lengthy. I would have liked to introduce the Kolmogorov-Smirnov statistic in order to check how much similar to a normal distribution is the distribution of the scaled-down sum of \(30\) independent non-identically distributed uniform random variables. Yet, again, creating a single and lengthy notebook that covers three topics doesn’t seem like a good idea.

The central limit theorem motivates and even sustains lots of topics in Statistics and Probability Theory. In the domain of the latter it provides us with a reason for studying in depth the normal distribution and it also invites us to keep exploring the distinct concepts of convergence. In the realm of the former it provides us with a framework for building test statistics and asymptotic confidence intervals —in other words, it gives us tools to perform inference—.

The set of points on which a real-valued function is not equal to zero is termed the support↩︎
The notation that we use for the argument of the function \(g\) is a placeholder. We may as well write \(g(x)\) instead of \(g(\tau)\). However, if we choose to do so, in order to avoid a notation conflict with what we have denoted as \(x(\tau)\) we would also have to change the notation that we use to refer to it. Accordingly, we would write something akin to \(y(x) = -x + t\) and we would consider \(f\) to be a function of \(y\).↩︎
If the support of the reflection of \(f\) and the support of \(g\) do not overlap then the support of the product of these functions is empty.↩︎
The derivation of this pdf is left as an exercise to the reader↩︎
This is the habitual abbreviation for independent and identically distributed.↩︎
Similarly, if we want to force the variance into being equal to one, all it takes is for us to divide our random variable by \(\sigma\).↩︎

The convolution formula and the central limit theorem: what is the distribution of the sum of continuous and independent random variables?

Fernando Pazos Ruiz

July 16, 2020