James Scott (UT-Austin)
Reference: Bertsekas Chapters 3.1-3.3, 2.3, 3.6
Up to now we've only been dealing with discrete random variables that are characterized by a PMF.
But what about a random variable like:
These outcomes cannot naturally be restricted to a finite or countable set, and they don't have PMFs. To describe these random variables, we need some more general concepts.
The cumulative distribution function, or CDF, is defined as:
\[ F_X(x) = P(X \leq x) \]
Facts:
The CDF for Binomial(N=2, p=0.5). (Let's write this on the board.) The jumps correspond to the points where the PMF has positive probability. What is \( F(1) \)? What is \( F(0.6)? \) What is \( F(17) \)?
All CDFs \( F(x) \) satisfy the following properties:
F is right-continuous, i.e.
\[ F(x) = \lim_{y \downarrow x} F(y) \]
Note: \( \lim_{y \downarrow x} \) means “limit as \( y \) approaches \( x \) from above.”
Intuitively, a continuous random variable is one that has no “jumps” in its CDF. More formally, we say that \( X \) is a continuous random variable if there exists a function \( f \) such that:
\[ P(X \in S) = P(a \leq X \leq b) = \int_a^b f(x) \ dx \]
If \( \delta \) is small, then \( P(x < X < x + \delta) \approx f(x) \cdot \delta \). The PDF can be interpreted as “probability per unit length” (like density in physics).
Suppose that \( X \) is a random variable with PDF \( f_X(x) = 1/2 \) for \( 0 \leq x \leq 2 \) (and \( f(x) = 0 \) otherwise). We write this as \( X \sim \) Uniform(0, 2).
More generally, we say that \( X \) has a (continuous) uniform distribution on \( (a,b) \) if its PDF tales the form:
\[ f_X(x) = \left\{ \begin{array}{ll} \frac{1}{b-a} , & a \leq x \leq b \\ 0, & \mbox{otherwise} \end{array} \right. \]
We write this as \( X \sim \) Uniform(a,b). Note: this is different than the discrete uniform distribution, which places probability \( 1/N \) on \( N \) discrete points.
Suppose that \( X \) is a random variable with PDF \( f_X(x) = \lambda e^{-\lambda x} \) for \( x \geq 0 \) (and \( f(x) = 0 \) otherwise).
The exponential distribution also has a wide range of applications. For example:
Note that, by the definition of the CDF and PDF, we have the following relationship for a continuous random variable:
\[ F_X(x) = P(X \in (-\infty, x)) = \int_{-\infty}^x f_X(x) dx \]
Remember the Fundamental Theorem of Calculus! This relationship says that the PDF is the derivative of the CDF:
\[ f(x) = F'(x) \]
at all points where \( F(x) \) is differentiable.
Recall that if \( X \sim \) Uniform(a,b), then the PDF is
\[ f_X(x) = \left\{ \begin{array}{ll} \frac{1}{b-a} , & a \leq x \leq b \\ 0, & \mbox{otherwise} \end{array} \right. \]
The corresponding CDF can then be computed as:
\[ F_X(x) = \int_{-\infty}^{x} f_X(x) dx = \left\{ \begin{array}{ll} 0, & x < a \\ \frac{x-a}{b-a} , & a \leq x \leq b \\ 1, & x > b \end{array} \right. \]
We can also go the other direction. For example:
Note that, since \( X_i \sim \) Uniform(0, 100), then for \( 0 < y < 100 \),
\[ P(X_i \leq y) = y/100 \, . \]
It is much easier to get the CDF of \( Y \) first! Notice that \( \max\{X_i\} \leq y \) if and only if \( X_i \leq y \) for all \( X_i \). So:
\[ \begin{aligned} F_Y(y) &= P(Y \leq y) \\ & = P(X_1 \leq y, X_2 \leq y, \ldots, X_{10} \leq y) \\ & = P(X_1 \leq y) \cdot P(X_2 \leq y) \cdots P(X_{10} \leq y) \\ &= \frac{y}{100} \cdot \frac{y}{100} \cdots \frac{y}{100} \\ &= \frac{y^{10}}{100^{10}} \end{aligned} \]
So the PDF is
\[ \begin{aligned} f_Y(y) &= F_Y'(y) \\ &= \frac{d}{dy} \frac{y^{10}}{100^{10}} \\ &= \frac{10y^9}{100^{10}} \end{aligned} \]
PDF:
Let's dive in to testmax_example.R on the class website.
Remember expected value and variance for discrete random variables. Suppose that \( X \) takes the values \( x_1, x_2, \ldots, x_N \). Then
\[ \mu = E(X) = \sum_{i=1}^N x_i \cdot P(X = x_i) \]
and
\[ \sigma^2 = \mbox{var}(X) = \sum_{i=1}^N (x_i - \mu)^2 \cdot P(X = x_i) \]
In the continuous case, the sum becomes an integral:
\[ \mu = E(X) = \int_{-\infty}^{\infty} x \ f(x) \ dx \]
and
\[ \sigma^2 = \mbox{var}(X) = \int_{-\infty}^{\infty} (x-\mu)^2 \ f(x) \ dx \]
Warning! Continuous random variables and PDFs can be confusing.
Here a some useful facts about CDFs:
\[ P(a < X < b) = P(a \leq X < b) = P(a < X \leq b) = P(a \leq X \leq b) \]
(Including/excluding endpoints makes no difference.)
Let \( X \) be a random variable with CDF \( F_X(x) \), and let \( q \in [0,1] \) be some desired quantile (e.g. 0.9 for the 90th percentile).
If \( F(x) \) is continuous and monotonically increasing (i.e. no flat regions), the we define the inverse CDF \( F^{-1}(q) \), or quantile function, as the unique \( x \) such that \( F(x) = q \).
Example: CDF and inverse CDF of an Exponential(1) random variable.
But what if \( F(x) \) has flat regions? Then we define
\[ F_X^{-1}(q) = \inf \left\{ x: F(x) \geq q \right\} \]
If you've never seen \( \inf \) (infimum) before, just think of it as \( \min \). In words, this equation says:
Back to Binomial(N=2, p=0.5). What is \( F^{-1}(0.3) \)? \( F^{-1}(0.75) \)? \( F^{-1}(0.7501) \)?
A continuous random variable \( X \) has a normal distribution if its PDF takes the form
\[ f_X(x) = \frac{1}{\sqrt{2 \pi \sigma^2}} \exp \left\{ -\frac{1}{2 \sigma^2} (x - \mu)^2 \right\} \]
for parameters \( \mu \) and \( \sigma^2 \).
We write this as \( X \sim N(\mu, \sigma) \).
One of the most useful distributions in all of probability and statistics!
Three different normals: \( (\mu=3, \sigma=2) \), \( (\mu=-2, \sigma=1) \), and \( (\mu=0, \sigma=0.5) \)
Here's a picture of \( \Phi(x) \):
A couple of useful critical values: if \( Z \) is standard normal, then
If \( X \sim N(\mu, \sigma) \), then we can obtain the CDF of \( X \) from the CDF of a standard normal:
\[ \begin{aligned} F_X(x) &= P(X \leq x) \\ &= P\left( \frac{X-\mu}{\sigma} \leq \frac{x-\mu}{\sigma} \right) \\ &= P\left( Z \leq \frac{x-\mu}{\sigma} \right) \\ &= \Phi \left( \frac{x-\mu}{\sigma} \right) \\ \end{aligned} \]
where \( Z \) is a standard normal random variable.
The normal is used everywhere:
Sometimes sensibly, sometimes inappropriately!
The normal distribution is also super important in statistical inference because of something called the Central Limit Theorem.
Very roughly, the central limit theorem says that averages of many independent measurements tend to look normally distributed, no matter what distribution the individual measurements have.
The CLT is one of the deepest and most useful insights in the history of mathematics!
It took over 80 years to work out properly, from de Moivre (1718) to Gauss and Laplace (early 1800s).
It's important in data science because we take averages a lot… and it's therefore important in the real word because so many decisions are made using data!
We'll cover this later.
Suppose that at age 25, you put \( W_0 = \) $10,000 in the S&P 500, expecting to withdraw it forty years later, when you're 65.
Let \( W_{40} \) be the value of your investment after 40 years. What should we expect \( W_{40} \) look like?
Classic assumption: suppose that average returns on the stock market, net of inflation, are \( r \in (0, 1) \). Here \( r \) is like an interest rate: that is, if \( r = 0.07 \) you average 7% returns a year, and so on.
Under this assumption we can write \( W_{40} \) in terms of formula for compound interest:
\[ \begin{aligned} W_{40} &= W_0 \cdot (1 + r) \cdot (1+r) \cdots (1+r) \quad \mbox{(40 times)} \\ &= W_0 \cdot (1 + r)^{40} \end{aligned} \]
So if \( r = 0.07 \), then
\[ W_{40} = 10000 \cdot (1 + 0.07)^{40} = 149744.6 \]
Under this assumption we can write \( W_{40} \) in terms of formula for compound interest:
\[ \begin{aligned} W_{40} &= W_0 \cdot (1 + r) \cdot (1+r) \cdots (1+r) \quad \mbox{(40 times)} \\ &= W_0 \cdot (1 + r)^{40} \end{aligned} \]
So if \( r = 0.07 \), then
\[ W_{40} = 10000 \cdot (1 + 0.07)^{40} = 149744.6 \]
What's wrong with this picture?
At least two things are wrong:
Problem 1: the returns fluctuate from year to year.
\[ \begin{aligned} W_{40} &= W_0 \cdot (1 + R_1) \cdot (1+R_2) \cdots (1+R_{40}) \\ &= W_0 \cdot \prod_{i=1}^{40} (1 + R_i) \end{aligned} \]
Problem 2: we don't know what the returns will be!
Let's dive in to normal_example.R on the class website.
Suppose \( X \) is a random variable with known probability distribution.
Now we define a new random variable \( Y = g(X) \) for some known, fixed \( g \). For example:
A key question is: what can we say about \( Y \) based on what we know about \( X \)?
We'll first focus on summaries: \( E(Y) \) and \( \mbox{var}(Y) \). In general, expectations and transformations do not commute: that is,
Then we'll ask: how can we get a full probability distribution for \( Y \), based on the probability distribution for \( X \)?
Linear functions are the one case where the rules for expectation and variance are easy. Suppose that \( X \) is some random variable, and that \( Y = aX + b \) for constants \( a \) and \( b \).
Then
\[ \begin{aligned} E(Y) &= E(aX + b) = a E(X) + b \quad \mbox{(Linearity of expectation)} \\ \mbox{var}(Y) &= a^2 \ \mbox{var}(X) \\ \mbox{sd}(Y) &= |a| \ \mbox{sd}(X) \end{aligned} \]
Let's prove this on the board using the definition of expectation.
For nonlinear functions, things are not as nice. To calculate \( E(Y) \), we must go back to the PMF/PDF. In words: the expectation of \( g(X) \) is the weighted average outcome for \( g(X) \), weighted by the probabilities.
If \( X \) is discrete with PMF \( p_X(x) \):
\[ E(Y) = E(g(x)) = \sum_{x \in \mathcal{X}} g(x) \ p_X(x) \]
And if \( X \) is continuous with PDF \( f_X(x) \):
\[ E(Y) = E(g(x)) = \int_{x \in \mathcal{X}} g(x) \ f_X(x) \ dx \]
This rule makes sense, intuitively. Suppose you play a game in Vegas where you draw \( X \) randomly from some distribution, and the casino pays you \( Y = g(X) \).
Your expected payoff is \( g(x) \), times the chance the \( X = x \), summed or integrated over all values of \( X \).
OK, what about characterizing the full distribution for \( Y = g(X) \)?
If \( X \) is discrete, things are easy. If \( Y = g(X) \), then the PMF of \( Y \) can be obtained directly from the PMF of \( X \):
\[ p_Y(y) = \sum_{x: g(x) = y} p_X(x) \]
In words: to obtain \( p_Y(y) \) we add the probabilities of all values of \( x \) such that \( g(x) = y \).
The continuous case is harder. There are three steps for finding the PDF \( f_{Y}(y) \) of a transformation \( Y = g(X) \).
Find the CDF:
\[ \begin{aligned} F_Y(y) = P(Y \leq y) &= P(g(X) \leq y) \\ &= P(X \in A_y) \\ &= \int_{x \in A_y} f_X(x) \ dx \end{aligned} \]
Compute the PDF as \( f_Y(y) = F_Y'(y) \).
What we know: a factory produces oil drills whose diameters average 1 ft, but that have a bit of manufacturing variance.
What we really care about: oil flows through the resulting borehole at a rate proportional to its cross-sectional area: \( Y = \pi \cdot (X/2)^2 = (\pi/4) \cdot X^2 \).
What is the probability density of \( Y \)? Let's work this together on the board and in wellbore.R.