Reference: Bertsekas Chapter 2.1, 2.2, 2.4
Suppose we have an uncertain outcome with sample space \( \Omega \).
A random variable \( X \) is a real-valued function of the uncertain outcome. That is, it maps each element \( \omega \in \Omega \) to a real number.
Simple mantra: a random variable is a numerical summary of some uncertain outcome.
Suppose you roll two dice (an uncertain outcome). An example of a random variable is the maximum of the two rolls:
Suppose you're trying to predict the number of no-shows on a flight:
Note: different \( \omega \)'s can map to the same number.
Starting with some random outcome with sample space \( \Omega \):
A random variable is discrete if its range, i.e. the set of values it can take, is a finite or countably infinite set.
Both our examples (dice, airline no-shows) were discrete random variables: you can count the outcomes on your fingers and toes. Something that can take on a continuous range of values (e.g. temperature, speed) is called a continuous random variable.
For now, we will focus exclusively on discrete random variables. (The math and notation for continuous random variables is more fiddly, but the concepts are the same.)
Suppose that \( X \) is a discrete random variable whose possible outcomes are some set \( \mathcal{X} \). Then the probability mass function of \( X \) is
\[ p_X(x) = P(\{X = x\}) \]
We always use \( X \) for the random variable, and \( x \) for some possible outcome.
Facts about PMFs:
Dan, my friend from grad school, has just pulled up to his new house after a long cross country drive.
But he discovers that the movers have buggered off and left all his furniture and boxes sitting in the front yard. What a mess!
He decides to ask his new neighbors for some help getting his stuff indoors.
Assuming his neighbors are the kindly type, how many pairs of hands might come to his aid?
Let's use a capital \( X \) to denote the (unknown) size of the household next door.
We'll let little \( x \) be some specific possible outcome.
Table: a probability mass function \( p_X(x) \), taken from census data.
| Size of household x | Probability, P(X = x) |
|---|---|
| 1 | 0.280 |
| 2 | 0.336 |
| 3 | 0.155 |
| 4 | 0.132 |
| 5 | 0.060 |
| 6 | 0.023 |
| 7 | 0.011 |
| 8 | 0.003 |
This is easier to visualize in a bar graph:
This probability mass function provides a complete representation of your uncertainty in this situation. It has all the key features of any probability distribution:
Most probability distributions won't be this simple, but they all have these same three features.
When you knock on the door, how many people do you “expect” to be living next door?
The expected value or expectation of a random variable \( X \) is the weighted average of the possible outcomes, where the weight on each outcome is its probability.
| Size x | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
|---|---|---|---|---|---|---|---|---|
| Probability, P(X = x) | 0.280 | 0.336 | 0.155 | 0.132 | 0.060 | 0.023 | 0.011 | 0.003 |
\[ E(X) = (0.280) \cdot 1 + (0.336) \cdot 2 + \cdots + (0.011) \cdot 7 + (0.003) \cdot 8 \approx 2.5 \, . \]
The more likely numbers (e.g. 1 and 2) get higher weights than \( 1/8 \), while the unlikely numbers (e.g. 7 and 8) get lower weights.
This differs from the ordinary average! For our family-size example this would be:
\[ \mbox{Ordinary average} = \frac{1}{8} \cdot 1 + \frac{1}{8} \cdot 2 + \cdots + \frac{1}{8} \cdot 7 + \frac{1}{8} \cdot 8 = 4.5 \, . \]
Here, the weight on each number in the sample space is \( 1/8 = 0.125 \), since there are 8 numbers.
This is not the expected value; it give each number in the sample space an equal weight, ignoring the fact that these numbers have different probabilities.
This example conveys something important about expected values. Even if the world is black and white, an expected value is often grey:
Even if the underlying outcomes are all whole numbers, the expected value doesn't have to be.
Suppose that the possible outcomes for a random variable \( X \) are the numbers \( x_1, \ldots, x_N \) with probabilities \( P(X = x_i) \).
The formal definition for the expected value of \( X \) is
\[ E(X) = \sum_{i=1}^N P(X = x_i) \cdot x_i = \sum_{x \in \mathcal{X}} p_X(x) \cdot x \, . \]
This measures the “center” or mean of the probability distribution.
A nice interpretation here is that the expectation is like the “center of gravity” of a probability distribution. Arrange each outcome \( x_i \) along the real line, and place a mass equal to \( P(X = x_i) \) at each point. The expected value is the center of gravity, i.e. where the masses would balance like on a see-saw:
You may have heard the expected value explained intuitively in terms of a limiting long-run average:
Three simulated long-run sequences of coin flips.
It goes kinda like this:
Suppose we were to observe a large number of random variables with the same probability distribution \( P \). The expected value of \( P \) is the limiting long-run average of the observed outcomes.
This is 100% true! But it is not the definition of the expected value.
A related concept is the variance, which measures the dispersion or spread of a probability distribution.
It is the expected (squared) deviation of a random variable from its mean:
\[ \mbox{var}(X) = E \big( \{X-E(X)\}^2 \big) \]
The standard deviation of a probability distribution is \( \sigma = \mbox{sd}(X) = \sqrt{\mbox{var}(X)} \). The standard deviation is more interpretable than the variance, because it has the same units (dollars, miles, etc.) as the random variable itself.
Again suppose that there are \( N \) possible outcome \( x_1, x_2, \ldots, x_N \). Then by just applying the definition of expected value, we see that:
\[ \begin{aligned} \mbox{var}(X) &= E \big( \{X-E(X)\}^2 \big) \\ &= \sum_{i=1}^N P(X = x_i) \cdot (x_i - \mu)^2 \end{aligned} \]
where \( \mu = E(X) \) is the expected value of the random variable.
Your turn: try calculating the variance for our family-size random variable:
| Size x | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
|---|---|---|---|---|---|---|---|---|
| Probability, P(X = x) | 0.280 | 0.336 | 0.155 | 0.132 | 0.060 | 0.023 | 0.011 | 0.003 |
Tips:
This might be the only time in your life you calculate a variance explicitly by hand!
Here's a Google Sheet where I've worked it out.
I get:
A useful identity for the variance of a random variable is
\[ \begin{aligned} \mbox{var}(X) &= E \big( \{X-E(X)\}^2 \big) \\ & = E(X^2) - E(X)^2\, . \end{aligned} \]
A probability model is a stylized description of a real-world system in terms of random variables. To build one:
Step 3—specify a probability distribution for the random variable(s) of interest—is usually the hardest one.
In fact, for most scenarios, if we had to build such a rule from scratch, we'd be in for an awful lot of careful, tedious work. Imagine trying to list, one by one, the probabilities for all possible outcomes of a soccer game.
Thus instead of building probability distributions from scratch, we will rely on a simplification called a parametric probability model.
A parametric probability distribution, or parametric model, is a probability distribution that can be completely described using a relatively small set of numbers.
These numbers are called the parameters of the distribution.
Lots of parametric models have been invented for specific purposes:
Suppose \( X \) is a discrete random variable with possible outcomes \( x \in \mathcal{X} \). In a parametric model, the PMF of \( X \) takes the form
\[ P(X = x) = f(x; \theta) \, \]
where \( \theta \) is the parameter (or parameters) of the model.
The parameter \( \theta \) completely specifies the PMF. Contrast this with the family-size example, where we had to write out the PMF explicitly for each possible outcome.
This is all pretty abstract! Let's see a specific example: the Poisson distribution.
This is all pretty abstract! Let's see a specific example: the Poisson distribution.
(Not that Poisson…)
This is all pretty abstract! Let's see a specific example: the Poisson distribution.
(That Poisson!)
The Poisson distribution models the number of “events” that occur in a pre-specified interval (e.g. time or space):
For each uncertain outcome, we identify the random variable \( X \) as the total number of events that occur in the given interval. The Poisson distribution assumes that:
\[ P(X = k) = \frac{\lambda^k}{k!} e^{-\lambda} \, , \]
Suppose that Arsenal averages 1.6 goals per game, while Manchester United averages 1.3 goals per game.
To answer these questions, we need a probability model. Let's try Poisson!
Let \( X_{A} \) be the number of Arsenal goals.
Similarly, \( X_{M} \) be the number of Man U goals.
Our model sets the rate parameters for each team's Poisson distribution to match their average scoring rates across the season.
What is \( P(X_A = 1) \)? Remember that if \( X \sim \) Poisson(\( \lambda \)), then
\[ P(X = k) = \frac{\lambda^k}{k!} e^{-\lambda} \]
So
\[ P(X_A = 1) = \frac{1.6^1}{1!} e^{-1.6} \approx 0.323 \]
What is \( P(X_M = 3) \)? You try!
\[ P(X = k) = \frac{\lambda^k}{k!} e^{-\lambda} \]
What is \( P(X_M = 3) \)? You try!
\[ P(X = k) = \frac{\lambda^k}{k!} e^{-\lambda} \]
I get
\[ P(X_M = 3) = \frac{1.3^3}{3!} e^{-1.3} \approx 0.1 \]
Suppose that \( X \) and \( Y \) are two discrete random variables. We say that \( X \) and \( Y \) are independent if
\[ P(X=x, Y=y) = P(X=x) \cdot P(Y=y) \]
for all possible outcomes \( x \) and \( y \).
All we're saying here is that the events \( \{X=x\} \) and \( \{Y=y\} \) are independent events.
Just as before, independence is often something we assume to make the math simpler.
How likely is it that Arsenal beats Man U by a score of 2-1?
For this, we need another assumption. Let's say that \( X_A \) and \( X_M \) are independent random variables: that is,
\[ P(X_A = j, X_M = k) = P(X_A = j) \cdot P(X_M = k) \]
for all \( j, k \).
Basically, we're saying that Arsenal's score doesn't affect Man U's score, and vice versa. (This assumption might not be exactly true, but doesn't seem crazy.)
We need \( P(X_A = 2, X_M = 1) \), where \( X_A \sim \) Poisson(1.6) and \( X_M \sim \) Poisson(1.3).
Under independence, we know that
\[ \begin{aligned} P(X_A = 2, X_M = 1) &= P(X_A = 2) \cdot P(X_M = 1) \\ &= \left[ \frac{1.6^2}{2!} e^{-1.6} \right] \cdot \left[ \frac{1.3^1}{1!} e^{-1.3} \right] \\ &\approx 0.2584 \cdot 0.354 \\ & \approx 0.0916 \end{aligned} \]
About a 9% chance of a 2-1 Arsenal victory.
Here's another Google Sheet where I've worked out all scores up to 8-8 under these assumptions.
Your turn! What is:
To modify this sheet, you can use “File > Make a Copy” or download my Google sheet as an Excel file.
Our second example of a parametric probability model is the Bernoulli distribution.
It's really simple! Imagine some uncertain outcome with only two possibilities:
To define a Bernoulli random variable \( B \) (or “Bernoulli trial”), we associate the number 0 with the “no” outcome, and the number 1 with the “yes” outcome:
In each case, we let \( p \) denote the probability of the “yes” (1) outcome. What counts as the “yes” outcome will be context dependent.
The PMF of a Bernoulli random variable is pretty basic:
\[ p_B(b) = \left\{ \begin{array}{rr} p & \mbox{if b=1} \\ 1-p & \mbox{if b=0} \\ 0 & \mbox{otherwise.} \end{array} \right. \]
We write this as \( B \sim \mbox{Bern}(p) \). The only parameter is \( p \): a number between 0 and 1.
The mean and variance of a Bernoulli are:
\[ \begin{aligned} E(B) &= 1 \cdot p + 0 \cdot (1-p) \\ &= p \\ \\ \mbox{var}(B) &= E(B^2) - E(B)^2 \\ &= \left[1^2 \cdot p + 0^2 \cdot (1-p) \right] - p^2 \\ &= p - p^2 \\ &= p(1-p) \end{aligned} \]
Pretty simple! Bernoulli random variables are much more interesting when we combine them!
Our third example of a parametric probability model is the binomial distribution. Suppose that we observe the results of \( N \) Bernoulli random variables.
We are interested in the total number of “yes” (1) outcomes.
The binomial distribution describes just this situation. It has some pretty specific assumptions:
Let \( X \) be the number of 1's in the N-trial sequence. We say that \( X \) has a binomial distribution with parameters \( N \) and \( p \): \( X \sim \mbox{Binomial}(N, p) \).
Suppose \( X \sim \mbox{Binomial}(N=2, p=0.5) \): a sequence of \( N=2 \) Bernoulli trials, with \( p=0.5 \). (Like two coin flips.)
In general, the PMF of a binomial distribution looks like this:
\[ p_X(k) = \binom{N}{k} p^x (1-p)^{N-k} \; , \quad \binom{N}{k} = \frac{N!}{k! (N-k)!} \]
\( \binom{N}{k} \) is the binomial coefficient, read aloud as “N choose k.”
Facts (to be proven later):
Where does this PMF come from? By counting.
If you count carefully, you find that there are
\[ \binom{N}{k} = \frac{N!}{k! (N-k)!} \]
binary sequences of length \( N \) with exactly \( k \) 1's in them. That's how we get the leading term in the binomial PMF.
I won't bore you with the counting; see Wikipedia
1 1 0 0
1 0 1 0
0 1 1 0
1 0 0 1
0 1 0 1
0 0 1 1
1 1 1 1 0 0
1 1 1 0 1 0
1 1 0 1 1 0
1 0 1 1 1 0
0 1 1 1 1 0
1 1 1 0 0 1
1 1 0 1 0 1
1 0 1 1 0 1
0 1 1 1 0 1
1 1 0 0 1 1
1 0 1 0 1 1
0 1 1 0 1 1
1 0 0 1 1 1
0 1 0 1 1 1
0 0 1 1 1 1
Let's use the binomial distribution as a probability model for our earlier example on airline no-shows.
Under these assumptions, \( X \sim \) Binomial(N=140, p=0.09), with PMF
\[ p_X(k) = \binom{140}{k} \ (0.09)^k \ (1-0.09)^{140-k} \, . \]
Your turn! Suppose that:
How would we calculate this probability?
To get a seat, you need 12 people to no-show (5 oversolds seats + 7 standby passengers). To calculate \( P(X \geq 12) \), we add up all the probabilities for \( X=12 \), \( X=13 \), \( X=14 \), all the way up to \( X=140 \).
This is called an “upper tail area” of the PMF.
\[ P(X \geq 12) = \sum_{k=12}^{140} p_X(k) = \sum_{k=12}^{140} \binom{140}{k} \ (0.09)^k \ (1-0.09)^{140-k} \]
You can eyeball it reasonably well from the graph of the PMF:
You can eyeball it reasonably well from the graph of the PMF:
\[ P(X \geq 12) = \sum_{k=12}^{140} p_X(k) \approx 0.613 \]
Real airlines use much more complicated models:
The binomial model cannot incorporate these (very real) effects. This approximation trades away flexibility for simplicity:
So to draw conclusions from a parametric probability model, we should ask two follow-up questions:
This second answer will always be context dependent:
Let's dig in to poisson_binomial.R on the class website to see how we can use R to do calculations with the Poisson and binomial distributions.