STA 380
Knowing some basic probability is like a data-science secret weapon!
If \( A \) denotes some event, then \( P(A) \) is the probability that this event occurs:
And so on.
Some probabilities are estimated from direct experience over the long run:
Some probabilities are estimated from direct experience over the long run:
Others are synthesized from our best judgments about unique events:
A conditional probability is the chance that one thing happens, given that some other thing has already happened.
A great example is a weather forecast: if you look outside this morning and see gathering clouds, you might assume that rain is likely and carry an umbrella.
We express this judgment as a conditional probability: e.g. “the conditional probability of rain this afternoon, given clouds this morning, is 60%.”
In stats we write this a bit more compactly:
\( P(A \mid B) \): “the probability of A, given that B occurs.”
Conditional probabilities are how we express judgments in a way that reflects our partial knowledge.
A really important fact is that conditional probabilities are not symmetric:
\[ P(A \mid B) \neq P(B \mid A) \]
As a quick counter-example, let the events A and B be as follows:
Clearly \( P(A \mid B) = 1 \): every NBA player can dribble a basketball.
But \( P(B \mid A) \) is nearly zero!
“Obey my rules, filthy capitalists.”
Consider an uncertain outcome with where \( \Omega \) is the set of all possible outcomes. “Probability” \( P(\cdot) \) is a set function that maps \( \Omega \) to the real numbers, such that:
OK, so how do we actually calculate probabilities?
The “probability calculus” provides a set of rules for calculating probabilities. These aren't axioms: they can be derived from Kolmogorov's axioms.
\( P(A^C) = 1 - P(A) \)
(Why? Because \( A \cup A^C = \Omega \), and \( P(\Omega) = 1 \).)
If \( A \subset B \), then \( P(A) \leq P(B) \).
(Why? Write \( B \) as \( B = A \cup (B \setminus A) \) and use finite additivity. )
\( P(B \setminus A) = P(B) - P(A \cap B) \).
Addition rule: \( P(A \cup B) = P(A) + P(B) - P(A \cap B) \).
Someone deals you a five-card poker hand. What is the probability of either a straight (five cards in a row, e.g. 3-4-5-6-7) or a flush (all cards the same suit)?
Note: these aren't mutually exclusive, since you might draw a hand that is both a straight AND a flush (e.g. 5-6-7-8-9 of clubs).
If all 2,598,960 poker hands are equally likely, then:
So by the addition rule:
\[ \begin{aligned} P(\mbox{straight or flush}) &= P(\mbox{straight}) + P(\mbox{flush}) - P(\mbox{straight AND flush}) \\ &= 0.00392465 + 0.00198079 - 0.0000153908 \\ &= 0.005890049 \end{aligned} \]
We've met Kolmogorov's three axioms, together with several rules we can derive from these axioms.
There's one final axiom for conditional probability, often called the multiplication rule. Let \( P(A, B) = P(A \cap B) \) be the joint probability that both \( A \) and \( B \) happen. Then:
\begin{equation} \label{eqn:conditional_probability} P(A \mid B) = \frac{P(A, B)}{P(B)} \, . \end{equation}
Or equivalently:
\[ P(A, B) = P(A \mid B) \cdot P(B) \, . \]
This is an axiom: it cannot be proven from Kolmogorov's rules.
Let's see why this axiom makes sense. (Figure courtesy David Speigelhalter and Jenny Gage.)
Suppose a woman goes for regular screening (left branch). What is P(survive | cancer)?
We get the same answer using the rule for conditional probabilities:
\[ \begin{aligned} P(S | C) = \frac{P(S, C)}{P(C)} &= \frac{12/200}{15/200} \\ &= \frac{12}{15} \\ \end{aligned} \]
Suppose that you're designing the movie-recommendation algorithm for Netflix, and you have access to the entire Netflix database, showing which customers have liked which films.
Your goal is to leverage this vast data resource to make automated, personalized movie recommendations.
You decide to start with an easy case: assessing how probable it is that a user will like the film Saving Private Ryan (event \( A \)), given that the same user has liked the HBO series Band of Brothers (event \( B \)).
This is almost certainly a good bet: both are epic war dramas about the Normandy invasion and its aftermath. Therefore, you might think: job done! Recommend away.
But keep in mind that you want to be able to do this kind of thing automatically.
Key insight: frame the problem in terms of conditional probability. Suppose we learn that Linda liked film \( B \), but hasn't yet seen film \( A \).
Conditional probabilities hold the key to understanding individualized preferences. So how can we learn \( P(\mbox{likes A} \mid \mbox{likes B}) \)?
Solution: go to the data! Suppose your database on 5 million subscribers like Linda reveals the following pattern:
Then
\[ \begin{aligned} P(\mbox{liked Saving Private Ryan} \mid \mbox{liked Band of Brothers}) &= \frac{2.8 \mbox{ million}}{3.5 \mbox{ million}} \\ &= 0.8 \, . \end{aligned} \]
Result: a good recommendation with no human in the loop.
Many companies do the same:
The digital economy runs on conditional probability!
Consider the following data on complication rates at a maternity hospital in Cambridge, England:
| Easier deliveries | Harder deliveries | Overall | |
|---|---|---|---|
| Senior doctors | 0.052 | 0.127 | 0.076 |
| Junior doctors | 0.067 | 0.155 | 0.072 |
Would you rather have a junior or senior doctor?
Consider the following data on complication rates at a maternity hospital in Cambridge, England:
| Easier deliveries | Harder deliveries | Overall | |
|---|---|---|---|
| Senior doctors | 0.052 | 0.127 | 0.076 |
| Junior doctors | 0.067 | 0.155 | 0.072 |
Would you rather have a junior or senior doctor?
Simpson's paradox. Senior doctors have:
Let's see the table with number of deliveries performed (in parentheses):
| Easier deliveries | Harder deliveries | Overall | |
|---|---|---|---|
| Senior doctors | 0.052 (213) | 0.127 (102) | 0.076 (315) |
| Junior doctors | 0.067 (3169) | 0.155 (206) | 0.072 (3375) |
Let's see the table with number of deliveries performed (in parentheses):
| Easier deliveries | Harder deliveries | Overall | |
|---|---|---|---|
| Senior doctors | 0.052 (213) | 0.127 (102) | 0.076 (315) |
| Junior doctors | 0.067 (3169) | 0.155 (206) | 0.072 (3375) |
Now we see what's going on:
It turns out the math of Simpson's paradox can be understood a lot more deeply in terms of something called the rule of total probability, or the mixture rule.
This rule sounds fancy, but is actually quite simple.
It says to divide and conquer: the probability of any event is the sum of the probabilities for all the different ways in which the event can happen. Really just Kolmogorov's third rule in disguise!
Let's see this rule in action for the hospital data.
There are two types of deliveries: easy and hard. So:
\[ P(\mbox{complication}) = P(\mbox{easy and complication}) + P(\mbox{hard and complication}) \, . \]
Now use the rule for conditional probabilities to each joint probability on the right-hand side:
\[ \begin{aligned} P(\mbox{complication}) &= P(\mbox{easy}) \cdot P(\mbox{complication} \mid \mbox{easy}) \\ & + P(\mbox{hard}) \cdot P(\mbox{complication} \mid \mbox{hard}) \, . \end{aligned} \]
The rule of total probability says that overall probability is a weighted average—a mixture—of the two conditional probabilities
For senior doctors we get
\[ P(\mbox{complication}) = \frac{213}{315} \cdot 0.052 + \frac{102}{315} \cdot 0.127 = 0.076 \, . \]
And for junior doctors, we get
\[ P(\mbox{complication}) = \frac{3169}{3375} \cdot 0.067 + \frac{206}{3375} \cdot 0.155 = 0.072 \, . \]
This is a lower marginal or overall probabiity of a complication, even though junior doctors have higher conditional probabilities of a complication in all scenarios.
Synonyms: overall probability = total probability = marginal probability
Here's the formal statement of the rule. Let \( \Omega \) be any sample space, and let \( B_1, B_2, \ldots, B_N \) be a partition of \( \Omega \)—that is, a set of events such that:
\[ P(B_i, B_j) = 0 \mbox{ for any $i \neq j$,} \quad \mbox{and} \quad \sum_{i=1}^N P(B_i) = 1 \, . \]
Now consider any event \( A \). Then
\[ P(A) = \sum_{i=1}^N P(A, B_i) = \sum_{i=1}^N P(B_i) \cdot P(A \mid B_i) \, . \]
Virginia Delaney-Black and her colleagues at Wayne State University gave an anonymous survey to teenagers in Detroit:
Citation: V. Delaney–Black et. al. “Just Say I Don't: Lack of Concordance Between Teen Report and Biological Measures of Drug Use.” Pediatrics 165:5, pp. 887-93 (2010)
The two sets of results were strikingly different.
The two sets of results were strikingly different.
And the parents lied, too:
Remember:
Yet a big fraction lied about their drug use anyway.
Drug surveys are really important:
Delaney-Black's study asks: can we trust any of it?
Here are some other things that, according to research on surveys, people lie about in surveys.
Upshot: people lie in predictable ways for predictable reasons. This opens the door for survey designers to use a bit of probability, and a bit of psychology, to get at the truth.
Suppose that you want to learn about the prevalence of drug use among college students. Here's a cute trick that uses probability theory to mitigate someone's incentive to lie.
Suppose that, instead of asking people point-blank, you tell them:
Key fact here: only the respondent knows which question he or she is answering.
This reduces the incentive to lie.
Let's run the survey! Flip a coin, keep the result private, and then answer the question in public :-)
Notation:
By the rule of total probability:
\[ \begin{aligned} P(Y) &= P(Y, Q_1) + P(Y, Q_2) \\ &= P(Q_1) \cdot P(Y \mid Q_1) + P(Q_2) \cdot P(Y \mid Q_2) \end{aligned} \]
\[ P(Y) = P(Q_1) \cdot P(Y \mid Q_1) + P(Q_2) \cdot P(Y \mid Q_2) \]
\( P(Y) \) is a weighted average of two conditional probabilities:
This equation has five probabilities in it.
Let's solve for \( P(Y \mid Q_2) \):
\[ P(Y) = P(Q_1) \cdot P(Y \mid Q_1) + P(Q_2) \cdot \mathbf{P(Y \mid Q_2) } \]
So
\[ P(Y \mid Q_2) = \frac{P(Y) - P(Q_1) \cdot P(Y \mid Q_1)}{ P(Q_2) } \]
Let's plug in our numbers on the right-hand side and get an answer!
Two events \( A \) and \( B \) are independent if
\[ P(A \mid B) = P(A \mid \mbox{not } B) = P(A) \]
In words: \( A \) and \( B \) convey no information about each other:
So if \( A \) and \( B \) are independent, then \( P(A, B) = P(A) \cdot P(B) \).
Two events \( A \) and \( B \) are conditionally independent, given C, if
\[ P(A, B \mid C) = P(A \mid C) \cdot P(B \mid C) \]
\( A \) and \( B \) convey no information about each other, once we know C: \( P(A \mid B , C) = P(A \mid C) \).
Neither independence nor conditional independence implies the other.
Let's see an example. Alice and Brianna live next door to each other and both commute to work on the same metro line.
A and B are dependent: if Brianna is late for work, we might infer that the metro line was delayed or that their neighborhood had bad weather. This means Alice is more likely to be late for work:
\[ P(A \mid B) > P(A) \]
Now let's add some additional information:
A and B are conditionally independent, given C. If Brianna is late for work but we know that the metro is running on time and the weather is clear, then we don't really learn anything about Alice's commute:
\[ P(A \mid B, C) = P(A \mid C) \]
Same characters, different story:
A and B are independent: Alice's eye color can't give us information about Brianna's.
Again, let's add some additional information.
A and B are conditionally dependent, given C: if Alice blue eyes, and we know that Brianna is her sister, then we know something about Brianna's genes. It is now more likely that Brianna has blue eyes.
Independence (or conditional independence) is often something we choose to assume for the purpose of making calculations easier. For example:
This works for more than two events. For example, Joe DiMaggio had a 56-game hitting streak in the 1941 baseball season. This was pretty unlikely!
\[ \begin{aligned} &\phantom{=} P(\mbox{hit game 1, hit game 2, hit game 3, $\ldots$, hit game 56}) \\ &= P(\mbox{hit game 1}) \cdot P(\mbox{hit game 2}) \cdot P(\mbox{hit game 3}) \cdots P(\mbox{hit game 56}) \\ &= 0.8 \cdot 0.8 \cdot 0.8 \cdots 0.8 \\ &= 0.8^{56} \\ &\approx \frac{1}{250,000} \end{aligned} \]
I like to call this the “compounding rule.”
Let's compare this with the corresponding probability for Pete Rose, a player who got a hit in 76% of his games. He's only slightly less skillful than DiMaggio! But:
\[ \begin{aligned} &\phantom{=} P(\mbox{hit game 1, hit game 2, hit game 3, $\ldots$, hit game 56}) \\ &= 0.76^{56} \\ &\approx \frac{1}{\mbox{5 million}} \end{aligned} \]
Small difference in one game, but a big difference over the long run.
What about an average MLB player who gets a hit in 68% of his games:
\[ \begin{aligned} &\phantom{=} P(\mbox{hit game 1, hit game 2, hit game 3, $\ldots$, hit game 56}) \\ &= 0.68^{56} \\ &\approx \frac{1}{\mbox{2.5 billion}} \end{aligned} \]
Never gonna happen!
Summary:
A small difference in probabilities becomes an enormous gulf over the long term.
Lesson: probability compounds multiplicatively, like the interest on your credit cards.
Gerald Ford falls down the steps of Air Force One. (He survived.)
Gerald Ford falls while skiing. (He again survived.)
Gerald Ford falls at a summit in Salzburg. (He once again survived.)
\[ \begin{aligned} P(\mbox{30-year streak without a deadly fall}) &= (0.9999997)^{365 \times 30} \\ &\approx 0.997 \end{aligned} \]
So if you have an average daily risk, then you have a 0.3% chance of dying in a fall at some point over the next 30 years—hardly negligible, but still small.
Let's change the numbers a tiny bit.
What if your daily survivorship probability was a bit smaller than average?
To invoke the DiMaggio/Rose example: what if you became only slightly less skillful at not falling?
For some specific numbers, let's make a diet analogy:
For some specific numbers, let's make a diet analogy:
You've just reduced your average daily calorie consumption by about 1/100th of a percent. Will you lose weight over the long run? Probably not.
But what if you reduced your daily fall-survivorship probability by 1/100 of a percent? (From 99.99997% to “merely” 99.99%.)
Tiny change in the short run, big change in the long run:
\[ \begin{aligned} P(\mbox{30-year streak without a deadly fall}) &= (0.9999)^{365 \times 30} \\ &\approx 0.33 \end{aligned} \]
Again: probability compounds multiplicatively (like interest), not additively (like calories).
Suppose we have two random outcomes \( A \) and \( B \) and we want to know if they're independent or not.
Solution:
NBA Jam c. 1993
The “hot hand hypothesis” says that if a player makes their previous shot, they're more likely to make their next shot (“He's on fire!”):
\[ P(\mbox{makes next} \mid \mbox{makes previous}) > P(\mbox{makes next} \mid \mbox{misses previous}) \]
On the other hand, the “independence hypothesis” says that
\[ P(\mbox{makes next} \mid \mbox{makes previous}) = P(\mbox{makes next} \mid \mbox{misses previous}) \]
The next slide show some data on shooting percentages for Dr. J's 1980–81 Philadelphia 76ers.
Key question: do players shoot better, worse, or about the same after they've just made a basket, versus how they do after they've just missed a basket?
Let's look at the data…
Shooting percentages after:
| Player | 3 misses | 2 misses | 1 miss | overall | 1 hit | 2 hits | 3 hits |
|---|---|---|---|---|---|---|---|
| Julius Erving | 0.52 | 0.51 | 0.51 | 0.52 | 0.52 | 0.53 | 0.48 |
| Caldwell Jones | 0.50 | 0.48 | 0.47 | 0.43 | 0.47 | 0.45 | 0.27 |
| Maurice Cheeks | 0.77 | 0.6 | 0.6 | 0.54 | 0.56 | 0.55 | 0.59 |
| Daryl Dawkins | 0.88 | 0.73 | 0.71 | 0.58 | 0.62 | 0.57 | 0.51 |
| Lionel Hollins | 0.50 | 0.49 | 0.46 | 0.46 | 0.46 | 0.46 | 0.32 |
| Bobby Jones | 0.61 | 0.58 | 0.58 | 0.47 | 0.54 | 0.53 | 0.53 |
| Andrew Toney | 0.52 | 0.53 | 0.51 | 0.40 | 0.46 | 0.43 | 0.34 |
| Clint Richardson | 0.50 | 0.47 | 0.56 | 0.50 | 0.50 | 0.49 | 0.48 |
| Steve Mix | 0.70 | 0.56 | 0.52 | 0.48 | 0.52 | 0.51 | 0.36 |
Which hypothesis looks right: hot hand or independence? (Remember small-sample fluctations.)
Suppose we pick a random US family with four male children. What is the probability \( P \) that all four will be colorblind?
The probability that a randomly sampled US male is colorblind is about 8%. So the naive answer involves just compounding up this probability:
\[ P = 0.08^4 \approx 0.0004 \]
What's wrong here?
Colorblindness runs in families (it's an X-linked trait, so males only need one copy on their X chromosome to express the phenotype). So it may be true that
\[ P(\mbox{brother 1 colorblind}) = 0.08 \]
But
\[ P(\mbox{brother 2 colorblind} \mid \mbox{brother 1 colorblind}) = 0.5 \neq 0.08 \]
And the same is true for all subsequent brothers: if brother 1 is colorblind, you know that mom is a carrier, and so all her male children have a 50/50 chance of colorblindness (conditional independence, given mom's genes!)
The correct overall probability has to be built up piece by piece using the multiplication rule:
\[ \begin{aligned} P(\mbox{brothers 1-4 colorblind}) &= P(\mbox{brother 1 colorblind}) \\ &\times P(\mbox{brother 2 colorblind} \mid \mbox{brother 1 colorblind}) \\ &\times P(\mbox{brother 3 colorblind} \mid \mbox{brothers 1-2 colorblind}) \\ &\times P(\mbox{brother 4 colorblind} \mid \mbox{brothers 1-3 colorblind}) \\ \end{aligned} \]
So:
\[ P(\mbox{brothers 1-4 colorblind}) = 0.08 \times 0.5^3 = 0.01 \]
Seems silly, right?
But you'd be surprised at how often people make this mistake! We might call this the “fallacy of mistaken compounding”: assuming events are independent and naively multiplying their probabilities.
Out of class, I'm asking you to read two short pieces that illustrate this unfortunate reality:
Key fact: all probabilities are contingent on what we know.
When our knowledge changes, our probabilities must change, too.
Bayes' rule tells us how to change them. Suppose A is some event we're interested in and B is some new relevant information. Bayes' rule tells us how to move from a prior probability, \( P(A) \), to a posterior probability \( P(A \mid B) \) that incorporates our knowledge of B.
\[ P(A \mid B) = P(A) \cdot \frac{P(B \mid A)}{P(B)} \]
Calculating \( P(B) \): use the rule of total probability.
Imagine a jar with 1024 normal quarters. Into this jar, a friend places a single two-headed quarter (i.e. with heads on both sides). Your friend shakes the jar to mix up the coins. You draw a single coin at random from the jar, and without examining it closely, flip the coin ten times.
The coin comes up heads all ten times.
Are you holding the two-headed quarter, or an ordinary quarter?
Let's see how a posterior probability is calculated using Bayes' rule:
\[ P(T \mid D) = \frac{P(T) \cdot P(D \mid T)}{P(D)} \, . \]
We'll take this equation one piece at a time.
\[ P(T \mid D) = \frac{P(T) \cdot P(D \mid T)}{P(D)} \, . \]
\( P(T) \) is the prior probability that you are holding the two-headed quarter.
\[ P(T \mid D) = \frac{P(T) \cdot P(D \mid T)}{P(D)} \, . \]
Next, what about \( P(D \mid T) \), the likelihood of flipping ten heads in a row, given that you chose the two-headed quarter?
\[ P(T \mid D) = \frac{P(T) \cdot P(D \mid T)}{P(D)} \, . \]
Finally, what about \( P(D) \), the marginal probability of flipping ten heads in a row? Use the rule of total probability:
\[ P(D) = P(T) \cdot P(D \mid T) + P(\mbox{not $T$}) \cdot P(D \mid \mbox{not $T$}) \, . \]
\[ P(D \mid \mbox{not $T$}) = \left(\frac{1}{2}\right)^{10} = \frac{1}{1024} \, . \]
We can now put all these pieces together:
\[ \begin{aligned} P(T \mid D) &= \frac{P(T) \cdot P(D \mid T) } { P(T) \cdot P(D \mid T) + P(\mbox{not $T$}) \cdot P(D \mid \mbox{not $T$}) } \\ &= \frac{ \frac{1}{1025} \cdot 1} {\frac{1}{1025} \cdot 1 + \frac{1024}{1025} \cdot \frac{1}{1024} } = \frac{1/1025}{2/1025} \\ &= \frac{1}{2} \, . \end{aligned} \]
There is only a 50% chance that you are holding the two-headed coin. Yes, flipping ten heads in a row with a normal coin is very unlikely (low likelihood). But so is drawing the one two-headed coin from a jar of 1024 normal coins! (Low prior probability.)