The goal of this document is to understand the manuscript and logic from “Surprised by the Gambler’s and Hot Hand Fallacies? A Truth in the Law of Small Numbers” by Miller and Sanjurjo, 2015.
The main message of that manuscript is that if you flip a fair coin, the empirical probability of a heads following a heads is lower than the true probability of flipping a heads. The goal of this document is to understand their argument and the arithmetic behind their argument.
I’m not going to go through the logic they present in their document, as you can download that, but the goal here is to present some code that actually estimates their approach so that people can see what they’re saying.
Their logic is demonstrated in a scenario they present in their Table 1. They present a scenario in which Jack flips a fair coin 4 times, and then walk through all possible situations that can result from these four flips. In order to walk through this example, the following code produces all possible sequences for N flips.
This is done the following way:
1s (or TRUES) correspond to a heads outcome.One note - this will only work on “little-endian” representations. I haven’t explored this on a big-endian system – you’d just have to change the indexing used in myIntToBits()
n_flips <- 4 ## number of flips
generate_all_sequences <- function(N){
## intToBits always converts to 32bit
## this just takes the first nbits from an integer
myIntToBits <- function(int, nbits){
out <- as.logical(intToBits(int))[1:nbits]
return(out)
}
## each column is a sequence -- all possible sequences
all_sequences_mat <- vapply(0:(2^N - 1),
FUN = myIntToBits,
nbits = N,
FUN.VALUE = logical(N)
)
return(all_sequences_mat)
}
all_flips <- generate_all_sequences(n_flips)
image(x = 1:n_flips,
y = 0:(2^n_flips - 1),
z = all_flips,
xlab = "Sequence Position",
ylab = "Sequence ID",
las = 1)
The plot is a representation of all possible flips. Red squares are “Tails”, whereas white squares are “Heads”.
Once we have created all possible sequences, we can work through calculating the values that are relevant.
We will use \(p(H_t|H_{t-1})\) to represent the probability of a heads at time \(t\) given a heads on the prior flip \(t - 1\). This is easy to calculate by first calculating the joint distribution (for sequences positions 2 through N) of whether that position is a heads and the prior position is also heads:
heads_and_prior_heads <- all_flips[-n_flips,] & all_flips[-1,]
The standard approach to calculating \(p(H_t|H_{t-1})\) is to count the number of TRUEs in the prior table, and to divide that by the number of times that there was a heads on trial \(t-1\). Unsurprisingly, this value is 0.5, because we are assuming a fair coin.
n_heads_given_prior_heads <- sum(heads_and_prior_heads) ## count the heads
num_heads_prior <- sum(all_flips[-n_flips,]) ## count the opportunities overall
p_HtgHtm1 <- n_heads_given_prior_heads / num_heads_prior ## calculate p
p_HtgHtm1
## [1] 0.5
The approach that is taken by Miller and Sanjurjno is to actually calculate \((p(H_t|H_{t-1},n_H))\), where \(n_H\) corresponds to the number of heads in the sequence.
Fundamentally, they calculate \(p(H_t|H_{t-1})\) for each number of heads within a sequence. We start by calculating how many times in the sequence \(H_{t-1}\) is true.
## n_heads_per_seq <- apply(all_flips,2,sum)
## way of doing it with paper
n_opportunities_by_seq <- apply(all_flips[-n_flips,],2,sum)
Then, we calculate, for each sequence, how many times the joint distribution is true.
n_heads_per_seq_given_opp <- colSums(heads_and_prior_heads)
Once we have this, we can calculate the probability in each sequence.
good_inds <- n_opportunities_by_seq != 0
p_HgH_by_seq <- n_heads_per_seq_given_opp/n_opportunities_by_seq
We can compare this to their Table 1.
n_H <- apply(all_flips,2,sum)
cbind(n_H,n_opportunities_by_seq, t(all_flips), p_HgH_by_seq)[order(n_H, na.last = FALSE),]
## n_H n_opportunities_by_seq p_HgH_by_seq
## [1,] 0 0 0 0 0 0 NaN
## [2,] 1 1 1 0 0 0 0.0000000
## [3,] 1 1 0 1 0 0 0.0000000
## [4,] 1 1 0 0 1 0 0.0000000
## [5,] 1 0 0 0 0 1 NaN
## [6,] 2 2 1 1 0 0 0.5000000
## [7,] 2 2 1 0 1 0 0.0000000
## [8,] 2 2 0 1 1 0 0.5000000
## [9,] 2 1 1 0 0 1 0.0000000
## [10,] 2 1 0 1 0 1 0.0000000
## [11,] 2 1 0 0 1 1 1.0000000
## [12,] 3 3 1 1 1 0 0.6666667
## [13,] 3 2 1 1 0 1 0.5000000
## [14,] 3 2 1 0 1 1 0.5000000
## [15,] 3 2 0 1 1 1 1.0000000
## [16,] 4 3 1 1 1 1 1.0000000
Once we have these probabilities for each sequence, we can calculate their expected value by computing the mean probabilities across sequences.
## comparison:
mean(p_HgH_by_seq, na.rm =TRUE) ## their probability
## [1] 0.4047619
The key difference when they aggregate across all sequences is that they ignore the number of heads in a sequence, and do not weight the probability by the number of opportunities. We can see this by dividing and computing the probability after summing. This value is still 0.5, as expected.
sum(n_heads_per_seq_given_opp) / sum(n_opportunities_by_seq) ## overall probability
## [1] 0.5
You can also see this by computing a weighted mean:
weighted.mean(x = p_HgH_by_seq, w = n_opportunities_by_seq, na.rm = TRUE)
## [1] 0.5
A specific example is that in the sequence of 4 heads, that is only counted as a single sequence, but there are 3 opportunities for a heads given a prior heads.
Their approach is not necessarily wrong, it’s just calculating something different than what is typical.
Specifically, in the context where you explicitly know how many heads there are, you really are only interested in what the baseline probability is of \(p(H_t|H_{t-1}, n_H)\) is. For example, if you know that someone took 10 shots, but only made 6 of them, then the baseline probability you care about might actually be the types of probabilities that they care about.
If we look at our prior example, we can do exactly that. Let’s estimate \(p(H_t|H_{t-1}, n_H)\), where \(n_H\) is \(2\). The probability is constrained to be less than \(\frac{n_H}{N}\), as long as \(n_H\) is less than \(N\), and it ends up being the same irrespective of whether we compute the p for each sequence or sum and compute the p in the end.
mean(p_HgH_by_seq[n_H == 2])
## [1] 0.3333333
sum(n_heads_per_seq_given_opp[n_H == 2]) / sum(n_opportunities_by_seq[n_H == 2])
## [1] 0.3333333
We can do the same for \(n_H = 3\).
mean(p_HgH_by_seq[n_H == 3])
## [1] 0.6666667
sum(n_heads_per_seq_given_opp[n_H == 3]) / sum(n_opportunities_by_seq[n_H == 3])
## [1] 0.6666667