The different probability distribution functions all start with one of the following 4 letters:
d \(\rightarrow\) density: Find the probability for a specific value: \(P(Y=a)\)
p \(\rightarrow\) Find the probability for the specific value and all values less than it (aka, cumulative probability): \(P(Y \le a)\)
q \(\rightarrow\) quantile: Finds the smallest value of the random variable, \(a\), so that \(P(Y \le a) \ge p\)
Like a binomial variable, a negative binomial has only 2 outcomes (aka, binary).
Unlike the binomial distribution, we don’t know how many trials we are going to have (N is random).
Instead, the number of one result is known before the trials (\(k\) for known) and the other result is random (\(r\) for random).
Let’s have our parameters be
\(r\) = the number of correct questions answered = 5
\(k\) = number of incorrect questions before stopping = 2,
\(\pi_k\) = 0.25 and \(\pi_r = 1 - \pi_k = 0.25\)
r = 5 # Number of correct questions needed to win $5000
k = 2 # number of incorrect questions required to stop
pi_r = 0.75 # Probability of a correct answer
pi_k = 0.25 # Probability of an incorrect answer
The equation to calculate the probability using a Negative Binomial Distribution is:
\[P(Y = r | k, \pi) = \frac{(r + k - 1)!}{r!(k - 1)!} \pi_r^r (1 - \pi_r)^k \]
dnbinom()
dnbinom()
has the same argument names as
dbinom()
where
x =
the result of the random group \(Y_r\)
size =
the number of the known group \(Y_k\)
prob =
the probability of the known group: \(\pi_k\)
# P(Y_r = 5)
dnbinom(
x = r, # successes
size = k, # needed failures
prob = pi_k # prob of failure
)
## [1] 0.08898926
The contestant has about a 8.9% chance of answering exactly 5 questions correctly before answering 2 incorrectly
But if they want to win $5000 or more, what can we do?
Let’s find the probabilities of all possible values of \(Y_r\) by creating a table of the outcomes and the corresponding probabilities of \(Y\)
Hypothetically, the contestant can keep answering questions indefinitely, so we’ll just look at the probability they answer 0 to 50 questions correctly.
nbin_df <-
tibble(
Y_r = 0:50,
P_Yr = dnbinom(x = Y_r, # Different possible correct answers
size = 2, # Always 2 incorrect answers
prob = pi_k) # Probability of an incorrect answer = 0.25
)
# Rounding the results
nbin_df |>
round(digits = 4)
## # A tibble: 51 × 2
## Y_r P_Yr
## <dbl> <dbl>
## 1 0 0.0625
## 2 1 0.0938
## 3 2 0.106
## 4 3 0.106
## 5 4 0.0989
## 6 5 0.089
## 7 6 0.0779
## 8 7 0.0667
## 9 8 0.0563
## 10 9 0.0469
## # ℹ 41 more rows
### Graph representing the probabilities
nbin_df |>
# Let's remove any outcomes that have less than 0.01% chance of occurring
filter(P_Yr > 0.0001) |>
ggplot(
mapping = aes(x = factor(Y_r), y = P_Yr)
) +
# Adding the bars
geom_col(fill = "steelblue") +
# Adding a vertical line at the 'goal' of 5 correct questions
geom_vline(
xintercept = 5,
color = "orange",
size = 1
) +
labs(
x = "r",
y = "P(Y = r)"
) +
scale_y_continuous(expand = c(0, 0, 0.05, 0)) +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5, size = 16)) +
labs(title = 'Negative binomial with k = 2 and pi_k = 0.25')
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Now we can add up the probabilities for \(Y_r \ge 5\)
nbin_df |>
filter(Y_r >= 5) |>
summarize(`P(At Least 5 Correct)` = sum(P_Yr)) |>
round(digits = 4)
## # A tibble: 1 × 1
## `P(At Least 5 Correct)`
## <dbl>
## 1 0.534
So they have a little more than 50% probability of winning at least $5000
pnbinom()
- Cumulative ProbabilityWe could have also used pnbinom()
with
lower = F
and some adjustments:
# P(Y <= r) = pnbinom(r, k, pi_k)
pnbinom(q = r, size = k, prob = pi_k)
## [1] 0.5550537
# P(Y >= r) = pnbinom(r-1, k, pi_k)
pnbinom(q = r-1, size = k, prob = pi_k, lower = F)
## [1] 0.5339355
If you round to enough decimal places, the two methods won’t be exactly the same since the probability they answer more than 50 correct is technically not 0, but they should be very similar since it is close to it!
It’s not hard to have R calculate more of the probabilities, you could let r go up to 1000 (or more!) instead of just 50, but the difference is very, very small
qnbinom()
From the show’s perspective, you want to make sure you don’t give away too much money. What’s the maximum amount of money 95% of contestants will take home? That is, how many questions will 95% or fewer contestants answer correctly?
# P(Y_r <= r) >= p
qnbinom(p = 0.95,
size = 2,
prob = 0.25)
## [1] 16
# What's the probability that Y_r <= 16?
pnbinom(q = 16,
size = 2,
prob = 0.25)
## [1] 0.960536
# About 96%. That is the probability that Y_r <= 15?
pnbinom(q = 15,
size = 2,
prob = 0.25)
## [1] 0.949887
# "only" 94.99%, so about 95% of contestants will win 15k or less.
rnbinom()
- Generating negative binomial
samples:If we want to generate \(n\)
different negative binomial random variables, that’s what
rnbinom()
is for!
Let’s simulate 20 contestants playing on the game show:
rnbinom(n = 20, size = 2,prob = 0.25)
## [1] 17 4 4 6 9 4 6 6 1 3 0 4 7 5 9 2 8 1 5 6
Now let’s simulate 10,000 contestants and create a graph of the results:
# Number of contestants
N = 1e5
# Simulating and plotting the results of 10,000 contestants
tibble(
Y = rnbinom(n = N, size = 2, prob = 0.25)
) |>
# Counting how many contests get 1 question right, 2 questions, ...
count(Y) |>
## Plotting the results
ggplot(
mapping = aes(x = factor(Y), y = n/N) # y = Probability instead of count
) +
geom_col(
fill = "steelblue"
) +
# Calculating the expected value and displaying it on the line
geom_vline(
mapping = aes(xintercept = sum(Y*n/N)),
color = "orange",
size = 1
) +
labs(
x = "r",
y = "Estimated P(Y = r)"
) +
scale_y_continuous(expand = c(0, 0, 0.05, 0)) +
theme_bw()