Initial Description: Four Distribution Letters

The different probability distribution functions all start with one of the following 4 letters:

  1. d \(\rightarrow\) density: Find the probability for a specific value: \(P(Y=a)\)

  2. p \(\rightarrow\) Find the probability for the specific value and all values less than it (aka, cumulative probability): \(P(Y \le a)\)

  3. q \(\rightarrow\) quantile: Finds the smallest value of the random variable, \(a\), so that \(P(Y \le a) \ge p\)

  1. r \(\rightarrow\) generate a value of the random variable Y given the parameters

Negative Binomial Distribution

Let’s have our parameters be

r = 5         # Number of correct questions needed to win $5000
k = 2         # number of incorrect questions required to stop
pi_r = 0.75   # Probability of a correct answer
pi_k = 0.25   # Probability of an incorrect answer

The equation to calculate the probability using a Negative Binomial Distribution is:

\[P(Y = r | k, \pi) = \frac{(r + k - 1)!}{r!(k - 1)!} \pi_r^r (1 - \pi_r)^k \]

dnbinom()

dnbinom() has the same argument names as dbinom() where

  • x = the result of the random group \(Y_r\)
    • The number of correct questions
  • size = the number of the known group \(Y_k\)
    • number of incorrect questions
  • prob = the probability of the known group: \(\pi_k\)
    • Probability of getting the question incorrect
# P(Y_r = 5) 
dnbinom(
  x = r,       # successes
  size = k,    # needed failures
  prob = pi_k  # prob of failure
)
## [1] 0.08898926

The contestant has about a 8.9% chance of answering exactly 5 questions correctly before answering 2 incorrectly

But if they want to win $5000 or more, what can we do?

Let’s find the probabilities of all possible values of \(Y_r\) by creating a table of the outcomes and the corresponding probabilities of \(Y\)

Hypothetically, the contestant can keep answering questions indefinitely, so we’ll just look at the probability they answer 0 to 50 questions correctly.

nbin_df <- 
  tibble(
    Y_r = 0:50,
    P_Yr = dnbinom(x = Y_r,       # Different possible correct answers
                   size = 2,      # Always 2 incorrect answers
                   prob = pi_k)   # Probability of an incorrect answer = 0.25
  )

# Rounding the results
nbin_df |> 
  round(digits = 4) 
## # A tibble: 51 × 2
##      Y_r   P_Yr
##    <dbl>  <dbl>
##  1     0 0.0625
##  2     1 0.0938
##  3     2 0.106 
##  4     3 0.106 
##  5     4 0.0989
##  6     5 0.089 
##  7     6 0.0779
##  8     7 0.0667
##  9     8 0.0563
## 10     9 0.0469
## # ℹ 41 more rows
### Graph representing the probabilities
nbin_df |> 
  
  # Let's remove any outcomes that have less than 0.01% chance of occurring
  filter(P_Yr > 0.0001) |> 
  
  ggplot(
    mapping = aes(x = factor(Y_r), y = P_Yr)
  ) + 
  
  # Adding the bars  
  geom_col(fill = "steelblue") +
  
  # Adding a vertical line at the 'goal' of 5 correct questions
  geom_vline(
    xintercept = 5,
    color = "orange",
    size = 1
  ) +
  
  labs(
    x = "r",
    y = "P(Y = r)"
  ) + 
  
  scale_y_continuous(expand = c(0, 0, 0.05, 0)) + 
  
  theme_bw() + 
  theme(plot.title = element_text(hjust = 0.5, size = 16)) +
  
  labs(title = 'Negative binomial with k = 2 and pi_k = 0.25')
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Now we can add up the probabilities for \(Y_r \ge 5\)

nbin_df |> 
  filter(Y_r >= 5) |> 
  summarize(`P(At Least 5 Correct)` = sum(P_Yr)) |> 
  round(digits = 4)
## # A tibble: 1 × 1
##   `P(At Least 5 Correct)`
##                     <dbl>
## 1                   0.534

So they have a little more than 50% probability of winning at least $5000

2) pnbinom() - Cumulative Probability

We could have also used pnbinom() with lower = F and some adjustments:

# P(Y <= r) = pnbinom(r, k, pi_k) 
pnbinom(q = r, size = k, prob = pi_k)
## [1] 0.5550537
# P(Y >= r) = pnbinom(r-1, k, pi_k) 
pnbinom(q = r-1, size = k, prob = pi_k, lower = F)
## [1] 0.5339355

If you round to enough decimal places, the two methods won’t be exactly the same since the probability they answer more than 50 correct is technically not 0, but they should be very similar since it is close to it!

It’s not hard to have R calculate more of the probabilities, you could let r go up to 1000 (or more!) instead of just 50, but the difference is very, very small

3) qnbinom()

From the show’s perspective, you want to make sure you don’t give away too much money. What’s the maximum amount of money 95% of contestants will take home? That is, how many questions will 95% or fewer contestants answer correctly?

# P(Y_r <= r) >= p
qnbinom(p = 0.95,
        size = 2,
        prob = 0.25)
## [1] 16
# What's the probability that Y_r <= 16?
pnbinom(q = 16,
        size = 2,
        prob = 0.25)
## [1] 0.960536
# About 96%. That is the probability that Y_r <= 15?
pnbinom(q = 15,
        size = 2,
        prob = 0.25)
## [1] 0.949887
# "only" 94.99%, so about 95% of contestants will win 15k or less. 

4) rnbinom() - Generating negative binomial samples:

If we want to generate \(n\) different negative binomial random variables, that’s what rnbinom() is for!

Let’s simulate 20 contestants playing on the game show:

rnbinom(n = 20, size = 2,prob = 0.25) 
##  [1] 17  4  4  6  9  4  6  6  1  3  0  4  7  5  9  2  8  1  5  6

Now let’s simulate 10,000 contestants and create a graph of the results:

# Number of contestants
N = 1e5

# Simulating and plotting the results of 10,000 contestants
tibble(
  Y = rnbinom(n = N, size = 2, prob = 0.25)
) |>  
  
  # Counting how many contests get 1 question right, 2 questions, ...
  count(Y) |> 
  
  
  ## Plotting the results
  ggplot(
    mapping = aes(x = factor(Y), y = n/N) # y = Probability instead of count
  ) + 
  
  geom_col(
    fill = "steelblue"
  ) + 
  
  # Calculating the expected value and displaying it on the line
  geom_vline(
    mapping = aes(xintercept = sum(Y*n/N)),
    color = "orange",
    size = 1
  ) +
  
  labs(
    x = "r",
    y = "Estimated P(Y = r)"
  ) + 
  
  scale_y_continuous(expand = c(0, 0, 0.05, 0)) + 
  
  theme_bw()