Binomial Distribution in R

Binomial Distribution:

Parameters:

# Total number of trials
n = 16 

# Probability of success
pi = 0.2

Probability of a single value: dbinom

dbinom() has 3 arguments:

x = the value of the random variable we want to find the probability for
size = the number of trials
prob = the probability of a single success

# Probabilities of a single value: P(Y = a)
# ?dbinom

# dbinom(x = value, size = # of trials, prob = probability of a success)
dbinom(x = 4, size = n, prob = pi)

## [1] 0.2001111

# We can also find the probabilities of all possible values of Y by giving x = a vector of outcomes
tibble(Y = 0:16, 
       P_Y = dbinom(x = Y, 
                    size = n, 
                    prob = pi)) |> 
  # Round the results 
  round(digits = 4) ->
  
  binom_df

binom_df

## # A tibble: 17 × 2
##        Y    P_Y
##    <dbl>  <dbl>
##  1     0 0.0281
##  2     1 0.113 
##  3     2 0.211 
##  4     3 0.246 
##  5     4 0.200 
##  6     5 0.120 
##  7     6 0.055 
##  8     7 0.0197
##  9     8 0.0055
## 10     9 0.0012
## 11    10 0.0002
## 12    11 0     
## 13    12 0     
## 14    13 0     
## 15    14 0     
## 16    15 0     
## 17    16 0

Let’s use ggplot() to create a graphical representation of the binomial distribution with varying parameter choices

# Start by creating the data set
tibble(
  # Outcome Range
  Y = 0:n,
  # P(Y = y)
  P_Y = dbinom(x = Y, size = n, prob = pi)
) |> 
  
  # Graph of probabilities
  ggplot(
    mapping = aes(x = Y, y = P_Y)
  ) +   
  
  # Using geom_col() to create the bars
  geom_col(
    fill = "steelblue",
    color = "black"
  ) + 
  
  # Adding the line for the mean of the binomial:
  geom_vline(
    mapping = aes(xintercept = sum(Y * P_Y)),
    color = "darkorange",
    linewidth = 1
  ) +
  
  labs(
    x = "a",
    y = "P(Y = a)",
    title = paste0("Y ~ Bin(", n, ", ", pi, ")")
  ) + 
  
  # Changing the theme
  theme_bw() + 
  theme(plot.title = element_text(hjust = 0.5, size = 16)) + 
  
  # Changing the tick marks on the x-axis
  scale_x_continuous(
    breaks = 0:n,
    minor_breaks = NULL
  ) + 
  
  # Having the bars sit on the x-axis
  scale_y_continuous(expand = c(0, 0, 0.05, 0))

Cumulative probability: `pbinom`

If you want to find the cumulative probability, \(P(Y <= a)\), you can use pbinom(), which has the same arguments as dbinom with one change:

x = is replaced by q =

## P(Y <= 4)
# pbinom(q = 4, size = n, prob = pi) 

# Adding all the cumulative probabilities to binom_df
binom_df <- 
  binom_df |> 
  mutate(
    cumulative_prob = pbinom(q = Y, size = n, prob = pi)
  ) |> 
  round(digits = 4)

binom_df

## # A tibble: 17 × 3
##        Y    P_Y cumulative_prob
##    <dbl>  <dbl>           <dbl>
##  1     0 0.0281          0.0281
##  2     1 0.113           0.141 
##  3     2 0.211           0.352 
##  4     3 0.246           0.598 
##  5     4 0.200           0.798 
##  6     5 0.120           0.918 
##  7     6 0.055           0.973 
##  8     7 0.0197          0.993 
##  9     8 0.0055          0.998 
## 10     9 0.0012          1.00  
## 11    10 0.0002          1     
## 12    11 0               1     
## 13    12 0               1     
## 14    13 0               1     
## 15    14 0               1     
## 16    15 0               1     
## 17    16 0               1

pbinom() also has an additional optional argument:

lower = F will let you find the probability of greater than \(a\): \(P(Y > a)\)

# P(Y > a) 
pbinom(q = 4, size = n, prob = pi, lower = F)

## [1] 0.2017546

# We can check to see if the probability of pbinom(lower = T) and pbinom(lower = F) add up to 1
binom_df |> 
  mutate(
    ge_prob = pbinom(q = Y, size = n, prob = pi, lower = F),
    total_prob = cumulative_prob + ge_prob
  ) |> 
  
  round(digits = 4)

## # A tibble: 17 × 5
##        Y    P_Y cumulative_prob ge_prob total_prob
##    <dbl>  <dbl>           <dbl>   <dbl>      <dbl>
##  1     0 0.0281          0.0281  0.972           1
##  2     1 0.113           0.141   0.859           1
##  3     2 0.211           0.352   0.648           1
##  4     3 0.246           0.598   0.402           1
##  5     4 0.200           0.798   0.202           1
##  6     5 0.120           0.918   0.0817          1
##  7     6 0.055           0.973   0.0267          1
##  8     7 0.0197          0.993   0.007           1
##  9     8 0.0055          0.998   0.0015          1
## 10     9 0.0012          1.00    0.0002          1
## 11    10 0.0002          1       0               1
## 12    11 0               1       0               1
## 13    12 0               1       0               1
## 14    13 0               1       0               1
## 15    14 0               1       0               1
## 16    15 0               1       0               1
## 17    16 0               1       0               1

Now let’s create a graph showing the cumulative probabilities of a binomial:

# Start by creating the data set
tibble(
  Y = 0:n,
  P_Y = pbinom(q = Y, size = n, prob = pi)) |> 
  
  ggplot(
    mapping = aes(x = Y, y = P_Y)
  ) + 
  
  # Using geom_col() to create the bars
  geom_col(
    fill = "steelblue",
    color = "black"
  ) + 
  
  labs(
    x = "a",
    y = "P(Y <= a)",
    title = paste0("Y ~ Bin(", n, ", ", pi, ")")
  ) + 
  
  theme_bw() + 
  theme(plot.title = element_text(hjust = 0.5, size = 16)) +
  
  scale_x_continuous(
    breaks = 0:n,
    minor_breaks = NULL
  ) + 
  
  scale_y_continuous(expand = c(0, 0, 0.05, 0))

Finding quantiles with `qbinom()`

What do we do if we want to find the quantile, \(a\), associated with a probability? Aka, what value of \(a\) has a 3% probability?

You can use qbinom()! It has the same arguments as pbinom() but replaces q = (quantile) with p = (probability).

Since we are working with discrete random variables, often there isn’t a \(a\) value that has the exact probability that is given to p =. Instead, what qbinom() and other q functions do is find the lowest \(a\) such that \(P(Y \le a) > p\)

binom_df

## # A tibble: 17 × 3
##        Y    P_Y cumulative_prob
##    <dbl>  <dbl>           <dbl>
##  1     0 0.0281          0.0281
##  2     1 0.113           0.141 
##  3     2 0.211           0.352 
##  4     3 0.246           0.598 
##  5     4 0.200           0.798 
##  6     5 0.120           0.918 
##  7     6 0.055           0.973 
##  8     7 0.0197          0.993 
##  9     8 0.0055          0.998 
## 10     9 0.0012          1.00  
## 11    10 0.0002          1     
## 12    11 0               1     
## 13    12 0               1     
## 14    13 0               1     
## 15    14 0               1     
## 16    15 0               1     
## 17    16 0               1

# If p = 0.02, should return 0 since P(Y <= 0) = 0.028
qbinom(p = 0.02, size = n, prob = pi)

## [1] 0

# If p = 0.03, should return 1 since P(Y <= 1) = 0.1126
qbinom(p = 0.03, size = n, prob = pi)

## [1] 1

# What should p = 0.75 return?



# You can also give p a vector to find the corresponding values:
tibble(
  cumul_probs = (0:20)/20, # Vector of 5% probability increments
  `Y <= a` = qbinom(p = cumul_probs, size = n, prob = pi)
)

## # A tibble: 21 × 2
##    cumul_probs `Y <= a`
##          <dbl>    <dbl>
##  1        0           0
##  2        0.05        1
##  3        0.1         1
##  4        0.15        2
##  5        0.2         2
##  6        0.25        2
##  7        0.3         2
##  8        0.35        2
##  9        0.4         3
## 10        0.45        3
## # ℹ 11 more rows

Generativing random binomial values: `rbinom()`

rbinom() has 3 arguments

n = the number of random binomials to be generated
size = the number of trials
prob = the probability of an individual success

# Generating 10 random binomial outcomes for Y ~ Bin(16, 0.2)
rbinom(n = 10, size = n, prob = pi)

##  [1] 2 4 2 3 2 2 4 1 5 1

Binomial Distribution in R

Module 1: Probability Distributions

STAT 5350

Initial Description: Four Distribution Letters