In this module, we explore one of the most fundamental discrete probability distributions: the Binomial Distribution.
The Binomial distribution describes the probability of obtaining a specific number of “successes” in a fixed number of independent trials, where each trial has only two possible outcomes (Binary).
To understand the Binomial distribution, we first define a Bernoulli Trial. This is a random experiment with exactly two possible outcomes: 1. Success (usually denoted as 1) 2. Failure (usually denoted as 0)
Let \(p\) be the probability of success. Then \(1-p\) (often denoted as \(q\)) is the probability of failure.
For a random variable \(X\) to follow a Binomial distribution, the following four conditions (often remembered by the acronym BINS) must be met:
Notation: \(X \sim B(n, p)\)
The probability of getting exactly \(k\) successes in \(n\) trials is given by the formula:
\[ P(X = k) = \binom{n}{k} p^k (1-p)^{n-k} \]
Where: * \(\binom{n}{k} = \frac{n!}{k!(n-k)!}\) is the binomial coefficient (number of ways to choose \(k\) successes from \(n\) trials). * \(p^k\) is the probability of \(k\) successes. * \((1-p)^{n-k}\) is the probability of \(n-k\) failures.
R provides four standard functions for the binomial distribution:
| Function | Description | Usage |
|---|---|---|
dbinom(x, size, prob) |
Probability Mass Function. Calculates \(P(X=x)\). | Finding exact probability. |
pbinom(q, size, prob) |
Cumulative Distribution Function. Calculates \(P(X \le q)\). | Finding probability of “at most” \(x\). |
qbinom(p, size, prob) |
Quantile Function. Inverse of
pbinom. |
Finding cut-off values. |
rbinom(n, size, prob) |
Random Number Generation. | Simulating experiments. |
Visualizations help us understand the shape of the distribution based on the parameters \(n\) and \(p\).
If \(p = 0.5\), the distribution is symmetric. If \(p < 0.5\), it is right-skewed. If \(p > 0.5\), it is left-skewed.
Let’s visualize a scenario where we flip a coin 20 times (\(n=20\)) with varying probabilities of success.
# Parameters
n <- 20
k <- 0:n
# Create data frames for plotting
df_sym <- data.frame(Successes = k, Probability = dbinom(k, n, 0.5), Type = "p = 0.5 (Symmetric)")
df_skew <- data.frame(Successes = k, Probability = dbinom(k, n, 0.1), Type = "p = 0.1 (Right Skewed)")
df_comb <- rbind(df_sym, df_skew)
# Plot using ggplot2
ggplot(df_comb, aes(x = Successes, y = Probability, fill = Type)) +
geom_bar(stat = "identity", position = "dodge", alpha = 0.7) +
labs(title = "Binomial Distribution Shapes",
subtitle = "Comparison of p=0.5 vs p=0.1 with n=20",
y = "Probability P(X=k)") +
theme_minimal() +
scale_fill_brewer(palette = "Set1")Scenario: A factory produces light bulbs. Historically, 5% of the bulbs produced are defective. A quality control engineer randomly selects a batch of 10 bulbs for testing.
Question: What is the probability that exactly 2 bulbs in the batch are defective?
n <- 10
p <- 0.05
k <- 2
prob_2_defective <- dbinom(x = k, size = n, prob = p)
print(paste("Probability of exactly 2 defective bulbs:", round(prob_2_defective, 4)))## [1] "Probability of exactly 2 defective bulbs: 0.0746"
Scenario: A new drug claims to cure 80% of patients with a specific infection. A doctor administers the drug to 15 patients.
Question: What is the probability that at least 13 patients are cured? \[P(X \ge 13) = P(X=13) + P(X=14) + P(X=15)\]
In R, we can use sum(dbinom(...)) or the cumulative
function pbinom. Note that pbinom(q) gives
\(P(X \le q)\). Therefore, \(P(X \ge 13) = 1 - P(X \le 12)\).
n <- 15
p <- 0.80
# Method 1: Summing specific probabilities
prob_at_least_13 <- sum(dbinom(13:15, size = n, prob = p))
# Method 2: Using Cumulative Distribution Function (1 - lower tail)
prob_at_least_13_cdf <- 1 - pbinom(12, size = n, prob = p)
print(paste("Probability at least 13 are cured:", round(prob_at_least_13, 4)))## [1] "Probability at least 13 are cured: 0.398"
Problem: A marketing email has a click-through rate (CTR) of 20%. You send the email to 50 random customers.
n <- 50
p <- 0.20
# Expected Value
expected_clicks <- n * p
# Variance and Standard Deviation
variance <- n * p * (1 - p)
std_dev <- sqrt(variance)
cat("Expected Clicks:", expected_clicks, "\n")## Expected Clicks: 10
## Standard Deviation: 2.83
We will highlight the area representing “more than 15 clicks” to visualize tail probabilities.
# Generate Data
clicks <- 0:50
probs <- dbinom(clicks, size = n, prob = p)
plot_data <- data.frame(clicks, probs)
# Determine colors based on condition (More than 15)
plot_data <- plot_data %>%
mutate(Highlight = ifelse(clicks > 15, "Tail (>15)", "Normal"))
# Plot
ggplot(plot_data, aes(x = clicks, y = probs, fill = Highlight)) +
geom_col(width = 0.7) +
scale_fill_manual(values = c("gray70", "red")) +
geom_vline(xintercept = expected_clicks, linetype="dashed", color="blue", size=1) +
annotate("text", x = expected_clicks + 2, y = 0.10, label = "Mean = 10", color="blue") +
labs(title = "Binomial Distribution of Email Clicks",
subtitle = "n=50, p=0.20. Red bars indicate >15 clicks.",
x = "Number of Clicks",
y = "Probability") +
theme_classic()dbinom) and cumulative probabilities
(pbinom).```
R install.packages(c("ggplot2", "dplyr", "rmarkdown"))File -> New File ->
R Markdown.