Background

Random variable \(X\) has a hypergeometric distribution if \(X = x\) is the number of successes in a sample size of size \(k\) (without replacement) when the population contains \(M\) successes and \(N\) non-successes.

The probability mass function of \(x\) successes is \[f(x|m,n,k) = \frac{{{m}\choose{x}}{{n}\choose{k-x}}}{{m+n}\choose{k}}.\] The expected value of the number of successes is \(E(X) = k\frac{m}{m+n}\) with variance \(Var(X) = k\frac{m}{m+n}\cdot\frac{m+n-k}{m+n}\cdot\frac{n}{m+n-1}\).

phyper returns the cumulative probability (percentile) p at the specified value (quantile) q. qhyper returns the value (quantile) q at the specified cumulative probability (percentile) p.

Example

What is the probability of selecting x = 14 red marbles from a sample of k = 20 taken from an urn containing m = 70 red marbles and n = 30 green marbles?

# probability
x = 14
m = 70
n = 30
k = 20
dhyper(x = x, m = m, n = n, k = k)
## [1] 0.2140911
# expected number of red marbles
k * m / (m + n)
## [1] 14
# variance
k * m / (m + n) * (m + n - k) / (m + n) * n / (m + n - 1)
## [1] 3.393939
library(ggplot2)
library(dplyr)
options(scipen = 999, digits = 2) # sig digits

density = dhyper(x = 1:20, m = m, n = n, k = k)
data.frame(red = 1:20, density) %>%
  mutate(red14 = ifelse(red == 14, "x = 14", "other")) %>%
ggplot(aes(x = factor(red), y = density, fill = red14)) +
  geom_col() +
  geom_text(
    aes(label = round(density,2), y = density + 0.01),
    position = position_dodge(0.9),
    size = 3,
    vjust = 0
  ) +
  labs(title = "PMF of X = x Red Balls",
       subtitle = "Hypergeometric(k = 20, M = 70, N = 30)",
       x = "Number of red balls (x)",
       y = "Density")

The hypergeometric random variable is similar to the binomial random variable except that it applies to situations of sampling without replacement from a small population. As the population size increases, sample without replacement converges to sampling with replacement, and the hypergeometric distribution converges to the binomial.

From the prior example what if the total sample size is 250? 500? 1000?

library(tidyr)
library(ggplot2)
library(dplyr)
options(scipen = 999, digits = 2) # sig digits

x = 14
m = 7000
n = 3000
k = 20

d_binom <- dbinom(x = 1:20, size = k, prob = m / (m + n))
df_binom <- data.frame(x = 1:20, Binomial = d_binom)
p <- ggplot(df_binom, aes(x = x, y = Binomial)) +
  geom_col()

d_hyper_100 <- dhyper(x = 1:20, m = 70, n = 30, k = k)
d_hyper_250 <- dhyper(x = 1:20, m = 175, n = 75, k = k)
d_hyper_500 <- dhyper(x = 1:20, m = 350, n = 150, k = k)
d_hyper_1000 <- dhyper(x = 1:20, m = 700, n = 300, k = k)
df_hyper = data.frame(x = 1:20, 
                Hyper_100 = d_hyper_100, 
                Hyper_250 = d_hyper_250, 
                Hyper_500 = d_hyper_500, 
                Hyper_1000 = d_hyper_1000)
df_hyper_tidy <- gather(df_hyper, key = "dist", value = "density", -c(x))
p + 
  geom_line(data = df_hyper_tidy, aes(x = x, y = density, color = dist)) +
  labs(title = "Hypergeometric Distribution Appoximation to Binomial",
       subtitle = "Hypergeometric approaches Binomial as population size increases.",
       x = "Number of successful observations (x)",
       y = "Density")