Background

The applications of the hypergeometric distribution are very similar to those of the binomial distribution. However, there is a fundamental difference: In the binomial, the probability of a success is equal from trial to trial (sampling with replacement), while in the hypergeometric, the probability of a succes may not be the same from trial to trial (sampling without replacement). Therefore, if the population shrinks each time, the probability for each observation will change.

The applications of the hypergeometric distribution are very common, especially where sampling without replacement is necessary. For example, where the sampling is performed at the expense of the item being tested; that is, the item is destroyed and cannot be replaced.

The expected value and variance of the hypergeometric distribution \(H (x\mid m, n, k)\) are \[E(X) = k \frac{m}{m + n}\] and \[Var(X) = k \frac{m}{m + n} \cdot \frac{m + n - k}{m + n} \cdot \frac{n}{m + n -1}\]

where:

\(k\) is the size of the sample.

\(m\) is the number of successes in the population.

\(n\) is the number of failures in the population.

Probability Mass Function

The “standard” formula for the probability mass function of \(x\) successes is:

\[P(X = x) = H (x\mid m,M, k) = \frac{\dbinom{m}{x} \dbinom{M - m}{k - x}}{\dbinom{M}{k}}\] where:

\(x\) is the number of successes of the sample

\(M\) is the size of the population.

\(k\) is the size of the sample or the number of trials.

\(m\) is the number of successes in the population.

Note that \(M\), the size of the population contains \(m\) successes and \(n\) failures, then \(M = m + n\), therefore, we can redefine the formula based on the number of successes and failures:

\[P(X = x) = H (x\mid m, n, k) = \frac{\dbinom{m}{x} \dbinom{n}{k - x}}{\dbinom{m+n}{k}}\] In R the probability mass function is denoted by dhyper:

dhyper(x, m, n, k, log = FALSE)

where:

x is the number of successes in the sample.

m is the number of successes in the population.

n is the number of failures in the population.

k is the size of the sample or the number of trials.

Example:

A homeowner plants 6 bulbs selected at random from a box containing 5 tulip bulbs and 4 daffodil bulbs. What is the probability that he planted 2 daffodil bulbs and 4 tulip bulbs?

\(x =\) 2, \(m =\) 4, \(n =\) 5, \(k =\) 6,

\[P(X = 2) = H (2\mid 4, 5, 6) = \frac{\dbinom{4}{2} \dbinom{5}{6-2}}{\dbinom{4 + 5}{6}} = \frac{5}{14} \approx 0.36\]

dhyper(x = 2, m = 4, n = 5, k = 6)
[1] 0.3571429
library(ggplot2)
library(dplyr)
density = dhyper(x = 1:6, m = 4, n = 5, k = 6)
data.frame(daffodil = 1:6, density) %>%
  mutate(daffodil2 = ifelse(daffodil == 2, "x = 2", "other")) %>%
ggplot(aes(x = factor(daffodil), y = density, fill = daffodil2)) +
  geom_col() +
    geom_text(
    aes(label = round(density,2), y = density + 0.01),
    position = position_dodge(0.9),
    size = 3,
    vjust = 0
  ) +
  labs(title = "PMF of X = x",
       subtitle = "Hypergeometric(m = 4, n = 5, k = 6)",
       x = "Number of daffodil bulbs (x)",
       y = "Density")

The same result is obtained with:

\(x =4\), \(m =5\), \(n = 4\), \(k =\) 6,

\[H (4\mid 5, 4, 6) = \frac{\dbinom{5}{4} \dbinom{4}{6-4}}{\dbinom{4+5}{6}} = \frac {5}{14} \approx 0.36\]

dhyper(x = 4, m = 5, n = 4, k = 6)
[1] 0.3571429
density = dhyper(x = 1:6, m = 5, n = 4, k = 6)
data.frame(tulip = 1:6, density) %>%
  mutate(tulip4 = ifelse(tulip == 4, "x = 4", "other")) %>%
ggplot(aes(x = factor(tulip), y = density, fill = tulip4)) +
  geom_col() +
  geom_text(
    aes(label = round(density,2), y = density + 0.01),
    position = position_dodge(0.9),
    size = 3,
    vjust = 0
  ) +
  labs(title = "PMF of X = x",
       subtitle = "Hypergeometric(m = 5, n = 4, k = 6)",
       x = "Number of tulip bulbs (x)",
       y = "Density")

The Cumulative Distribution Function

The cumulative distribution function or the probability that \(X\) will take a value less than or equal to \(x\) is:

\[P[X\leq x] =\sum_{x_i \leq x} H(x_i\mid m, n, k)\]

In R the distribution function is denoted by phyper

phyper(q, m, n, k, lower.tail = TRUE, log.p = FALSE)

Example:

A storeroom just received a shipment of 13 paper graters. Shortly after they were received, the manufacturer called to report that he had inadvertently shipped 5 defective units. The owner of the storeroom decided to test 4 of the 13 paper graters she received. Assume the samples are drawn without replacement.

What is the probability that 3 or less units are defective?

phyper(q = 3, m = 5, n = 8, k = 4)
[1] 0.993007

The same result can be obtained with the probability mass function:

sum(dhyper(x = c(0:3), m = 5, n = 8, k = 4))
[1] 0.993007

What is the probability that more than 2 units are defective?

phyper(q = 2, m = 5, n = 8, k = 4, lower.tail = FALSE)
[1] 0.1188811

or returning the area in the left tail of the distribution:

1 - phyper(q = 2, m = 5, n = 8, k = 4, lower.tail = TRUE)
[1] 0.1188811

The same result can be obtained with the probability mass function:

sum(dhyper(x = c(3:13), m = 5, n = 8, k = 4))
[1] 0.1188811

Quantile Function

The \(p^{th}\) quantile is the smallest value of Hypergeometric ramdom variable \(X\) such that \(P(X \leq x) \geq p\).

The syntax to compute the quantiles of Hypergeometric distribution using R is qhyper(p,m,n,k)

The function qhyper gives \(100 \times p^{th}\) quantile of Hypergeometric distribution for given value.

A company produces and ships 16 personal computers knowing that 5 of them have defective wiring. The company that purchased the computers is going to thoroughly test four of the computers. The purchasing company can detect the defective wiring.

What is the value of \(c\), if \(P(X \leq) \geq 0.90\)?

We need to find the value of c such a that \(P(X\leq c) \geq 0.90\). That is we need to find the 60th quantile of given Hypergeometric distribution.

qhyper(p = 0.90, m = 5, n = 11, k = 4)
[1] 2

Random Variates

We need to generate 100 random observations from Hypergeometric distribution with m = 5, n = 11 and k = 4.

rhyper(nn = 100, m = 5, n = 11, k = 4) 
  [1] 1 2 2 3 2 0 0 1 1 1 0 2 1 2 1 1 0 1 1 2 2 1 1 0 2 1 2 1 2 1 0 1 3 1 1 1 2 2
 [39] 0 1 1 0 2 2 1 2 3 0 2 2 1 3 1 2 0 0 1 2 1 1 2 2 2 2 0 0 1 2 2 1 1 0 2 1 3 1
 [77] 2 2 1 3 1 2 1 0 1 2 1 0 1 0 1 2 0 2 2 1 1 3 0 2

Miscellaneous exercises

1. A CD contains 10 songs; 6 are classical and 4 are rock and roll. In a sample of 3 songs, what is the probability that exactly 2 are classical? Assume the samples are drawn without replacement.

dhyper(x = 2, m = 6, n = 4, k = 3)
[1] 0.5

2. Keith’s Florists has 15 delivery trucks, used mainly to deliver flowers and flower arrangements in the Greenville, South Carolina, area. Of these 15 trucks, 6 have brake problems. A sample of 5 trucks is randomly selected. What is the probability that 2 of those tested have defective brakes?

dhyper(x = 2, m = 6, n = 9, k = 5)
[1] 0.4195804

3. If 7 cards are dealt from an ordinary deck of 52 playing cards, what is the probability that

a) exactly 2 of them will be face cards?

\[ P(X = 2) = H (2\mid 12, 40, 7) = \frac{\dbinom{12}{2} \dbinom{40}{7 - 2}}{\dbinom{12+40}{7}} \approx 0.32\] Solving this probability directly with the choose function:

(choose(n = 12, k = 2) * choose(n = 40, k = 5)) / choose(n = 52, k = 7)
[1] 0.3246154

The same result can be obtained using dhyper:

dhyper(x = 2, m = 12, n = 40, k = 7)
[1] 0.3246154

b) at least 1 of them will be a queen?

\[P(X \geq 1) = 1 - P(X = 0) = 1 - H (0\mid 4, 48, 7) = 1-\frac{\dbinom{4}{0} \dbinom{48}{7 - 0}}{\dbinom{48+4}{7}} \approx 0.45\]

1 - (choose(n = 4, k = 0) * choose(n = 48, k = 7)) / choose(n = 52, k = 7)
[1] 0.4496445

Using the built-in Hypergeometric {stats} functions of R:

sum(dhyper(x = 1:4, m = 4, n = 48, k = 7))
[1] 0.4496445
phyper(q = 0, m = 4, n = 48, k = 7, lower.tail = FALSE)
[1] 0.4496445

4. A company is interested in evaluating its current inspection procedure for shipments of 50 identical items. The procedure is to take a sample of 5 and pass the shipment if no more than 2 are found to be defective. What proportion of shipments with 20% defectives will be accepted?

The number of defectives in a shipment is \(50 \times 20\% = 10\). Therefore, \(m = 10\), \(n = 40\), \(k = 5\).

The probability that a shipment will be accepted with 20% defective items is \(P[X\leq2]\)

\[P[X\leq2] =\sum_{x=0}^2 H(x_i\mid 10, 40, 5)\]

sum(dhyper(x = 0:2, m = 10, n = 40, k = 5)) # density function
[1] 0.9517397
phyper(q = 2, m = 10, n = 40, k = 5) # distribution function
[1] 0.9517397

5. An annexation suit against a county subdivision of 1200 residences is being considered by a neighboring city. If the occupants of half the residences object to being annexed, what is the probability that in a random sample of 10 at least 3 favor the annexation suit?

phyper(q = 2, m = 600, n = 600, k = 10, lower.tail = FALSE)
[1] 0.9460455

6. A nationwide survey of 17,000 college seniors by the University of Michigan revealed that almost 70% disapprove of daily pot smoking. If 18 of these seniors are selected at random and asked their opinion, what is the probability that more than 9 but fewer than 14 disapprove of smoking pot daily?

We can sum the probabilities from more than 9 (10), to fewer than 14 (13):

sum(dhyper(x = 10:13, m = 17000 * 0.7, n = 17000 * 0.3, k = 18))
[1] 0.6079669

There is another approach to get the same probability: using the distribution function combined with the diff function that calculates the difference between successive elements of a vector, in this way we can get the difference in cumulative probabilities:

diff(phyper(q = c(9, 13), m = 17000 * 0.7, n = 17000 * 0.3, k = 18, lower.tail = TRUE))
[1] 0.6079669

This is equivalent to:

phyper(q = 13, m = 17000 * 0.7, n = 17000 * 0.3, k = 18, lower.tail = TRUE) - phyper(q = 9, m = 17000 * 0.7, n = 17000 * 0.3, k = 18, lower.tail = TRUE)
[1] 0.6079669

Also equivalent to:

diff(phyper(q = c(13, 9), m = 17000 * 0.7, n = 17000 * 0.3, k = 18, lower.tail = FALSE))
[1] 0.6079669

7. A foreign student club lists as its members 2 Canadians, 3 Japanese, 5 Italians, and 2 Germans. If a committee of 4 is selected at random, find the probability that * all nationalities are represented

(choose(2, 1) * choose(3, 1) * choose(5, 1) * choose(2, 1)) / choose(12, 4)
[1] 0.1212121

8. Biologists doing studies in a particular environment often tag and release subjects in order to estimate the size of a population or the prevalence of certain features in the population. Ten animals of a certain population thought to be extinct (or near extinction) are caught, tagged, and released in a certain region. After a period of time, a random sample of 15 of this type of animal is selected in the region. What is the probability that 5 of those selected are tagged if there are 25 animals of this type in the region?

\[P(X=5) = H(0\mid 10, 15, 15) = \frac{\dbinom{10}{5} \dbinom{15}{15 - 5}}{\dbinom{10+15}{15}}\]

dhyper(x = 5, m = 10 , n = 15 , k = 15)
[1] 0.2315116

9. A government task force suspects that some manufacturing companies are in violation of federal pollution regulations with regard to dumping a certain type of product. Twenty firms are under suspicion but not all can be inspected. Suppose that 3 of the firms are in violation.

a) What is the probability that inspection of 5 firms will find no violations?

\[P(X=0) = H(0\mid 3, 17, 5) = \frac{\dbinom{3}{0} \dbinom{17}{5-0}}{\dbinom{3 + 17}{5}}\]

dhyper(x = 0, m = 3, n = 17, k = 5)
[1] 0.3991228

b) What is the probability that the plan above will find two violations?

\[P(X=2) = H(2\mid 3, 17, 5) = \frac{\dbinom{3}{2} \dbinom{17}{5-2}}{\dbinom{3 + 17}{5}}\]

dhyper(x = 2, m = 3, n = 17, k = 5)
[1] 0.1315789

10. To avoid detection at customs, a traveler places 6 narcotic tablets in a bottle containing 9 vitamin tablets that are similar in appearance. If the customs official selects 3 of the tablets at random for analysis, what is the probability that the traveler will be arrested for illegal possession of narcotics?

\[P(X \geq 1) = \sum_{x=1}^{6} H(x_i\mid 6, 9, 3)\]

sum(dhyper(x = 1:6, m = 6, n = 9, k = 3))
[1] 0.8153846

Same as:

phyper(q = 0, m = 6, n = 9, k = 3, lower.tail = FALSE) 
[1] 0.8153846

Another approach to the problem is: \[P(X \geq 1) = 1 - P(X = 0) = 1 - H(0 \mid 6, 9, 3) \]

1 - dhyper(x = 0, m = 6, n = 9, k = 3)
[1] 0.8153846

11. From a lot of 10 missiles, 4 are selected at random and fired. If the lot contains 3 defective missiles that will not fire, what is the probability that (a) all 4 will fire?

\[P(X=4) = H(4\mid 7, 3, 4) = \frac{\dbinom{7}{4} \dbinom{3}{4-4}}{\dbinom{7+3}{4}}\]

dhyper(x = 4, m = 7, n = 3, k = 4)
[1] 0.1666667

(b) at most 2 will not fire?

\[P(X \leq 2) = \sum_{x=0}^{2} H(x_i\mid 3, 7, 4)\]

sum(dhyper(x = 0:2, m = 3, n = 7, k = 4))
[1] 0.9666667

Or what is the same:

phyper(q = 2, m = 3, n = 7, k = 4)
[1] 0.9666667

12. A manufacturing company uses an acceptance scheme on items from a production line before they are shipped. The plan is a two-stage one. Boxes of 25 items are readied for shipment, and a sample of 3 items is tested for defectives. If any defectives are found, the entire box is sent back for 100% screening. If no defectives are found, the box is shipped.

a) What is the probability that a box containing 3 defectives will be shipped?

dhyper(x = 0, m = 3, n = 22, k = 3)
[1] 0.6695652

b) What is the probability that a box containing only 1 defective will be sent back for screening?

dhyper(x = 1, m = 1, n = 24, k = 3)
[1] 0.12

References

Lind, Douglas A, William G Marchal, and Samuel Adam Wathen. 2012. Statistical Techniques in Business & Economics. New York, NY: McGraw-Hill/Irwin,

Myers, R, S Myers, R Walpole, and K Ye. 2012. “Probability & Statistics for Engineers and Scientists, Ninth Editon.” Pearson.

