Introduction
In this blog, I will demonstrate how to use the hypergeometric
distribution for hypothesis testing. Hypergeometric distribution is a
probability distribution which describes the probability of success K
after n draws without replacement from a fixed population N.
Below, the probability mass function will provide the probability of
success without replacement from this fixed population: \[
P(X = k) = \frac{\binom{K}{k} \binom{N-K}{n-k}}{\binom{N}{n}}
\] where (ab)(ba) is the binomial coefficient: \[
\binom{a}{b} = \frac{a!}{b!(a-b)!}
\] Hypothesis
Suppose we receive a shipment of 100 items. Based on prior shipments, we
know 30 out of 100 items received are usually defective. We want to
determine if this new shipment of 100 items contains exactly 30
defective items. To test this, we randomly draw a sample of 10 items
without replacement from the new shipment to inspect for defects.
The hypothesis:
Now we will use R to test the hypothesis.
Define Parameters
Below we set:
1. N=100 which represents the total shipment/population size
2. K=30 which represents the number of defective items in prior
shipments
3. n=10 which represents the sample size we are using to test the
hypothesis
4. k_observed =5 which represents the number of observed defective items
in the sample of 10
By setting the k_observed to 5, we aim to determine whether the probability of selecting exactly 5 defective items from a random sample of 10 items is consistent with the null hypothesis that the shipment contains 30 defective items.
N <- 100 # Total population
K <- 30 # Total number of defective items in population
n <- 10 # Sample size
k_observed <- 5 # Observed number of defective items in the sample
Calculate the Hypergeometric Probability
Below we use the dhyper() to calculate the probability of observing
k_observed = 5 defective items using hypergeometric distribution. The
result is: There is a 9.96% chance of observing 5 defective items in
this scenario when the null hypothesis is true.
# probability of 5 defective items from the sample of 10
p_k <- dhyper(k_observed, K, N - K, n)
print(paste("P(X =", k_observed, ") =", round(p_k, 4)))
## [1] "P(X = 5 ) = 0.0996"
Calculate Cumulative Probability
Next we will calculate the p-values. The p-values will tell us whether
observing k_observed=5 is consistent with the null hypothesis or whether
the 5 defective items are an outlier. Since k_observed is set to 5, we
need to calculate the p-values of k is less than or equal to 5, k is
equal to or greater than 5 or is exactly 5.
Lower tail
The lower tail represents the probability of observing 5 or fewer
defective items in the sample under the null hypothesis. The lower-tail
cumulative probability was determined to be 96.12%, meaning there is a
96.12% chance of observing five or less defective items if the null
hypothesis is true.
Upper tail
The upper tail represents the probability of observing 5 or more
defective items in the sample under the null hypothesis. The upper-tail
cumulative probability was determined to be 13.84%, meaning there is a
13.84% chance of observing five or more defective items if the null
hypothesis is true.
Exactly 5 Defects
The probability of observing exactly 5 defective items in the sample
under the null hypothesis is 9.96%. This means there is a 9.96% chance
that a sample of 10 will contain exactly 5 defective items assuming the
null hypothesis is true.
Two-tailed p-value
The two-tailed p-value combines the probabilities of observing 5 defects
in both tails of the distribution. Below a two tailed p-value of 1
indicates that observing 5 defective items from the 10 item sample is
possible under the null hypothesis and is not an outlier. The is null
hypothesis is not rejected.
# less than or equal to 5 defects
p_cumulative <- phyper(k_observed, K, N - K, n)
print(paste("P(X <= ", k_observed, ") =", round(p_cumulative, 4)))
## [1] "P(X <= 5 ) = 0.9612"
# greater than or equal to 5 defects
p_upper <- 1 - phyper(k_observed - 1, K, N - K, n)
print(paste("P(X >= ", k_observed, ") =", round(p_upper, 4)))
## [1] "P(X >= 5 ) = 0.1384"
# exactly 5
p_exact <- dhyper(k_observed, K, N - K, n)
p_exact
## [1] 0.09963728
# Two-tailed p-value
p_two_tailed <- p_cumulative + p_upper - p_exact
print(p_two_tailed)
## [1] 1
Conclusion
The analysis indicates that observing 5 defective items in a random
sample of 10 without replacement is consistent with the null hypothesis.
There is no evidence to reject the null hypothesis.