DATA 621: Blog 4

Introduction
In this blog, I will demonstrate how to use the hypergeometric distribution for hypothesis testing. Hypergeometric distribution is a probability distribution which describes the probability of success K after n draws without replacement from a fixed population N.

Below, the probability mass function will provide the probability of success without replacement from this fixed population: \[ P(X = k) = \frac{\binom{K}{k} \binom{N-K}{n-k}}{\binom{N}{n}} \] where (ab)(ba) is the binomial coefficient: \[ \binom{a}{b} = \frac{a!}{b!(a-b)!} \] Hypothesis
Suppose we receive a shipment of 100 items. Based on prior shipments, we know 30 out of 100 items received are usually defective. We want to determine if this new shipment of 100 items contains exactly 30 defective items. To test this, we randomly draw a sample of 10 items without replacement from the new shipment to inspect for defects.

The hypothesis:

Null Hypothesis: The shipment contains 30 defective items.
Alternative Hypothesis: The shipment contains fewer or more defective items.

Now we will use R to test the hypothesis.

Define Parameters
Below we set:
1. N=100 which represents the total shipment/population size
2. K=30 which represents the number of defective items in prior shipments
3. n=10 which represents the sample size we are using to test the hypothesis
4. k_observed =5 which represents the number of observed defective items in the sample of 10

By setting the k_observed to 5, we aim to determine whether the probability of selecting exactly 5 defective items from a random sample of 10 items is consistent with the null hypothesis that the shipment contains 30 defective items.

N <- 100        # Total population 
K <- 30         # Total number of defective items in population
n <- 10         # Sample size
k_observed <- 5 # Observed number of defective items in the sample

Calculate the Hypergeometric Probability
Below we use the dhyper() to calculate the probability of observing k_observed = 5 defective items using hypergeometric distribution. The result is: There is a 9.96% chance of observing 5 defective items in this scenario when the null hypothesis is true.

# probability of 5 defective items from the sample of 10
p_k <- dhyper(k_observed, K, N - K, n)
print(paste("P(X =", k_observed, ") =", round(p_k, 4)))

## [1] "P(X = 5 ) = 0.0996"

Calculate Cumulative Probability
Next we will calculate the p-values. The p-values will tell us whether observing k_observed=5 is consistent with the null hypothesis or whether the 5 defective items are an outlier. Since k_observed is set to 5, we need to calculate the p-values of k is less than or equal to 5, k is equal to or greater than 5 or is exactly 5.

Lower tail
The lower tail represents the probability of observing 5 or fewer defective items in the sample under the null hypothesis. The lower-tail cumulative probability was determined to be 96.12%, meaning there is a 96.12% chance of observing five or less defective items if the null hypothesis is true.

Upper tail
The upper tail represents the probability of observing 5 or more defective items in the sample under the null hypothesis. The upper-tail cumulative probability was determined to be 13.84%, meaning there is a 13.84% chance of observing five or more defective items if the null hypothesis is true.

Exactly 5 Defects
The probability of observing exactly 5 defective items in the sample under the null hypothesis is 9.96%. This means there is a 9.96% chance that a sample of 10 will contain exactly 5 defective items assuming the null hypothesis is true.

Two-tailed p-value
The two-tailed p-value combines the probabilities of observing 5 defects in both tails of the distribution. Below a two tailed p-value of 1 indicates that observing 5 defective items from the 10 item sample is possible under the null hypothesis and is not an outlier. The is null hypothesis is not rejected.

# less than or equal to 5 defects
p_cumulative <- phyper(k_observed, K, N - K, n)
print(paste("P(X <= ", k_observed, ") =", round(p_cumulative, 4)))

## [1] "P(X <=  5 ) = 0.9612"

# greater than or equal to 5 defects
p_upper <- 1 - phyper(k_observed - 1, K, N - K, n)
print(paste("P(X >= ", k_observed, ") =", round(p_upper, 4)))

## [1] "P(X >=  5 ) = 0.1384"

# exactly 5
p_exact <- dhyper(k_observed, K, N - K, n)
p_exact

## [1] 0.09963728

# Two-tailed p-value
p_two_tailed <- p_cumulative + p_upper - p_exact
print(p_two_tailed)

## [1] 1

Conclusion
The analysis indicates that observing 5 defective items in a random sample of 10 without replacement is consistent with the null hypothesis. There is no evidence to reject the null hypothesis.

DATA 621: Blog 4

Gregg Maloy