We conduct epidemiological surveys to understand, calculate, or estimate the prevalence of diseases and specific behavioral patterns such as smoking, helmet use, and headphone use while working. When using surveys to determine these prevalence values, we typically select a random sample that represents the entire population.

With this background, let’s consider an example. Imagine you’ve conducted a survey in a village with a population of 2000, using a sample size of 385 people to determine the prevalence of diabetes. Blood samples were collected from each participant after a 12-hour fasting period, and there were no non-responses. At the end of the survey, 37 blood samples exhibited higher blood glucose levels than expected, classifying them as diabetic cases. We assume that the test kits used had 100% sensitivity and specificity. This is what we refer to as our “observed data.” Based on this sample, we can calculate the prevalence of diabetes in the village, which is approximately 9.6%.

From a statistical perspective, you may recognize that the sample size and the number of positive samples for diabetes represent count or discrete data. In probability terms, you can think of the sample size as the number of “trials” conducted and the number of positive blood samples for diabetes as the “successes” achieved.

If the same study were conducted with the same sample size, it’s likely that the number of positive samples for diabetes would remain close to the previous number. However, there’s a range of possibilities, from 0 (none of the samples turning out positive for diabetes) to 385 (all samples becoming positive for diabetes).

To model and summarize the data generation process, we use a binomial Probability Distribution Function (PDF). The choice of the binomial PDF is based on two main characteristics:

  1. Type of data (count/discrete data).
  2. Data distribution (successes out of total trials/tests).

Mathematically, we can express the binomial model as follows:

\(P(x) = \left( \begin{array}{c} n \\ x \end{array} \right)\pi^x (1-\pi)^{n-x}\)

Where n, x and \(\pi\) denote number of trials, number of successes and underlying probability of success.

When applying our data to this model, it looks like this:

\(P(getting\ 37\ out\ of\ 385\ trials) = \left( \begin{array}{c} 385 \\ 37 \end{array} \right)\pi^{37}(1-\pi)^{385-37}\)

What is \(\pi\)? What does this equation tells us?

\(\pi\) represents the population-level prevalence (a parameter) that led to the result of 37 positive blood samples for diabetes. We refer to this as the “likelihood” value which generated our observed data. As you know, prevalence can range from 0 to 100 when expressed as a percentage (or 0 to 1 when expressed as a proportion). This equation helps us find the probabilities of producing 37 as the result using different values for \(\pi\), ranging from 0 to 1. Our goal is to find the \(\pi\) value that maximizes the probability of producing 37 as the result. This value for \(\pi\) formally receives the name “Maximum Likelihood Estimate (MLE).”

We can rewrite the equation with this new understanding as:

\(l(\pi) = \left( \begin{array}{c} 385 \\ 37 \end{array} \right)\pi^{37}(1-\pi)^{385-37}\)

What would be the population-level diabetic prevalence given the observed data? In other words, what is the population diabetic prevalence, given this data?

# Creating a vector of pi values ranging 0 to 1 with steps of 0.001
pi <- seq(from = 0, to = 1, by = 0.001)

# Number of trials/tests
n <- 385

# Observed positivity
x <- 37

# Calculate probability of getting 37 in different pi values
pd <- pi^x * (1-pi)^(n-x)

data <- data.frame(pi = pi, probability = pd)
head(data)
##      pi   probability
## 1 0.000  0.000000e+00
## 2 0.001 7.059759e-112
## 3 0.002 6.847597e-101
## 4 0.003  1.582711e-94
## 5 0.004  4.682410e-90
## 6 0.005  1.271517e-86
# Plotting the results
plot(x = pi, y = pd, type = "l", ylab = "Probability", xaxt = "n")
axis(side = 1, at = seq(0,1,0.1))
grid()

# Finding pi value with maximum probability value

data[data$probability == max(data$probability),]
##       pi probability
## 97 0.096 1.23211e-53

Therefore, we can conclude that, within the range of values for \(\pi\) the most probable population estimate is 9.6%, based on the observed data.