In medical statistics and health data science, the ability to quantify disease occurrence is the foundation of all subsequent analyses. We do not merely want to know if a disease is present; we need to measure how much is present, how fast it is spreading, and who is most affected.
Broadly, measures of disease frequency fall into two categories: Prevalence (the burden of existing disease) and Incidence (the flow of new disease).
Before diving into disease-specific measures, we must distinguish between three mathematical constructs:
Prevalence measures the proportion of a population that has a condition at a specific time. It is a “snapshot” of the disease burden.
The proportion of the population that has the disease at a single point in time.
\[Prevalence = \frac{\text{Number of existing cases at time } t}{\text{Total population at time } t}\]
The proportion of the population that has the disease at any time during a specified interval (e.g., one year).
\[Period Prevalence = \frac{\text{Existing cases at start} + \text{New cases during interval}}{\text{Average population during interval}}\]
Suppose a survey in a city of 100,000 adults found 8,500 individuals with Type 2 Diabetes on January 1st, 2023.
Incidence measures the occurrence of new cases in a population initially at risk.
The proportion of a closed population that develops the disease during a specified time period.
\[CI = \frac{\text{Number of new cases during period}}{\text{Number of disease-free individuals at start of period}}\]
In dynamic populations where people are followed for different lengths of time, we use Person-Time in the denominator. This is the gold standard for health data science.
\[IR = \frac{\text{Number of new cases during period}}{\sum \text{Person-time at risk}}\]
In an MSc program, understanding person-time is critical. We often visualize this using a Lexis Diagram or a follow-up plot.
# Creating a dummy dataset for person-time visualization
cohort_data <- data.frame(
id = factor(1:5),
start = c(0, 0, 1, 2, 0),
end = c(5, 3.5, 5, 4.2, 2.8),
event = c(0, 1, 0, 1, 0) # 1 = Case, 0 = Censored
)
ggplot(cohort_data, aes(x = start, xend = end, y = id, yend = id)) +
geom_segment(size = 2, color = "steelblue") +
geom_point(data = subset(cohort_data, event == 1), aes(x = end, y = id),
shape = 4, size = 5, stroke = 2, color = "red") +
theme_minimal() +
labs(title = "Cohort Follow-up and Person-Time",
x = "Years of Follow-up", y = "Subject ID") +
scale_x_continuous(breaks = 0:5)
Figure 1: Follow-up of 5 subjects in a cohort study. ‘X’ denotes disease onset, arrows denote loss to follow-up.
Calculation from Figure 1: * Total Person-Years = \(5 + 3.5 + 4 + 2.2 + 2.8 = 17.5\) person-years. * New Cases = 2. * Incidence Rate = \(2 / 17.5 = 0.114\) cases per person-year (or 11.4 per 100 person-years).
In a steady-state population, prevalence is a function of how many people get the disease (Incidence) and how long they stay sick (Duration).
\[Prevalence \approx Incidence \times \text{Average Duration of Disease} (D)\]
When comparing two populations (e.g., Norway vs. India), a “Crude Rate” can be misleading if the age structures differ. Older populations naturally have higher mortality rates. We use Standardization to adjust for this.
We apply the observed age-specific rates of our study populations to a single “Standard Population” (e.g., the WHO World Standard).
\[Rate_{adjusted} = \frac{\sum (Rate_{age\_i} \times Population_{standard\_i})}{\sum Population_{standard\_i}}\]
When age-specific rates for the study population are unknown or unstable (small numbers), we use the Standardized Mortality Ratio (SMR).
\[SMR = \frac{\text{Observed Deaths}}{\text{Expected Deaths}}\]
As health data scientists, we must report the precision of our estimates.
For a large sample size \(n\): \[95\% CI = p \pm 1.96 \sqrt{\frac{p(1-p)}{n}}\]
For \(D\) events in \(PT\) person-time: \[95\% CI = \frac{D \pm 1.96\sqrt{D}}{PT}\]
Below is an example of how to calculate these measures using the
Epi package or base R.
# Real-life scenario: 10-year follow up of a cohort for cardiovascular disease
cases <- 45
total_n <- 1200
person_years <- 10500
# 1. Cumulative Incidence (Risk)
risk <- cases / total_n
cat("10-year Risk:", round(risk * 100, 2), "%\n")
## 10-year Risk: 3.75 %
# 2. Incidence Rate
ir <- cases / person_years
cat("Incidence Rate:", round(ir * 1000, 2), "per 1,000 person-years\n")
## Incidence Rate: 4.29 per 1,000 person-years
# 3. Confidence Interval for Rate (Exact Poisson)
# Using poisson.test for exact CI
res <- poisson.test(cases, person_years)
cat("95% CI for IR:", round(res$conf.int[1]*1000, 2), "-", round(res$conf.int[2]*1000, 2), "\n")
## 95% CI for IR: 3.13 - 5.73
| Measure | Numerator | Denominator | Interpretation |
|---|---|---|---|
| Prevalence | All cases | Total population | Burden of disease |
| Cum. Incidence | New cases | Population at risk (start) | Probability of developing disease |
| Incidence Rate | New cases | Total Person-Time | Velocity of disease onset |