1. Introduction

In medical statistics and health data science, the ability to quantify disease occurrence is the foundation of all subsequent analyses. We do not merely want to know if a disease is present; we need to measure how much is present, how fast it is spreading, and who is most affected.

Broadly, measures of disease frequency fall into two categories: Prevalence (the burden of existing disease) and Incidence (the flow of new disease).

2. Basic Mathematical Tools: Ratios, Proportions, and Rates

Before diving into disease-specific measures, we must distinguish between three mathematical constructs:

3. Prevalence

Prevalence measures the proportion of a population that has a condition at a specific time. It is a “snapshot” of the disease burden.

3.1 Point Prevalence

The proportion of the population that has the disease at a single point in time.

\[Prevalence = \frac{\text{Number of existing cases at time } t}{\text{Total population at time } t}\]

3.2 Period Prevalence

The proportion of the population that has the disease at any time during a specified interval (e.g., one year).

\[Period Prevalence = \frac{\text{Existing cases at start} + \text{New cases during interval}}{\text{Average population during interval}}\]

3.3 Real-Life Example: Diabetes in the UK

Suppose a survey in a city of 100,000 adults found 8,500 individuals with Type 2 Diabetes on January 1st, 2023.

  • Point Prevalence: \(8,500 / 100,000 = 0.085\) or 8.5%.

4. Incidence

Incidence measures the occurrence of new cases in a population initially at risk.

4.1 Cumulative Incidence (Risk)

The proportion of a closed population that develops the disease during a specified time period.

\[CI = \frac{\text{Number of new cases during period}}{\text{Number of disease-free individuals at start of period}}\]

4.2 Incidence Rate (Incidence Density)

In dynamic populations where people are followed for different lengths of time, we use Person-Time in the denominator. This is the gold standard for health data science.

\[IR = \frac{\text{Number of new cases during period}}{\sum \text{Person-time at risk}}\]


5. Visualizing Person-Time

In an MSc program, understanding person-time is critical. We often visualize this using a Lexis Diagram or a follow-up plot.

# Creating a dummy dataset for person-time visualization
cohort_data <- data.frame(
  id = factor(1:5),
  start = c(0, 0, 1, 2, 0),
  end = c(5, 3.5, 5, 4.2, 2.8),
  event = c(0, 1, 0, 1, 0) # 1 = Case, 0 = Censored
)

ggplot(cohort_data, aes(x = start, xend = end, y = id, yend = id)) +
  geom_segment(size = 2, color = "steelblue") +
  geom_point(data = subset(cohort_data, event == 1), aes(x = end, y = id), 
             shape = 4, size = 5, stroke = 2, color = "red") +
  theme_minimal() +
  labs(title = "Cohort Follow-up and Person-Time",
       x = "Years of Follow-up", y = "Subject ID") +
  scale_x_continuous(breaks = 0:5)
Figure 1: Follow-up of 5 subjects in a cohort study. 'X' denotes disease onset, arrows denote loss to follow-up.

Figure 1: Follow-up of 5 subjects in a cohort study. ‘X’ denotes disease onset, arrows denote loss to follow-up.

Calculation from Figure 1: * Total Person-Years = \(5 + 3.5 + 4 + 2.2 + 2.8 = 17.5\) person-years. * New Cases = 2. * Incidence Rate = \(2 / 17.5 = 0.114\) cases per person-year (or 11.4 per 100 person-years).


6. Relationship Between Prevalence and Incidence

In a steady-state population, prevalence is a function of how many people get the disease (Incidence) and how long they stay sick (Duration).

\[Prevalence \approx Incidence \times \text{Average Duration of Disease} (D)\]


7. Standardization of Rates

When comparing two populations (e.g., Norway vs. India), a “Crude Rate” can be misleading if the age structures differ. Older populations naturally have higher mortality rates. We use Standardization to adjust for this.

7.1 Direct Standardization

We apply the observed age-specific rates of our study populations to a single “Standard Population” (e.g., the WHO World Standard).

\[Rate_{adjusted} = \frac{\sum (Rate_{age\_i} \times Population_{standard\_i})}{\sum Population_{standard\_i}}\]

7.2 Indirect Standardization (SMR)

When age-specific rates for the study population are unknown or unstable (small numbers), we use the Standardized Mortality Ratio (SMR).

\[SMR = \frac{\text{Observed Deaths}}{\text{Expected Deaths}}\]


8. Statistical Inference: Confidence Intervals

As health data scientists, we must report the precision of our estimates.

95% CI for Prevalence (Normal Approximation)

For a large sample size \(n\): \[95\% CI = p \pm 1.96 \sqrt{\frac{p(1-p)}{n}}\]

95% CI for Incidence Rate (Poisson)

For \(D\) events in \(PT\) person-time: \[95\% CI = \frac{D \pm 1.96\sqrt{D}}{PT}\]


9. Practical Exercises in R

Below is an example of how to calculate these measures using the Epi package or base R.

# Real-life scenario: 10-year follow up of a cohort for cardiovascular disease
cases <- 45
total_n <- 1200
person_years <- 10500

# 1. Cumulative Incidence (Risk)
risk <- cases / total_n
cat("10-year Risk:", round(risk * 100, 2), "%\n")
## 10-year Risk: 3.75 %
# 2. Incidence Rate
ir <- cases / person_years
cat("Incidence Rate:", round(ir * 1000, 2), "per 1,000 person-years\n")
## Incidence Rate: 4.29 per 1,000 person-years
# 3. Confidence Interval for Rate (Exact Poisson)
# Using poisson.test for exact CI
res <- poisson.test(cases, person_years)
cat("95% CI for IR:", round(res$conf.int[1]*1000, 2), "-", round(res$conf.int[2]*1000, 2), "\n")
## 95% CI for IR: 3.13 - 5.73

10. Summary Table

Measure Numerator Denominator Interpretation
Prevalence All cases Total population Burden of disease
Cum. Incidence New cases Population at risk (start) Probability of developing disease
Incidence Rate New cases Total Person-Time Velocity of disease onset

Further Reading

  1. Rothman, K. J. (2012). Epidemiology: An Introduction. Oxford University Press.
  2. Kirkwood, B. R., & Sterne, J. A. (2003). Essential Medical Statistics. Blackwell Science.