In this example, we use the Mark–Recapture method to estimate the true size of a smoking population, which is essential for planning effective public health interventions, allocating resources, and understanding the burden of tobacco use within a community.
The Mark–Recapture method, originally developed in ecology to estimate wildlife populations, provides a statistically rigorous alternative for estimating the size of partially observed human populations.
Here is an example demonstrating how to apply the Mark–Recapture method to estimate the size of a smoking population using simulated data.
This section of the code begins by loading the {tidyverse} library, which provides a collection of R packages commonly used for data wrangling, visualization, and analysis. After loading the library, the code defines three key values used in the Mark–Recapture method. The variable M represents the number of smokers identified in the first sample, effectively serving as the “marked” individuals. The variable C represents the total number of smokers captured in the second, independent sample. Finally, R represents the number of smokers who appeared in both samples, meaning they were recaptured. These three values form the foundation for calculating population estimates using both the Lincoln–Petersen and Chapman estimators.
# Load library
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.2
## ✔ ggplot2 4.0.0 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Add the values for the first sample, second sample and recaptured smokers
M <- 200 # First sample: smokers identified (“marked”)
C <- 180 # Second sample: total smokers captured
R <- 45 # Recaptured smokers (appear in both samples)
The Lincoln–Petersen estimator is the simplest and most widely used mark–recapture formula for estimating the size of a partially observed population. It uses three key values: the number of individuals identified in the first sample (M), the number identified in the second sample (C), and the number who appear in both samples (R).
# Lincoln–Petersen estimator formula
N_LP <- (M * C) / R
N_LP
## [1] 800
The following section of the code calculates the uncertainty around the Lincoln–Petersen population estimate by computing its approximate variance, standard error, and 95% confidence interval.
# Variance (approximate)
var_LP <- (M^2 * C * (C - R)) / (R^3)
# Standard error
se_LP <- sqrt(var_LP)
# 95% CI
LP_lower <- N_LP - 1.96 * se_LP
LP_upper <- N_LP + 1.96 * se_LP
c(
LP_Estimate = N_LP,
LP_Lower95 = LP_lower,
LP_Upper95 = LP_upper
)
## LP_Estimate LP_Lower95 LP_Upper95
## 800.0000 597.5721 1002.4279
This line of code calculates the Chapman estimator, an improved version of the Lincoln–Petersen method that reduces bias, particularly when the number of recaptured individuals (R) is small. The formula adds 1 to each of the three input values M, C, and R to stabilize the estimate and prevent inflation that can occur when recapture numbers are low. After multiplying the adjusted first and second sample sizes and dividing by the adjusted number of recaptures, the code subtracts 1 to return the final Chapman population estimate.
# Chapman Adjusted Estimate formula
N_Chapman <- ((M + 1) * (C + 1) / (R + 1)) - 1
N_Chapman
## [1] 789.8913
The following section of the code computes the statistical uncertainty around the Chapman-adjusted population estimate by calculating its variance, standard error, and 95% confidence interval.
# Variance of Chapman estimator
var_chapman <- ((M + 1) * (C + 1) * (M - R) * (C - R)) /
((R + 1)^2 * (R + 2))
# Standard error
se_chapman <- sqrt(var_chapman)
# 95% CI
chap_lower <- N_Chapman - 1.96 * se_chapman
chap_upper <- N_Chapman + 1.96 * se_chapman
c(
Chapman_Estimate = N_Chapman,
Chapman_Lower95 = chap_lower,
Chapman_Upper95 = chap_upper
)
## Chapman_Estimate Chapman_Lower95 Chapman_Upper95
## 789.8913 618.4090 961.3736
This code creates a clean, organized dataframe that consolidates the key results from both mark–recapture estimation methods—Lincoln–Petersen and Chapman—into a single structure that can be easily viewed, analyzed, or plotted. The dataframe includes four columns: the method name, the point estimate for each method, and the corresponding lower and upper bounds of the 95% confidence intervals.
# Create new dataframe
df_plot <- data.frame(
Method = c("Lincoln-Petersen", "Chapman"),
Estimate = c(N_LP, N_Chapman),
Lower95 = c(LP_lower, chap_lower),
Upper95 = c(LP_upper, chap_upper)
)
df_plot
## Method Estimate Lower95 Upper95
## 1 Lincoln-Petersen 800.0000 597.5721 1002.4279
## 2 Chapman 789.8913 618.4090 961.3736
This block of code uses the {ggplot2} library to create a visual comparison of the population estimates produced by the Lincoln–Petersen and Chapman mark–recapture methods. The plot maps each method on the x-axis and displays its corresponding population estimate on the y-axis.
# Crete plot maps for each method
ggplot(df_plot, aes(x = Method, y = Estimate)) +
geom_point(size = 4, color = "black") +
geom_errorbar(aes(ymin = Lower95, ymax = Upper95),
width = 0.15, linewidth = 1) +
labs(
title = "Population Size Estimates Using Mark–Recapture Methods",
subtitle = "Lincoln–Petersen vs Chapman Adjusted",
y = "Estimated Population Size",
x = ""
) +
theme_minimal(base_size = 14)
Using the Mark–Recapture approach to estimate the smoking population provides a practical and statistically grounded method for assessing the true size of a partially observed population. Both the Lincoln–Petersen and Chapman estimators produced consistent results, with the Chapman method offering a slightly less biased estimate and more stable confidence interval, especially when recapture numbers are modest.
Overall, this analysis demonstrates how the Mark–Recapture method can be effectively applied to public-health surveillance problems—such as estimating the number of smokers—especially when traditional survey approaches may underestimate the true size of the smoking population.
Disclaimer: The author of this tutorial, along with any associated organizations, assumes no responsibility for the use or misuse of the code and methods presented. This content is intended for educational purposes only and is not a substitute for professional advice.