Estimating the Size of a Smoking Population Using the Mark

Using the Mark–Recapture Method in R

In this example, we use the Mark–Recapture method to estimate the true size of a smoking population, which is essential for planning effective public health interventions, allocating resources, and understanding the burden of tobacco use within a community.

The Mark–Recapture method, originally developed in ecology to estimate wildlife populations, provides a statistically rigorous alternative for estimating the size of partially observed human populations.

Here is an example demonstrating how to apply the Mark–Recapture method to estimate the size of a smoking population using simulated data.

This section of the code begins by loading the {tidyverse} library, which provides a collection of R packages commonly used for data wrangling, visualization, and analysis. After loading the library, the code defines three key values used in the Mark–Recapture method. The variable M represents the number of smokers identified in the first sample, effectively serving as the “marked” individuals. The variable C represents the total number of smokers captured in the second, independent sample. Finally, R represents the number of smokers who appeared in both samples, meaning they were recaptured. These three values form the foundation for calculating population estimates using both the Lincoln–Petersen and Chapman estimators.

# Load library
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.2
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

# Add the values for the first sample, second sample and recaptured smokers
M <- 200   # First sample: smokers identified (“marked”)
C <- 180   # Second sample: total smokers captured
R <- 45    # Recaptured smokers (appear in both samples)

The Lincoln–Petersen estimator is the simplest and most widely used mark–recapture formula for estimating the size of a partially observed population. It uses three key values: the number of individuals identified in the first sample (M), the number identified in the second sample (C), and the number who appear in both samples (R).

# Lincoln–Petersen estimator formula
N_LP <- (M * C) / R
N_LP

## [1] 800

The following section of the code calculates the uncertainty around the Lincoln–Petersen population estimate by computing its approximate variance, standard error, and 95% confidence interval.

# Variance (approximate)
var_LP <- (M^2 * C * (C - R)) / (R^3)

# Standard error
se_LP <- sqrt(var_LP)

# 95% CI
LP_lower <- N_LP - 1.96 * se_LP
LP_upper <- N_LP + 1.96 * se_LP

c(
  LP_Estimate = N_LP,
  LP_Lower95 = LP_lower,
  LP_Upper95 = LP_upper
)

## LP_Estimate  LP_Lower95  LP_Upper95 
##    800.0000    597.5721   1002.4279

This line of code calculates the Chapman estimator, an improved version of the Lincoln–Petersen method that reduces bias, particularly when the number of recaptured individuals (R) is small. The formula adds 1 to each of the three input values M, C, and R to stabilize the estimate and prevent inflation that can occur when recapture numbers are low. After multiplying the adjusted first and second sample sizes and dividing by the adjusted number of recaptures, the code subtracts 1 to return the final Chapman population estimate.

# Chapman Adjusted Estimate formula
N_Chapman <- ((M + 1) * (C + 1) / (R + 1)) - 1
N_Chapman

## [1] 789.8913

The following section of the code computes the statistical uncertainty around the Chapman-adjusted population estimate by calculating its variance, standard error, and 95% confidence interval.

# Variance of Chapman estimator
var_chapman <- ((M + 1) * (C + 1) * (M - R) * (C - R)) /
               ((R + 1)^2 * (R + 2))

# Standard error
se_chapman <- sqrt(var_chapman)

# 95% CI
chap_lower <- N_Chapman - 1.96 * se_chapman
chap_upper <- N_Chapman + 1.96 * se_chapman

c(
  Chapman_Estimate = N_Chapman,
  Chapman_Lower95 = chap_lower,
  Chapman_Upper95 = chap_upper
)

## Chapman_Estimate  Chapman_Lower95  Chapman_Upper95 
##         789.8913         618.4090         961.3736

This code creates a clean, organized dataframe that consolidates the key results from both mark–recapture estimation methods—Lincoln–Petersen and Chapman—into a single structure that can be easily viewed, analyzed, or plotted. The dataframe includes four columns: the method name, the point estimate for each method, and the corresponding lower and upper bounds of the 95% confidence intervals.

# Create new dataframe
df_plot <- data.frame(
  Method = c("Lincoln-Petersen", "Chapman"),
  Estimate = c(N_LP, N_Chapman),
  Lower95 = c(LP_lower, chap_lower),
  Upper95 = c(LP_upper, chap_upper)
)

df_plot

##             Method Estimate  Lower95   Upper95
## 1 Lincoln-Petersen 800.0000 597.5721 1002.4279
## 2          Chapman 789.8913 618.4090  961.3736

This block of code uses the {ggplot2} library to create a visual comparison of the population estimates produced by the Lincoln–Petersen and Chapman mark–recapture methods. The plot maps each method on the x-axis and displays its corresponding population estimate on the y-axis.

# Crete plot maps for each method 
ggplot(df_plot, aes(x = Method, y = Estimate)) +
  geom_point(size = 4, color = "black") +
  geom_errorbar(aes(ymin = Lower95, ymax = Upper95),
                width = 0.15, linewidth = 1) +
  labs(
    title = "Population Size Estimates Using Mark–Recapture Methods",
    subtitle = "Lincoln–Petersen vs Chapman Adjusted",
    y = "Estimated Population Size",
    x = ""
  ) +
  theme_minimal(base_size = 14)

Using the Mark–Recapture approach to estimate the smoking population provides a practical and statistically grounded method for assessing the true size of a partially observed population. Both the Lincoln–Petersen and Chapman estimators produced consistent results, with the Chapman method offering a slightly less biased estimate and more stable confidence interval, especially when recapture numbers are modest.

Overall, this analysis demonstrates how the Mark–Recapture method can be effectively applied to public-health surveillance problems—such as estimating the number of smokers—especially when traditional survey approaches may underestimate the true size of the smoking population.

Disclaimer: The author of this tutorial, along with any associated organizations, assumes no responsibility for the use or misuse of the code and methods presented. This content is intended for educational purposes only and is not a substitute for professional advice.

Estimating the Size of a Smoking Population Using the Mark–Recapture Method in R

Ramon Rodriguez-Santana, MBA, MPH

2025-11-30

Using the Mark–Recapture Method in R

A.M.D.G.