Affiliations
¹ Charles Sturt University, Sydney, Australia
² Gulabali Research Centre, Charles Sturt University, Wagga Wagga, Australia
Corresponding Author: Dr Hom Nath Dhungana
Email: homnath1988@gmail.com
By the end of these notes, you should be able to:
Infectious disease modelling translates biological, epidemiological and behavioural knowledge into mathematical or statistical structures. These models help researchers and public health decision-makers answer questions such as:
Models are not perfect predictions. They are structured ways of reasoning under uncertainty. Their usefulness depends on the quality of data, assumptions, parameter values, model structure and interpretation.
An infectious disease system usually involves:
Incidence is the number of new cases occurring during a specified time interval.
Prevalence is the number or proportion of existing cases at a particular point or period.
For modelling acute infections, incidence is often more informative because it describes the timing of new infections. For chronic infections such as tuberculosis or HIV, both incidence and prevalence may be important.
The force of infection, usually denoted by \(\lambda(t)\), is the instantaneous rate at which susceptible individuals become infected.
In a basic frequency-dependent model:
\[ \lambda(t) = \beta \frac{I(t)}{N}, \]
where:
The generation interval is biologically central but often unobserved. The serial interval is easier to observe and is widely used in reproduction-number estimation.
Deterministic models produce the same output every time for the same initial conditions and parameter values. They are usually written as ordinary differential equations.
Advantages:
Limitations:
Stochastic models include random variation. They are useful when transmission is highly variable or when the number of infections is small.
Advantages:
Limitations:
Compartmental models divide the population into epidemiological states, for example:
Common models include SI, SIS, SIR, SEIR and SEIRD.
The SI model is the simplest infection model. Individuals move from susceptible to infected.
\[ \frac{dS}{dt} = -\beta \frac{SI}{N}, \]
\[ \frac{dI}{dt} = \beta \frac{SI}{N}. \]
This model may be useful for early growth or infections without recovery over the period of interest, but it is usually too simple for most real epidemics.
N <- 1000
beta <- 0.35
days <- 100
S <- I <- numeric(days + 1)
S[1] <- 999
I[1] <- 1
dt <- 1
for (t in 1:days) {
new_inf <- beta * S[t] * I[t] / N * dt
S[t + 1] <- S[t] - new_inf
I[t + 1] <- I[t] + new_inf
}
plot(0:days, S, type = "l", lwd = 2, ylim = c(0, N), xlab = "Day", ylab = "Number of people")
lines(0:days, I, lwd = 2, lty = 2)
legend("right", legend = c("Susceptible", "Infected"), lwd = 2, lty = c(1, 2), bty = "n")
The SIR model is the classic model for acute immunising infections.
\[ \frac{dS}{dt} = -\beta \frac{SI}{N}, \]
\[ \frac{dI}{dt} = \beta \frac{SI}{N} - \gamma I, \]
\[ \frac{dR}{dt} = \gamma I. \]
where:
The basic reproduction number is:
\[ R_0 = \frac{\beta}{\gamma}. \]
If \(R_0 > 1\), the epidemic can grow in a fully susceptible population. If \(R_0 < 1\), sustained spread is unlikely.
simulate_sir <- function(N = 1000, I0 = 1, R0_init = 0, beta = 0.3, gamma = 0.1, days = 160, dt = 1) {
steps <- days / dt
S <- I <- R <- numeric(steps + 1)
S[1] <- N - I0 - R0_init
I[1] <- I0
R[1] <- R0_init
for (t in 1:steps) {
new_inf <- beta * S[t] * I[t] / N * dt
new_rec <- gamma * I[t] * dt
S[t + 1] <- S[t] - new_inf
I[t + 1] <- I[t] + new_inf - new_rec
R[t + 1] <- R[t] + new_rec
}
data.frame(day = seq(0, days, by = dt), S = S, I = I, R = R)
}
sir_out <- simulate_sir(beta = 0.30, gamma = 0.10)
head(sir_out)
plot(sir_out$day, sir_out$S, type = "l", lwd = 2, ylim = c(0, 1000), xlab = "Day", ylab = "Number of people")
lines(sir_out$day, sir_out$I, lwd = 2, lty = 2)
lines(sir_out$day, sir_out$R, lwd = 2, lty = 3)
legend("right", legend = c("S", "I", "R"), lwd = 2, lty = c(1, 2, 3), bty = "n")
The SEIR model includes an exposed compartment.
\[ \frac{dS}{dt} = -\beta \frac{SI}{N}, \]
\[ \frac{dE}{dt} = \beta \frac{SI}{N} - \sigma E, \]
\[ \frac{dI}{dt} = \sigma E - \gamma I, \]
\[ \frac{dR}{dt} = \gamma I. \]
where:
simulate_seir <- function(N = 1000, E0 = 0, I0 = 1, R0_init = 0, beta = 0.4, sigma = 1/4, gamma = 1/6, days = 160, dt = 1) {
steps <- days / dt
S <- E <- I <- R <- numeric(steps + 1)
S[1] <- N - E0 - I0 - R0_init
E[1] <- E0
I[1] <- I0
R[1] <- R0_init
for (t in 1:steps) {
new_exp <- beta * S[t] * I[t] / N * dt
new_inf <- sigma * E[t] * dt
new_rec <- gamma * I[t] * dt
S[t + 1] <- S[t] - new_exp
E[t + 1] <- E[t] + new_exp - new_inf
I[t + 1] <- I[t] + new_inf - new_rec
R[t + 1] <- R[t] + new_rec
}
data.frame(day = seq(0, days, by = dt), S = S, E = E, I = I, R = R)
}
seir_out <- simulate_seir()
plot(seir_out$day, seir_out$S, type = "l", lwd = 2, ylim = c(0, 1000), xlab = "Day", ylab = "Number of people")
lines(seir_out$day, seir_out$E, lwd = 2, lty = 2)
lines(seir_out$day, seir_out$I, lwd = 2, lty = 3)
lines(seir_out$day, seir_out$R, lwd = 2, lty = 4)
legend("right", legend = c("S", "E", "I", "R"), lwd = 2, lty = c(1, 2, 3, 4), bty = "n")
The basic reproduction number \(R_0\) is the expected number of secondary infections generated by one infectious individual in a fully susceptible population.
For a simple SIR model:
\[ R_0 = \frac{\beta}{\gamma}. \]
The effective reproduction number \(R_t\) accounts for depletion of susceptible individuals and interventions.
For a homogeneous SIR model:
\[ R_t = R_0 \frac{S(t)}{N}. \]
sir_out$Rt <- (0.30 / 0.10) * sir_out$S / 1000
plot(sir_out$day, sir_out$Rt, type = "l", lwd = 2, xlab = "Day", ylab = "Effective reproduction number")
abline(h = 1, lty = 2)
In practice, \(R_t\) can be estimated from incidence data and a serial interval distribution. A common renewal-equation approximation is:
\[ I_t \sim \text{Poisson}(R_t \Lambda_t), \]
where:
\[ \Lambda_t = \sum_{s=1}^{t} I_{t-s} w_s, \]
and \(w_s\) is the serial interval distribution.
The Cori et al. framework estimates the instantaneous reproduction number using incidence and serial interval assumptions. The EpiEstim R package implements this approach, but below we use a simple base-R version for teaching.
The following small dataset contains daily confirmed COVID-19 cases for Australia during March 2020. It is embedded directly so that the R Markdown file knits without downloading data. In applied work, always use the most recent official data source, check definitions carefully and document reporting delays.
covid_aus <- data.frame(
date = as.Date(c(
"2020-03-01", "2020-03-02", "2020-03-03", "2020-03-04", "2020-03-05",
"2020-03-06", "2020-03-07", "2020-03-08", "2020-03-09", "2020-03-10",
"2020-03-11", "2020-03-12", "2020-03-13", "2020-03-14", "2020-03-15",
"2020-03-16", "2020-03-17", "2020-03-18", "2020-03-19", "2020-03-20",
"2020-03-21"
)),
cases = c(3, 4, 8, 12, 9, 13, 12, 15, 20, 16, 27, 32, 46, 37, 52, 57, 77, 111, 144, 147, 224)
)
covid_aus$cumulative <- cumsum(covid_aus$cases)
covid_aus
plot(covid_aus$date, covid_aus$cases, type = "h", lwd = 5, xlab = "Date", ylab = "Daily confirmed cases")
points(covid_aus$date, covid_aus$cases, pch = 16)
During the early phase of an epidemic, susceptible depletion is small. Incidence may grow approximately exponentially:
\[ I_t = I_0 e^{rt}, \]
where \(r\) is the exponential growth rate.
Taking logs:
\[ \log(I_t) = \log(I_0) + rt. \]
dat <- covid_aus
dat$t <- seq_len(nrow(dat)) - 1
fit_growth <- lm(log(cases) ~ t, data = dat)
summary(fit_growth)
##
## Call:
## lm(formula = log(cases) ~ t, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.34294 -0.10887 -0.01426 0.14636 0.50288
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.415279 0.090882 15.57 2.84e-12 ***
## t 0.188917 0.007774 24.30 8.99e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2157 on 19 degrees of freedom
## Multiple R-squared: 0.9688, Adjusted R-squared: 0.9672
## F-statistic: 590.5 on 1 and 19 DF, p-value: 8.989e-16
r_hat <- coef(fit_growth)["t"]
doubling_time <- log(2) / r_hat
r_hat
## t
## 0.1889167
doubling_time
## t
## 3.669063
pred_cases <- exp(predict(fit_growth))
plot(dat$date, dat$cases, pch = 16, xlab = "Date", ylab = "Daily cases")
lines(dat$date, pred_cases, lwd = 2)
If the infectious period is exponentially distributed with recovery rate \(\gamma\), a simple approximation is:
\[ R_0 \approx 1 + \frac{r}{\gamma}. \]
For a fixed generation interval \(T_g\), another approximation is:
\[ R_0 \approx e^{rT_g}. \]
gamma <- 1/5
Tg <- 5
R0_exp_infectious <- 1 + r_hat / gamma
R0_fixed_generation <- exp(r_hat * Tg)
c(
growth_rate = r_hat,
doubling_time_days = doubling_time,
R0_exponential_infectious_period = R0_exp_infectious,
R0_fixed_generation_interval = R0_fixed_generation
)
## growth_rate.t doubling_time_days.t
## 0.1889167 3.6690632
## R0_exponential_infectious_period.t R0_fixed_generation_interval.t
## 1.9445833 2.5717416
Interpretation: these are rough estimates. They depend strongly on the assumed infectious period or generation interval. Early incidence data are often affected by testing, importation, reporting delays and surveillance changes.
We now estimate a simple instantaneous reproduction number using a discretised serial interval distribution.
incidence <- covid_aus$cases
n <- length(incidence)
# Discrete serial interval: mean approximately 5 days, sd approximately 2 days
si_days <- 1:15
si_weights <- dgamma(si_days, shape = (5/2)^2, scale = 2^2/5)
si_weights <- si_weights / sum(si_weights)
infectiousness <- rep(NA_real_, n)
Rt_simple <- rep(NA_real_, n)
for (t in 2:n) {
max_lag <- min(t - 1, length(si_weights))
infectiousness[t] <- sum(incidence[(t - max_lag):(t - 1)] * si_weights[1:max_lag])
if (!is.na(infectiousness[t]) && infectiousness[t] > 0) {
Rt_simple[t] <- incidence[t] / infectiousness[t]
}
}
rt_dat <- data.frame(date = covid_aus$date, incidence = incidence, infectiousness = infectiousness, Rt = Rt_simple)
rt_dat
plot(rt_dat$date, rt_dat$Rt, type = "b", pch = 16, xlab = "Date", ylab = "Simple estimated Rt", ylim = c(0, max(rt_dat$Rt, na.rm = TRUE)))
abline(h = 1, lty = 2)
This simple estimator is intentionally transparent. For publication-quality analysis, use a formal framework with uncertainty intervals, reporting-delay adjustment, imported/local case distinction and sensitivity analysis.
Suppose we observe daily incidence and want to estimate \(\beta\) and \(\gamma\). We can fit a deterministic model by minimising the squared difference between observed and predicted incidence.
The predicted new infections in the SIR model are:
\[ \text{new infections}_t = \beta \frac{S_t I_t}{N}. \]
true_beta <- 0.32
true_gamma <- 0.10
N <- 10000
sir_true <- simulate_sir(N = N, I0 = 10, beta = true_beta, gamma = true_gamma, days = 80)
observed <- data.frame(
day = sir_true$day[-1],
cases = pmax(0, round(diff(N - sir_true$S) + rnorm(80, mean = 0, sd = 8)))
)
plot(observed$day, observed$cases, pch = 16, xlab = "Day", ylab = "Observed incidence")
sir_incidence <- function(beta, gamma, N = 10000, I0 = 10, days = 80) {
out <- simulate_sir(N = N, I0 = I0, beta = beta, gamma = gamma, days = days)
diff(N - out$S)
}
objective <- function(par) {
beta <- exp(par[1])
gamma <- exp(par[2])
pred <- sir_incidence(beta, gamma, N = N, I0 = 10, days = 80)
sum((observed$cases - pred)^2)
}
fit <- optim(par = log(c(beta = 0.2, gamma = 0.2)), fn = objective, method = "Nelder-Mead")
est <- exp(fit$par)
est
## beta gamma
## 0.31730649 0.09727082
est["beta"] / est["gamma"]
## beta
## 3.262093
fitted_inc <- sir_incidence(est["beta"], est["gamma"], N = N, I0 = 10, days = 80)
plot(observed$day, observed$cases, pch = 16, xlab = "Day", ylab = "Daily cases")
lines(observed$day, fitted_inc, lwd = 2)
legend("topright", legend = c("Observed", "Fitted deterministic SIR"), pch = c(16, NA), lty = c(NA, 1), lwd = c(NA, 2), bty = "n")
Squared-error fitting is simple but not always statistically appropriate. Infectious disease case counts are non-negative integers, so Poisson or negative binomial likelihoods are common.
For a Poisson observation model:
\[ Y_t \sim \text{Poisson}(\mu_t), \]
where \(\mu_t\) is predicted incidence from the model.
neg_log_lik <- function(par) {
beta <- exp(par[1])
gamma <- exp(par[2])
mu <- sir_incidence(beta, gamma, N = N, I0 = 10, days = 80)
mu <- pmax(mu, 1e-8)
-sum(dpois(observed$cases, lambda = mu, log = TRUE))
}
fit_pois <- optim(par = log(c(beta = 0.25, gamma = 0.12)), fn = neg_log_lik, method = "Nelder-Mead")
est_pois <- exp(fit_pois$par)
est_pois
## beta gamma
## 0.3196818 0.1001534
est_pois["beta"] / est_pois["gamma"]
## beta
## 3.191923
A simple stochastic SIR model can be simulated using binomial transitions:
\[ \text{new infections}_t \sim \text{Binomial}\left(S_t, 1 - e^{-\beta I_t/N}\right), \]
\[ \text{new recoveries}_t \sim \text{Binomial}\left(I_t, 1 - e^{-\gamma}\right). \]
simulate_stochastic_sir <- function(N = 1000, I0 = 1, beta = 0.3, gamma = 0.1, days = 160) {
S <- I <- R <- numeric(days + 1)
S[1] <- N - I0
I[1] <- I0
R[1] <- 0
for (t in 1:days) {
p_inf <- 1 - exp(-beta * I[t] / N)
p_rec <- 1 - exp(-gamma)
new_inf <- rbinom(1, S[t], p_inf)
new_rec <- rbinom(1, I[t], p_rec)
S[t + 1] <- S[t] - new_inf
I[t + 1] <- I[t] + new_inf - new_rec
R[t + 1] <- R[t] + new_rec
}
data.frame(day = 0:days, S = S, I = I, R = R)
}
stoch1 <- simulate_stochastic_sir(beta = 0.3, gamma = 0.1)
plot(stoch1$day, stoch1$I, type = "l", lwd = 2, xlab = "Day", ylab = "Infectious individuals")
nsim <- 50
mat_I <- matrix(NA_real_, nrow = 161, ncol = nsim)
for (i in 1:nsim) {
mat_I[, i] <- simulate_stochastic_sir(beta = 0.3, gamma = 0.1)$I
}
plot(0:160, mat_I[, 1], type = "l", ylim = range(mat_I), xlab = "Day", ylab = "Infectious individuals")
for (i in 2:nsim) lines(0:160, mat_I[, i])
Sensitivity analysis asks: how much do model outputs change when parameters change?
Common outputs include:
beta_values <- seq(0.15, 0.50, by = 0.025)
sens <- data.frame(beta = beta_values, peak_I = NA_real_, peak_day = NA_real_, final_size = NA_real_)
for (i in seq_along(beta_values)) {
out <- simulate_sir(N = 1000, I0 = 1, beta = beta_values[i], gamma = 0.1, days = 160)
sens$peak_I[i] <- max(out$I)
sens$peak_day[i] <- out$day[which.max(out$I)]
sens$final_size[i] <- max(out$R)
}
sens
plot(sens$beta, sens$peak_I, type = "b", pch = 16, xlab = "Transmission rate beta", ylab = "Peak infectious population")
beta_grid <- seq(0.15, 0.45, by = 0.05)
gamma_grid <- seq(0.05, 0.20, by = 0.025)
res <- expand.grid(beta = beta_grid, gamma = gamma_grid)
res$R0 <- res$beta / res$gamma
res$final_size <- NA_real_
for (i in seq_len(nrow(res))) {
out <- simulate_sir(N = 1000, I0 = 1, beta = res$beta[i], gamma = res$gamma[i], days = 160)
res$final_size[i] <- max(out$R)
}
head(res)
plot(res$R0, res$final_size, pch = 16, xlab = "Basic reproduction number", ylab = "Final epidemic size")
Interventions can be represented by reducing transmission:
\[ \beta_{new} = \beta(1 - c), \]
where \(c\) is intervention effectiveness.
compare_intervention <- function(c_effect) {
beta0 <- 0.30
beta_new <- beta0 * (1 - c_effect)
out <- simulate_sir(N = 1000, I0 = 5, beta = beta_new, gamma = 0.10, days = 160)
out$intervention <- paste0(round(c_effect * 100), "% reduction")
out
}
int0 <- compare_intervention(0)
int30 <- compare_intervention(0.30)
int60 <- compare_intervention(0.60)
plot(int0$day, int0$I, type = "l", lwd = 2, ylim = c(0, max(int0$I)), xlab = "Day", ylab = "Infectious individuals")
lines(int30$day, int30$I, lwd = 2, lty = 2)
lines(int60$day, int60$I, lwd = 2, lty = 3)
legend("topright", legend = c("No reduction", "30% reduction", "60% reduction"), lwd = 2, lty = c(1, 2, 3), bty = "n")
For a perfect vaccine in a homogeneous population, the critical vaccination threshold is:
\[ p_c = 1 - \frac{1}{R_0}. \]
If vaccine effectiveness is \(VE\), the required coverage is approximately:
\[ p_c = \frac{1 - 1/R_0}{VE}. \]
R0_values <- seq(1.1, 5, by = 0.1)
VE <- 0.90
threshold <- (1 - 1 / R0_values) / VE
threshold <- pmin(threshold, 1)
plot(R0_values, threshold, type = "l", lwd = 2, xlab = "R0", ylab = "Required vaccine coverage")
Many diseases show overdispersion: a small proportion of cases generate a large proportion of secondary infections. This can be represented by a negative binomial offspring distribution.
If the mean number of secondary infections is \(R\) and the dispersion parameter is \(k\), then smaller \(k\) means stronger heterogeneity.
R_mean <- 2.5
k_values <- c(0.1, 0.5, 2, 10)
x <- 0:20
plot(x, dnbinom(x, size = k_values[1], mu = R_mean), type = "h", lwd = 4, ylim = c(0, 0.75), xlab = "Secondary cases", ylab = "Probability")
for (k in k_values[-1]) {
lines(x + runif(1, -0.05, 0.05), dnbinom(x, size = k, mu = R_mean), type = "h", lwd = 2)
}
legend("topright", legend = paste("k =", k_values), lwd = 2, bty = "n")
Reported cases are not always equal to true infections. If \(\rho\) is reporting probability:
\[ Y_t \sim \text{Binomial}(I_t, \rho). \]
Under-reporting affects estimated incidence, growth rates, severity ratios and reproduction numbers, especially if reporting changes over time.
Observed reports may lag behind infection or symptom onset:
\[ Y_t = \sum_{d=0}^{D} X_{t-d} q_d, \]
where \(q_d\) is the delay distribution.
Delay correction is important for real-time estimation because recent days are usually incomplete.
Spatial models include location-specific risk:
\[ \log(\mu_{it}) = \alpha + u_i + v_i + \gamma_t + \delta_{it} + \log(Pop_i), \]
where:
These models are useful for dengue, tuberculosis, COVID-19, influenza, Ebola and vector-borne diseases.
Good modelling practice includes:
sessionInfo()
## R version 4.4.3 (2025-02-28)
## Platform: x86_64-apple-darwin20
## Running under: macOS Sequoia 15.7.7
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
##
## locale:
## [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
##
## time zone: Australia/Sydney
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## loaded via a namespace (and not attached):
## [1] digest_0.6.37 R6_2.6.1 fastmap_1.2.0 xfun_0.52
## [5] cachem_1.1.0 knitr_1.50 htmltools_0.5.8.1 rmarkdown_2.29
## [9] lifecycle_1.0.4 cli_3.6.4 vctrs_0.6.5 sass_0.4.9
## [13] jquerylib_0.1.4 compiler_4.4.3 rstudioapi_0.17.1 tools_4.4.3
## [17] evaluate_1.0.3 bslib_0.9.0 yaml_2.3.10 rlang_1.1.5
## [21] jsonlite_1.9.1