1. SIMULATING THE SEIR MODEL IN R
We start by coding the model in R and plotting the model results.This
model is embedded in R and uses the (seir_model) function.
Libraries
library(deSolve) #Solving differential equations
library(ggplot2) # For plotting
deSolve: Used to solve the system of differential equations
numerically (required for SEIR model).
ggplot2: Used for plotting the results of the simulation in a
visually appealing way.
Define the SEIR model function
seir_model <- function(time, state, parameters) {
S <- state[1]
E <- state[2]
I <- state[3]
R <- state[4]
N <- S + E + I + R
beta <- parameters["beta"]
sigma <- parameters["sigma"]
gamma <- parameters["gamma"]
dSdt <- -beta * S * I / N
dEdt <- beta * S * I / N - sigma * E
dIdt <- sigma * E - gamma * I
dRdt <- gamma * I
list(c(dSdt, dEdt, dIdt, dRdt))}
This function defines the SEIR model as a system of differential
equations (ODEs):
Variables: S: Susceptible individuals (not yet infected).
E: Exposed individuals (infected but not yet infectious).
I: Infected individuals (can spread the disease).
R: Recovered individuals (immune or removed from the population).
The total population N is the sum of S + E + I + R.
Parameters: beta: The transmission rate (how easily the disease
spreads).
sigma: The rate at which exposed individuals become infectious (i.e.,
the incubation period).
gamma: The recovery rate (rate at which infected individuals
recover).
Differential Equations: dS/dt: Change in susceptible population due
to infection.
dE/dt: Change in exposed population due to exposure (contact with
infected individuals).
dI/dt: Change in infected population (from exposed individuals
becoming infectious and recovering).
dR/dt: Change in recovered population (from infected individuals
recovering).
The model returns a list with the derivatives (rates of change) for
each compartment.
Initial conditions
N <- 995 # Total population
E0 <- 10 # Initial exposed
I0 <- 5 # Initial infected
R0 <- 0 # Initial recovered
S0 <- N - E0 - I0 - R0 # Initial susceptible
initial_state <- c(S = S0, E = E0, I = I0, R = R0)
N: Total population (995).
E0, I0, R0: Initial numbers of exposed, infected, and recovered
individuals.
S0: Initial number of susceptible individuals (calculated by
subtracting the others from N).
initial_state: A vector with the initial values for S, E, I, and
R.
Parameters
parameters <- c(beta = 0.8, sigma = 0.0347, gamma = 0.5)
beta = 0.8: Transmission rate.
sigma = 0.0347: Rate at which exposed individuals become infected
(1/incubation period).
gamma = 0.5: Recovery rate (probability of recovery per time
unit).
Time span for simulation
times <- seq(0, 200, length.out = 500)
times: Creates a sequence of time points (from 0 to 200 days), with
500 points in total. These are the time intervals at which the model
will be solved.
Q:check on lower values of t i.e t = 10 and high value i.e t =
300
Solve the system of ODEs.
solution <- ode(y = initial_state, times = times, func = seir_model, parms = parameters)
ode(): The core function from deSolve that numerically solves the
system of differential equations (seir_model).
y = initial_state: Initial conditions for S, E, I, and R.
times = times: The time intervals at which the solution is
computed.
func = seir_model: The SEIR model function that defines the
differential equations.
parms = parameters: The model parameters (beta, sigma, gamma).
This returns a matrix of results for each time step, with values for
S, E, I, and R.
Convert solution to a data frame.
solution_df <- as.data.frame(solution)
Converts the output of the ode() function (a matrix) into a data
frame for easier manipulation and plotting.
Plot results
ggplot(solution_df, aes(x = time)) +
geom_line(aes(y = S, color = "Susceptible")) +
geom_line(aes(y = I, color = "Infected")) +
geom_line(aes(y = R, color = "Recovered")) +
#geom_line(aes(y = E, color = "Exposed")) +
labs(x = "Time (days)", y = "Population", title = "SEIR Model Dynamics") +
scale_color_manual(values = c("Susceptible" = "blue", "Infected" = "red", "Exposed" = "yellow", "Recovered" = "green")) +
theme_minimal()

Plot Components: - aes(x = time): Time on the x-axis.
geom_line(): Plots a line for each compartment (Susceptible,
Infected, and Recovered).
labs(): Labels for the plot (x-axis, y-axis, title).
scale_color_manual(): Customizes the line colors for each
compartment (blue for Susceptible, red for Infected, green for
Recovered).
theme_minimal(): Applies a minimalist theme to the plot.
Summary
This code models the spread of a disease in a population using the
SEIR compartmental model. - It defines the dynamics of susceptible,
exposed, infected, and recovered individuals - Solves the system of
differential equations - Plots the dynamics over time. The plot helps
visualize how the number of susceptible, infected, and recovered
individuals change over 200 days.
Key Concepts:
SEIR model: A simple mathematical model of disease spread.
ODE solver: The ode() function numerically solves the system of
equations.
ggplot2: Used for visualizing the results.
1.2. Comparing the model results with real or observed data.
Since the real data was not available, we used synthetic data in csv
file format.The malaria data set was named
malaria5.
We can first look at the descriptive statistics of the data set
malaria5 using the code below. ###### Always ensure to
check the file before running to check if the datase is OK.
Loading necessary libraries
library(readr) # For reading CSV files
library(dplyr) # For data manipulation
Load the CSV file
data <- read_csv("malaria5.csv")
Display the first few rows
cat("First 5 rows of the dataset:\n")
First 5 rows of the dataset:
print(head(data, 5))
# A tibble: 5 × 7
day S E I R Sv Iv
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0 980 10 5 0 990 10
2 1 969 28 5 0 983 15
3 2 949 54 7 0 976 20
4 3 920 86 12 0 966 27
5 4 879 127 19 0 950 40
Get basic info about the dataset
cat("\nDataset Summary:\n")
Dataset Summary:
print(str(data)) # Structure of the dataset
spc_tbl_ [100 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ day: num [1:100] 0 1 2 3 4 5 6 7 8 9 ...
$ S : num [1:100] 980 969 949 920 879 818 727 601 442 269 ...
$ E : num [1:100] 10 28 54 86 127 184 265 373 503 634 ...
$ I : num [1:100] 5 5 7 12 19 30 46 70 103 148 ...
$ R : num [1:100] 0 0 0 0 0 0 1 3 6 11 ...
$ Sv : num [1:100] 990 983 976 966 950 925 887 831 752 648 ...
$ Iv : num [1:100] 10 15 20 27 40 61 93 140 206 291 ...
- attr(*, "spec")=
.. cols(
.. day = col_double(),
.. S = col_double(),
.. E = col_double(),
.. I = col_double(),
.. R = col_double(),
.. Sv = col_double(),
.. Iv = col_double()
.. )
- attr(*, "problems")=<externalptr>
NULL
Check for missing values
cat("\nMissing values per column:\n")
Missing values per column:
print(colSums(is.na(data)))
day S E I R Sv Iv
0 0 0 0 0 0 0
Describe numerical data
cat("\nStatistical Summary:\n")
Statistical Summary:
print(summary(data))
day S E I
Min. : 0.00 Min. : 8.0 Min. : 10.0 Min. : 5.0
1st Qu.:24.75 1st Qu.: 24.0 1st Qu.:164.8 1st Qu.:346.0
Median :49.50 Median : 30.5 Median :180.0 Median :358.0
Mean :49.50 Mean :100.3 Mean :229.1 Mean :373.7
3rd Qu.:74.25 3rd Qu.: 33.0 3rd Qu.:194.8 3rd Qu.:459.8
Max. :99.00 Max. :980.0 Max. :748.0 Max. :572.0
R Sv Iv
Min. : 0.0 Min. : 55.00 Min. : 10.0
1st Qu.: 329.8 1st Qu.: 67.75 1st Qu.:297.0
Median : 773.0 Median : 85.00 Median :298.5
Mean : 653.5 Mean :167.98 Mean :312.3
3rd Qu.: 982.5 3rd Qu.: 87.00 3rd Qu.:316.8
Max. :1131.0 Max. :990.00 Max. :617.0
We can compare the model values vs the observed values graphed
together to check whether the model parameters fit well with real
data.Here The Comparison is done for the Infected and the Recovered
classes.
Load necessary libraries
library(deSolve) # For solving differential equations
library(ggplot2) # For plotting results
library(readr) # For reading/writing CSV files
library(dplyr) # For data manipulation
SEIR Model Function
seir_model <- function(time, state, parameters) {
S <- state["S"]
E <- state["E"]
I <- state["I"]
R <- state["R"]
N <- S + E + I + R
beta <- parameters["beta"]
sigma <- parameters["sigma"]
gamma <- parameters["gamma"]
dSdt <- -beta * S * I / N
dEdt <- beta * S * I / N - sigma * E
dIdt <- sigma * E - gamma * I
dRdt <- gamma * I
list(c(dSdt, dEdt, dIdt, dRdt))
}
Initial conditions
N <- 995 # Total population
E0 <- 10 # Initial exposed
I0 <- 5 # Initial infected
R0 <- 0 # Initial recovered
S0 <- N - E0 - I0 - R0 # Initial susceptible
initial_state <- c(S = S0, E = E0, I = I0, R = R0)
- N = 995: Total population.
- E0, I0, R0: Initial values for exposed, infected, and recovered
individuals.
- S0: Initial susceptible individuals, calculated as the remainder of
the population not in any other compartment
Parameters
parameters <- c(beta = 0.8, sigma = 0.0347, gamma = 0.5)
- beta = 0.8: Transmission rate (probability of infection per contact
between susceptible and infected individuals).
- sigma = 0.0347: Rate at which exposed individuals become
infectious.
- gamma = 0.5: Recovery rate (probability of recovery per unit
time).
Time grid
times <- seq(0, 100, length.out = 500)
times: Generates a sequence of time points (from 0 to 100 days) with
500 equally spaced points. These are the time steps for which the model
will be solved.
Solve SEIR model
solution <- ode(y = initial_state, times = times, func = seir_model, parms = parameters)
- The ode() function from the deSolve package is used to numerically
solve the system of differential equations defined in seir_model().
- y = initial_state: Initial conditions for S, E, I, and R.
- times = times: The time points at which the solution is
computed.
- func = seir_model: The function defining the SEIR model.
- parms = parameters: The parameters (beta, sigma, gamma).
- The solution is a matrix containing the values of S, E, I, and R at
each time point.
Convert to a dataframe
solution_df <- as.data.frame(solution)
- Converts the output from the ode() function (a matrix) to a data
frame for easier manipulation and visualization.
Synthetic data deneration
#malaria_data <- solution_df %>%
# mutate(
# Infected = I + rnorm(n(), mean = 0, sd = 10),
# Recovered = R + rnorm(n(), mean = 0, sd = 10),
#Susceptible = S + rnorm(n(), mean = 0, sd = 10)
# )
- This part generates synthetic malaria data which was called malaria5
by adding random noise to the model output (I, R, S).
- rnorm(): Generates random numbers from a normal distribution with a
specified mean and standard deviation (sd = 10).
- This noise is used to simulate real-world data where values are
typically not perfect and are subject to random fluctuations.
- We don`t need this chunk when using real data
Save synthetic malaria data
#write_csv(malaria_data, "malaria5.csv")
Saves the synthetic malaria data to a CSV file called
malaria5.csv.
Load malaria data for comparison
malaria_observed <- read_csv("malaria5.csv")
Reads the synthetic malaria data back from the CSV file for
comparison with the model output.
Add time column if missing for plotting
# Add a 'time' column if missing
if (!"time" %in% colnames(malaria_observed)) {
malaria_observed$time <- seq_len(nrow(malaria_observed)) - 1
}
##Combine model and observed data for plotting
library(dplyr)
library(tidyverse)
# Combine model and observed data
model_long <- solution_df %>%
select(time, I, R) %>%
pivot_longer(cols = c(I, R), names_to = "Compartment", values_to = "Count") %>%
mutate(Source = "Model")
model_long
# A tibble: 1,000 × 4
time Compartment Count Source
<dbl> <chr> <dbl> <chr>
1 0 I 5 Model
2 0 R 0 Model
3 0.200 I 4.59 Model
4 0.200 R 0.480 Model
5 0.401 I 4.23 Model
6 0.401 R 0.922 Model
7 0.601 I 3.90 Model
8 0.601 R 1.33 Model
9 0.802 I 3.61 Model
10 0.802 R 1.70 Model
# ℹ 990 more rows
observed_long <- malaria_observed %>%
select(time, I, R) %>%
pivot_longer(cols = c(I, R), names_to = "Compartment", values_to = "Count") %>%
mutate(Source = "Observed")
plot_data <- bind_rows(model_long, observed_long)
Create a label for color
plot_data <- plot_data %>%
mutate(Group = paste(Source, Compartment))
Now plot using ggplot
ggplot(plot_data, aes(x = time, y = Count, color = Group, linetype = Source)) +
geom_line() +
geom_point(data = filter(plot_data, Source == "Observed"), size = 1.5) + # Only points for observed
labs(x = "Days", y = "Population", title = "SEIR Model vs. Observed Malaria Data", color = "Legend") +
scale_color_manual(values = c(
"Model I" = "red",
"Observed I" = "darkred",
"Model R" = "blue",
"Observed R" = "darkblue"
)) +
theme_minimal()

- ggplot(): Starts a new plot.
- geom_line(): Adds dashed lines representing the model’s predicted
infected and recovered populations (I and R).
- geom_point(): Adds points representing the observed (synthetic)
infected and recovered populations.
- labs(): Adds labels and a title to the plot.
- scale_color_manual(): Customizes the colors for the model and
observed data.
- theme_minimal(): Applies a minimalist theme to the plot.
- Plot Interpretation: Dashed lines represent the model’s predictions
for the infected and recovered populations.
- Points represent the synthetic (observed) data for infected and
recovered populations.
This comparison allows us to visually check how well the SEIR model
matches the observed malaria data. From the graphs the parameters used
to simulate the models gives very erroneous results and therefore
unreliable.
The parameters beta = 0.8, sigma = 0.0347, gamma = 0.5 are inaccurate
parameters
Summary
This code simulates a SEIR model for disease spread, generates
synthetic malaria data with noise, saves it, and compares it visually
with the model predictions. The comparison is shown in a plot that
overlays the model’s predictions (dashed lines) and the synthetic
observed data (points) for the infected and recovered populations.
1.3 PARAMETER ESTIMATION (Curve Fitting using Non Linear least
square Method)
In some cases (as in the case above) the model does not compare well
with the real data from the field which prompts the need to estimate the
parameters from the real data collected. Note that some parameters are
fixed due to biological undrerstanding i.e the rate of progression from
exposed humans to being infected is given by 1 / incubation period. such
a parameter must remain fixed.
Here We use the Nonlinear Least Squares method. The function performs
nonlinear least squares fitting using the Levenberg–Marquardt (LM)
algorithm which estimates model parameters by minimizing the sum of
squared residuals between the model-predicted and the observed data.
We start by estimating all the parameters with none fixed.
Load required libraries
library(deSolve) #For solving differential equations (SEIR model)
library(readr) # For reading CSV files
library(minpack.lm) # For nonlinear least squares optimization
library(dplyr) # For data manipulation
SEIR Model Function
seir_model <- function(t, y, params) {
with(as.list(c(y, params)), {
N <- S + E + I + R
dSdt <- -beta * S * I / N
dEdt <- beta * S * I / N - sigma * E
dIdt <- sigma * E - gamma * I
dRdt <- gamma * I
return(list(c(dSdt, dEdt, dIdt, dRdt)))
})
}
Parameter Estimation Function
##Load and prepare data (suppress column type message)
estimate_parameters <- function(data_path) {
data <- read_csv(data_path, show_col_types = FALSE)
}
- This function performs the parameter estimation by comparing the
model’s predictions to observed data.
- Reads the observed malaria data from a CSV file (data_path).
- show_col_types = FALSE: suppresses warnings about column types.
Optionally remove ‘Date’ column if it exists
if ("Date" %in% colnames(data)) {
data <- select(data, -Date)
}
- Removes the Date column if it exists, as it’s not necessary for the
model
If ‘day’ column doesn’t exist, create it
if (!"day" %in% colnames(data)) {
data$day <- seq_len(nrow(data)) - 1 # 0-based index
}
- Adds a day column if it doesn’t exist, which represents the time
steps for the model.
Initial conditions
N <- 995
E0 <- 10
I0 <- 5
R0 <- 0
S0 <- N - E0 - I0 - R0
y0 <- c(S = S0, E = E0, I = I0, R = R0)
t <- data$day
observed_I <- data$I
- Defines the initial conditions: total population (N), initial values
for E0 (exposed), I0 (infected), R0 (recovered), and calculates the
initial number of susceptible individuals (S0)
- y0: Initial state vector for S, E, I, and R.
- t: Time vector (days) from the observed data.
- observed_I: Observed infected data from the CSV file.
Model output function
model_output <- function(params, t) {
out <- ode(y = y0, times = t, func = seir_model,
parms = list(beta = params[1], sigma = params[2], gamma = params[3]),
method = "lsoda", atol = 1e-6, rtol = 1e-6)
return(out[, "I"])
}
- model_output(): Solves the SEIR model and returns the predicted
infected population (I) at each time point.
- ode() from the deSolve package is used to solve the SEIR
system.
- The parameters beta, sigma, and gamma are passed as a vector
params.
- method = “lsoda” stands for Livermore Solver for Ordinary
Differential equations with Automatic switching.It automatically
switches between non-stiff methods (Adams-Moulton) when the problem is
easy and stiff methods (Backward Differentiation Formula) when
needed.This makes “lsoda” a robust and efficient choice, especially when
unsure about the stiffness of the system. Stiff equations are those
where some components change much faster than others—common in
biological models like SEIR.
- atol = 1e-6 (Absolute Tolerance). atol controls the absolute error
tolerance. It defines how precisely small values (e.g., near-zero states
like early infection counts) are calculated.
- 1e-6 means: the solution to be accurate within ±0.000001 for small
values.
- rtol = 1e-6 (Relative Tolerance), rtol controls the relative error
tolerance. It defines the acceptable error relative to the size of the
value.
Residuals function
residuals <- function(params) {
model_I <- model_output(params, t)
return(observed_I - model_I)
}
- residuals(): Calculates the difference (residuals) between the
observed infected population (observed_I) and the model’s predicted
infected population (model_I) for a given set of parameters.
Initial guess
initial_guess <- c(0.4, 0.02, 0.05)
- initial_guess: Provides initial estimates for the parameters (beta,
sigma, and gamma).
Run optimization
estimate_parameters <- function(data_path) {
{fit <- nls.lm(par = initial_guess, fn = residuals,
control = nls.lm.control(maxiter = 1024))
return(fit$par)
}}
- nls.lm(): Performs nonlinear least squares optimization using the
Levenberg-Marquardt algorithm (from minpack.lm).
- fn = residuals: The function to minimize (i.e., residuals between
observed and model predictions).
- maxiter = 1024: Maximum number of iterations.
- After fitting, fit$par contains the estimated parameters
Main execution
tryCatch({
params <- estimate_parameters("malaria5.csv")
cat("\n=== Estimated Parameters ===\n")
cat(sprintf("Gamma: %.4f\n", params[3]))
cat(sprintf("Sigma: %.4f\n", params[2]))
cat(sprintf("Beta : %.4f\n", params[1]), "\n")
}, error = function(e) {
cat("Error in parameter estimation:", e$message, "\n")
})
=== Estimated Parameters ===
Gamma: 0.0205
Sigma: 0.0489
Beta : 15.3513
- The tryCatch block is used to handle any potential errors that may
occur during the estimation process (e.g., file reading errors or
optimization issues).
- params <- estimate_parameters(“malaria1.csv”), Attempts to
estimate model parameters using the data from “malaria1.csv”.
- cat() prints text to the console.
- sprintf() formats strings like “Gamma: 0.0312” with 4 decimal places
(%.4f).
- error = function(e) {…}: This block runs only if there’s an error.
e$message extracts the error message and printer the error.
Comparing with the python values === Estimated Parameters === python
R Gamma: 0.0212 0.0205 Sigma: 0.0604 0.0489 Beta : 5.7807 15.3513
Comparing original parameters and estimated parameters in R
studio
NOTE: Original parameters beta = 0.8, sigma = 0.0347, gamma = 0.5
Estimated parameters beta = 15.3513, sigma = 0.0489, gamma = 0.0205
Parameter estimation where some parameters are fixed due to
biological knowledge
Some parameters are fixed due to biological understanding i.e the
rate of progression from exposed humans to being infected is given by (1
/ incubation period). such a parameter must remain fixed.
For the original parameter values the incubation period is about 29
days (sigma = 0.0347), For the estimated parameters the incubation
period is about 20 days (sigma = 0.0489)
Which is more accurate biologically?
Libraries
Here we fixed sigma and Gamma and estimate Beta -The transmission
rate or the probability of being infected from a single mosquito
bite)
library(deSolve)
library(readr)
library(minpack.lm)
library(dplyr)
# SEIR model function
seir_model <- function(t, y, params) {
with(as.list(c(y, params)), {
N <- S + E + I + R
dSdt <- -beta * S * I / N
dEdt <- beta * S * I / N - sigma * E
dIdt <- sigma * E - gamma * I
dRdt <- gamma * I
return(list(c(dSdt, dEdt, dIdt, dRdt)))
})
}
# Parameter estimation function
estimate_parameters <- function(data_path) {
data <- read_csv(data_path, show_col_types = FALSE)
if ("Date" %in% colnames(data)) {
data <- select(data, -Date)
}
if (!"day" %in% colnames(data)) {
data$day <- seq_len(nrow(data)) - 1
}
# Initial conditions
N <- 995
E0 <- 10
I0 <- 5
R0 <- 0
S0 <- N - E0 - I0 - R0
y0 <- c(S = S0, E = E0, I = I0, R = R0)
t <- data$day
observed_I <- data$I
# === Fixed parameters from biological knowledge ===
fixed_sigma <- 1/5.2 # e.g., incubation ~5.2 days
fixed_gamma <- 1/7 # e.g., recovery ~7 days
# Model output using beta as the only free parameter
model_output <- function(params, t) {
beta <- params[1]
out <- ode(
y = y0,
times = t,
func = seir_model,
parms = list(beta = beta, sigma = fixed_sigma, gamma = fixed_gamma),
method = "lsoda", atol = 1e-6, rtol = 1e-6
)
return(out[, "I"])
}
# Residual function for optimization
residuals <- function(params) {
model_I <- model_output(params, t)
return(observed_I - model_I)
}
# Initial guess (only for beta)
initial_guess <- c(0.4)
# Optimization
fit <- nls.lm(par = initial_guess, fn = residuals,
control = nls.lm.control(maxiter = 1024))
return(c(beta = fit$par, sigma = fixed_sigma, gamma = fixed_gamma))
}
# Main execution
tryCatch({
params <- estimate_parameters("malaria5.csv")
cat("\n=== Estimated and Fixed Parameters ===\n")
cat(sprintf("Beta : %.4f (estimated)\n", params["beta"]))
cat(sprintf("Sigma: %.4f (fixed)\n", params["sigma"]))
cat(sprintf("Gamma: %.4f (fixed)\n", params["gamma"]), "\n")
}, error = function(e) {
cat("Error in parameter estimation:", e$message, "\n")
})
=== Estimated and Fixed Parameters ===
Beta : 0.7852 (estimated)
Sigma: 0.1923 (fixed)
Gamma: 0.1429 (fixed)
Pthon values === Estimated and Fixed Parameters === Beta : 0.7852
(estimated) Sigma: 0.1923 (fixed) Gamma: 0.1429 (fixed)
R Values === Estimated and Fixed Parameters === Beta : 0.7852
(estimated) Sigma: 0.1923 (fixed) Gamma: 0.1429 (fixed)
When some values are fixed both python and R studio gives same
values
2.3 Parameter estimation
This is important when the parameter values from literature does not
fit well with the csv numerical results. The code below prints the
parameter values using the curve fitting method for the SEIR - SvIv
model. The code estimates the parameters and gives a graphical
comparison of the observed vs model predictions. Here we only compare
one compartment (The Infected Compartment) for clarity of graphics
Load required libraries
library(deSolve) # solving odes
library(readr) #Reading CSV files
library(minpack.lm) # For curve fitting
library(ggplot2) #Plotting (ggplot2)
# Suppress warnings
options(warn = -1)
Load CSV data
data <- read_csv("malaria5.csv")#Loads the data
# Ensure 'day' column exists (0-based if missing)
if (!"day" %in% colnames(data)) {
data$day <- seq_len(nrow(data)) - 1
}
- !“day” %in% colnames(data): if the column ‘day’ is not in the data,
it creates one
- seq_len(nrow(data)): generates a sequence from 1 to the number of
rows (i.e., 1, 2, …, N). Subtracting 1 turns it into 0-based indexing:
0, 1, 2, …, N-1. The resulting vector is assigned to data$day, adding a
new day column. Summary This code loads data from malaria5.csv and
ensures that there’s a day column, even if the original data doesn’t
include it—defaulting to a sequence starting from 0.
Malaria Model Equations (SEIRV)
malaria_model <- function(t, y, params) {
with(as.list(c(y, params)), {
dS <- Lambda_h - beta_h * S * Iv - mu_h * S + delta_h * R
dE <- beta_h * S * Iv - (sigma_h + mu_h) * E
dI <- sigma_h * E - (gamma_h + mu_h) * I
dR <- gamma_h * I - (delta_h + mu_h) * R
dSv <- Lambda_v - beta_v * Sv * I - mu_v * Sv
dIv <- beta_v * Sv * I - (nu_v + mu_v) * Iv
return(list(c(dS, dE, dI, dR, dSv, dIv)))
})
}
Function to estimate parameters
estimate_parameters <- function(data) {
# Initial conditions from first row
y0 <- c(
S = data$S[1],
E = data$E[1],
I = data$I[1],
R = data$R[1],
Sv = data$Sv[1],
Iv = data$Iv[1]
)
t <- data$day
observed_infected <- data$I
# Function that returns model predictions for infected
fit_func <- function(params, t) {
out <- ode(
y = y0,
times = t,
func = malaria_model,
parms = list(
Lambda_h = params[1],
Lambda_v = params[2],
beta_h = params[3],
beta_v = params[4],
sigma_h = params[5],
gamma_h = params[6],
delta_h = params[7],
mu_h = params[8],
mu_v = params[9],
nu_v = params[10]
),
method = "lsoda"
)
return(out[, "I"]) # Return only infected individuals
}
# Objective function for least squares
objective <- function(params) {
fit <- fit_func(params, t)
return(fit - observed_infected)
}
# Initial parameter guesses
initial_guess <- c(10, 10, 0.3, 0.3, 0.2, 0.1, 0.05, 0.01, 0.01, 0.02)
# Perform curve fitting
fit <- nls.lm(
par = initial_guess,
fn = objective,
control = nls.lm.control(maxiter = 1024)
)
return(fit$par)
}
Function Definition - estimate_parameters <- function(data) { :
Takes in a data frame from malaria5.csv which includes columns for S, E,
I, R, Sv, Iv and day.
Initial Conditions - y0: sets the initial state of the model using
the first row of the dataset. These represent the initial values for
human and mosquito compartments at day 0.
Time and Observed Data - t: is the vector of time points (days). -
observed_infected: is the vector of observed infected humans from the
data (what the model should fit to).
Model Simulation Function - fit_func(): uses the ode() solver (from
the deSolve package) to simulate the malaria model over time. - It takes
a parameter vector params and time vector t. The parameters are unpacked
and passed to malaria_model (a separate function defining the ODE
system). The solver returns a matrix of output values over time; we
extract just the “I” (infected humans) column to compare to data.
Objective Function for Optimization - objective <-
function(params) { : Defines the error between model predictions and
observed data. Returns the difference vector (residuals), which will be
minimized using least squares.
Initial Guesses -initial_guess <- c(10, 10, 0.3, 0.3, 0.2, 0.1,
0.05, 0.01, 0.01, 0.02): Initial parameter values for optimization.
These should be close enough to the true values for the optimizer to
converge.
Least Squares Fitting -fit <- nls.lm(…): nls.lm() is a nonlinear
least squares solver from the minpack.lm package. It adjusts parameters
to minimize the residuals from objective fn. maxiter = 1024 sets the
maximum number of iterations.
Return Estimated Parameters - return(fit$par): Returns the optimized
parameter values that best fit the model to the infected data.
Summary Load initial conditions and time/data from input. Define a
simulator using an ODE malaria model. Build an objective function
comparing model output to actual data. Use nonlinear least squares
(nls.lm) to find the best parameters. Return the estimated parameter
values.
Estimate parameters
estimated_params <- estimate_parameters(data)
# Parameter names
param_names <- c("Lambda_h", "Lambda_v", "beta_h", "beta_v", "sigma_h",
"gamma_h", "delta_h", "mu_h", "mu_v", "nu_v")
# Print results
cat("\nFinal Estimated Parameters:\n")
Final Estimated Parameters:
for (i in seq_along(param_names)) {
cat(sprintf("%s: %.4f\n", param_names[i], estimated_params[i]))
}
Lambda_h: 11.0570
Lambda_v: 9.9891
beta_h: 0.3039
beta_v: 0.2991
sigma_h: 0.1792
gamma_h: 0.0999
delta_h: 0.0501
mu_h: 0.0100
mu_v: 0.0099
nu_v: 0.0204
Final Estimated Parameters vs Original parameters:
Lambda_h: 11.0570 Lambda_h <- 10 # Human recruitment rate
Lambda_v: 9.9891 Lambda_v <- 100 # Mosquito recruitment rate beta_h:
0.3039 beta_h <- 0.002 # Transmission rate from mosquitoes to humans
beta_v: 0.2991 beta_v <- 0.001 # Transmission rate from humans to
mosquitoes sigma_h: 0.1792 sigma_h <- 0.1 # Rate at which exposed
humans become infectious gamma_h: 0.0999 gamma_h <- 0.05 # Recovery
rate of infected humans delta_h: 0.0501 delta_h <- 0.01 # Rate of
loss of immunity mu_h: 0.0100 mu_h <- 0.0001 # Natural death rate of
humans mu_v: 0.0099 mu_v <- 0.1 # Natural death rate of mosquitoes
nu_v: 0.0204 nu_v <- 0.1 # Extrinsic incubation period rate
Python estimated parameters R studio Final Estimated Parameters:
Lambda_h: 9.8367 11.0570 Lambda_v: 9.8458 9.9891 beta_h: 0.2977 0.3039
beta_v: 0.2976 0.2991 sigma_h: 0.1439 0.1792 gamma_h: 0.1431 0.0999
delta_h: 0.1073 0.0501 mu_h: 0.0052 0.0100 mu_v: 0.0064 0.0099 nu_v:
0.0175 0.0204
Parameter estimation when some parameters are fixed due to
biological knowledge
library(deSolve)
library(readr)
library(minpack.lm)
library(ggplot2)
options(warn = -1) # Suppress warnings
# Load data
data <- read_csv("malaria5.csv")
if (!"day" %in% colnames(data)) {
data$day <- seq_len(nrow(data)) - 1
}
# Malaria model equations
malaria_model <- function(t, y, params) {
with(as.list(c(y, params)), {
dS <- Lambda_h - beta_h * S * Iv - mu_h * S + delta_h * R
dE <- beta_h * S * Iv - (sigma_h + mu_h) * E
dI <- sigma_h * E - (gamma_h + mu_h) * I
dR <- gamma_h * I - (delta_h + mu_h) * R
dSv <- Lambda_v - beta_v * Sv * I - mu_v * Sv
dIv <- beta_v * Sv * I - (nu_v + mu_v) * Iv
return(list(c(dS, dE, dI, dR, dSv, dIv)))
})
}
# Function to estimate parameters
estimate_parameters <- function(data) {
# Initial conditions
y0 <- c(
S = data$S[1],
E = data$E[1],
I = data$I[1],
R = data$R[1],
Sv = data$Sv[1],
Iv = data$Iv[1]
)
t <- data$day
observed_I <- data$I
# === Fixed parameters based on biological knowledge ===
fixed_params <- list(
mu_h = 0.01,
mu_v = 0.01,
delta_h = 0.05
)
# Parameters to estimate and their initial guesses
param_names_to_estimate <- c("Lambda_h", "Lambda_v", "beta_h", "beta_v",
"sigma_h", "gamma_h", "nu_v")
initial_guess <- c(10, 10, 0.3, 0.3, 0.2, 0.1, 0.02)
names(initial_guess) <- param_names_to_estimate
# Helper to build full parameter list for the ODE solver
assemble_full_params <- function(estimated_values) {
full_params <- fixed_params
for (name in param_names_to_estimate) {
full_params[[name]] <- estimated_values[name]
}
return(full_params)
}
# Model prediction function
model_I_output <- function(params_vec, t) {
full_params <- assemble_full_params(params_vec)
out <- ode(y = y0, times = t, func = malaria_model,
parms = full_params, method = "lsoda", atol = 1e-6, rtol = 1e-6)
return(out[, "I"])
}
# Objective function
residuals_fn <- function(params_vec) {
model_I <- tryCatch(model_I_output(params_vec, t),
error = function(e) rep(NA, length(t)))
if (any(is.na(model_I)) || any(!is.finite(model_I))) {
return(rep(1e6, length(t))) # Penalize invalid solutions
}
return(model_I - observed_I)
}
# Fit parameters
fit <- nls.lm(
par = initial_guess,
fn = residuals_fn,
control = nls.lm.control(maxiter = 1024)
)
# Combine estimated and fixed parameters
estimated_params <- as.list(fit$par)
all_params <- c(estimated_params, fixed_params)
all_params <- all_params[sort(names(all_params))] # Sort for readability
return(all_params)
}
# Run estimation
estimated <- estimate_parameters(data)
# Print results
cat("\nFinal Parameter Estimates (Estimated + Fixed):\n")
Final Parameter Estimates (Estimated + Fixed):
for (name in names(estimated)) {
cat(sprintf("%s: %.4f\n", name, estimated[[name]]))
}
beta_h: 0.2962
beta_v: 0.2991
delta_h: 0.0500
gamma_h: 0.1003
Lambda_h: 11.3203
Lambda_v: 9.9663
mu_h: 0.0100
mu_v: 0.0100
nu_v: 0.0213
sigma_h: 0.1682
- The fixed parameters are: delta_h: 0.0500, mu_h: 0.0100, mu_v:
0.0100
Python estimated parameters R Estimated Parameters Final Parameter
Estimates (Estimated + Fixed): Lambda_h: 10.0124 11.3203 Lambda_v:
10.1119 9.9663 beta_h: 0.2898 0.2962 beta_v: 0.3256 0.2991 delta_h:
0.0500 0.0500 gamma_h: 0.0768 0.1003 mu_h: 0.0100 0.0100 mu_v: 0.0100
0.0100 nu_v: 0.0902 0.0213 sigma_h: 0.1870 0.1682
The fixed parameters are: mu_v, mu_v and delta_h Q -Which parameters
do we need to fix from the model above?
- cat(“Estimated Parameters:”): Prints a title/header to the
console.
- for (i in seq_along(param_names)) {…}: Loops through each parameter
using seq_along(param_names) (which safely handles the correct index
range).
- sprintf() is used to format the output string:
- %s: inserts the parameter name.
- %.4f: inserts the numeric estimate with 4 decimal places.
- cat(): prints that formatted string to the console, line by
line.
2.4. The code below can also be used to print the parameters
estimated
# After parameter estimation (right after estimated_params <- estimate_parameters(data))
# Extract and display numerical values only
cat("\nEstimated Parameter Values:\n")
Estimated Parameter Values:
cat("-------------------------\n")
-------------------------
cat(sprintf("Lambda_h = %.6f\n", estimated_params[1]))
Lambda_h = 11.056969
cat(sprintf("Lambda_v = %.6f\n", estimated_params[2]))
Lambda_v = 9.989087
cat(sprintf("beta_h = %.6f\n", estimated_params[3]))
beta_h = 0.303945
cat(sprintf("beta_v = %.6f\n", estimated_params[4]))
beta_v = 0.299107
cat(sprintf("sigma_h = %.6f\n", estimated_params[5]))
sigma_h = 0.179235
cat(sprintf("gamma_h = %.6f\n", estimated_params[6]))
gamma_h = 0.099887
cat(sprintf("delta_h = %.6f\n", estimated_params[7]))
delta_h = 0.050102
cat(sprintf("mu_h = %.6f\n", estimated_params[8]))
mu_h = 0.009984
cat(sprintf("mu_v = %.6f\n", estimated_params[9]))
mu_v = 0.009877
cat(sprintf("nu_v = %.6f\n", estimated_params[10]))
nu_v = 0.020384
- cat(): Prints text or values without extra formatting
- sprintf(): Formats numeric values with precision (e.g., 6 decimal
places)
- estimated_params[n]: Accesses the nth estimated parameter value
2.6. MALARIA MODELS WITH MORE INTERVENTIONS/COMPARTMENTS USING THE
ESTIMATED PARAMETERS
Simulating the SEIRV - IvSv model with coding which may not be
embedded in R. This allows one to use any number of equations, variables
and parameters.
Below is a Malaria model System of 8 ODEs in R with vaccination as an
intervention strategy included. The proportion vaccinated is assumed to
be fully protected from the formulation. The
Malaria_model function is used.
library(deSolve) # Solves differential equations
library(ggplot2) # For plotting
Parameters
params <- list(
Lambda_h = 11.0570, # Human recruitment rate
Lambda_v = 9.9891, # Mosquito recruitment rate
beta_h = 0.3039, # Transmission rate from mosquitoes to humans
beta_v = 0.2991, # Transmission rate from humans to mosquitoes
sigma_h = 0.0999, # Rate at which exposed humans become infectious
gamma_h = 0.099887, # Recovery rate of infected humans
delta_h = 0.0501, # Rate of loss of immunity
mu_h = 0.0100, # Natural death rate of humans
mu_v = 0.0099, # Natural death rate of mosquitoes
nu_v = 0.0204, # Extrinsic incubation period rate
a = 0.71 # Proportion of humans vaccinated (vaccine is 100% effective)
)
a = 0.71 means 71% of humans are vaccinated. Vaccine is 100%
effective, so vaccinated individuals cannot become infected.
Initial conditions
initial_state <- c(
S = 980, # Susceptible humans
E = 10, # Exposed humans
I = 5, # Infected humans
R = 0, # Recovered humans
V = 0, # Vaccinated humans
Sv = 990, # Susceptible mosquitoes
Iv = 50 # Infected mosquitoes
)
Define the system of ODEs
malaria_model <- function(t, state, params) {
with(as.list(c(state, params)), {
dS <- (1 - a) * Lambda_h - beta_h * S * Iv - mu_h * S + delta_h * R
dE <- beta_h * S * Iv - (sigma_h + mu_h) * E
dI <- sigma_h * E - (gamma_h + mu_h) * I
dR <- gamma_h * I - (delta_h + mu_h) * R
dSv <- Lambda_v - beta_v * Sv * I - mu_v * Sv
dIv <- beta_v * Sv * I - (nu_v + mu_v) * Iv
dV <- a * Lambda_h - mu_h * V
return(list(c(dS, dE, dI, dR, dV, dSv, dIv)))
})
}
Time span
t_span <- c(0, 200)
t_eval <- seq(t_span[1], t_span[2], length.out = 1000)
Simulates over 50 days with 1000 time points.
Solve the ODEs
solution <- ode(
y = initial_state,
times = t_eval,
func = malaria_model,
parms = params
)
Convert to data frame
results <- as.data.frame(solution)
Plot results
ggplot(results, aes(x = time)) +
geom_line(aes(y = (S + V), color = "Susceptible and Vaccinated Humans")) +
#geom_line(aes(y = E, color = "Exposed Humans")) +
#geom_line(aes(y = I, color = "Infected Humans")) +
geom_line(aes(y = (I + E + R), color = "Total Infected Humans")) +
#geom_line(aes(y = V, color = "Vaccinated")) +
#geom_line(aes(y = Sv, color = "Susceptible Mosquitoes"), linetype = "dashed") +
labs(x = "Time (days)", y = "Population",
title = "Malaria Transmission Dynamics with Vaccination") +
scale_color_manual(values = c(
"Susceptible and Vaccinated Humans" = "blue",
"Exposed Humans" = "orange",
"Total Infected Humans" = "red",
"Vaccinated" = "green",
"Susceptible Mosquitoes" = "purple"
)) +
theme_minimal() +
theme(legend.position = "bottom", legend.title = element_blank()) +
guides(color = guide_legend(nrow = 2, byrow = TRUE))
Q -Try reducing time to about seven days to visualize what happens in
the first week
- ggplot(results, aes(x = time)) +: Starts building a ggplot object
using the results data frame. Sets the x-axis to the time column (likely
days).
- Plotting Lines for Each Compartment, Each line is given a color
label (which is not a column name, but a string that ggplot will treat
as a legend entry).
- geom_line(aes(y = Sv, color = “Susceptible Mosquitoes”), linetype =
“dashed”) + : This line would plot susceptible mosquitoes (Sv) if
uncommented. It’s styled as dashed, likely to differentiate mosquito
populations from human ones.
Axis Labels and Title - labs(x = “Time (days)”, :Adds labels for the
x-axis, y-axis, and main title of the plot.
Custom Colors = scale_color_manual(values = c(…: Manually assigns
specific colors to the legend labels defined.
Clean Theme & Legend Styling - theme_minimal() +: Applies a
minimal theme for a cleaner look (no background gridlines or boxes).
2.7. Parameter estimation using Neural Networks
We use the SEIR Model using true parameters.
Load required libraries
# Install packages if not already installed
if (!requireNamespace("nnet", quietly = TRUE)) install.packages("nnet")
if (!requireNamespace("readr", quietly = TRUE)) install.packages("readr")
library(nnet) #for neural networks.
library(readr) #for reading/writing CSV files easily.
This checks if the nnet and readr packages are installed.If they are
not, it installs them then it loads them into RStudio.
Generate synthetic data
# --------------------------------
# Step 1: Generate synthetic malaria data
# --------------------------------
set.seed(42) # for reproducibility
# True parameters
beta_true <- 0.8 # transmission rate
sigma_true <- 0.2 # progression rate
gamma_true <- 0.1 # recovery rate
# True parameters
#beta_true <- 0.58 # transmission rate
#sigma_true <- 0.12 # progression rate
#gamma_true <- 0.11 # recovery rate
# Simulate SEIR model
time <- 1:100
S <- numeric(length(time))
E <- numeric(length(time))
I <- numeric(length(time))
R <- numeric(length(time))
# Initial conditions
S[1] <- 9909
E[1] <- 49
I[1] <- 42
R[1] <- 0
N <- S[1] + E[1] + I[1] + R[1] # Total population
dt <- 1
for (t in 1:(length(time) - 1)) {
dS <- -beta_true * S[t] * I[t] / N * dt
dE <- (beta_true * S[t] * I[t] / N - sigma_true * E[t]) * dt
dI <- (sigma_true * E[t] - gamma_true * I[t]) * dt
dR <- gamma_true * I[t] * dt
S[t+1] <- S[t] + dS
E[t+1] <- E[t] + dE
I[t+1] <- I[t] + dI
R[t+1] <- R[t] + dR
}
# Add some random noise (small)
noise_level <- 5
S_noisy <- pmax(S + rnorm(length(S), 0, noise_level), 0)
E_noisy <- pmax(E + rnorm(length(E), 0, noise_level), 0)
I_noisy <- pmax(I + rnorm(length(I), 0, noise_level), 0)
R_noisy <- pmax(R + rnorm(length(R), 0, noise_level), 0)
# Save synthetic data
synthetic_data <- data.frame(time = time, S = round(S_noisy), E = round(E_noisy), I = round(I_noisy), R = round(R_noisy))
write_csv(synthetic_data, "malaria_synthetic_data.csv")
cat("Synthetic data saved to 'malaria_synthetic_data.csv'\n")
Synthetic data saved to 'malaria_synthetic_data.csv'
set.seed(42) # for reproducibility Sets the random number
generator so every time you run the code, you get the same random
results.
Define “true” malaria parameters beta_true <- 0.8 #
transmission rate sigma_true <- 0.2 # progression rate gamma_true
<- 0.1 # recovery rate These are the “true” parameters of the SEIR
model that will be tried later to estimate using a neural
network.
Set up SEIR model variables time <- 1:100 # 100 days S <-
numeric(length(time)) # Susceptible E <- numeric(length(time)) #
Exposed I <- numeric(length(time)) # Infectious R <-
numeric(length(time)) # Recovered Creates empty vectors to store how
many people are Susceptible, Exposed, Infectious, and Recovered each
day
Initialize starting values S[1] <- 9909 E[1] <- 49 I[1]
<- 42 R[1] <- 0 N <- S[1] + E[1] + I[1] + R[1] # Total
population dt <- 1 # Time step (1 day) Sets the starting human
population numbers on Day 1.
Simulate the SEIR dynamics for (t in 1:(length(time) - 1)) { dS
<- -beta_true * S[t] * I[t] / N * dt dE <- (beta_true * S[t] *
I[t] / N - sigma_true * E[t]) * dt dI <- (sigma_true * E[t] -
gamma_true * I[t]) * dt dR <- gamma_true * I[t] * dt
S[t+1] <- S[t] + dS E[t+1] <- E[t] + dE I[t+1] <- I[t] + dI
R[t+1] <- R[t] + dR } For each day, it calculates how many people
move: From Susceptible to Exposed.From Exposed to Infectious.From
Infectious to Recovered.
These are simple differential equations discretized by a time
step.
- Add random noise (to make it realistic) noise_level <- 5 S_noisy
<- pmax(S + rnorm(length(S), 0, noise_level), 0) E_noisy <- pmax(E
+ rnorm(length(E), 0, noise_level), 0) I_noisy <- pmax(I +
rnorm(length(I), 0, noise_level), 0) R_noisy <- pmax(R +
rnorm(length(R), 0, noise_level), 0)
Adds random variation to the numbers (simulate real-world observation
errors) - pmax(…, 0) ensures no negative population values. - rnorm()
generates random values from a normal distribution.
- Save the data synthetic_data <- data.frame(time = time, S =
round(S_noisy), E = round(E_noisy), I = round(I_noisy), R =
round(R_noisy)) write_csv(synthetic_data, “malaria_synthetic_data.csv”)
cat(“Synthetic data saved to ‘malaria_synthetic_data.csv’”)
Creates a data.frame with time and SEIR values.Rounds the numbers
(because people are whole numbers). Saves it into a CSV file.
Load the synthetic data
# --------------------------------
# Step 2: Load the data
# --------------------------------
# Load the same file you just saved
data <- read_csv("malaria_synthetic_data.csv")
#data <- read_csv("malaria5.csv")
# Inputs: (S, E, I, R)
inputs <- as.data.frame(data[, c("S", "E", "I", "R")])
# Targets: true parameters
targets <- data.frame(beta = rep(beta_true, nrow(inputs)),
sigma = rep(sigma_true, nrow(inputs)),
gamma = rep(gamma_true, nrow(inputs)))
- Step 2: Load the data data <-
read_csv(“malaria_synthetic_data.csv”) inputs <- as.data.frame(data[,
c(“S”, “E”, “I”, “R”)]) Loads the synthetic data from the CSV.
Picks the inputs (S, E, I, R) — these will be fed into the neural
network.
- Define target outputs targets <- data.frame(beta = rep(beta_true,
nrow(inputs)), sigma = rep(sigma_true, nrow(inputs)), gamma =
rep(gamma_true, nrow(inputs)))
Since you are working with synthetic data (and you know the true
parameters), you create the target outputs for training.These are the
values you want the neural network to learn.
Train the neural network
# --------------------------------
# Step 3: Train a neural network
# --------------------------------
# Inputs
X <- as.matrix(inputs)
# Outputs (targets)
Y <- as.matrix(targets)
# Train neural network
model <- nnet(X, Y,
size = 10, # hidden neurons
linout = TRUE,
maxit = 1000,
decay = 0.01)
# weights: 83
initial value 67.883456
iter 10 value 10.857013
iter 20 value 0.360371
iter 30 value 0.083634
iter 40 value 0.004486
iter 50 value 0.003779
iter 60 value 0.003511
iter 70 value 0.003162
iter 80 value 0.002928
iter 90 value 0.002828
iter 100 value 0.002812
iter 110 value 0.002785
iter 120 value 0.002721
iter 130 value 0.002694
iter 140 value 0.002658
iter 150 value 0.002582
iter 160 value 0.002423
iter 170 value 0.002272
iter 180 value 0.002248
iter 190 value 0.002227
iter 200 value 0.002219
iter 210 value 0.002214
iter 220 value 0.002211
iter 230 value 0.002209
iter 240 value 0.002203
iter 250 value 0.002191
iter 260 value 0.002172
iter 270 value 0.002149
iter 280 value 0.002108
iter 290 value 0.002076
iter 300 value 0.002068
iter 310 value 0.002053
iter 320 value 0.002037
iter 330 value 0.001950
iter 340 value 0.001850
iter 350 value 0.001733
iter 360 value 0.001633
iter 370 value 0.001552
iter 380 value 0.001515
iter 390 value 0.001462
iter 400 value 0.001396
iter 410 value 0.001351
iter 420 value 0.001307
iter 430 value 0.001287
iter 440 value 0.001268
iter 450 value 0.001252
iter 460 value 0.001230
iter 470 value 0.001224
iter 480 value 0.001211
iter 490 value 0.001174
iter 500 value 0.001126
iter 510 value 0.001117
iter 520 value 0.001112
iter 530 value 0.001103
iter 540 value 0.001084
iter 550 value 0.001059
iter 560 value 0.001035
iter 570 value 0.001031
iter 580 value 0.001025
iter 590 value 0.001007
iter 600 value 0.000987
iter 610 value 0.000963
iter 620 value 0.000939
iter 630 value 0.000924
iter 640 value 0.000917
iter 650 value 0.000908
iter 660 value 0.000902
iter 670 value 0.000894
iter 680 value 0.000883
iter 690 value 0.000871
iter 700 value 0.000861
iter 710 value 0.000846
iter 720 value 0.000828
iter 730 value 0.000813
iter 740 value 0.000810
iter 750 value 0.000806
iter 760 value 0.000792
iter 770 value 0.000773
iter 780 value 0.000764
iter 790 value 0.000761
iter 800 value 0.000757
iter 810 value 0.000751
iter 820 value 0.000745
iter 830 value 0.000722
iter 840 value 0.000711
iter 850 value 0.000707
iter 860 value 0.000706
iter 870 value 0.000705
iter 880 value 0.000704
iter 890 value 0.000703
iter 900 value 0.000702
iter 910 value 0.000701
iter 920 value 0.000693
iter 930 value 0.000691
iter 940 value 0.000690
iter 950 value 0.000690
iter 960 value 0.000690
iter 970 value 0.000689
iter 980 value 0.000685
iter 990 value 0.000681
iter1000 value 0.000675
final value 0.000675
stopped after 1000 iterations
cat("Neural network trained.\n")
Neural network trained.
- Step 3: Train a neural network X <- as.matrix(inputs) # Inputs
matrix Y <- as.matrix(targets) # Outputs matrix
model <- nnet(X, Y, size = 10, # number of hidden neurons linout =
TRUE, # linear output (good for regression) maxit = 1000, decay = 0.01)
cat(“Neural network trained.”) Converts inputs and targets to
matrices.
Trains a small feed-forward neural network with 1 hidden layer of 10
neurons. linout = TRUE means the output is continuous (since beta,
sigma, gamma are real numbers).
Predicted parameters
# --------------------------------
# Step 4: Prediction
# --------------------------------
# Predict parameters
predicted_params <- predict(model, X)
# Now it will have 3 columns already
predicted_params <- as.data.frame(predicted_params)
colnames(predicted_params) <- c("beta_pred", "sigma_pred", "gamma_pred")
# Combine results
results <- cbind(inputs, targets, predicted_params)
print(head(results))
S E I R beta sigma gamma beta_pred sigma_pred gamma_pred
1 9916 55 32 0 0.8 0.2 0.1 0.7999052 0.1999679 0.09997992
2 9873 78 49 8 0.8 0.2 0.1 0.7999330 0.1999749 0.09998339
3 9840 91 63 9 0.8 0.2 0.1 0.7999382 0.1999762 0.09998403
4 9796 131 81 18 0.8 0.2 0.1 0.7999592 0.1999815 0.09998666
5 9740 149 81 21 0.8 0.2 0.1 0.7999566 0.1999809 0.09998634
6 9668 191 104 30 0.8 0.2 0.1 0.7999689 0.1999840 0.09998787
predicted_params <- predict(model, X) predicted_params <-
as.data.frame(predicted_params) colnames(predicted_params) <-
c(“beta_pred”, “sigma_pred”, “gamma_pred”) results <- cbind(inputs,
targets, predicted_params) print(head(results))
Predicts beta, sigma, gamma values using the trained network. Stores
the predicted parameters. Combines them with the true values to see how
good the model performed.
Step 5: Plotting
par(mfrow = c(1, 3))
plot(results$beta, results$beta_pred,
main = "Beta: True vs Predicted", xlab = "True beta", ylab = "Predicted beta", pch = 19)
abline(0, 1, col = "red")
plot(results$sigma, results$sigma_pred,
main = "Sigma: True vs Predicted", xlab = "True sigma", ylab = "Predicted sigma", pch = 19)
abline(0, 1, col = "red")
plot(results$gamma, results$gamma_pred,
main = "Gamma: True vs Predicted", xlab = "True gamma", ylab = "Predicted gamma", pch = 19)
abline(0, 1, col = "red")
-Step 5: Plotting par(mfrow = c(1, 3)) # 3 plots side-by-side
plot(results\(beta, results\)beta_pred,
main = “Beta: True vs Predicted”, xlab = “True beta”, ylab = “Predicted
beta”, pch = 19) abline(0, 1, col = “red”) # Ideal line y = x
plot(results\(sigma,
results\)sigma_pred, main = “Sigma: True vs Predicted”, xlab =
“True sigma”, ylab = “Predicted sigma”, pch = 19) abline(0, 1, col =
“red”) plot(results\(gamma,
results\)gamma_pred, main = “Gamma: True vs Predicted”, xlab =
“True gamma”, ylab = “Predicted gamma”, pch = 19) abline(0, 1, col =
“red”)
Makes 3 plots:True vs Predicted for beta. True vs Predicted for
sigma. True vs Predicted for gamma.
Perfect prediction would lie along the red line. This estimation
helps one check wether the true parameters correspond to your data
set
Why do prediction when you have true values
We are using them as targets to train a neural network, so that the
network learns to predict those parameters from the SEIR data (the
population states over time). This is a supervised
learning task
So, even though you know the true parameters, you’re training the
model to learn how to infer those parameters from observational
data, which is useful in real-world settings where the true
parameters are unknown in another data set. To ensure that the
prediction is correct.
Parameter estimation using neural networks
Where we dont have true values but only estimated parameters
# Install required packages if not already installed
if (!requireNamespace("nnet", quietly = TRUE)) install.packages("nnet")
if (!requireNamespace("deSolve", quietly = TRUE)) install.packages("deSolve")
if (!requireNamespace("readr", quietly = TRUE)) install.packages("readr")
if (!requireNamespace("ggplot2", quietly = TRUE)) install.packages("ggplot2")
library(nnet)
library(deSolve)
library(readr)
library(ggplot2)
# --------------------------------
# Step 1: Define the SEIR model
# --------------------------------
seir_model <- function(t, state, params) {
with(as.list(c(state, params)), {
beta <- params[1]
sigma <- params[2]
gamma <- params[3]
dS <- -beta * S * I / (S + E + I + R)
dE <- beta * S * I / (S + E + I + R) - sigma * E
dI <- sigma * E - gamma * I
dR <- gamma * I
list(c(dS, dE, dI, dR))
})
}
# --------------------------------
# Step 2: Load observed data (no true parameters known)
# --------------------------------
# Replace this with your real or synthetic dataset path
data <- read_csv("malaria_synthetic_data.csv") # assuming it's in your working directory
times <- data$time
observed_SEIR <- data[, c("S", "E", "I", "R")]
# Initial conditions
init_state <- c(S = data$S[1], E = data$E[1], I = data$I[1], R = data$R[1])
# --------------------------------
# Step 3: Estimate parameters via ODE + optimization
# --------------------------------
cat("Starting parameter estimation...\n")
Starting parameter estimation...
objective_function <- function(params, times, observed_I, init_state) {
out <- ode(y = init_state, times = times, func = seir_model, parms = params)
pred_I <- out[, "I"]
mse <- mean((pred_I - observed_I)^2, na.rm = TRUE)
return(mse)
}
initial_guess <- c(beta = 0.6, sigma = 0.15, gamma = 0.08)
fit <- optim(par = initial_guess,
fn = objective_function,
times = times,
observed_I = data$I,
init_state = init_state,
method = "L-BFGS-B",
lower = c(beta = 0.01, sigma = 0.01, gamma = 0.01),
upper = c(beta = 2, sigma = 1, gamma = 1))
estimated_params <- fit$par
cat("✅ Estimated parameters:\n")
✅ Estimated parameters:
print(estimated_params)
beta sigma gamma
0.59404315 0.28795939 0.09771527
# --------------------------------
# Step 4: Generate synthetic data using estimated parameters
# This simulates pseudo-labeled data for training a neural network
# --------------------------------
set.seed(123)
num_datasets <- 50 # Number of pseudo-datasets to generate
time_steps <- nrow(data)
dataset_list <- list()
cat("Generating pseudo-labeled datasets using estimated parameters...\n")
Generating pseudo-labeled datasets using estimated parameters...
for (i in 1:num_datasets) {
# Perturb the estimated parameters slightly to simulate variation
beta_perturbed <- estimated_params[1] * runif(1, 0.9, 1.1)
sigma_perturbed <- estimated_params[2] * runif(1, 0.9, 1.1)
gamma_perturbed <- estimated_params[3] * runif(1, 0.9, 1.1)
# Randomized initial conditions
S0 <- round(runif(1, 9000, 10000))
E0 <- round(runif(1, 20, 100))
I0 <- round(runif(1, 20, 100))
R0 <- 0
init_state_sim <- c(S = S0, E = E0, I = I0, R = R0)
times_sim <- 1:time_steps
# Solve ODE
out_sim <- ode(y = init_state_sim, times = times_sim, func = seir_model, parms = c(beta_perturbed, sigma_perturbed, gamma_perturbed))
# Add noise
noise_level <- 10
noisy_sim <- out_sim[, 2:5] + matrix(rnorm(4 * nrow(out_sim), 0, noise_level), ncol = 4)
noisy_sim <- pmax(noisy_sim, 0)
# Store input (SEIR) and output (parameters)
dataset_list[[i]] <- list(
input = noisy_sim,
output = c(beta = beta_perturbed, sigma = sigma_perturbed, gamma = gamma_perturbed)
)
}
# --------------------------------
# Step 5: Prepare training data
# --------------------------------
X_train <- do.call(rbind, lapply(dataset_list, function(x) x$input))
Y_train <- do.call(rbind, lapply(dataset_list, function(x) matrix(x$output, nrow = nrow(x$input), ncol = 3, byrow = TRUE)))
# Normalize inputs
normalize <- function(x) (x - min(x)) / (max(x) - min(x))
X_train_scaled <- apply(X_train, 2, normalize)
# --------------------------------
# Step 6: Train Neural Network
# --------------------------------
cat("Training neural network...\n")
Training neural network...
nn_model <- nnet(
X_train_scaled, Y_train,
size = 10,
linout = TRUE,
maxit = 1000,
decay = 0.01,
trace = FALSE
)
cat(" Neural network trained.\n")
Neural network trained.
# --------------------------------
# Step 7: Predict on original observed data
# --------------------------------
# Normalize observed data
observed_scaled <- t(apply(observed_SEIR, 1, function(row) (row - min(X_train)) / (max(X_train) - min(X_train))))
predicted_params <- predict(nn_model, observed_scaled)
avg_prediction <- colMeans(predicted_params)
# Print predicted vs estimated
results <- rbind(
"Estimated (from optimization)" = estimated_params,
"Predicted (from neural network)" = avg_prediction
)
print(results)
beta sigma gamma
Estimated (from optimization) 0.5940431 0.2879594 0.09771527
Predicted (from neural network) 0.5893953 0.2870263 0.09727007
# --------------------------------
# Step 8: Plot prediction
# --------------------------------
plot_data <- data.frame(
time = rep(times, 2),
Infected = c(data$I, observed_SEIR$I), # dummy plot since we don't know true
type = rep(c("Observed", "Neural Net Prediction"), each = length(times))
)
# Just visualize predicted parameters
cat("Plotting infected curve using predicted parameters...\n")
Plotting infected curve using predicted parameters...
simulated_with_pred <- ode(y = init_state, times = times, func = seir_model, parms = avg_prediction)
plot_data_nn <- data.frame(
time = rep(times, 2),
Infected = c(data$I, simulated_with_pred[, "I"]),
type = rep(c("Observed", "Simulated w/ Pred Params"), each = length(times))
)
ggplot(plot_data_nn, aes(x = time, y = Infected, color = type)) +
geom_line(size = 1) +
labs(title = "Observed vs Simulated Infections Using Predicted Parameters",
x = "Time", y = "Infected Individuals") +
theme_minimal()

Parameter estimation using tensor flow
- If your data set comes from real or observed data, and you don’t
know the true parameters (β, σ, γ), then the goal shifts from supervised
learning to optimizing parameters by minimizing prediction error. -
Instead of a neural network trained on labeled targets, we’ll use
TensorFlow/Keras to build a regression model that learns the
best-fitting parameters by minimizing the difference between simulated
and observed infections over time.
Libraries
# REequired Libraries
library(deSolve)
library(readr)
library(tensorflow)
library(keras)
# LOoad real data
#data <- read_csv("malaria5.csv")
#if (!"day" %in% colnames(data)) {
#data$day <- seq_len(nrow(data)) - 1}
# EXxtract time and observed infections
#t <- data$day
#observed_I <- data$I
# INnitial Conditions
#N <- 995
#E0 <- 10; I0 <- 5; R0 <- 0
#S0 <- N - E0 - I0 - R0
#y0 <- c(S = S0, E = E0, I = I0, R = R0)
# SeEIR ODE Model
#seir_model <- function(t, y, params) {
#with(as.list(c(y, params)), {
# N <- S + E + I + R
# dS <- -beta * S * I / N
# dE <- beta * S * I / N - sigma * E
# dI <- sigma * E - gamma * I
# dR <- gamma * I
# list(c(dS, dE, dI, dR)) })}
# WRrapper to simulate infections
#simulate_infections <- function(beta, sigma, gamma) {
# out <- ode(y = y0, times = t, func = seir_model, parms = c(beta=beta, sigma=sigma, gamma=gamma))
# return(out[, "I"])}
# TEensorFlow Parameter Variables
#beta <- tf$Variable(0.4, dtype = "float32")
#sigma <- tf$Variable(0.02, dtype = "float32")
#gamma <- tf$Variable(0.05, dtype = "float32")
# OPptimizer
#optimizer <- optimizer_adam(learning_rate = 0.01)
# Observed values as Tensor
#observed_I_tensor <- tf$constant(observed_I, dtype = "float32")
# LOoss function (Mean Squared Error)
#loss_fn <- function() {
# pred_I <- simulate_infections(beta$numpy(), sigma$numpy(), gamma$numpy())
# pred_tensor <- tf$constant(pred_I, dtype = "float32")
# tf$reduce_mean(tf$square(pred_tensor - observed_I_tensor))}
# TRraining Loop
#for (epoch in 1:300) {
# with(tf$GradientTape() %as% tape, {
# loss <- loss_fn() })
#grads <- tape$gradient(loss, list(beta, sigma, gamma))
#optimizer$apply_gradients(purrr::transpose(list(grads, list(beta, sigma, gamma))))
# if (epoch %% 50 == 0) {
# cat(sprintf("Epoch %d | Loss = %.4f | beta = %.4f | sigma = %.4f | gamma = %.4f\n",
# epoch, loss$numpy(), beta$numpy(), sigma$numpy(), gamma$numpy())) }}
# FIinal Estimated Parameters
#cat(sprintf("\nEstimated Parameters from Real Data:\nBeta = %.4f\nSigma = %.4f\nGamma = %.4f\n",
# beta$numpy(), sigma$numpy(), gamma$numpy()))
3.1. PARAMETER ESTIMATION BY SPLITTING THE DATA INTO TRAINING AND
TESTING SET
- Splitting the dataset into training and testing sets can be useful
for compartmental models like SEIR, especially in parameter
estimation.
- Generalization Check: By training on one portion of the data and
testing on another, we can evaluate whether the estimated parameters
generalize well beyond the training period.
- Model Validation: The testing set acts as an independent evaluation,
helping to identify overfitting, where the model performs well on
training data but poorly on unseen data.
- Robustness of Parameter Estimates: Ensuring that parameters work for
future projections rather than just fitting past data is crucial for
forecasting.
- Tuning and Comparison: Different model variations (e.g., different
parameter constraints, alternative assumptions) can be compared using
training-test performance.
We will use the csv file malaria6. We will allow
some parameters to be fixed especially those already known from
biological knowledge.
Checking the descriptive statistics of the csv file. ## Load
libraries
library(readr) # For reading CSV files
library(dplyr) # For data manipulation
library(deSolve) # For solving ODEs
library(minpack.lm) # For nonlinear curve fitting
library(dplyr) # For data manipulation
library(tidyr)
Load the CSV file
data <- read_csv("malaria6.csv")
Loads the CSV file containing malaria data into a dataframe.
Display the first few rows
cat("First 5 rows of the dataset:\n")
First 5 rows of the dataset:
print(head(data, 5))
# A tibble: 5 × 7
day s E i r Sv Iv
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0 980 10 5 0 990 10
2 1 969 28 5 0 983 15
3 2 949 54 7 0 976 20
4 3 920 86 12 0 966 27
5 4 879 127 19 0 950 40
Get basic info about the dataset
cat("\nDataset Summary:\n")
Dataset Summary:
print(str(data)) # Structure of the dataset
spc_tbl_ [100 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ day: num [1:100] 0 1 2 3 4 5 6 7 8 9 ...
$ s : num [1:100] 980 969 949 920 879 818 727 601 442 269 ...
$ E : num [1:100] 10 28 54 86 127 184 265 373 503 634 ...
$ i : num [1:100] 5 5 7 12 19 30 46 70 103 148 ...
$ r : num [1:100] 0 0 0 0 0 0 1 3 6 11 ...
$ Sv : num [1:100] 990 983 976 966 950 925 887 831 752 648 ...
$ Iv : num [1:100] 10 15 20 27 40 61 93 140 206 291 ...
- attr(*, "spec")=
.. cols(
.. day = col_double(),
.. s = col_double(),
.. E = col_double(),
.. i = col_double(),
.. r = col_double(),
.. Sv = col_double(),
.. Iv = col_double()
.. )
- attr(*, "problems")=<externalptr>
NULL
Check for missing values
cat("\nMissing values per column:\n")
Missing values per column:
print(colSums(is.na(data)))
day s E i r Sv Iv
0 0 0 0 0 0 0
Describe numerical data
cat("\nStatistical Summary:\n")
Statistical Summary:
print(summary(data))
day s E i
Min. : 0.00 Min. : 8.0 Min. : 10.0 Min. : 5.0
1st Qu.:24.75 1st Qu.: 24.0 1st Qu.:164.8 1st Qu.:346.0
Median :49.50 Median : 30.5 Median :180.0 Median :358.0
Mean :49.50 Mean :100.3 Mean :229.1 Mean :373.7
3rd Qu.:74.25 3rd Qu.: 33.0 3rd Qu.:194.8 3rd Qu.:459.8
Max. :99.00 Max. :980.0 Max. :748.0 Max. :572.0
r Sv Iv
Min. : 0.0 Min. : 55.00 Min. : 10.0
1st Qu.: 329.8 1st Qu.: 67.75 1st Qu.:297.0
Median : 773.0 Median : 85.00 Median :298.5
Mean : 653.5 Mean :167.98 Mean :312.3
3rd Qu.: 982.5 3rd Qu.: 87.00 3rd Qu.:316.8
Max. :1131.0 Max. :990.00 Max. :617.0
This helps you understand what the dataset looks like and whether
it’s clean.
Load and preprocess data
Data Preprocessing
data <- read.csv("malaria6.csv") %>%
rename_with(tolower) %>%
mutate(day = seq_len(n()) - 1) %>% # Start day count at 0
rename(infected = i) %>% # Rename to avoid conflicts
drop_na()
- Load the CSV
- data <- read.csv(“malaria6.csv”) %>%: Reads the CSV file
“malaria6.csv” using base R’s read.csv().
- Convert Column Names to Lowercase
- rename_with(tolower) %>%: converts all column names in the data
frame to lowercase. This helps avoid issues with inconsistent naming
like “I” vs “i” or “Time” vs “time”.
- Add a time Column
- mutate(day = seq_len(n()) - 1) %>%: mutate() adds a new column
called day.
- seq_len(n()): creates a sequence from 1 to the number of rows (n()
is the number of rows in the current data). Subtracting 1 makes the
sequence start at 0: 0, 1, 2, …, N-1. This is often used for time steps
in simulations.
- Rename the Infected Column
- rename(infected = i) %>%: Renames the column i to infected. This
might be to Make the name clearer and avoid naming conflicts with
variables or functions.
- Drop Any Missing Data
- drop_na(): Removes any rows in the dataset that contain NA (missing
values). Ensures the data is clean for modeling or plotting.
The Final Result: Has all-lowercase column names, Includes a day
column starting at 0.Uses infected instead of i and has no missing
values.
Check required column exists
if(!"infected" %in% colnames(data)) {
stop("Required column 'infected' missing. Check your CSV file.")
}
cat("Available columns in CSV:", paste(colnames(data), collapse = ", "), "\n")
Available columns in CSV: day, s, e, infected, r, sv, iv
- if(!“infected” %in%: This checks whether the column “infected”
exists in the data frame.
- colnames(data): gets a vector of all column names.
- “infected” %in% colnames(data): returns TRUE if “infected” is among
them.
- The ! negates it: if “infected” is not present, then stop() halts
the execution of the program and prints the given error message.
️ Print Column Names - cat(“Available columns in CSV:”,
paste(colnames(data), collapse = “,”), “”): This prints out all
available column names, separated by commas. - paste(…, collapse = “,”):
combines them into one string like: “s, e, infected, r, day” - cat()
prints that message without quotes or line breaks, adding a newline at
the end.
Summary Ensures the “infected” column is in the dataset. Stops the
script with a clear error message if it’s missing. Prints all the
current column names in the data, so the user can quickly verify or
debug.
Split data into training and testing sets (80% training)
split_index <- floor(nrow(data) * 0.8)
train_data <- data[1:split_index, ]
test_data <- data[(split_index+1):nrow(data), ]
80% training data for fitting the model, 20% test data for evaluation
(used later) - split_index <- floor(nrow(data) * 0.8): Gives the
total number of rows in your dataset. - 0.8 means taking 80% of the data
for training. - floor() ensures the index is a whole number (it rounds
down).
This takes the first 80% of the rows (from row 1 to
split_index).Stores them in train_data. These rows are used to fit/train
your model.
Create Testing Set - test_data <- data[(split_index+1):nrow(data),
]: This takes the remaining 20% of the rows (from split_index + 1 to the
end). Stores them in test_data. These are used to test/evaluate the
model’s performance.
Why Split Like This: Helps avoid overfitting (where a model memorizes
the training data) and allows you to assess how well the model
generalizes to new, unseen data.
80/20 splits are a common rule of thumb, though you might also use
70/30.
SEIR Model with Vector Dynamics
malaria_model <- function(t, state, params) {
with(as.list(c(state, params)), {
dS <- Lambda_h - beta_h * S * Iv - mu_h * S + delta_h * Rec
dE <- beta_h * S * Iv - (sigma_h + mu_h) * E
dI <- sigma_h * E - (gamma_h + mu_h) * I
dRec <- gamma_h * I - (delta_h + mu_h) * Rec
dSv <- Lambda_v - beta_v * Sv * I - mu_v * Sv
dIv <- beta_v * Sv * I - (nu_v + mu_v) * Iv
return(list(c(dS, dE, dI, dRec, dSv, dIv)))
})
}
Simplified wrapper function for parameter estimation
Some parametrs are fixed based on biological knowledge
model_fit <- function(t, beta_h, beta_v, gamma_h, nu_v) {
# Fixed parameters based on biological knowledge
fixed_params <- c(
Lambda_h = 10, Lambda_v = 100,
sigma_h = 0.1, delta_h = 0.01,
mu_h = 0.0001, mu_v = 0.1
)
initial_conditions <- c(S = 990, E = 5, I = 5, Rec = 0, Sv = 500, Iv = 10)
params <- c(fixed_params, beta_h = beta_h, beta_v = beta_v,
gamma_h = gamma_h, nu_v = nu_v)
sol <- ode(y = initial_conditions, times = t, func = malaria_model,
parms = params, method = "lsoda")
return(sol[,"I, S, E, R"]) # Return only infected counts
}
- model_fit <- function(t, beta_h, beta_v, gamma_h, nu_v) { :This
defines a function model_fit which takes four parameters as input.
- t: A vector of time points at which the model should be
evaluated.
- beta_h: Transmission rate from mosquitoes to humans, beta_v:
Transmission rate from humans to mosquitoes, gamma_h: Recovery rate of
humans, nu_v: Extrinsic incubation period (the time it takes for a
mosquito to become infected after feeding on an infected human).
- Fixed Parameters
- fixed_params <- c( Lambda_h = 10, Lambda_v = 100, sigma_h = 0.1,
delta_h = 0.01, mu_h = 0.0001, mu_v = 0.1 ) : This sets up a vector
fixed_params containing fixed values for various biological parameters
that are assumed to be constant across the simulation. These parameters
include: Lambda_h: Human birth rate, Lambda_v: Mosquito birth rate,
sigma_h: Rate at which exposed humans become infected, delta_h: Rate at
which infected humans recover, mu_h: Natural death rate of humans, mu_v:
Natural death rate of mosquitoes.
- Initial Conditions
- initial_conditions <- c(S = 990, E = 5, I = 5, Rec = 0, Sv = 500,
Iv = 10) : This sets up the initial conditions for the model:
- Parameters Combination
- params <- c(fixed_params, beta_h = beta_h, beta_v = beta_v,
gamma_h = gamma_h, nu_v = nu_v): This combines the fixed parameters and
the parameters passed into the function (beta_h, beta_v, gamma_h, and
nu_v) into a single vector called params. This vector will be passed to
the ODE solver.
- ODE Solver Call
- sol <- ode(y = initial_conditions, times = t, func =
malaria_model, parms = params, method = “lsoda”) :This calls the ode
function (from the deSolve package) to solve the system of differential
equations:
- y = initial_conditions: Specifies the initial values of the
compartments (S, E, I, Rec, Sv, Iv).
- times = t: A vector of time points at which the solution is to be
computed.
- func = malaria_model: The function malaria_model contains the system
of differential equations for the malaria spread.
- parms = params: The parameters (both fixed and dynamic) that will be
used in the model.
- method = “lsoda”: Specifies the numerical integration method (lsoda
is a robust method for solving stiff ODEs).
- Return Infected Population
- return(sol[,“I”]) # Return only infected counts. :The ode function
returns a matrix where each row corresponds to a time point, and each
column corresponds to a compartment (S, E, I, Rec, Sv, Iv). This line
extracts only the infected population (I) from the solution and returns
it.
Summary: Purpose: The model_fit function is used to simulate the
number of infected humans (I) over time for a malaria model.
Inputs: - Time points t at which the model is evaluated. - Parameters
for transmission rates (beta_h, beta_v), recovery rate (gamma_h), and
extrinsic incubation period (nu_v). - Outputs: The function returns a
vector of infected individuals (I) at each time step in t.
This function is useful for fitting the model to real data, as it
computes the predicted infected population over time given the
parameters and initial conditions.
Parameter estimation function
estimate_parameters <- function(data) {
t <- data$day
I_data <- data$infected # Use actual infected counts
# Starting values based on typical malaria parameters
start_vals <- c(
beta_h = 0.002, # Transmission rate from mosquitoes to humans
beta_v = 0.001, # Transmission rate from humans to mosquitoes
gamma_h = 0.05, # Recovery rate
nu_v = 0.1 # Extrinsic incubation period
)
# Set reasonable bounds
lower_bounds <- c(1e-4, 1e-4, 0.01, 0.01)
upper_bounds <- c(0.1, 0.1, 1, 1)
# Perform fitting with error handling
tryCatch({
fit <- nlsLM(
I_data ~ model_fit(t, beta_h, beta_v, gamma_h, nu_v),
start = start_vals,
lower = lower_bounds,
upper = upper_bounds,
control = nls.lm.control(maxiter = 500, ftol = 1e-6)
)
return(coef(fit))
}, error = function(e) {
message("Parameter estimation failed with error: ", e$message)
message("Returning starting values as fallback")
return(start_vals)
})
}
step 1 - estimate_parameters <- function(data) { This defines a
function called estimate_parameters. It takes data as input. data must
have at least two columns: day (time points), infected (observed
infected human counts).
Step 2: Extract day and infected from the data - t <- data\(day
- I_data <- data\)infected
t becomes the vector of time points. I_data becomes the real-world
data for the number of infected people.
Step 3: Set starting values for parameters - start_vals <- c( -
beta_h = 0.002, # mosquito → human transmission rate - beta_v = 0.001, #
human → mosquito transmission rate - gamma_h = 0.05, # human recovery
rate - nu_v = 0.1 # mosquito incubation rate )
Starting guesses for the four parameters you want to estimate. Good
starting points help the algorithm converge faster.
Step 4: Set bounds for the parameters -lower_bounds <- c(1e-4,
1e-4, 0.01, 0.01) -upper_bounds <- c(0.1, 0.1, 1, 1) -Lower and upper
limits for each parameter.
Makes sure the fitted parameters stay realistic and biological.
Step 5: Fit the model to the data using nlsLM fit <- nlsLM(
-I_data ~ model_fit(t, beta_h, beta_v, gamma_h, nu_v), -start =
start_vals, -lower = lower_bounds, -upper = upper_bounds, -control =
nls.lm.control(maxiter = 500, ftol = 1e-6) )
nlsLM = non-linear least squares fitting (from the minpack.lm
package). It tries to find values of beta_h, beta_v, gamma_h, and nu_v
that make model_fit(t, …) match I_data.
- maxiter = 500 = allow up to 500 iterations.
- ftol = 1e-6 = very small tolerance for convergence (tight
fitting).
Step 6: Return the fitted parameters -return(coef(fit)) If fitting
succeeds, coef(fit) gives the estimated values of the parameters.
Step 7: Handle errors gracefully -}, error = function(e) {
-message(“Parameter estimation failed with error:”, e$message)
-message(“Returning starting values as fallback”) -return(start_vals)
})
If fitting fails (for example, the model doesn’t converge), the
function: Prints an error message. Returns the starting values instead
of crashing.
Estimate parameters using training data
params_opt <- estimate_parameters(train_data)
param_names <- c("Beta_h", "Beta_v", "Gamma_h", "Nu_v")
cat("\nEstimated Parameters:\n")
Estimated Parameters:
for(i in seq_along(params_opt)) {
cat(sprintf("%-8s: %.6f\n", param_names[i], params_opt[i]))
}
Beta_h : 0.002000
Beta_v : 0.001000
Gamma_h : 0.050000
Nu_v : 0.100000
The above parameters are the ones to be estimated. the others
were fixed.
params_opt <- estimate_parameters(train_data) Calls the
function with your train_data dataset.Stores the output (the estimated
parameters) into params_opt.
param_names <- c(“Beta_h”, “Beta_v”, “Gamma_h”, “Nu_v”)
-cat(“Parameters:”) -for(i in seq_along(params_opt)) {
-cat(sprintf(“%-8s: %.6f”, param_names[i], params_opt[i])) }
Prepares nice names for printing (param_names). Loops through the
estimated values (params_opt) and prints them one by one neatly.
In summary: This code takes your real malaria data and tries to find
the best parameters (β_h, β_v, γ_h, ν_v) so that your model matches the
data.
Set up the problem, Fit the model, Handle errors, Print results
Return estimated parameters as a named list
final_params <- list(
beta_h = params_opt["beta_h"],
beta_v = params_opt["beta_v"],
gamma_h = params_opt["gamma_h"],
nu_v = params_opt["nu_v"],
# Fixed parameters
Lambda_h = 10,
Lambda_v = 100,
sigma_h = 0.1,
delta_h = 0.01,
mu_h = 0.0001,
mu_v = 0.1
)
The code is creating a list of all parameters needed to run your
malaria model. This list combines the The estimated parameters you just
fit (beta_h, beta_v, gamma_h, nu_v) Plus the fixed biological parameters
you already knew (Lambda_h, Lambda_v, etc.)
Steps 1. final_params <- list(…) Create a new list called
final_params. 2. beta_h = params_opt[“beta_h”] Pull the estimated beta_h
value from your fitted parameters (params_opt). 3. beta_v =
params_opt[“beta_v”] Pull the estimated beta_v value. 4. gamma_h =
params_opt[“gamma_h”] Pull the estimated human recovery rate. 5. nu_v =
params_opt[“nu_v”] Pull the estimated mosquito incubation rate. 6.
Lambda_h = 10 Set human recruitment (birth rate) fixed at 10. 7.
Lambda_v = 100 Set mosquito recruitment rate fixed at 100. 8. sigma_h =
0.1 Set human progression rate (E → I) fixed at 0.1. 9. delta_h = 0.01
Human disease-induced death rate fixed at 0.01. 10. mu_h = 0.0001
Natural human death rate fixed at 0.0001. 11. mu_v = 0.1 Natural
mosquito death rate fixed at 0.1.
Why build final_params? Because when you run your ODE model (using
ode() again), it needs all the parameters — both estimated and
fixed.
We then have a clean list final_params we can easily plug into our
model.
Print all parameters being used
cat("\nComplete Parameter Set for Model:\n")
Complete Parameter Set for Model:
print(as.data.frame(final_params))
beta_h beta_v gamma_h nu_v Lambda_h Lambda_v sigma_h delta_h mu_h mu_v
beta_h 0.002 0.001 0.05 0.1 10 100 0.1 0.01 1e-04 0.1
# Return the final parameter list
final_params
$beta_h
beta_h
0.002
$beta_v
beta_v
0.001
$gamma_h
gamma_h
0.05
$nu_v
nu_v
0.1
$Lambda_h
[1] 10
$Lambda_v
[1] 100
$sigma_h
[1] 0.1
$delta_h
[1] 0.01
$mu_h
[1] 1e-04
$mu_v
[1] 0.1
- The above is a full list of the fixed parameters and the estimated
parameters
-cat(“Parameter Set for Model:”) -print(as.data.frame(final_params))
-final_params
Steps 1. cat(“Parameter Set for Model:”): Prints a simple title
message to the console to organize your output. (means new line.) 2.
print(as.data.frame(final_params)): Converts the final_params list into
a data frame and prints it nicely. 3. final_params: By typing the name
final_params at the end, R will automatically display the contents of
final_params in the console when the script runs.
- cat(): is just for a cleaner, readable console message.
- as.data.frame(): formats the list as a table (columns/rows) instead
of messy list output.
- Returning final_params: means the last thing your script outputs is
the actual list of parameter values ready to use in the model or another
function.
3.2. Estimating the parameters with no parameter fixed by splitting
the data sets
The code below estimates all the parameters from the training set. No
parameter is fixed.
library(deSolve)
library(ggplot2)
library(dplyr)
library(tidyr)
Load and preprocess data
data <- read.csv("malaria6.csv") %>%
rename_all(tolower) %>%
rename_all(~ gsub(" ", "", .)) %>%
drop_na() %>%
mutate(day = seq_len(n()) - 1) # Ensure day starts at 0
-data <- read.csv(“malaria6.csv”) %>% -rename_all(tolower)
%>% - rename_all(~ gsub(” “,”“, .)) %>% - drop_na() %>% -
mutate(day = seq_len(n()) - 1) # Ensure day starts at 0
- read.csv(“malaria6.csv”): Read the CSV file into R. Loads your
malaria data into memory.
- %>% rename_all(tolower): Make all column names lowercase. Avoids
issues like Infected vs. infected.
- %>% rename_all(~ gsub(” “,”“, .)): Remove any spaces from column
names. Prevents errors if columns were named like”Infected Cases”.
- %>% drop_na(): Drop any rows with missing values (NA). So your
model won’t break from missing data.
- %>% mutate(day = seq_len(n()) - 1): Create a new day column: 0,
1, 2, …, (n-1). Sets up a time variable for model fitting, starting at
day 0. The data is cleaned — lowercase, no spaces, no missing values,
the data gets a new day column for time-series modeling, All this
happens in one smooth pipeline using the dplyr package (because of
%>% pipes).
Check for required column
if (!"i" %in% colnames(data)) {
stop("The column 'i' is missing from the CSV file. Please check the column names.")
}
-if (!“i” %in% colnames(data)) { - stop(“The column ‘i’ is missing
from the CSV file. Please check the column names.”) }
Steps 1 colnames(data): Get the list of all column names in your
data. Check what columns are available. 2 “i” %in% colnames(data): Check
if “i” exists in the column names. Logical test: TRUE if “i” is found,
FALSE if missing. 3 ! Negate the result (i.e., if “i” is NOT found). So
the if triggers only if “i” is missing. 4 stop(“…”) If “i” is missing,
stop the program immediately and print an error message. Prevents
running the model with wrong/missing data!
Why is this important? The data must have a column called “i” for
infected humans for the model to work. If it’s missing (maybe a typo in
the CSV), you want to catch the problem early rather than get strange
errors later in the simulation. Stop() completely halts the script, so
you don’t waste time running broken code.
In short: Check if your important column “i” exists. If not,
immediately stop with a clear error message.
Split data into training and testing
split_index <- floor(nrow(data) * 0.8)
train_data <- data[1:split_index, ]
test_data <- data[(split_index + 1):nrow(data), ]
-split_index <- floor(nrow(data) * 0.8) -train_data <-
data[1:split_index, ] -test_data <- data[(split_index +
1):nrow(data), ]
Steps 1 nrow(data): Get the number of rows in the dataset. Find out
how many time points you have. 2 nrow(data) * 0.8: Multiply by 0.8. Take
80% of the data for training. 3 floor(…): Round down to the nearest
whole number. You can’t have partial rows, so use whole numbers only. 4
split_index <- …: Store the index where you split the data. So you
know where training ends and testing begins. 5 train_data <-
data[1:split_index, ]: Select the first 80% of rows. These will be used
for training the model (fitting parameters). 6 test_data <-
data[(split_index + 1):nrow(data), ]: Select the remaining 20% of rows.
These will be used for testing (seeing how well the model predicts new
data).
Example: Suppose you have nrow(data) = 100 rows. split_index =
floor(100 × 0.8) = 80. Then train_data = rows 1 to 80, test_data = rows
81 to 100
Why do this? We need training data to estimate the model’s parameters
and testing data to check if the model generalizes well (doesn’t just
“memorize” the training data). This is standard in machine learning and
time series modeling too.
SEIR with vector dynamics
seir_model <- function(t, state, parameters) {
with(as.list(c(state, parameters)), {
dS <- lambda_h - beta_h * S * Iv - mu_h * S + delta_h * R
dE <- beta_h * S * Iv - (sigma_h + mu_h) * E
dI <- sigma_h * E - (gamma_h + mu_h) * I
dR <- gamma_h * I - (delta_h + mu_h) * R
dSv <- lambda_v - beta_v * Sv * I - mu_v * Sv
dIv <- beta_v * Sv * I - (nu_v + mu_v) * Iv
list(c(dS, dE, dI, dR, dSv, dIv))
})
}
Objective function for optimization
objective_function <- function(params, time, observed_I) {
initial_conditions <- c(S = 990, E = 5, I = 5, R = 0, Sv = 500, Iv = 10)
parameters <- list(
lambda_h = params[1], lambda_v = params[2],
beta_h = params[3], beta_v = params[4],
sigma_h = params[5], gamma_h = params[6],
delta_h = params[7], mu_h = params[8],
mu_v = params[9], nu_v = params[10]
)
times <- seq(min(time), max(time), by = 1)
sol <- tryCatch({
ode(y = initial_conditions, times = times, func = seir_model, parms = parameters)
}, error = function(e) {
return(matrix(NA, nrow = length(times), ncol = 7))
})
pred_I <- sol[, "I"]
pred_log_I <- log(pred_I + 1)
return(pred_log_I)
}
Steps 1 objective_function <- function(params, time, observed_I)
{: Define a new function that takes: params, time, and observed_I.
params are what you are trying to estimate (like beta, gamma, etc.). 2
initial_conditions <- c(…): Set the starting populations for humans
and mosquitoes. You need initial values for the ODE solver. 3 parameters
<- list(…): Map the incoming params vector into named model
parameters. The ODE solver needs named parameters (like beta_h, mu_h,
etc.). 4 times <- seq(min(time), max(time), by = 1): Create a time
sequence from minimum to maximum day. You want the model to simulate at
each day (1-day steps). 5 sol <- tryCatch({…}): Solve the system of
ODEs using the seir_model function. This gives predicted values over
time. 6 Inside tryCatch: If solving fails (for bad params), return a
matrix full of NAs. Avoid crashes during fitting. 7 pred_I <- sol[,
“I”]: Extract only the predicted number of infected humans (I). Focus on
matching infections to observed data. 8 pred_log_I <- log(pred_I +
1): Take the log of predicted infected (+1 to avoid log(0)). Helps
stabilize variance and avoid huge errors. 9 return(pred_log_I): Return
the log-transformed predictions. Later used to compare against actual
data.
Why log-transform the predictions? - smooths large spikes in
infection numbers. - reduces the effect of very large numbers
overpowering the fitting. - makes loss functions behave nicely.
Parameter estimation function
estimate_parameters <- function(train_data) {
time <- train_data$day
observed_I <- train_data$i
#initial_guess <- rep(0.1, 10)
#lower_bounds <- rep(0, 10)
#upper_bounds <- rep(1, 10)
initial_guess <- rep(0.1, 10)
lower_bounds <- rep(0, 10)
upper_bounds <- rep(1, 10)
result <- optim(
par = initial_guess,
fn = function(params) {
pred_log_I <- objective_function(params, time, observed_I)
sum((pred_log_I - log(observed_I + 1))^2, na.rm = TRUE)
},
method = "L-BFGS-B",
lower = lower_bounds,
upper = upper_bounds,
control = list(maxit = 1000)
)
return(result$par)
}
- estimate_parameters <- function(train_data) { time <-
train_data\(day
observed_I <- train_data\)i
This function prepares the training set by by estimating the
parameters time = the days (0, 1, 2, …). and observed_I = the number of
infected individuals (“i” column).
- initial_guess <- rep(0.1, 10) lower_bounds <- rep(0, 10)
upper_bounds <- rep(1, 10)
Step 2: Set starting points and boundaries for parameters
initial_guess = rep(0.1, 10) → Start of guessing all parameters are 0.1.
lower_bounds = rep(0, 10) → Minimum values for 10 parameters (all ≥ 0).
upper_bounds = rep(1, 10) → Maximum values for 13 parameters (all ≤
1).
- result <- optim( par = initial_guess, fn = function(params) {
pred_log_I <- objective_function(params, time, observed_I)
sum((pred_log_I - log(observed_I + 1))^2, na.rm = TRUE) }, method =
“L-BFGS-B”, lower = lower_bounds, upper = upper_bounds, control =
list(maxit = 1000) )
Step 3: Optimize the parameters using optim()
par = initial_guess: Start optimization from initial guesses. fn =
function(params): Define what we are minimizing: the sum of squared
differences between predicted and observed (both log-transformed).
method = “L-BFGS-B”: A good optimization algorithm that allows lower and
upper bounds. lower, upper Limits for parameters. control = list(maxit =
1000): Allow up to 1000 iterations for better convergence.
This code finds the set of parameters that makes your model match the
real infection data as closely as possible.
Step 4: Return optimized parameter values result$par contains the
best set of parameter estimates after fitting.
params_opt <- estimate_parameters(train_data): Run the
estimate_parameters function to fit the model on your training data and
Store the best parameters into params_opt.
param_names <- c(“Lambda_h”, “Lambda_v”, “Beta_h”, “Beta_v”,
“Sigma_h”, “Gamma_h”, “Delta_h”, “Mu_h”, “Mu_v”, “Nu_v”) List the names
of the parameters you’re estimating.
cat(“Estimated Parameters:”) for (i in seq_along(params_opt)) {
cat(sprintf(“%-10s: %.4f”, param_names[i], params_opt[i])) }: Nicely
print the results For each parameter, print its name and its estimated
value (formatted to 4 decimals).
In short: Define a function to find the best parameters.
Use optim() to minimize prediction error.
Log-transform infections to smooth variability.
Print out your best estimates nicely.
Quick Summary Table:
NOTE optim() Fits the model by minimizing prediction errors.
initial_guess Where optimizer starts looking. lower_bounds, upper_bounds
Constrain parameters to be realistic (e.g., [0,1]). log(I + 1)
Stabilizes the error by reducing influence of big values.
cat(sprintf(…)) Print the estimated values cleanly for inspection.
Solve with estimated parameters
initial_conditions <- c(S = 990, E = 5, I = 5, R = 0, Sv = 500, Iv = 10)
parameters <- list(
lambda_h = params_opt[1], lambda_v = params_opt[2],
beta_h = params_opt[3], beta_v = params_opt[4],
sigma_h = params_opt[5], gamma_h = params_opt[6],
delta_h = params_opt[7], mu_h = params_opt[8],
mu_v = params_opt[9], nu_v = params_opt[10]
)
times <- seq(0, max(data$day), by = 1)
solution <- ode(y = initial_conditions, times = times, func = seir_model, parms = parameters)
DLSODA- At T (=R1), too much accuracy requested
for precision of machine.. See TOLSF (=R2)
In above message, R1 = 1, R2 = nan
- times <- seq(0, max(data$day), by = 1) solution <- ode(y =
initial_conditions, times = times, func = seir_model, parms =
parameters)
Step 1: Define Initial Conditions initial_conditions <- c(S = 990,
E = 5, I = 5, R = 0, Sv = 500, Iv = 10) What happens:
You define the starting values for each compartment in your SEIR
model:
S = 990 → Susceptible humans (how many are initially at risk of
infection).
E = 5 → Exposed humans (those who are infected but not yet
infectious).
I = 5 → Infected humans (currently infectious individuals).
R = 0 → Recovered humans (those who have recovered and are now
immune).
Sv = 500 → Susceptible mosquitoes.
Iv = 10 → Infected mosquitoes.
Why it’s needed: These initial values serve as the starting point for
your differential equation (ODE) solver. They tell the model how to
begin, and the solver will evolve these values over time based on the
disease dynamics.
Estimate parameters
params_opt <- estimate_parameters(train_data)
param_names <- c("Lambda_h", "Lambda_v", "Beta_h", "Beta_v", "Sigma_h",
"Gamma_h", "Delta_h", "Mu_h", "Mu_v", "Nu_v")
cat("Estimated Parameters:\n")
Estimated Parameters:
for (i in seq_along(params_opt)) {
cat(sprintf("%-10s: %.4f\n", param_names[i], params_opt[i]))
}
Lambda_h : 1.0000
Lambda_v : 0.9258
Beta_h : 0.9228
Beta_v : 0.9210
Sigma_h : 0.0220
Gamma_h : 0.0000
Delta_h : 1.0000
Mu_h : 0.0100
Mu_v : 0.0000
Nu_v : 0.0000
Step 2: Define Parameters (from params_opt) parameters <- list(
lambda_h = params_opt[1], lambda_v = params_opt[2], beta_h =
params_opt[3], beta_v = params_opt[4], sigma_h = params_opt[5], gamma_h
= params_opt[6], delta_h = params_opt[7], mu_h = params_opt[8], mu_v =
params_opt[9], nu_v = params_opt[10] ) This line is creating a list of
parameters for the ODE model, which are essential for the disease
transmission process. These values are optimized previously and are now
being passed into the model. These parameters are the variables that
control the dynamics of the disease spread and are key to solving the
ODE system correctly. These values are needed by the ode() function to
simulate how the infection evolves over time.
Step 3: Create a Time Vector - times <- seq(0, max(data\(day), by = 1)
times creates a sequence of time steps starting from day 0 up to the
maximum day in your data (max(data\)day)), with a step of 1
(i.e., it runs day-by-day). The ODE solver needs a time vector to
simulate the model’s evolution. By providing the exact days over which
the model should run, the solver can output predictions for each of
those days.
Step 4: Solve the ODE System - solution <- ode(y =
initial_conditions, times = times, func = seir_model, parms =
parameters) What happens: This line uses the ode() function to
numerically solve the system of ordinary differential equations (ODEs)
based on the SEIR model.
y = initial_conditions: The initial conditions for the populations
(S, E, I, R, Sv, Iv).
times = times: The vector of times over which the solution will be
calculated.
func = seir_model: Specifies the function (the SEIR model) that
contains the differential equations describing the disease spread.
parms = parameters: Passes in the parameters that control the disease
dynamics (the transmission rates, recovery rates, etc.).
Why it’s needed: This step runs the simulation of the disease spread
using the SEIR model and the parameters you’ve specified. The ode()
function computes the values for each compartment (S, E, I, R, Sv, Iv)
at each time step, generating a solution that reflects how the disease
evolves over time.
In summary: This code sets up the initial conditions for a SEIR
(Susceptible, Exposed, Infected, Recovered) model with both human and
mosquito populations, defines the necessary parameters that control the
transmission dynamics, and then solves the system of ODEs over a given
time period to simulate the disease spread.
The result, stored in solution, contains the predicted values of S,
E, I, R, Sv, Iv over time — which can then be compared with actual data
to check how well the model fits.
Prepare for plotting
#solution_df <- as.data.frame(solution) %>%
# rename(Model_I = I, Model_R = R) %>%
# select(time, Model_I, Model_R)
#plot_data <- data %>%
#select(day, i) %>%
#rename(Actual_I = i) %>%
#left_join(solution_df, by = c("day" = "time"))
# Plot
#ggplot(plot_data, aes(x = day)) +
# geom_line(aes(y = Model_I, color = "Model Infected"), size = 1) +
#geom_line(aes(y = Model_R, color = "Model Recovered"), size = 1) +
#geom_point(aes(y = Actual_I, color = "Actual Infected"), alpha = 0.6, size = 2) +
#geom_vline(xintercept = split_index, linetype = "dashed", color = "gray") +
#labs(title = "Comparison of Actual vs. Model Cases",
# x = "Days",
# y = "Population Count",
# color = "Legend") +
#scale_color_manual(values = c("Model Infected" = "blue",
# "Model Recovered" = "purple",
# "Actual Infected" = "red")) +
#theme_minimal() +
#theme(legend.position = "bottom")
Step 1: Prepare the Model Data for Plotting - solution_df <-
as.data.frame(solution) %>% rename(Model_I = I, Model_R = R) %>%
select(time, Model_I, Model_R) What happens: solution contains the
output of the ODE solver, which is likely a matrix with columns for the
time steps and the values of different compartments (S, E, I, R, Sv,
Iv). This line is converting that matrix into a data frame using
as.data.frame().
rename(Model_I = I, Model_R = R): renames the columns I and R to
Model_I and Model_R, respectively, to better reflect that these are the
model’s predictions for the number of infected (I) and recovered (R)
individuals.
select(time, Model_I, Model_R): selects only the time, Model_I, and
Model_R columns from the solution_df. The time column corresponds to the
days (or time steps) for which the model was solved.
Why it’s needed: This step organizes the model’s output into a tidy
format (a data frame) with only the columns needed for the plot — the
time, model’s predicted infected, and recovered populations.
Step 2: Prepare the Actual Data for Plotting - plot_data <- data
%>% select(day, i) %>% rename(Actual_I = i) %>%
left_join(solution_df, by = c(“day” = “time”)) What happens: data is the
original dataset with actual infection data (e.g., from your CSV
file).
-select(day, i): selects the columns day (the day number) and i (the
number of infected individuals).
-rename(Actual_I = i): renames the column i to Actual_I, reflecting
that this is the observed data for infected individuals.
-left_join(solution_df, by = c(“day” = “time”)): merges the model’s
predictions (solution_df) with the actual data (data). The merge is done
based on the matching day column from the actual data and the time
column from the model output.
Why it’s needed:This step combines the actual infection data with the
model’s predicted data (infected and recovered populations), so both can
be plotted together for comparison.
Step 3: Create the Plot - ggplot(plot_data, aes(x = day)) +
geom_line(aes(y = Model_I, color = “Model Infected”), size = 1) +
#geom_line(aes(y = Model_R, color = “Model Recovered”), size = 1) +
geom_point(aes(y = Actual_I, color = “Actual Infected”), alpha = 0.6,
size = 2) + geom_vline(xintercept = split_index, linetype = “dashed”,
color = “gray”) + labs(title = “Comparison of Actual vs. Model Cases”, x
= “Days”, y = “Population Count”, color = “Legend”) +
scale_color_manual(values = c(“Model Infected” = “blue”, #“Model
Recovered” = “purple”, “Actual Infected” = “red”)) + theme_minimal() +
theme(legend.position = “bottom”) Explanation of the Plot:
ggplot(plot_data, aes(x = day)):
This starts the plot with plot_data and specifies that the x-axis
will represent the day (time).
geom_line(aes(y = Model_I, color = “Model Infected”), size = 1):
Adds a line plot for the model’s infected population (Model_I).
The color of the line is labeled as “Model Infected”, and the line
thickness is set to 1 (size = 1).
geom_line(aes(y = Model_R, color = “Model Recovered”), size = 1):
This line is commented out, but it would add a line for the model’s
recovered population (Model_R). If you wanted to display it, you could
simply uncomment this line.
geom_point(aes(y = Actual_I, color = “Actual Infected”), alpha = 0.6,
size = 2):Adds points for the actual observed infected population
(Actual_I).
The points are given a red color (“Actual Infected”) and are made
semi-transparent with alpha = 0.6.
The size of the points is set to 2.
geom_vline(xintercept = split_index, linetype = “dashed”, color =
“gray”): Adds a vertical dashed line at split_index. This line is used
to mark a data split point (for example, training/testing split) on the
x-axis.
The line is dashed (linetype = “dashed”) and colored gray.
labs(…):
Adds titles and labels to the plot:
Title: “Comparison of Actual vs. Model Cases”.
x-axis label: “Days”.
y-axis label: “Population Count”.
Legend title: “Legend” (for distinguishing actual vs. model
data).
scale_color_manual(values = c(…)):
Customizes the colors used in the plot:
“Model Infected” is blue.
“Actual Infected” is red.
The “Model Recovered” color is commented out, but it would be purple
if included.
theme_minimal():
Applies a minimal theme for a clean look with no grid lines.
theme(legend.position = “bottom”):
Positions the legend at the bottom of the plot for better
readability.
Summary: solution_df holds the model’s predictions, including the
infected and recovered individuals over time.
plot_data combines the model’s predictions with actual infection
data.
ggplot() creates a plot that:
Displays the actual infected cases as red points.
Displays the model’s predicted infected cases as a blue line.
Optionally shows the model’s predicted recovered cases (if
uncommented).
Marks a dashed line to highlight the training/testing split.
The plot helps visually compare the actual infection data with the
model’s predictions, which is essential for assessing the model’s
accuracy.
Parameter estimation using training and testing set with comparison
of the infected graphs
The full code with no parameter estimated
# Required Libraries
library(deSolve)
library(readr)
library(dplyr)
library(tidyr)
library(ggplot2)
# Load and preprocess data
data <- read.csv("malaria6.csv") %>%
rename_all(tolower) %>%
rename_all(~ gsub(" ", "", .)) %>%
drop_na() %>%
mutate(day = seq_len(n()) - 1)
# Check for required column
if (!"i" %in% colnames(data)) {
stop("The column 'i' is missing from the CSV file. Please check the column names.")
}
# Split data
split_index <- floor(nrow(data) * 0.8)
train_data <- data[1:split_index, ]
test_data <- data[(split_index + 1):nrow(data), ]
# SEIR vector model
seir_model <- function(t, state, parameters) {
with(as.list(c(state, parameters)), {
dS <- lambda_h - beta_h * S * Iv - mu_h * S + delta_h * R
dE <- beta_h * S * Iv - (sigma_h + mu_h) * E
dI <- sigma_h * E - (gamma_h + mu_h) * I
dR <- gamma_h * I - (delta_h + mu_h) * R
dSv <- lambda_v - beta_v * Sv * I - mu_v * Sv
dIv <- beta_v * Sv * I - (nu_v + mu_v) * Iv
list(c(dS, dE, dI, dR, dSv, dIv))
})
}
# Objective function
objective_function <- function(params, time, observed_I) {
initial_conditions <- c(S = 990, E = 5, I = 5, R = 0, Sv = 500, Iv = 10)
parameters <- list(
lambda_h = params[1], lambda_v = params[2],
beta_h = params[3], beta_v = params[4],
sigma_h = params[5], gamma_h = params[6],
delta_h = params[7], mu_h = params[8],
mu_v = params[9], nu_v = params[10]
)
times <- seq(min(time), max(time), by = 1)
sol <- tryCatch({
ode(y = initial_conditions, times = times, func = seir_model, parms = parameters)
}, error = function(e) {
return(matrix(NA, nrow = length(times), ncol = 7))
})
pred_I <- sol[, "I"]
pred_log_I <- log(pred_I + 1)
return(pred_log_I)
}
# Parameter estimation
estimate_parameters <- function(train_data) {
time <- train_data$day
observed_I <- train_data$i
initial_guess <- rep(0.1, 10)
lower_bounds <- rep(0, 10)
upper_bounds <- rep(1, 10)
result <- optim(
par = initial_guess,
fn = function(params) {
pred_log_I <- objective_function(params, time, observed_I)
sum((pred_log_I - log(observed_I + 1))^2, na.rm = TRUE)
},
method = "L-BFGS-B",
lower = lower_bounds,
upper = upper_bounds,
control = list(maxit = 1000)
)
return(result$par)
}
# Estimate and display parameters
params_opt <- estimate_parameters(train_data)
param_names <- c("Lambda_h", "Lambda_v", "Beta_h", "Beta_v", "Sigma_h",
"Gamma_h", "Delta_h", "Mu_h", "Mu_v", "Nu_v")
cat("Estimated Parameters:\n")
Estimated Parameters:
for (i in seq_along(params_opt)) {
cat(sprintf("%-10s: %.4f\n", param_names[i], params_opt[i]))
}
Lambda_h : 1.0000
Lambda_v : 0.9258
Beta_h : 0.9228
Beta_v : 0.9210
Sigma_h : 0.0220
Gamma_h : 0.0000
Delta_h : 1.0000
Mu_h : 0.0100
Mu_v : 0.0000
Nu_v : 0.0000
# Solve SEIR with estimated parameters
initial_conditions <- c(S = 990, E = 5, I = 5, R = 0, Sv = 500, Iv = 10)
parameters <- list(
lambda_h = params_opt[1], lambda_v = params_opt[2],
beta_h = params_opt[3], beta_v = params_opt[4],
sigma_h = params_opt[5], gamma_h = params_opt[6],
delta_h = params_opt[7], mu_h = params_opt[8],
mu_v = params_opt[9], nu_v = params_opt[10]
)
times <- seq(0, max(data$day), by = 1)
solution <- ode(y = initial_conditions, times = times, func = seir_model, parms = parameters)
solution_df <- as.data.frame(solution) %>%
rename(Model_I = I, Model_R = R) %>%
select(time, Model_I, Model_R)
plot_data <- data %>%
select(day, i) %>%
rename(Actual_I = i) %>%
left_join(solution_df, by = c("day" = "time"))
# Plot
ggplot(plot_data, aes(x = day)) +
geom_line(aes(y = Model_I, color = "Model Infected"), size = 1) +
geom_line(aes(y = Model_R, color = "Model Recovered"), size = 1) +
geom_point(aes(y = Actual_I, color = "Actual Infected"), alpha = 0.6, size = 2) +
geom_vline(xintercept = split_index, linetype = "dashed", color = "gray") +
labs(title = "Comparison of Actual vs. Model Cases",
x = "Days",
y = "Population Count",
color = "Legend") +
scale_color_manual(values = c("Model Infected" = "blue",
#"Model Recovered" = "purple",
"Actual Infected" = "red")) +
theme_minimal() +
theme(legend.position = "bottom")

Python values R Studio Values Estimated Parameters: Lambda_h : 0.1269
1.000 Lambda_v : 0.1297 0.9258 Beta_h : 0.0006 0.9228 Beta_v : 0.0037
0.9210 Sigma_h : 0.1426 0.0220 Gamma_h : 0.0408 0.0000 Delta_h : 0.0261
1.0000 Mu_h : 0.0000 0.0100 Mu_v : 0.0000 0.0000 Nu_v : 0.0000
0.0000
The python estimates looks better from the graphs