Install and Load the libraries

library(caret) #for training machine learning models
library(pROC) ## For visualizing, smoothing, and comparing ROC curves
library(e1071) ## For statistical modeling and  machine learning tasks
library(class) ## For classification using k-Nearest Neighbors and other methods
library(caTools) ## For splitting data into training and testing sets
library(MASS) ## Provides plotting functions and data sets
library(ISLR) ## for practical applications of statistical learning methods
library(lattice) #for multivariate data visualization.
library(ggplot2) ##for data visualization
library(psych) ##for description of  data
library(tidyverse) ##for data manipulation
library(mlbench) ## For benchmarking ML Models
library(mltools) #for hyper parameter tuning
library(caretEnsemble) ##enables the creation of ensemble models
library(tictoc) #for determining the time taken for a model to run
library(flextable) ## To create and style tables
library(haven) ## For importing and exporting sas, stata, spss data formats.
library(foreign) # Reading a Stata file
library(diffeqr) # Solving differential eqns using deep learning methods
library(lorenz) #  for simulating chaotic systems, especially the Lorenz attractor.
library(deSolve)     # For solving ODEs
library(readr)       # For reading CSV files
library(minpack.lm)  # For nonlinear curve fitting
library(dplyr)       # For data manipulation

1. SIMULATING THE SEIR MODEL IN R

We start by coding the model in R and plotting the model results.This model is embedded in R and uses the (seir_model) function.

Libraries

library(deSolve) #Solving differential equations
library(ggplot2) # For plotting

deSolve: Used to solve the system of differential equations numerically (required for SEIR model).

ggplot2: Used for plotting the results of the simulation in a visually appealing way.

Define the SEIR model function

seir_model <- function(time, state, parameters) {
  S <- state[1]
  E <- state[2]
  I <- state[3]
  R <- state[4]
  N <- S + E + I + R
  
  beta <- parameters["beta"]
  sigma <- parameters["sigma"]
  gamma <- parameters["gamma"]

  
  dSdt <- -beta * S * I / N
  dEdt <- beta  * S * I / N - sigma * E
  dIdt <- sigma * E - gamma * I
  dRdt <- gamma * I
  
  list(c(dSdt, dEdt, dIdt, dRdt))}

This function defines the SEIR model as a system of differential equations (ODEs):

Variables: S: Susceptible individuals (not yet infected).

E: Exposed individuals (infected but not yet infectious).

I: Infected individuals (can spread the disease).

R: Recovered individuals (immune or removed from the population).

The total population N is the sum of S + E + I + R.

Parameters: beta: The transmission rate (how easily the disease spreads).

sigma: The rate at which exposed individuals become infectious (i.e., the incubation period).

gamma: The recovery rate (rate at which infected individuals recover).

Differential Equations: dS/dt: Change in susceptible population due to infection.

dE/dt: Change in exposed population due to exposure (contact with infected individuals).

dI/dt: Change in infected population (from exposed individuals becoming infectious and recovering).

dR/dt: Change in recovered population (from infected individuals recovering).

The model returns a list with the derivatives (rates of change) for each compartment.

Initial conditions

N <- 995  # Total population
E0 <- 10 # Initial exposed
I0 <- 5  # Initial infected
R0 <- 0  # Initial recovered
S0 <- N - E0 - I0 - R0 # Initial susceptible
initial_state <- c(S = S0, E = E0, I = I0, R = R0)

N: Total population (995).

E0, I0, R0: Initial numbers of exposed, infected, and recovered individuals.

S0: Initial number of susceptible individuals (calculated by subtracting the others from N).

initial_state: A vector with the initial values for S, E, I, and R.

Parameters

parameters <- c(beta = 0.8, sigma = 0.0347, gamma = 0.5)

beta = 0.8: Transmission rate.

sigma = 0.0347: Rate at which exposed individuals become infected (1/incubation period).

gamma = 0.5: Recovery rate (probability of recovery per time unit).

Time span for simulation

times <- seq(0, 200, length.out = 500)

times: Creates a sequence of time points (from 0 to 200 days), with 500 points in total. These are the time intervals at which the model will be solved.

Q:check on lower values of t i.e t = 10 and high value i.e t = 300

Solve the system of ODEs.

solution <- ode(y = initial_state, times = times, func = seir_model, parms = parameters)

ode(): The core function from deSolve that numerically solves the system of differential equations (seir_model).

y = initial_state: Initial conditions for S, E, I, and R.

times = times: The time intervals at which the solution is computed.

func = seir_model: The SEIR model function that defines the differential equations.

parms = parameters: The model parameters (beta, sigma, gamma).

This returns a matrix of results for each time step, with values for S, E, I, and R.

Convert solution to a data frame.

solution_df <- as.data.frame(solution)

Converts the output of the ode() function (a matrix) into a data frame for easier manipulation and plotting.

Plot results

ggplot(solution_df, aes(x = time)) +
  geom_line(aes(y = S, color = "Susceptible")) +
  geom_line(aes(y = I, color = "Infected")) +
  geom_line(aes(y = R, color = "Recovered")) +
  #geom_line(aes(y = E, color = "Exposed")) +
  labs(x = "Time (days)", y = "Population", title = "SEIR Model Dynamics") +
  scale_color_manual(values = c("Susceptible" = "blue", "Infected" = "red", "Exposed" = "yellow", "Recovered" = "green")) +
  theme_minimal()

Plot Components: - aes(x = time): Time on the x-axis.

  • geom_line(): Plots a line for each compartment (Susceptible, Infected, and Recovered).

  • labs(): Labels for the plot (x-axis, y-axis, title).

  • scale_color_manual(): Customizes the line colors for each compartment (blue for Susceptible, red for Infected, green for Recovered).

  • theme_minimal(): Applies a minimalist theme to the plot.

Summary

This code models the spread of a disease in a population using the SEIR compartmental model. - It defines the dynamics of susceptible, exposed, infected, and recovered individuals - Solves the system of differential equations - Plots the dynamics over time. The plot helps visualize how the number of susceptible, infected, and recovered individuals change over 200 days.

Key Concepts:

SEIR model: A simple mathematical model of disease spread.

ODE solver: The ode() function numerically solves the system of equations.

ggplot2: Used for visualizing the results.

1.2. Comparing the model results with real or observed data.

Since the real data was not available, we used synthetic data in csv file format.The malaria data set was named malaria5.

We can first look at the descriptive statistics of the data set malaria5 using the code below. ###### Always ensure to check the file before running to check if the datase is OK.

Loading necessary libraries

library(readr)   # For reading CSV files
library(dplyr)   # For data manipulation

Load the CSV file

data <- read_csv("malaria5.csv")

Display the first few rows

cat("First 5 rows of the dataset:\n")
First 5 rows of the dataset:
print(head(data, 5))
# A tibble: 5 × 7
    day     S     E     I     R    Sv    Iv
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1     0   980    10     5     0   990    10
2     1   969    28     5     0   983    15
3     2   949    54     7     0   976    20
4     3   920    86    12     0   966    27
5     4   879   127    19     0   950    40

Get basic info about the dataset

cat("\nDataset Summary:\n")

Dataset Summary:
print(str(data))  # Structure of the dataset
spc_tbl_ [100 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ day: num [1:100] 0 1 2 3 4 5 6 7 8 9 ...
 $ S  : num [1:100] 980 969 949 920 879 818 727 601 442 269 ...
 $ E  : num [1:100] 10 28 54 86 127 184 265 373 503 634 ...
 $ I  : num [1:100] 5 5 7 12 19 30 46 70 103 148 ...
 $ R  : num [1:100] 0 0 0 0 0 0 1 3 6 11 ...
 $ Sv : num [1:100] 990 983 976 966 950 925 887 831 752 648 ...
 $ Iv : num [1:100] 10 15 20 27 40 61 93 140 206 291 ...
 - attr(*, "spec")=
  .. cols(
  ..   day = col_double(),
  ..   S = col_double(),
  ..   E = col_double(),
  ..   I = col_double(),
  ..   R = col_double(),
  ..   Sv = col_double(),
  ..   Iv = col_double()
  .. )
 - attr(*, "problems")=<externalptr> 
NULL

Check for missing values

cat("\nMissing values per column:\n")

Missing values per column:
print(colSums(is.na(data)))
day   S   E   I   R  Sv  Iv 
  0   0   0   0   0   0   0 

Describe numerical data

cat("\nStatistical Summary:\n")

Statistical Summary:
print(summary(data))
      day              S               E               I        
 Min.   : 0.00   Min.   :  8.0   Min.   : 10.0   Min.   :  5.0  
 1st Qu.:24.75   1st Qu.: 24.0   1st Qu.:164.8   1st Qu.:346.0  
 Median :49.50   Median : 30.5   Median :180.0   Median :358.0  
 Mean   :49.50   Mean   :100.3   Mean   :229.1   Mean   :373.7  
 3rd Qu.:74.25   3rd Qu.: 33.0   3rd Qu.:194.8   3rd Qu.:459.8  
 Max.   :99.00   Max.   :980.0   Max.   :748.0   Max.   :572.0  
       R                Sv               Iv       
 Min.   :   0.0   Min.   : 55.00   Min.   : 10.0  
 1st Qu.: 329.8   1st Qu.: 67.75   1st Qu.:297.0  
 Median : 773.0   Median : 85.00   Median :298.5  
 Mean   : 653.5   Mean   :167.98   Mean   :312.3  
 3rd Qu.: 982.5   3rd Qu.: 87.00   3rd Qu.:316.8  
 Max.   :1131.0   Max.   :990.00   Max.   :617.0  

We can compare the model values vs the observed values graphed together to check whether the model parameters fit well with real data.Here The Comparison is done for the Infected and the Recovered classes.

Load necessary libraries

library(deSolve)    # For solving differential equations
library(ggplot2)    # For plotting results
library(readr)      # For reading/writing CSV files
library(dplyr)      # For data manipulation

SEIR Model Function

seir_model <- function(time, state, parameters) {
  S <- state["S"]
  E <- state["E"]
  I <- state["I"]
  R <- state["R"]
  N <- S + E + I + R
  
  beta <- parameters["beta"]
  sigma <- parameters["sigma"]
  gamma <- parameters["gamma"]
  
  dSdt <- -beta * S * I / N
  dEdt <- beta * S * I / N - sigma * E
  dIdt <- sigma * E - gamma * I
  dRdt <- gamma * I
  
  list(c(dSdt, dEdt, dIdt, dRdt))
}

Initial conditions

N <- 995  # Total population
E0 <- 10 # Initial exposed
I0 <- 5  # Initial infected
R0 <- 0  # Initial recovered
S0 <- N - E0 - I0 - R0 # Initial susceptible
initial_state <- c(S = S0, E = E0, I = I0, R = R0)
  • N = 995: Total population.
  • E0, I0, R0: Initial values for exposed, infected, and recovered individuals.
  • S0: Initial susceptible individuals, calculated as the remainder of the population not in any other compartment

Parameters

parameters <- c(beta = 0.8, sigma = 0.0347, gamma = 0.5)
  • beta = 0.8: Transmission rate (probability of infection per contact between susceptible and infected individuals).
  • sigma = 0.0347: Rate at which exposed individuals become infectious.
  • gamma = 0.5: Recovery rate (probability of recovery per unit time).

Time grid

times <- seq(0, 100, length.out = 500)

times: Generates a sequence of time points (from 0 to 100 days) with 500 equally spaced points. These are the time steps for which the model will be solved.

Solve SEIR model

solution <- ode(y = initial_state, times = times, func = seir_model, parms = parameters)
  • The ode() function from the deSolve package is used to numerically solve the system of differential equations defined in seir_model().
  • y = initial_state: Initial conditions for S, E, I, and R.
  • times = times: The time points at which the solution is computed.
  • func = seir_model: The function defining the SEIR model.
  • parms = parameters: The parameters (beta, sigma, gamma).
  • The solution is a matrix containing the values of S, E, I, and R at each time point.

Convert to a dataframe

solution_df <- as.data.frame(solution)
  • Converts the output from the ode() function (a matrix) to a data frame for easier manipulation and visualization.

Synthetic data deneration

#malaria_data <- solution_df %>%
# mutate(
 #  Infected = I + rnorm(n(), mean = 0, sd = 10),
  # Recovered = R + rnorm(n(), mean = 0, sd = 10),
   #Susceptible = S + rnorm(n(), mean = 0, sd = 10)
 # )
  • This part generates synthetic malaria data which was called malaria5 by adding random noise to the model output (I, R, S).
  • rnorm(): Generates random numbers from a normal distribution with a specified mean and standard deviation (sd = 10).
  • This noise is used to simulate real-world data where values are typically not perfect and are subject to random fluctuations.
  • We don`t need this chunk when using real data

Save synthetic malaria data

#write_csv(malaria_data, "malaria5.csv")

Saves the synthetic malaria data to a CSV file called malaria5.csv.

Load malaria data for comparison

malaria_observed <- read_csv("malaria5.csv")

Reads the synthetic malaria data back from the CSV file for comparison with the model output.

Add time column if missing for plotting

# Add a 'time' column if missing
if (!"time" %in% colnames(malaria_observed)) {
  malaria_observed$time <- seq_len(nrow(malaria_observed)) - 1
}

##Combine model and observed data for plotting

library(dplyr)
library(tidyverse)

# Combine model and observed data
model_long <- solution_df %>%
  select(time, I, R) %>%
  pivot_longer(cols = c(I, R), names_to = "Compartment", values_to = "Count") %>%
  mutate(Source = "Model")
model_long
# A tibble: 1,000 × 4
    time Compartment Count Source
   <dbl> <chr>       <dbl> <chr> 
 1 0     I           5     Model 
 2 0     R           0     Model 
 3 0.200 I           4.59  Model 
 4 0.200 R           0.480 Model 
 5 0.401 I           4.23  Model 
 6 0.401 R           0.922 Model 
 7 0.601 I           3.90  Model 
 8 0.601 R           1.33  Model 
 9 0.802 I           3.61  Model 
10 0.802 R           1.70  Model 
# ℹ 990 more rows
observed_long <- malaria_observed %>%
  select(time, I, R) %>%
  pivot_longer(cols = c(I, R), names_to = "Compartment", values_to = "Count") %>%
  mutate(Source = "Observed")

plot_data <- bind_rows(model_long, observed_long)

Create a label for color

plot_data <- plot_data %>%
  mutate(Group = paste(Source, Compartment))

Now plot using ggplot

ggplot(plot_data, aes(x = time, y = Count, color = Group, linetype = Source)) +
  geom_line() +
  geom_point(data = filter(plot_data, Source == "Observed"), size = 1.5) +  # Only points for observed
  labs(x = "Days", y = "Population", title = "SEIR Model vs. Observed Malaria Data", color = "Legend") +
  scale_color_manual(values = c(
    "Model I" = "red",
    "Observed I" = "darkred",
    "Model R" = "blue",
    "Observed R" = "darkblue"
  )) +
  theme_minimal()

  • ggplot(): Starts a new plot.
  • geom_line(): Adds dashed lines representing the model’s predicted infected and recovered populations (I and R).
  • geom_point(): Adds points representing the observed (synthetic) infected and recovered populations.
  • labs(): Adds labels and a title to the plot.
  • scale_color_manual(): Customizes the colors for the model and observed data.
  • theme_minimal(): Applies a minimalist theme to the plot.
  • Plot Interpretation: Dashed lines represent the model’s predictions for the infected and recovered populations.
  • Points represent the synthetic (observed) data for infected and recovered populations.

This comparison allows us to visually check how well the SEIR model matches the observed malaria data. From the graphs the parameters used to simulate the models gives very erroneous results and therefore unreliable.

The parameters beta = 0.8, sigma = 0.0347, gamma = 0.5 are inaccurate parameters

Summary

This code simulates a SEIR model for disease spread, generates synthetic malaria data with noise, saves it, and compares it visually with the model predictions. The comparison is shown in a plot that overlays the model’s predictions (dashed lines) and the synthetic observed data (points) for the infected and recovered populations.

1.3 PARAMETER ESTIMATION (Curve Fitting using Non Linear least square Method)

In some cases (as in the case above) the model does not compare well with the real data from the field which prompts the need to estimate the parameters from the real data collected. Note that some parameters are fixed due to biological undrerstanding i.e the rate of progression from exposed humans to being infected is given by 1 / incubation period. such a parameter must remain fixed.

Here We use the Nonlinear Least Squares method. The function performs nonlinear least squares fitting using the Levenberg–Marquardt (LM) algorithm which estimates model parameters by minimizing the sum of squared residuals between the model-predicted and the observed data.

We start by estimating all the parameters with none fixed.

Load required libraries

library(deSolve) #For solving differential equations (SEIR model)
library(readr)   # For reading CSV files
library(minpack.lm)  # For nonlinear least squares optimization
library(dplyr)      # For data manipulation

SEIR Model Function

seir_model <- function(t, y, params) {
  with(as.list(c(y, params)), {
    N <- S + E + I + R
    dSdt <- -beta * S * I / N
    dEdt <- beta * S * I / N - sigma * E
    dIdt <- sigma * E - gamma * I
    dRdt <- gamma * I
    return(list(c(dSdt, dEdt, dIdt, dRdt)))
  })
}

Parameter Estimation Function

##Load and prepare data (suppress column type message)

estimate_parameters <- function(data_path) {
  data <- read_csv(data_path, show_col_types = FALSE)
}
  • This function performs the parameter estimation by comparing the model’s predictions to observed data.
  • Reads the observed malaria data from a CSV file (data_path).
  • show_col_types = FALSE: suppresses warnings about column types.

Optionally remove ‘Date’ column if it exists

  if ("Date" %in% colnames(data)) {
    data <- select(data, -Date)
  }
  • Removes the Date column if it exists, as it’s not necessary for the model

If ‘day’ column doesn’t exist, create it

  if (!"day" %in% colnames(data)) {
    data$day <- seq_len(nrow(data)) - 1  # 0-based index
  }
  • Adds a day column if it doesn’t exist, which represents the time steps for the model.

Initial conditions

  N <- 995
  E0 <- 10
  I0 <- 5
  R0 <- 0
  S0 <- N - E0 - I0 - R0

  y0 <- c(S = S0, E = E0, I = I0, R = R0)
  t <- data$day
  observed_I <- data$I
  • Defines the initial conditions: total population (N), initial values for E0 (exposed), I0 (infected), R0 (recovered), and calculates the initial number of susceptible individuals (S0)
  • y0: Initial state vector for S, E, I, and R.
  • t: Time vector (days) from the observed data.
  • observed_I: Observed infected data from the CSV file.

Model output function

  model_output <- function(params, t) {
    out <- ode(y = y0, times = t, func = seir_model,
               parms = list(beta = params[1], sigma = params[2], gamma = params[3]),
               method = "lsoda", atol = 1e-6, rtol = 1e-6)
    return(out[, "I"])
  }
  • model_output(): Solves the SEIR model and returns the predicted infected population (I) at each time point.
  • ode() from the deSolve package is used to solve the SEIR system.
  • The parameters beta, sigma, and gamma are passed as a vector params.
  • method = “lsoda” stands for Livermore Solver for Ordinary Differential equations with Automatic switching.It automatically switches between non-stiff methods (Adams-Moulton) when the problem is easy and stiff methods (Backward Differentiation Formula) when needed.This makes “lsoda” a robust and efficient choice, especially when unsure about the stiffness of the system. Stiff equations are those where some components change much faster than others—common in biological models like SEIR.
  • atol = 1e-6 (Absolute Tolerance). atol controls the absolute error tolerance. It defines how precisely small values (e.g., near-zero states like early infection counts) are calculated.
  • 1e-6 means: the solution to be accurate within ±0.000001 for small values.
  • rtol = 1e-6 (Relative Tolerance), rtol controls the relative error tolerance. It defines the acceptable error relative to the size of the value.

Residuals function

  residuals <- function(params) {
    model_I <- model_output(params, t)
    return(observed_I - model_I)
  }
  • residuals(): Calculates the difference (residuals) between the observed infected population (observed_I) and the model’s predicted infected population (model_I) for a given set of parameters.

Initial guess

  initial_guess <- c(0.4, 0.02, 0.05)
  • initial_guess: Provides initial estimates for the parameters (beta, sigma, and gamma).

Run optimization

estimate_parameters <- function(data_path) {
  {fit <- nls.lm(par = initial_guess, fn = residuals, 
                control = nls.lm.control(maxiter = 1024))
return(fit$par)
}}
  • nls.lm(): Performs nonlinear least squares optimization using the Levenberg-Marquardt algorithm (from minpack.lm).
  • fn = residuals: The function to minimize (i.e., residuals between observed and model predictions).
  • maxiter = 1024: Maximum number of iterations.
  • After fitting, fit$par contains the estimated parameters

Main execution

tryCatch({
  params <- estimate_parameters("malaria5.csv")

  cat("\n=== Estimated Parameters ===\n")
  cat(sprintf("Gamma: %.4f\n", params[3]))
  cat(sprintf("Sigma: %.4f\n", params[2]))
  cat(sprintf("Beta : %.4f\n", params[1]), "\n")
}, error = function(e) {
  cat("Error in parameter estimation:", e$message, "\n")
})

=== Estimated Parameters ===
Gamma: 0.0205
Sigma: 0.0489
Beta : 15.3513
 
  • The tryCatch block is used to handle any potential errors that may occur during the estimation process (e.g., file reading errors or optimization issues).
  • params <- estimate_parameters(“malaria1.csv”), Attempts to estimate model parameters using the data from “malaria1.csv”.
  • cat() prints text to the console.
  • sprintf() formats strings like “Gamma: 0.0312” with 4 decimal places (%.4f).
  • error = function(e) {…}: This block runs only if there’s an error. e$message extracts the error message and printer the error.

Comparing with the python values === Estimated Parameters === python R Gamma: 0.0212 0.0205 Sigma: 0.0604 0.0489 Beta : 5.7807 15.3513

Comparing original parameters and estimated parameters in R studio

NOTE: Original parameters beta = 0.8, sigma = 0.0347, gamma = 0.5 Estimated parameters beta = 15.3513, sigma = 0.0489, gamma = 0.0205

Parameter estimation where some parameters are fixed due to biological knowledge

Some parameters are fixed due to biological understanding i.e the rate of progression from exposed humans to being infected is given by (1 / incubation period). such a parameter must remain fixed.

For the original parameter values the incubation period is about 29 days (sigma = 0.0347), For the estimated parameters the incubation period is about 20 days (sigma = 0.0489)

Which is more accurate biologically?

Libraries

Here we fixed sigma and Gamma and estimate Beta -The transmission rate or the probability of being infected from a single mosquito bite)

library(deSolve)
library(readr)
library(minpack.lm)
library(dplyr)

# SEIR model function
seir_model <- function(t, y, params) {
  with(as.list(c(y, params)), {
    N <- S + E + I + R
    dSdt <- -beta * S * I / N
    dEdt <- beta * S * I / N - sigma * E
    dIdt <- sigma * E - gamma * I
    dRdt <- gamma * I
    return(list(c(dSdt, dEdt, dIdt, dRdt)))
  })
}

# Parameter estimation function
estimate_parameters <- function(data_path) {
  data <- read_csv(data_path, show_col_types = FALSE)

  if ("Date" %in% colnames(data)) {
    data <- select(data, -Date)
  }
  if (!"day" %in% colnames(data)) {
    data$day <- seq_len(nrow(data)) - 1
  }

  # Initial conditions
  N <- 995
  E0 <- 10
  I0 <- 5
  R0 <- 0
  S0 <- N - E0 - I0 - R0
  y0 <- c(S = S0, E = E0, I = I0, R = R0)

  t <- data$day
  observed_I <- data$I

  # === Fixed parameters from biological knowledge ===
  fixed_sigma <- 1/5.2   # e.g., incubation ~5.2 days
  fixed_gamma <- 1/7     # e.g., recovery ~7 days

  # Model output using beta as the only free parameter
  model_output <- function(params, t) {
    beta <- params[1]
        out <- ode(
      y = y0,
      times = t,
      func = seir_model,
      parms = list(beta = beta, sigma = fixed_sigma, gamma = fixed_gamma),
      method = "lsoda", atol = 1e-6, rtol = 1e-6
    )
    return(out[, "I"])
  }

  # Residual function for optimization
  residuals <- function(params) {
    model_I <- model_output(params, t)
    return(observed_I - model_I)
  }

  # Initial guess (only for beta)
  initial_guess <- c(0.4)

  # Optimization
  fit <- nls.lm(par = initial_guess, fn = residuals, 
                control = nls.lm.control(maxiter = 1024))

  return(c(beta = fit$par, sigma = fixed_sigma, gamma = fixed_gamma))
}

# Main execution
tryCatch({
  params <- estimate_parameters("malaria5.csv")

  cat("\n=== Estimated and Fixed Parameters ===\n")
  cat(sprintf("Beta : %.4f (estimated)\n", params["beta"]))
  cat(sprintf("Sigma: %.4f (fixed)\n", params["sigma"]))
  cat(sprintf("Gamma: %.4f (fixed)\n", params["gamma"]), "\n")
}, error = function(e) {
  cat("Error in parameter estimation:", e$message, "\n")
})

=== Estimated and Fixed Parameters ===
Beta : 0.7852 (estimated)
Sigma: 0.1923 (fixed)
Gamma: 0.1429 (fixed)
 

Pthon values === Estimated and Fixed Parameters === Beta : 0.7852 (estimated) Sigma: 0.1923 (fixed) Gamma: 0.1429 (fixed)

R Values === Estimated and Fixed Parameters === Beta : 0.7852 (estimated) Sigma: 0.1923 (fixed) Gamma: 0.1429 (fixed)

When some values are fixed both python and R studio gives same values

1.4 Parameter estimation and curve fitting/Graphing. (Graph of the model infected and the observed values)

This section explain step by step function of estimating parameters and comparing the graph of the observed values versus the model predicted values using the estimated parameters with non fixed.

Load required libraries

library(deSolve)
library(readr)      # For reading CSV files
library(minpack.lm)  # For better curve fitting than base R's nls
library(ggplot2)

# Suppress warnings
options(warn = -1)  #This line tells R to suppress warning messages

Load csv file

data <- read_csv("malaria5.csv")

# If 'day' column is missing, create it starting from 1
if (!"day" %in% colnames(data)) {
  data$day <- seq_len(nrow(data)) - 1  # R is 1-based.
}

SEIR Model Equations

seir_model <- function(t, y, params) {
  with(as.list(c(y, params)), {
    N <- S + E + I + R  # Total population
    dSdt <- -beta * S * I / N
    dEdt <- beta * S * I / N - sigma * E
    dIdt <- sigma * E - gamma * I
    dRdt <- gamma * I
    return(list(c(dSdt, dEdt, dIdt, dRdt)))
  })
}
  • seir_model <- function(t, y, params) { … }: Defines the SEIR model differential equations

Function to estimate parameters using curve fitting

estimate_parameters <- function(data) {
  N <- 995  # Total population
  E0 <- 10; I0 <- 5; R0 <- 0  # Initial conditions
  S0 <- N - E0 - I0 - R0
  
  y0 <- c(S = S0, E = E0, I = I0, R = R0)
  t <- data$day
  observed_infected <- data$I
  
  # Function the optimizer will use
  fit_func <- function(params, t) {
    beta <- params[1]
    sigma <- params[2]
    gamma <- params[3]
    
    out <- ode(y = y0, times = t, func = seir_model,
               parms = c(beta = beta, sigma = sigma, gamma = gamma),
               method = "lsoda", atol = 1e-6, rtol = 1e-6)
    return(out[, "I"])  # Return only infected column
  }
  
  # Objective function for least squares
  objective <- function(params) {
    fit <- fit_func(params, t)
    return(fit - observed_infected)
  }
  
  initial_guess <- c(0.4, 0.02, 0.05)
  
  # Levenberg-Marquardt algorithm
  fit <- nls.lm(par = initial_guess, fn = objective, control = nls.lm.control(maxiter = 5000))
  
  return(fit$par)
}
  • estimate_parameters <- function(data) { … }: This function fits the SEIR model to the actual infected data to estimate the parameters
  • fit_func <- function(params, t) { … }: Given parameters [β, σ, γ], solves the SEIR model and returns only the Infected (I) values over time using ode() from the deSolve package.
  • objective <- function(params) {fit …}: Returns the difference between model prediction and real data for infected cases
  • fit <- nls.lm(par = initial_guess, fn = objective, control = …): Uses nonlinear least squares (Levenberg-Marquardt method via nls.lm) to minimize the objective function. initial_guess is the starting point: β=0.4, σ=0.02, γ=0.05.

Estimate parameters

params_opt <- estimate_parameters(data)   # Estimate parameters

# Extract values
beta_opt <- params_opt[1]
sigma_opt <- params_opt[2]
gamma_opt <- params_opt[3]

# Display results
cat(sprintf("Estimated Parameters:\nGamma=%.4f\nSigma=%.4f\nBeta=%.4f\n",
            gamma_opt, sigma_opt, beta_opt))
Estimated Parameters:
Gamma=0.0205
Sigma=0.0489
Beta=15.3513

1.5. Plot the fitted curve against observed data

if (exists("data") && "I" %in% colnames(data)) {
  fitted_values <- ode(
    y = c(S = N - 10 - 5 - 0, E = 10, I = 5, R = 0),
    times = data$day,
    func = seir_model,
    parms = c(beta = beta_opt, sigma = sigma_opt, gamma = gamma_opt)
  )
  
  plot_data <- data.frame(
    Day = rep(data$day, 2),
    Cases = c(data$I, fitted_values[, "I"]),
    Type = rep(c("Observed", "Fitted"), each = length(data$day))
  )
  
  ggplot(plot_data, aes(x = Day, y = Cases, color = Type)) +
    geom_line() +
    labs(title = "SEIR Model Fit to Malaria Data", y = "Infected Cases") +
    theme_minimal()
}

The comparison is fairly accurate

2.0 MALARIA COMPARTMENTAL MODELS

A compartmental model for malaria must account for both human and mosquito populations since malaria is transmitted by infected mosquitoes. A standard approach is the SEIRS-SI model, which consists of the following compartments:

Human Population

  • S (Susceptible humans): Individuals who can be infected.
  • E (Exposed humans): Individuals who have been bitten by an infectious mosquito and are incubating the parasite but are not yet infectious.
  • I (Infected humans): Individuals who are infectious and can transmit malaria to mosquitoes.
  • R (Recovered humans): Individuals who have recovered and gained temporary immunity but may become susceptible again after a period.

Mosquito Population

  • S_V (Susceptible mosquitoes): Mosquitoes that are not infected.
  • I_V (Infected mosquitoes): Mosquitoes that have acquired the malaria parasite and can transmit it to humans.

We will be using the same data set called malaria5 in csv format. First we check how the data looks like to verify whether it fits our model. This is important to ensure that you are not running a wrong or corrupted file.

Libraries

library(readr)   # For reading CSV files
library(dplyr)   # For data manipulation

Load the CSV file

data <- read_csv("malaria5.csv")

# Display the first few rows
cat("First 5 rows of the dataset:\n")
First 5 rows of the dataset:
print(head(data, 5))
# A tibble: 5 × 7
    day     S     E     I     R    Sv    Iv
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1     0   980    10     5     0   990    10
2     1   969    28     5     0   983    15
3     2   949    54     7     0   976    20
4     3   920    86    12     0   966    27
5     4   879   127    19     0   950    40

Get basic info about the dataset

cat("\nDataset Summary:\n")

Dataset Summary:
print(str(data))  # Structure of the dataset
spc_tbl_ [100 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ day: num [1:100] 0 1 2 3 4 5 6 7 8 9 ...
 $ S  : num [1:100] 980 969 949 920 879 818 727 601 442 269 ...
 $ E  : num [1:100] 10 28 54 86 127 184 265 373 503 634 ...
 $ I  : num [1:100] 5 5 7 12 19 30 46 70 103 148 ...
 $ R  : num [1:100] 0 0 0 0 0 0 1 3 6 11 ...
 $ Sv : num [1:100] 990 983 976 966 950 925 887 831 752 648 ...
 $ Iv : num [1:100] 10 15 20 27 40 61 93 140 206 291 ...
 - attr(*, "spec")=
  .. cols(
  ..   day = col_double(),
  ..   S = col_double(),
  ..   E = col_double(),
  ..   I = col_double(),
  ..   R = col_double(),
  ..   Sv = col_double(),
  ..   Iv = col_double()
  .. )
 - attr(*, "problems")=<externalptr> 
NULL

Check for missing values

cat("\nMissing values per column:\n")

Missing values per column:
print(colSums(is.na(data)))
day   S   E   I   R  Sv  Iv 
  0   0   0   0   0   0   0 

Describe numerical data

cat("\nStatistical Summary:\n")

Statistical Summary:
print(summary(data))
      day              S               E               I        
 Min.   : 0.00   Min.   :  8.0   Min.   : 10.0   Min.   :  5.0  
 1st Qu.:24.75   1st Qu.: 24.0   1st Qu.:164.8   1st Qu.:346.0  
 Median :49.50   Median : 30.5   Median :180.0   Median :358.0  
 Mean   :49.50   Mean   :100.3   Mean   :229.1   Mean   :373.7  
 3rd Qu.:74.25   3rd Qu.: 33.0   3rd Qu.:194.8   3rd Qu.:459.8  
 Max.   :99.00   Max.   :980.0   Max.   :748.0   Max.   :572.0  
       R                Sv               Iv       
 Min.   :   0.0   Min.   : 55.00   Min.   : 10.0  
 1st Qu.: 329.8   1st Qu.: 67.75   1st Qu.:297.0  
 Median : 773.0   Median : 85.00   Median :298.5  
 Mean   : 653.5   Mean   :167.98   Mean   :312.3  
 3rd Qu.: 982.5   3rd Qu.: 87.00   3rd Qu.:316.8  
 Max.   :1131.0   Max.   :990.00   Max.   :617.0  

2.1. Simulating the S,E,I,R,Iv,Sv Model

The Model is embedded in R and uses the (malaria_model) function

Load required libraries

library(deSolve)
library(ggplot2)
library(tidyr)

Parameters

Lambda_h <- 10      # Human recruitment rate
Lambda_v <- 100     # Mosquito recruitment rate
beta_h <- 0.002     # Transmission rate from mosquitoes to humans
beta_v <- 0.001     # Transmission rate from humans to mosquitoes
sigma_h <- 0.1      # Rate at which exposed humans become infectious
gamma_h <- 0.05     # Recovery rate of infected humans
delta_h <- 0.01     # Rate of loss of immunity
mu_h <- 0.0001      # Natural death rate of humans
mu_v <- 0.1         # Natural death rate of mosquitoes
nu_v <- 0.1         # Extrinsic incubation period rate

Initial conditions

y0 <- c(
  S = 980,   # Susceptible humans
  E = 10,    # Exposed humans
  I = 5,     # Infected humans
  R = 0,     # Recovered humans
  Sv = 990,  # Susceptible mosquitoes
  Iv = 10    # Infected mosquitoes
)

Time vector

t <- seq(0, 100, length.out = 1000)  # Simulate for 200 days
  • seq() is a function that generates a sequence of numbers, 0 is the starting value of the sequence, 200 is the ending value of the sequence.
  • length.out = 1000 tells R to generate exactly 1000 values evenly spaced between 0 and 200. So, this line creates a vector t containing 1000 numbers, starting at 0 and ending at 200, with equal spacing between each number.

System of differential equations

malaria_model <- function(t, y, params) {
  with(as.list(c(y, params)), {
    dS <- Lambda_h - beta_h * S * Iv - mu_h * S + delta_h * R
    dE <- beta_h * S * Iv - (sigma_h + mu_h) * E
    dI <- sigma_h * E - (gamma_h + mu_h) * I
    dR <- gamma_h * I - (delta_h + mu_h) * R
    dSv <- Lambda_v - beta_v * Sv * I - mu_v * Sv
    dIv <- beta_v * Sv * I - (nu_v + mu_v) * Iv
    
    return(list(c(dS, dE, dI, dR, dSv, dIv)))
  })
}
  • return(list(): Returns the list from the function—this is what R uses to continue solving the ODEs step-by-step.

Parameters list

params <- list(
  Lambda_h = Lambda_h,
  Lambda_v = Lambda_v,
  beta_h = beta_h,
  beta_v = beta_v,
  sigma_h = sigma_h,
  gamma_h = gamma_h,
  delta_h = delta_h,
  mu_h = mu_h,
  mu_v = mu_v,
  nu_v = nu_v
)

Solve ODEs

solution <- ode(
  y = y0,
  times = t,
  func = malaria_model,
  parms = params,
  method = "lsoda"
)
  • y = y0: This is the initial state of the system.
  • y0: is a named vector that contains the initial values for all the compartments, e.g., S = 1000, E = 10, I = 5.
  • times = t: This defines the time sequence over which the ODEs will be solved.
  • func = malaria_model: This is the function that defines the system of differential equations.
  • parms = params: A list or vector of parameters used inside the malaria_model function.
  • method = “lsoda”: lsoda” is a powerful method that automatically switches between stiff and non-stiff solvers depending on the problem. It’s great for systems that might behave differently at different times (e.g., rapid outbreaks followed by slow changes

Convert to data frame

results <- as.data.frame(solution)
  • results <- assigns this data frame to the variable results, which you can now use for further analysis or visualization.

Why convert to a data frame? - summarize results easily. - You can plot it with various tools.

Plot results

ggplot(results, aes(x = time)) +
  geom_line(aes(y = S, color = "Susceptible Humans"), linewidth = 1) +
  #geom_line(aes(y = E, color = "Exposed Humans"), linewidth = 1) +
  geom_line(aes(y = I, color = "Infected Humans"), linewidth = 1) +
  geom_line(aes(y = R, color = "Recovered Humans"), linewidth = 1) +
  # Uncomment to show mosquito populations:
  # geom_line(aes(y = Sv, color = "Susceptible Mosquitoes"), linetype = "dashed", linewidth = 1) +
  # geom_line(aes(y = Iv, color = "Infected Mosquitoes"), linetype = "dashed", linewidth = 1) +
  labs(
    title = "Malaria Transmission Dynamics",
    x = "Time (days)",
    y = "Population",
    color = "Population Type"
  ) +
  scale_color_manual(values = c(
    "Susceptible Humans" = "blue",
    "Exposed Humans" = "orange",
    "Infected Humans" = "red",
    "Recovered Humans" = "green",
    "Susceptible Mosquitoes" = "purple",
    "Infected Mosquitoes" = "brown"
  )) +
  theme_minimal() +
  theme(
    legend.position = "bottom",
    panel.grid = element_blank(),
    panel.background = element_rect(fill = "white", color = "black")
  )

  • ggplot(results, aes(x = time)): Starts the ggplot object using the results data frame. Sets the x-axis to the time column. All geom_line() layers will inherit time as the x variable unless otherwise specified.

  • geom_line(aes(y = S, color = “Susceptible Humans”), linewidth = 1): Plots a line showing how the number of susceptible humans (S) changes over time.The line is blue (set later in scale_color_manual). Same idea for:E = Exposed Humans (orange), I = Infected Humans (red), R = Recovered Humans (green)

  • Mosquito Populations are commented out to facilitate visualization geom_line(aes(y = Sv, color = “Susceptible Mosquitoes”), …) geom_line(aes(y = Iv, color = “Infected Mosquitoes”), …) These are commented out, but if you uncomment them, it will add dashed lines for: Sv: Susceptible mosquitoes (purple), Iv: Infected mosquitoes (brown). linetype = “dashed” helps distinguish them from human populations.

    Labels

  • labs( ): Adds a title and labels for the x-axis, y-axis, and color legend.

    Custom Colors

  • scale_color_manual(values = c(…)): Assigns specific colors to each population type. This ensures that “Infected Humans” is always red, etc., regardless of order.

    Theme and Layout

  • theme_minimal(): Uses a clean, minimal background theme.

  • legend.position = “bottom”,..): Moves the legend to the bottom.Removes grid lines and Sets a white background with a black border.

2.1. Numerical plot of the SEIR Compartments.

We may want to visualize the various compartments and compare them with the simulated graphs before parameter estimation. Load the csv file named malaria5 This could be data collected from the field. The code below is used to numerically plot the S E, I and R compartments for visualization

Load libraries

library(ggplot2)   #for data visualization
library(readr)     # For reading CSV files
library(dplyr)      # For data manipulation
library(tidyr)      # reshape and tidy data

Load the CSV file

data <- read_csv("malaria5.csv", show_col_types = FALSE)
  • It reads a CSV file named “malaria5.csv” and loads it as a data frame.
  • show_col_types = FALSE: This tells read_csv() not to print the column type information to the console.

Check if required columns exist

required_columns <- c("day", "S", "I", "R")
missing_columns <- setdiff(required_columns, colnames(data))

if (length(missing_columns) > 0) {
  stop(paste("Dataset is missing required columns:", paste(missing_columns, collapse = ", ")))
}
  • required_columns <- c(“day”, “S”, “I”, “R”): This defines a vector of column names that are required for your analysis or model: “day”, “S”, “I”, and “R”.
  • missing_columns <- setdiff(required_columns, colnames(data)): This finds which required columns are missing from the dataset.
  • colnames(data): returns the names of all columns in data.
  • setdiff(A, B): returns elements in A that are not in B. So if data is missing any of the required columns, those names will go into missing_columns.
  • if (length(missing_columns) > 0) {…}: This checks if any required columns are missing and If so, it stops the program using stop() and prints a message like-“Dataset is missing required columns: I, R” This is useful for early error detection, especially in scripts or models that depend on those columns being present.

Convert to long format for plotting

data_long <- data %>%
  select(day, S, I, R) %>%
  pivot_longer(cols = c(S, I, R), names_to = "Compartment", values_to = "Population")
  • Step 1: data %>%: This is the pipe operator (%>%) from the magrittr package, used to pass data from one function to the next.
  • step 2: select(day, S, I, R) This keeps only the day, S, I, and R columns. It filters out any other columns that might be in data.
  • Step 3: pivot_longer(…): This reshapes the data from wide format to long format:
  • names_to = “Compartment” → The names of the original columns (S, I, R) go into a new column called “Compartment”.
  • values_to = “Population” → The actual values go into a new column called “Population”.

Define color palette

compartment_colors <- c(
  S = "#1f77b4",   # blue
  I = "#d62728",   # red
  R = "#2ca02c"    # green
)
  • defining a named vector of color hex codes for plotting different compartments (S, I, R) in a graph.

Create the plot

ggplot(data_long, aes(x = day, y = Population, color = Compartment)) +
  geom_line(linewidth = 1) +
  labs(
    title = "Time Series of S, I, and R Compartments",
    x = "Time (days)",
    y = "Population",
    color = "Compartment"
  ) +
  scale_color_manual(values = compartment_colors,
                     labels = c(S = "Susceptible", I = "Infected", R = "Recovered")) +
  theme_minimal() +
  theme(
    legend.position = "bottom",
    panel.grid = element_line(color = "gray90", linewidth = 0.2),
    plot.title = element_text(hjust = 0.5, size = 14, face = "bold"),
    axis.title = element_text(size = 12),
    legend.title = element_text(size = 11),
    legend.text = element_text(size = 10)
  )

  • ggplot(data_long, aes(x = day, y = Population, color = Compartment)): Creates a plot object from data_long.
  • geom_line(linewidth = 1): Adds lines for each compartment over time. linewidth = 1 makes the lines a bit thicker than default.
  • labs( ): Adds a title, axis labels, and a legend title.
  • scale_color_manual( ) : uses your predefined compartment_colors vector to set line colors. Gives readable labels in the legend (instead of just “S”, “I”, “R”).

Theme and appearance - theme_minimal(): Applies a clean, minimalistic background with fewer grid lines and no grey background. - theme(legend.position = “bottom”…): Customizes visual elements, Moves the legend to the bottom, Lightens the grid lines, Centers and styles the title and Increases font sizes for clarity.

2.3 Parameter estimation

This is important when the parameter values from literature does not fit well with the csv numerical results. The code below prints the parameter values using the curve fitting method for the SEIR - SvIv model. The code estimates the parameters and gives a graphical comparison of the observed vs model predictions. Here we only compare one compartment (The Infected Compartment) for clarity of graphics

Load required libraries

library(deSolve)   # solving odes
library(readr)    #Reading CSV files
library(minpack.lm)  # For curve fitting
library(ggplot2)   #Plotting (ggplot2)

# Suppress warnings
options(warn = -1)

Load CSV data

data <- read_csv("malaria5.csv")#Loads the data 

# Ensure 'day' column exists (0-based if missing)
if (!"day" %in% colnames(data)) {
  data$day <- seq_len(nrow(data)) - 1
}
  • !“day” %in% colnames(data): if the column ‘day’ is not in the data, it creates one
  • seq_len(nrow(data)): generates a sequence from 1 to the number of rows (i.e., 1, 2, …, N). Subtracting 1 turns it into 0-based indexing: 0, 1, 2, …, N-1. The resulting vector is assigned to data$day, adding a new day column. Summary This code loads data from malaria5.csv and ensures that there’s a day column, even if the original data doesn’t include it—defaulting to a sequence starting from 0.

Malaria Model Equations (SEIRV)

malaria_model <- function(t, y, params) {
  with(as.list(c(y, params)), {
    dS <- Lambda_h - beta_h * S * Iv - mu_h * S + delta_h * R
    dE <- beta_h * S * Iv - (sigma_h + mu_h) * E
    dI <- sigma_h * E - (gamma_h + mu_h) * I
    dR <- gamma_h * I - (delta_h + mu_h) * R
    dSv <- Lambda_v - beta_v * Sv * I - mu_v * Sv
    dIv <- beta_v * Sv * I - (nu_v + mu_v) * Iv
    
    return(list(c(dS, dE, dI, dR, dSv, dIv)))
  })
}

Function to estimate parameters

estimate_parameters <- function(data) {
  # Initial conditions from first row
  y0 <- c(
    S = data$S[1],
    E = data$E[1],
    I = data$I[1],
    R = data$R[1],
    Sv = data$Sv[1],
    Iv = data$Iv[1]
  )
  
  t <- data$day
  observed_infected <- data$I
  
  # Function that returns model predictions for infected
  fit_func <- function(params, t) {
    out <- ode(
      y = y0,
      times = t,
      func = malaria_model,
      parms = list(
        Lambda_h = params[1],
        Lambda_v = params[2],
        beta_h = params[3],
        beta_v = params[4],
        sigma_h = params[5],
        gamma_h = params[6],
        delta_h = params[7],
        mu_h = params[8],
        mu_v = params[9],
        nu_v = params[10]
      ),
      method = "lsoda"
    )
    return(out[, "I"])  # Return only infected individuals
  }
  
  # Objective function for least squares
  objective <- function(params) {
    fit <- fit_func(params, t)
    return(fit - observed_infected)
  }
  
  # Initial parameter guesses
  initial_guess <- c(10, 10, 0.3, 0.3, 0.2, 0.1, 0.05, 0.01, 0.01, 0.02)
  
  # Perform curve fitting
  fit <- nls.lm(
    par = initial_guess,
    fn = objective,
    control = nls.lm.control(maxiter = 1024)
  )
  
  return(fit$par)
}

Function Definition - estimate_parameters <- function(data) { : Takes in a data frame from malaria5.csv which includes columns for S, E, I, R, Sv, Iv and day.

Initial Conditions - y0: sets the initial state of the model using the first row of the dataset. These represent the initial values for human and mosquito compartments at day 0.

Time and Observed Data - t: is the vector of time points (days). - observed_infected: is the vector of observed infected humans from the data (what the model should fit to).

Model Simulation Function - fit_func(): uses the ode() solver (from the deSolve package) to simulate the malaria model over time. - It takes a parameter vector params and time vector t. The parameters are unpacked and passed to malaria_model (a separate function defining the ODE system). The solver returns a matrix of output values over time; we extract just the “I” (infected humans) column to compare to data.

Objective Function for Optimization - objective <- function(params) { : Defines the error between model predictions and observed data. Returns the difference vector (residuals), which will be minimized using least squares.

Initial Guesses -initial_guess <- c(10, 10, 0.3, 0.3, 0.2, 0.1, 0.05, 0.01, 0.01, 0.02): Initial parameter values for optimization. These should be close enough to the true values for the optimizer to converge.

Least Squares Fitting -fit <- nls.lm(…): nls.lm() is a nonlinear least squares solver from the minpack.lm package. It adjusts parameters to minimize the residuals from objective fn. maxiter = 1024 sets the maximum number of iterations.

Return Estimated Parameters - return(fit$par): Returns the optimized parameter values that best fit the model to the infected data.

Summary Load initial conditions and time/data from input. Define a simulator using an ODE malaria model. Build an objective function comparing model output to actual data. Use nonlinear least squares (nls.lm) to find the best parameters. Return the estimated parameter values.

Estimate parameters

estimated_params <- estimate_parameters(data)

# Parameter names
param_names <- c("Lambda_h", "Lambda_v", "beta_h", "beta_v", "sigma_h", 
                 "gamma_h", "delta_h", "mu_h", "mu_v", "nu_v")

# Print results
cat("\nFinal Estimated Parameters:\n")

Final Estimated Parameters:
for (i in seq_along(param_names)) {
  cat(sprintf("%s: %.4f\n", param_names[i], estimated_params[i]))
}
Lambda_h: 11.0570
Lambda_v: 9.9891
beta_h: 0.3039
beta_v: 0.2991
sigma_h: 0.1792
gamma_h: 0.0999
delta_h: 0.0501
mu_h: 0.0100
mu_v: 0.0099
nu_v: 0.0204

Final Estimated Parameters vs Original parameters:

Lambda_h: 11.0570 Lambda_h <- 10 # Human recruitment rate Lambda_v: 9.9891 Lambda_v <- 100 # Mosquito recruitment rate beta_h: 0.3039 beta_h <- 0.002 # Transmission rate from mosquitoes to humans beta_v: 0.2991 beta_v <- 0.001 # Transmission rate from humans to mosquitoes sigma_h: 0.1792 sigma_h <- 0.1 # Rate at which exposed humans become infectious gamma_h: 0.0999 gamma_h <- 0.05 # Recovery rate of infected humans delta_h: 0.0501 delta_h <- 0.01 # Rate of loss of immunity mu_h: 0.0100 mu_h <- 0.0001 # Natural death rate of humans mu_v: 0.0099 mu_v <- 0.1 # Natural death rate of mosquitoes nu_v: 0.0204 nu_v <- 0.1 # Extrinsic incubation period rate

Python estimated parameters R studio Final Estimated Parameters: Lambda_h: 9.8367 11.0570 Lambda_v: 9.8458 9.9891 beta_h: 0.2977 0.3039 beta_v: 0.2976 0.2991 sigma_h: 0.1439 0.1792 gamma_h: 0.1431 0.0999 delta_h: 0.1073 0.0501 mu_h: 0.0052 0.0100 mu_v: 0.0064 0.0099 nu_v: 0.0175 0.0204

Parameter estimation when some parameters are fixed due to biological knowledge

library(deSolve)
library(readr)
library(minpack.lm)
library(ggplot2)

options(warn = -1)  # Suppress warnings

# Load data
data <- read_csv("malaria5.csv")
if (!"day" %in% colnames(data)) {
  data$day <- seq_len(nrow(data)) - 1
}

# Malaria model equations
malaria_model <- function(t, y, params) {
  with(as.list(c(y, params)), {
    dS <- Lambda_h - beta_h * S * Iv - mu_h * S + delta_h * R
    dE <- beta_h * S * Iv - (sigma_h + mu_h) * E
    dI <- sigma_h * E - (gamma_h + mu_h) * I
    dR <- gamma_h * I - (delta_h + mu_h) * R
    dSv <- Lambda_v - beta_v * Sv * I - mu_v * Sv
    dIv <- beta_v * Sv * I - (nu_v + mu_v) * Iv
    return(list(c(dS, dE, dI, dR, dSv, dIv)))
  })
}

# Function to estimate parameters
estimate_parameters <- function(data) {
  # Initial conditions
  y0 <- c(
    S = data$S[1],
    E = data$E[1],
    I = data$I[1],
    R = data$R[1],
    Sv = data$Sv[1],
    Iv = data$Iv[1]
  )
  t <- data$day
  observed_I <- data$I

  # === Fixed parameters based on biological knowledge ===
  fixed_params <- list(
    mu_h = 0.01,
    mu_v = 0.01,
    delta_h = 0.05
  )

  # Parameters to estimate and their initial guesses
  param_names_to_estimate <- c("Lambda_h", "Lambda_v", "beta_h", "beta_v",
                               "sigma_h", "gamma_h", "nu_v")
  initial_guess <- c(10, 10, 0.3, 0.3, 0.2, 0.1, 0.02)
  names(initial_guess) <- param_names_to_estimate

  # Helper to build full parameter list for the ODE solver
  assemble_full_params <- function(estimated_values) {
    full_params <- fixed_params
    for (name in param_names_to_estimate) {
      full_params[[name]] <- estimated_values[name]
    }
    return(full_params)
  }

  # Model prediction function
  model_I_output <- function(params_vec, t) {
    full_params <- assemble_full_params(params_vec)
    out <- ode(y = y0, times = t, func = malaria_model,
               parms = full_params, method = "lsoda", atol = 1e-6, rtol = 1e-6)
    return(out[, "I"])
  }

  # Objective function
  residuals_fn <- function(params_vec) {
    model_I <- tryCatch(model_I_output(params_vec, t),
                        error = function(e) rep(NA, length(t)))
    if (any(is.na(model_I)) || any(!is.finite(model_I))) {
      return(rep(1e6, length(t)))  # Penalize invalid solutions
    }
    return(model_I - observed_I)
  }

  # Fit parameters
  fit <- nls.lm(
    par = initial_guess,
    fn = residuals_fn,
    control = nls.lm.control(maxiter = 1024)
  )

  # Combine estimated and fixed parameters
  estimated_params <- as.list(fit$par)
  all_params <- c(estimated_params, fixed_params)
  all_params <- all_params[sort(names(all_params))]  # Sort for readability
  return(all_params)
}

# Run estimation
estimated <- estimate_parameters(data)

# Print results
cat("\nFinal Parameter Estimates (Estimated + Fixed):\n")

Final Parameter Estimates (Estimated + Fixed):
for (name in names(estimated)) {
  cat(sprintf("%s: %.4f\n", name, estimated[[name]]))
}
beta_h: 0.2962
beta_v: 0.2991
delta_h: 0.0500
gamma_h: 0.1003
Lambda_h: 11.3203
Lambda_v: 9.9663
mu_h: 0.0100
mu_v: 0.0100
nu_v: 0.0213
sigma_h: 0.1682
  • The fixed parameters are: delta_h: 0.0500, mu_h: 0.0100, mu_v: 0.0100

Python estimated parameters R Estimated Parameters Final Parameter Estimates (Estimated + Fixed): Lambda_h: 10.0124 11.3203 Lambda_v: 10.1119 9.9663 beta_h: 0.2898 0.2962 beta_v: 0.3256 0.2991 delta_h: 0.0500 0.0500 gamma_h: 0.0768 0.1003 mu_h: 0.0100 0.0100 mu_v: 0.0100 0.0100 nu_v: 0.0902 0.0213 sigma_h: 0.1870 0.1682

The fixed parameters are: mu_v, mu_v and delta_h Q -Which parameters do we need to fix from the model above?

  • cat(“Estimated Parameters:”): Prints a title/header to the console.
  • for (i in seq_along(param_names)) {…}: Loops through each parameter using seq_along(param_names) (which safely handles the correct index range).
  • sprintf() is used to format the output string:
  • %s: inserts the parameter name.
  • %.4f: inserts the numeric estimate with 4 decimal places.
  • cat(): prints that formatted string to the console, line by line.

2.4. The code below can also be used to print the parameters estimated

# After parameter estimation (right after estimated_params <- estimate_parameters(data))

# Extract and display numerical values only
cat("\nEstimated Parameter Values:\n")

Estimated Parameter Values:
cat("-------------------------\n")
-------------------------
cat(sprintf("Lambda_h  = %.6f\n", estimated_params[1]))
Lambda_h  = 11.056969
cat(sprintf("Lambda_v  = %.6f\n", estimated_params[2]))
Lambda_v  = 9.989087
cat(sprintf("beta_h    = %.6f\n", estimated_params[3]))
beta_h    = 0.303945
cat(sprintf("beta_v    = %.6f\n", estimated_params[4]))
beta_v    = 0.299107
cat(sprintf("sigma_h   = %.6f\n", estimated_params[5]))
sigma_h   = 0.179235
cat(sprintf("gamma_h   = %.6f\n", estimated_params[6]))
gamma_h   = 0.099887
cat(sprintf("delta_h   = %.6f\n", estimated_params[7]))
delta_h   = 0.050102
cat(sprintf("mu_h      = %.6f\n", estimated_params[8]))
mu_h      = 0.009984
cat(sprintf("mu_v      = %.6f\n", estimated_params[9]))
mu_v      = 0.009877
cat(sprintf("nu_v      = %.6f\n", estimated_params[10]))
nu_v      = 0.020384
  • cat(): Prints text or values without extra formatting
  • sprintf(): Formats numeric values with precision (e.g., 6 decimal places)
  • estimated_params[n]: Accesses the nth estimated parameter value

2.5. Visualization of the graphics using the estimated parameters by Splitting graphs.

In some cases one wants to sum up some compartments graphed together i.e (R + I + E) vs the Susceptible to visualize those who escape the epidemic OR Comparing the infected in the population vs the uninfected.

library(deSolve)   # For solving differential equations (ODEs)
library(ggplot2)   # For plotting
library(gridExtra)  # (Optional) For arranging multiple plots

Parameters

Lambda_h <- 11.0570      # Human recruitment rate
Lambda_v <- 9.9891     # Mosquito recruitment rate
beta_h <- 0.3039     # Transmission rate from mosquitoes to humans
beta_v <- 0.2991     # Transmission rate from humans to mosquitoes
sigma_h <- 0.0999      # Rate at which exposed humans become infectious
gamma_h <- 0.099887     # Recovery rate of infected humans
delta_h <- 0.0501     # Rate of loss of immunity
mu_h <- 0.01000      # Natural death rate of humans
mu_v <- 0.0099         # Natural death rate of mosquitoes
nu_v <- 0.0204         # Extrinsic incubation period rate

Initial conditions

initial_conditions <- c(
  S = 980,  # Susceptible humans
  E = 10,    # Exposed humans
  I = 5,     # Infected humans
  R = 0,     # Recovered humans
  Sv = 990,  # Susceptible mosquitoes
  Iv = 10    # Infected mosquitoes
)

# Time vector
time <- seq(0, 200, length.out = 1000)  # Simulate for 200 days

Simulate from day 0 to 200 with 1000 points for smoothness.

System of differential equations

malaria_model <- function(time, state, parameters) {
  with(as.list(c(state, parameters)), {
    dS <- Lambda_h - beta_h * S * Iv - mu_h * S + delta_h * R
    dE <- beta_h * S * Iv - (sigma_h + mu_h) * E
    dI <- sigma_h * E - (gamma_h + mu_h) * I
    dR <- gamma_h * I - (delta_h + mu_h) * R
    dSv <- Lambda_v - beta_v * Sv * I - mu_v * Sv
    dIv <- beta_v * Sv * I - (nu_v + mu_v) * Iv
    
    return(list(c(dS, dE, dI, dR, dSv, dIv)))
  })
}

Solve ODEs

parameters <- c(
  Lambda_h = Lambda_h,
  Lambda_v = Lambda_v,
  beta_h = beta_h,
  beta_v = beta_v,
  sigma_h = sigma_h,
  gamma_h = gamma_h,
  delta_h = delta_h,
  mu_h = mu_h,
  mu_v = mu_v,
  nu_v = nu_v
)

solution <- ode(
  y = initial_conditions,
  times = time,
  func = malaria_model,
  parms = parameters
)
  • ode() numerically integrates the ODEs.

Convert to data frame for plotting

results <- as.data.frame(solution)
  • Result is converted into a dataframe for plotting.

Create plots

p1 <- ggplot(results, aes(x = time)) +
  #geom_line(aes(y = E, color = "Exposed Humans")) +
  #geom_line(aes(y = I, color = "Infected Humans")) +
  geom_line(aes(y = (I + E), color = "Infected Humans")) +
  geom_line(aes(y = (S + R), color = "Total Uninfected Humans")) +
  #geom_line(aes(y = Iv, color = "Infected Mosquitoes"), linetype = "dashed") +
  labs(x = "Time (days)", y = "Population", 
       title = "Malaria Transmission Dynamics") +
  scale_color_manual(values = c("Exposed Humans" = "blue", 
                               "Infected Humans" = "red",
                               "Total Uninfected Humans" = "green",
                               "Infected Mosquitoes" = "purple")) +
  theme_minimal() +
  theme(legend.position = "bottom", legend.title = element_blank())
  • Exposed humans (E), Infected humans (I), Total humans (S + E + R), Infected mosquitoes (Iv)
  • linetype = “dashed” distinguishes mosquito population.

Create a second plot showing all variables for reference

p2 <- ggplot(results, aes(x = time)) +
  geom_line(aes(y = S, color = "Susceptible Humans")) +
  geom_line(aes(y = E, color = "Exposed Humans")) +
  geom_line(aes(y = I, color = "Infected Humans")) +
  geom_line(aes(y = R, color = "Recovered Humans")) +
  #geom_line(aes(y = Sv, color = "Susceptible Mosquitoes"), linetype = "dashed") +
  #geom_line(aes(y = Iv, color = "Infected Mosquitoes"), linetype = "dashed") +
  labs(x = "Time (days)", y = "Population", 
       title = "Complete Malaria Model Dynamics") +
  theme_minimal() +
  theme(legend.position = "bottom", legend.title = element_blank())

Display plots

grid.arrange(p1, p2, ncol = 1)

To show just the first plot

 print(p1)

To show the second plot

 print(p2)

Summary

Section Description - parameters: Defines the constants for the SEIR-SIv model - initial_conditions: Starting populations for humans and mosquitoes - malaria_model(): Defines how compartments change over time - ode(): Solves the ODE system - ggplot(): Plots the results - p1, p2: Two different visualizations

2.6. MALARIA MODELS WITH MORE INTERVENTIONS/COMPARTMENTS USING THE ESTIMATED PARAMETERS

Simulating the SEIRV - IvSv model with coding which may not be embedded in R. This allows one to use any number of equations, variables and parameters.

Below is a Malaria model System of 8 ODEs in R with vaccination as an intervention strategy included. The proportion vaccinated is assumed to be fully protected from the formulation. The Malaria_model function is used.

library(deSolve)   # Solves differential equations
library(ggplot2)   # For plotting

Parameters

params <- list(
  Lambda_h = 11.0570,   # Human recruitment rate
  Lambda_v = 9.9891,    # Mosquito recruitment rate
  beta_h = 0.3039,     # Transmission rate from mosquitoes to humans
  beta_v = 0.2991,     # Transmission rate from humans to mosquitoes
  sigma_h = 0.0999,      # Rate at which exposed humans become infectious
  gamma_h = 0.099887,     # Recovery rate of infected humans
  delta_h = 0.0501,     # Rate of loss of immunity
  mu_h = 0.0100,      # Natural death rate of humans
  mu_v = 0.0099,         # Natural death rate of mosquitoes
  nu_v = 0.0204,         # Extrinsic incubation period rate
  a = 0.71             # Proportion of humans vaccinated (vaccine is 100% effective)
)

a = 0.71 means 71% of humans are vaccinated. Vaccine is 100% effective, so vaccinated individuals cannot become infected.

Initial conditions

initial_state <- c(
  S = 980,  # Susceptible humans
  E = 10,    # Exposed humans
  I = 5,     # Infected humans
  R = 0,     # Recovered humans
  V = 0,     # Vaccinated humans
  Sv = 990,  # Susceptible mosquitoes
  Iv = 50    # Infected mosquitoes
)

Define the system of ODEs

malaria_model <- function(t, state, params) {
  with(as.list(c(state, params)), {
    dS <- (1 - a) * Lambda_h - beta_h * S * Iv - mu_h * S + delta_h * R
    dE <- beta_h * S * Iv - (sigma_h + mu_h) * E
    dI <- sigma_h * E - (gamma_h + mu_h) * I
    dR <- gamma_h * I - (delta_h + mu_h) * R
    dSv <- Lambda_v - beta_v * Sv * I - mu_v * Sv
    dIv <- beta_v * Sv * I - (nu_v + mu_v) * Iv
    dV <- a * Lambda_h - mu_h * V
    
    return(list(c(dS, dE, dI, dR, dV, dSv, dIv)))
  })
}

Time span

t_span <- c(0, 200)
t_eval <- seq(t_span[1], t_span[2], length.out = 1000)

Simulates over 50 days with 1000 time points.

Solve the ODEs

solution <- ode(
  y = initial_state,
  times = t_eval,
  func = malaria_model,
  parms = params
)

Convert to data frame

results <- as.data.frame(solution)

Plot results

ggplot(results, aes(x = time)) +
  geom_line(aes(y =  (S + V), color = "Susceptible and Vaccinated Humans")) +
  #geom_line(aes(y = E, color = "Exposed Humans")) +
  #geom_line(aes(y = I, color = "Infected Humans")) +
  geom_line(aes(y = (I + E + R), color = "Total Infected Humans")) +
 #geom_line(aes(y = V, color = "Vaccinated")) +
  #geom_line(aes(y = Sv, color = "Susceptible Mosquitoes"), linetype = "dashed") +
  labs(x = "Time (days)", y = "Population", 
       title = "Malaria Transmission Dynamics with Vaccination") +
  scale_color_manual(values = c(
    "Susceptible and Vaccinated Humans" = "blue",
    "Exposed Humans" = "orange",
    "Total Infected Humans" = "red",
    "Vaccinated" = "green",
    "Susceptible Mosquitoes" = "purple"
  )) +
  theme_minimal() +
  theme(legend.position = "bottom", legend.title = element_blank()) +
  guides(color = guide_legend(nrow = 2, byrow = TRUE))

Q -Try reducing time to about seven days to visualize what happens in the first week

  • ggplot(results, aes(x = time)) +: Starts building a ggplot object using the results data frame. Sets the x-axis to the time column (likely days).
  • Plotting Lines for Each Compartment, Each line is given a color label (which is not a column name, but a string that ggplot will treat as a legend entry).
  • geom_line(aes(y = Sv, color = “Susceptible Mosquitoes”), linetype = “dashed”) + : This line would plot susceptible mosquitoes (Sv) if uncommented. It’s styled as dashed, likely to differentiate mosquito populations from human ones.

Axis Labels and Title - labs(x = “Time (days)”, :Adds labels for the x-axis, y-axis, and main title of the plot.

Custom Colors = scale_color_manual(values = c(…: Manually assigns specific colors to the legend labels defined.

Clean Theme & Legend Styling - theme_minimal() +: Applies a minimal theme for a cleaner look (no background gridlines or boxes).

2.7. Parameter estimation using Neural Networks

We use the SEIR Model using true parameters.

Load required libraries

# Install packages if not already installed
if (!requireNamespace("nnet", quietly = TRUE)) install.packages("nnet")
if (!requireNamespace("readr", quietly = TRUE)) install.packages("readr")

library(nnet) #for neural networks.
library(readr) #for reading/writing CSV files easily.

This checks if the nnet and readr packages are installed.If they are not, it installs them then it loads them into RStudio.

Generate synthetic data

# --------------------------------
# Step 1: Generate synthetic malaria data
# --------------------------------

set.seed(42)  # for reproducibility

# True parameters
beta_true <- 0.8    # transmission rate
sigma_true <- 0.2   # progression rate
gamma_true <- 0.1   # recovery rate


# True parameters
#beta_true <- 0.58    # transmission rate
#sigma_true <- 0.12   # progression rate
#gamma_true <- 0.11   # recovery rate

# Simulate SEIR model

time <- 1:100
S <- numeric(length(time))
E <- numeric(length(time))
I <- numeric(length(time))
R <- numeric(length(time))

# Initial conditions
S[1] <- 9909
E[1] <- 49
I[1] <- 42
R[1] <- 0

N <- S[1] + E[1] + I[1] + R[1]  # Total population
dt <- 1

for (t in 1:(length(time) - 1)) {
  dS <- -beta_true * S[t] * I[t] / N * dt
  dE <- (beta_true * S[t] * I[t] / N - sigma_true * E[t]) * dt
  dI <- (sigma_true * E[t] - gamma_true * I[t]) * dt
  dR <- gamma_true * I[t] * dt
  
  S[t+1] <- S[t] + dS
  E[t+1] <- E[t] + dE
  I[t+1] <- I[t] + dI
  R[t+1] <- R[t] + dR
}

# Add some random noise (small)
noise_level <- 5
S_noisy <- pmax(S + rnorm(length(S), 0, noise_level), 0)
E_noisy <- pmax(E + rnorm(length(E), 0, noise_level), 0)
I_noisy <- pmax(I + rnorm(length(I), 0, noise_level), 0)
R_noisy <- pmax(R + rnorm(length(R), 0, noise_level), 0)

# Save synthetic data
synthetic_data <- data.frame(time = time, S = round(S_noisy), E = round(E_noisy), I = round(I_noisy), R = round(R_noisy))
write_csv(synthetic_data, "malaria_synthetic_data.csv")
cat("Synthetic data saved to 'malaria_synthetic_data.csv'\n")
Synthetic data saved to 'malaria_synthetic_data.csv'
  • set.seed(42) # for reproducibility Sets the random number generator so every time you run the code, you get the same random results.

  • Define “true” malaria parameters beta_true <- 0.8 # transmission rate sigma_true <- 0.2 # progression rate gamma_true <- 0.1 # recovery rate These are the “true” parameters of the SEIR model that will be tried later to estimate using a neural network.

  • Set up SEIR model variables time <- 1:100 # 100 days S <- numeric(length(time)) # Susceptible E <- numeric(length(time)) # Exposed I <- numeric(length(time)) # Infectious R <- numeric(length(time)) # Recovered Creates empty vectors to store how many people are Susceptible, Exposed, Infectious, and Recovered each day

  • Initialize starting values S[1] <- 9909 E[1] <- 49 I[1] <- 42 R[1] <- 0 N <- S[1] + E[1] + I[1] + R[1] # Total population dt <- 1 # Time step (1 day) Sets the starting human population numbers on Day 1.

  • Simulate the SEIR dynamics for (t in 1:(length(time) - 1)) { dS <- -beta_true * S[t] * I[t] / N * dt dE <- (beta_true * S[t] * I[t] / N - sigma_true * E[t]) * dt dI <- (sigma_true * E[t] - gamma_true * I[t]) * dt dR <- gamma_true * I[t] * dt

    S[t+1] <- S[t] + dS E[t+1] <- E[t] + dE I[t+1] <- I[t] + dI R[t+1] <- R[t] + dR } For each day, it calculates how many people move: From Susceptible to Exposed.From Exposed to Infectious.From Infectious to Recovered.

These are simple differential equations discretized by a time step.

  • Add random noise (to make it realistic) noise_level <- 5 S_noisy <- pmax(S + rnorm(length(S), 0, noise_level), 0) E_noisy <- pmax(E + rnorm(length(E), 0, noise_level), 0) I_noisy <- pmax(I + rnorm(length(I), 0, noise_level), 0) R_noisy <- pmax(R + rnorm(length(R), 0, noise_level), 0)

Adds random variation to the numbers (simulate real-world observation errors) - pmax(…, 0) ensures no negative population values. - rnorm() generates random values from a normal distribution.

  • Save the data synthetic_data <- data.frame(time = time, S = round(S_noisy), E = round(E_noisy), I = round(I_noisy), R = round(R_noisy)) write_csv(synthetic_data, “malaria_synthetic_data.csv”) cat(“Synthetic data saved to ‘malaria_synthetic_data.csv’”)

Creates a data.frame with time and SEIR values.Rounds the numbers (because people are whole numbers). Saves it into a CSV file.

Load the synthetic data

# --------------------------------
# Step 2: Load the data
# --------------------------------

# Load the same file you just saved
data <- read_csv("malaria_synthetic_data.csv")
#data <- read_csv("malaria5.csv")

# Inputs: (S, E, I, R)
inputs <- as.data.frame(data[, c("S", "E", "I", "R")])

# Targets: true parameters
targets <- data.frame(beta = rep(beta_true, nrow(inputs)),
                      sigma = rep(sigma_true, nrow(inputs)),
                      gamma = rep(gamma_true, nrow(inputs)))
  • Step 2: Load the data data <- read_csv(“malaria_synthetic_data.csv”) inputs <- as.data.frame(data[, c(“S”, “E”, “I”, “R”)]) Loads the synthetic data from the CSV.

Picks the inputs (S, E, I, R) — these will be fed into the neural network.

  • Define target outputs targets <- data.frame(beta = rep(beta_true, nrow(inputs)), sigma = rep(sigma_true, nrow(inputs)), gamma = rep(gamma_true, nrow(inputs)))

Since you are working with synthetic data (and you know the true parameters), you create the target outputs for training.These are the values you want the neural network to learn.

Train the neural network

# --------------------------------
# Step 3: Train a neural network
# --------------------------------

# Inputs
X <- as.matrix(inputs)

# Outputs (targets)
Y <- as.matrix(targets)

# Train neural network
model <- nnet(X, Y, 
              size = 10,    # hidden neurons
              linout = TRUE, 
              maxit = 1000, 
              decay = 0.01)
# weights:  83
initial  value 67.883456 
iter  10 value 10.857013
iter  20 value 0.360371
iter  30 value 0.083634
iter  40 value 0.004486
iter  50 value 0.003779
iter  60 value 0.003511
iter  70 value 0.003162
iter  80 value 0.002928
iter  90 value 0.002828
iter 100 value 0.002812
iter 110 value 0.002785
iter 120 value 0.002721
iter 130 value 0.002694
iter 140 value 0.002658
iter 150 value 0.002582
iter 160 value 0.002423
iter 170 value 0.002272
iter 180 value 0.002248
iter 190 value 0.002227
iter 200 value 0.002219
iter 210 value 0.002214
iter 220 value 0.002211
iter 230 value 0.002209
iter 240 value 0.002203
iter 250 value 0.002191
iter 260 value 0.002172
iter 270 value 0.002149
iter 280 value 0.002108
iter 290 value 0.002076
iter 300 value 0.002068
iter 310 value 0.002053
iter 320 value 0.002037
iter 330 value 0.001950
iter 340 value 0.001850
iter 350 value 0.001733
iter 360 value 0.001633
iter 370 value 0.001552
iter 380 value 0.001515
iter 390 value 0.001462
iter 400 value 0.001396
iter 410 value 0.001351
iter 420 value 0.001307
iter 430 value 0.001287
iter 440 value 0.001268
iter 450 value 0.001252
iter 460 value 0.001230
iter 470 value 0.001224
iter 480 value 0.001211
iter 490 value 0.001174
iter 500 value 0.001126
iter 510 value 0.001117
iter 520 value 0.001112
iter 530 value 0.001103
iter 540 value 0.001084
iter 550 value 0.001059
iter 560 value 0.001035
iter 570 value 0.001031
iter 580 value 0.001025
iter 590 value 0.001007
iter 600 value 0.000987
iter 610 value 0.000963
iter 620 value 0.000939
iter 630 value 0.000924
iter 640 value 0.000917
iter 650 value 0.000908
iter 660 value 0.000902
iter 670 value 0.000894
iter 680 value 0.000883
iter 690 value 0.000871
iter 700 value 0.000861
iter 710 value 0.000846
iter 720 value 0.000828
iter 730 value 0.000813
iter 740 value 0.000810
iter 750 value 0.000806
iter 760 value 0.000792
iter 770 value 0.000773
iter 780 value 0.000764
iter 790 value 0.000761
iter 800 value 0.000757
iter 810 value 0.000751
iter 820 value 0.000745
iter 830 value 0.000722
iter 840 value 0.000711
iter 850 value 0.000707
iter 860 value 0.000706
iter 870 value 0.000705
iter 880 value 0.000704
iter 890 value 0.000703
iter 900 value 0.000702
iter 910 value 0.000701
iter 920 value 0.000693
iter 930 value 0.000691
iter 940 value 0.000690
iter 950 value 0.000690
iter 960 value 0.000690
iter 970 value 0.000689
iter 980 value 0.000685
iter 990 value 0.000681
iter1000 value 0.000675
final  value 0.000675 
stopped after 1000 iterations
cat("Neural network trained.\n")
Neural network trained.
  • Step 3: Train a neural network X <- as.matrix(inputs) # Inputs matrix Y <- as.matrix(targets) # Outputs matrix

model <- nnet(X, Y, size = 10, # number of hidden neurons linout = TRUE, # linear output (good for regression) maxit = 1000, decay = 0.01) cat(“Neural network trained.”) Converts inputs and targets to matrices.

Trains a small feed-forward neural network with 1 hidden layer of 10 neurons. linout = TRUE means the output is continuous (since beta, sigma, gamma are real numbers).

Predicted parameters

# --------------------------------
# Step 4: Prediction
# --------------------------------

# Predict parameters
predicted_params <- predict(model, X)

# Now it will have 3 columns already
predicted_params <- as.data.frame(predicted_params)
colnames(predicted_params) <- c("beta_pred", "sigma_pred", "gamma_pred")

# Combine results
results <- cbind(inputs, targets, predicted_params)

print(head(results))
     S   E   I  R beta sigma gamma beta_pred sigma_pred gamma_pred
1 9916  55  32  0  0.8   0.2   0.1 0.7999052  0.1999679 0.09997992
2 9873  78  49  8  0.8   0.2   0.1 0.7999330  0.1999749 0.09998339
3 9840  91  63  9  0.8   0.2   0.1 0.7999382  0.1999762 0.09998403
4 9796 131  81 18  0.8   0.2   0.1 0.7999592  0.1999815 0.09998666
5 9740 149  81 21  0.8   0.2   0.1 0.7999566  0.1999809 0.09998634
6 9668 191 104 30  0.8   0.2   0.1 0.7999689  0.1999840 0.09998787
  • Step 4: Prediction

predicted_params <- predict(model, X) predicted_params <- as.data.frame(predicted_params) colnames(predicted_params) <- c(“beta_pred”, “sigma_pred”, “gamma_pred”) results <- cbind(inputs, targets, predicted_params) print(head(results))

Predicts beta, sigma, gamma values using the trained network. Stores the predicted parameters. Combines them with the true values to see how good the model performed.

Step 5: Plotting

par(mfrow = c(1, 3))

plot(results$beta, results$beta_pred,
     main = "Beta: True vs Predicted", xlab = "True beta", ylab = "Predicted beta", pch = 19)
abline(0, 1, col = "red")

plot(results$sigma, results$sigma_pred,
     main = "Sigma: True vs Predicted", xlab = "True sigma", ylab = "Predicted sigma", pch = 19)
abline(0, 1, col = "red")

plot(results$gamma, results$gamma_pred,
     main = "Gamma: True vs Predicted", xlab = "True gamma", ylab = "Predicted gamma", pch = 19)
abline(0, 1, col = "red")

-Step 5: Plotting par(mfrow = c(1, 3)) # 3 plots side-by-side plot(results\(beta, results\)beta_pred, main = “Beta: True vs Predicted”, xlab = “True beta”, ylab = “Predicted beta”, pch = 19) abline(0, 1, col = “red”) # Ideal line y = x plot(results\(sigma, results\)sigma_pred, main = “Sigma: True vs Predicted”, xlab = “True sigma”, ylab = “Predicted sigma”, pch = 19) abline(0, 1, col = “red”) plot(results\(gamma, results\)gamma_pred, main = “Gamma: True vs Predicted”, xlab = “True gamma”, ylab = “Predicted gamma”, pch = 19) abline(0, 1, col = “red”)

Makes 3 plots:True vs Predicted for beta. True vs Predicted for sigma. True vs Predicted for gamma.

Perfect prediction would lie along the red line. This estimation helps one check wether the true parameters correspond to your data set

Why do prediction when you have true values

We are using them as targets to train a neural network, so that the network learns to predict those parameters from the SEIR data (the population states over time). This is a supervised learning task

So, even though you know the true parameters, you’re training the model to learn how to infer those parameters from observational data, which is useful in real-world settings where the true parameters are unknown in another data set. To ensure that the prediction is correct.

Parameter estimation using neural networks

Where we dont have true values but only estimated parameters

# Install required packages if not already installed
if (!requireNamespace("nnet", quietly = TRUE)) install.packages("nnet")
if (!requireNamespace("deSolve", quietly = TRUE)) install.packages("deSolve")
if (!requireNamespace("readr", quietly = TRUE)) install.packages("readr")
if (!requireNamespace("ggplot2", quietly = TRUE)) install.packages("ggplot2")

library(nnet)
library(deSolve)
library(readr)
library(ggplot2)

# --------------------------------
# Step 1: Define the SEIR model
# --------------------------------
seir_model <- function(t, state, params) {
  with(as.list(c(state, params)), {
    beta  <- params[1]
    sigma <- params[2]
    gamma <- params[3]
    
    dS <- -beta * S * I / (S + E + I + R)
    dE <- beta * S * I / (S + E + I + R) - sigma * E
    dI <- sigma * E - gamma * I
    dR <- gamma * I
    
    list(c(dS, dE, dI, dR))
  })
}

# --------------------------------
# Step 2: Load observed data (no true parameters known)
# --------------------------------
# Replace this with your real or synthetic dataset path
data <- read_csv("malaria_synthetic_data.csv")  # assuming it's in your working directory
times <- data$time
observed_SEIR <- data[, c("S", "E", "I", "R")]

# Initial conditions
init_state <- c(S = data$S[1], E = data$E[1], I = data$I[1], R = data$R[1])

# --------------------------------
# Step 3: Estimate parameters via ODE + optimization
# --------------------------------
cat("Starting parameter estimation...\n")
Starting parameter estimation...
objective_function <- function(params, times, observed_I, init_state) {
  out <- ode(y = init_state, times = times, func = seir_model, parms = params)
  pred_I <- out[, "I"]
  mse <- mean((pred_I - observed_I)^2, na.rm = TRUE)
  return(mse)
}

initial_guess <- c(beta = 0.6, sigma = 0.15, gamma = 0.08)

fit <- optim(par = initial_guess,
             fn = objective_function,
             times = times,
             observed_I = data$I,
             init_state = init_state,
             method = "L-BFGS-B",
             lower = c(beta = 0.01, sigma = 0.01, gamma = 0.01),
             upper = c(beta = 2, sigma = 1, gamma = 1))

estimated_params <- fit$par
cat("✅ Estimated parameters:\n")
✅ Estimated parameters:
print(estimated_params)
      beta      sigma      gamma 
0.59404315 0.28795939 0.09771527 
# --------------------------------
# Step 4: Generate synthetic data using estimated parameters
#         This simulates pseudo-labeled data for training a neural network
# --------------------------------

set.seed(123)
num_datasets <- 50   # Number of pseudo-datasets to generate
time_steps <- nrow(data)

dataset_list <- list()

cat("Generating pseudo-labeled datasets using estimated parameters...\n")
Generating pseudo-labeled datasets using estimated parameters...
for (i in 1:num_datasets) {
  
  # Perturb the estimated parameters slightly to simulate variation
  beta_perturbed <- estimated_params[1] * runif(1, 0.9, 1.1)
  sigma_perturbed <- estimated_params[2] * runif(1, 0.9, 1.1)
  gamma_perturbed <- estimated_params[3] * runif(1, 0.9, 1.1)
  
  # Randomized initial conditions
  S0 <- round(runif(1, 9000, 10000))
  E0 <- round(runif(1, 20, 100))
  I0 <- round(runif(1, 20, 100))
  R0 <- 0
  
  init_state_sim <- c(S = S0, E = E0, I = I0, R = R0)
  times_sim <- 1:time_steps
  
  # Solve ODE
  out_sim <- ode(y = init_state_sim, times = times_sim, func = seir_model, parms = c(beta_perturbed, sigma_perturbed, gamma_perturbed))
  
  # Add noise
  noise_level <- 10
  noisy_sim <- out_sim[, 2:5] + matrix(rnorm(4 * nrow(out_sim), 0, noise_level), ncol = 4)
  noisy_sim <- pmax(noisy_sim, 0)
  
  # Store input (SEIR) and output (parameters)
  dataset_list[[i]] <- list(
    input = noisy_sim,
    output = c(beta = beta_perturbed, sigma = sigma_perturbed, gamma = gamma_perturbed)
  )
}

# --------------------------------
# Step 5: Prepare training data
# --------------------------------
X_train <- do.call(rbind, lapply(dataset_list, function(x) x$input))
Y_train <- do.call(rbind, lapply(dataset_list, function(x) matrix(x$output, nrow = nrow(x$input), ncol = 3, byrow = TRUE)))

# Normalize inputs
normalize <- function(x) (x - min(x)) / (max(x) - min(x))
X_train_scaled <- apply(X_train, 2, normalize)

# --------------------------------
# Step 6: Train Neural Network
# --------------------------------
cat("Training neural network...\n")
Training neural network...
nn_model <- nnet(
  X_train_scaled, Y_train,
  size = 10,
  linout = TRUE,
  maxit = 1000,
  decay = 0.01,
  trace = FALSE
)

cat(" Neural network trained.\n")
 Neural network trained.
# --------------------------------
# Step 7: Predict on original observed data
# --------------------------------
# Normalize observed data
observed_scaled <- t(apply(observed_SEIR, 1, function(row) (row - min(X_train)) / (max(X_train) - min(X_train))))

predicted_params <- predict(nn_model, observed_scaled)
avg_prediction <- colMeans(predicted_params)

# Print predicted vs estimated
results <- rbind(
  "Estimated (from optimization)" = estimated_params,
  "Predicted (from neural network)" = avg_prediction
)

print(results)
                                     beta     sigma      gamma
Estimated (from optimization)   0.5940431 0.2879594 0.09771527
Predicted (from neural network) 0.5893953 0.2870263 0.09727007
# --------------------------------
# Step 8: Plot prediction
# --------------------------------
plot_data <- data.frame(
  time = rep(times, 2),
  Infected = c(data$I, observed_SEIR$I),  # dummy plot since we don't know true
  type = rep(c("Observed", "Neural Net Prediction"), each = length(times))
)

# Just visualize predicted parameters
cat("Plotting infected curve using predicted parameters...\n")
Plotting infected curve using predicted parameters...
simulated_with_pred <- ode(y = init_state, times = times, func = seir_model, parms = avg_prediction)

plot_data_nn <- data.frame(
  time = rep(times, 2),
  Infected = c(data$I, simulated_with_pred[, "I"]),
  type = rep(c("Observed", "Simulated w/ Pred Params"), each = length(times))
)

ggplot(plot_data_nn, aes(x = time, y = Infected, color = type)) +
  geom_line(size = 1) +
  labs(title = "Observed vs Simulated Infections Using Predicted Parameters",
       x = "Time", y = "Infected Individuals") +
  theme_minimal()

Parameter estimation using tensor flow

  • If your data set comes from real or observed data, and you don’t know the true parameters (β, σ, γ), then the goal shifts from supervised learning to optimizing parameters by minimizing prediction error. - Instead of a neural network trained on labeled targets, we’ll use TensorFlow/Keras to build a regression model that learns the best-fitting parameters by minimizing the difference between simulated and observed infections over time.

Libraries

# REequired Libraries
library(deSolve)
library(readr)
library(tensorflow)
library(keras)

# LOoad real data
#data <- read_csv("malaria5.csv")
#if (!"day" %in% colnames(data)) {
  #data$day <- seq_len(nrow(data)) - 1}

# EXxtract time and observed infections
#t <- data$day
#observed_I <- data$I

# INnitial Conditions
#N <- 995
#E0 <- 10; I0 <- 5; R0 <- 0
#S0 <- N - E0 - I0 - R0
#y0 <- c(S = S0, E = E0, I = I0, R = R0)

# SeEIR ODE Model

#seir_model <- function(t, y, params) {
  #with(as.list(c(y, params)), {
   # N <- S + E + I + R
   # dS <- -beta * S * I / N
   # dE <- beta * S * I / N - sigma * E
   # dI <- sigma * E - gamma * I
   # dR <- gamma * I
   # list(c(dS, dE, dI, dR))  })}

# WRrapper to simulate infections

#simulate_infections <- function(beta, sigma, gamma) {
#  out <- ode(y = y0, times = t, func = seir_model, parms = c(beta=beta, sigma=sigma, gamma=gamma))
#  return(out[, "I"])}

# TEensorFlow Parameter Variables

#beta <- tf$Variable(0.4, dtype = "float32")
#sigma <- tf$Variable(0.02, dtype = "float32")
#gamma <- tf$Variable(0.05, dtype = "float32")

# OPptimizer

#optimizer <- optimizer_adam(learning_rate = 0.01)

# Observed values as Tensor

#observed_I_tensor <- tf$constant(observed_I, dtype = "float32")

# LOoss function (Mean Squared Error)

#loss_fn <- function() {
 # pred_I <- simulate_infections(beta$numpy(), sigma$numpy(), gamma$numpy())
 # pred_tensor <- tf$constant(pred_I, dtype = "float32")
 # tf$reduce_mean(tf$square(pred_tensor - observed_I_tensor))}

# TRraining Loop
#for (epoch in 1:300) {
#  with(tf$GradientTape() %as% tape, {
 #   loss <- loss_fn()  })
  #grads <- tape$gradient(loss, list(beta, sigma, gamma))
  #optimizer$apply_gradients(purrr::transpose(list(grads, list(beta, sigma, gamma))))
  
 # if (epoch %% 50 == 0) {
 #   cat(sprintf("Epoch %d | Loss = %.4f | beta = %.4f | sigma = %.4f | gamma = %.4f\n",
  #              epoch, loss$numpy(), beta$numpy(), sigma$numpy(), gamma$numpy()))  }}

# FIinal Estimated Parameters
#cat(sprintf("\nEstimated Parameters from Real Data:\nBeta  = %.4f\nSigma = %.4f\nGamma = %.4f\n",
           # beta$numpy(), sigma$numpy(), gamma$numpy()))

3.1. PARAMETER ESTIMATION BY SPLITTING THE DATA INTO TRAINING AND TESTING SET

  1. Generalization Check: By training on one portion of the data and testing on another, we can evaluate whether the estimated parameters generalize well beyond the training period.
  2. Model Validation: The testing set acts as an independent evaluation, helping to identify overfitting, where the model performs well on training data but poorly on unseen data.
  3. Robustness of Parameter Estimates: Ensuring that parameters work for future projections rather than just fitting past data is crucial for forecasting.
  4. Tuning and Comparison: Different model variations (e.g., different parameter constraints, alternative assumptions) can be compared using training-test performance.

We will use the csv file malaria6. We will allow some parameters to be fixed especially those already known from biological knowledge.

Checking the descriptive statistics of the csv file. ## Load libraries

library(readr)   # For reading CSV files
library(dplyr)   # For data manipulation
library(deSolve) # For solving ODEs
library(minpack.lm)  # For nonlinear curve fitting
library(dplyr)  # For data manipulation
library(tidyr)

Load the CSV file

data <- read_csv("malaria6.csv")

Loads the CSV file containing malaria data into a dataframe.

Display the first few rows

cat("First 5 rows of the dataset:\n")
First 5 rows of the dataset:
print(head(data, 5))
# A tibble: 5 × 7
    day     s     E     i     r    Sv    Iv
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1     0   980    10     5     0   990    10
2     1   969    28     5     0   983    15
3     2   949    54     7     0   976    20
4     3   920    86    12     0   966    27
5     4   879   127    19     0   950    40

Get basic info about the dataset

cat("\nDataset Summary:\n")

Dataset Summary:
print(str(data))  # Structure of the dataset
spc_tbl_ [100 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ day: num [1:100] 0 1 2 3 4 5 6 7 8 9 ...
 $ s  : num [1:100] 980 969 949 920 879 818 727 601 442 269 ...
 $ E  : num [1:100] 10 28 54 86 127 184 265 373 503 634 ...
 $ i  : num [1:100] 5 5 7 12 19 30 46 70 103 148 ...
 $ r  : num [1:100] 0 0 0 0 0 0 1 3 6 11 ...
 $ Sv : num [1:100] 990 983 976 966 950 925 887 831 752 648 ...
 $ Iv : num [1:100] 10 15 20 27 40 61 93 140 206 291 ...
 - attr(*, "spec")=
  .. cols(
  ..   day = col_double(),
  ..   s = col_double(),
  ..   E = col_double(),
  ..   i = col_double(),
  ..   r = col_double(),
  ..   Sv = col_double(),
  ..   Iv = col_double()
  .. )
 - attr(*, "problems")=<externalptr> 
NULL

Check for missing values

cat("\nMissing values per column:\n")

Missing values per column:
print(colSums(is.na(data)))
day   s   E   i   r  Sv  Iv 
  0   0   0   0   0   0   0 

Describe numerical data

cat("\nStatistical Summary:\n")

Statistical Summary:
print(summary(data))
      day              s               E               i        
 Min.   : 0.00   Min.   :  8.0   Min.   : 10.0   Min.   :  5.0  
 1st Qu.:24.75   1st Qu.: 24.0   1st Qu.:164.8   1st Qu.:346.0  
 Median :49.50   Median : 30.5   Median :180.0   Median :358.0  
 Mean   :49.50   Mean   :100.3   Mean   :229.1   Mean   :373.7  
 3rd Qu.:74.25   3rd Qu.: 33.0   3rd Qu.:194.8   3rd Qu.:459.8  
 Max.   :99.00   Max.   :980.0   Max.   :748.0   Max.   :572.0  
       r                Sv               Iv       
 Min.   :   0.0   Min.   : 55.00   Min.   : 10.0  
 1st Qu.: 329.8   1st Qu.: 67.75   1st Qu.:297.0  
 Median : 773.0   Median : 85.00   Median :298.5  
 Mean   : 653.5   Mean   :167.98   Mean   :312.3  
 3rd Qu.: 982.5   3rd Qu.: 87.00   3rd Qu.:316.8  
 Max.   :1131.0   Max.   :990.00   Max.   :617.0  

This helps you understand what the dataset looks like and whether it’s clean.

Load and preprocess data

Data Preprocessing

data <- read.csv("malaria6.csv") %>% 
  rename_with(tolower) %>%
  mutate(day = seq_len(n()) - 1) %>%  # Start day count at 0
  rename(infected = i) %>%  # Rename to avoid conflicts
  drop_na()
  1. Load the CSV
  • data <- read.csv(“malaria6.csv”) %>%: Reads the CSV file “malaria6.csv” using base R’s read.csv().
  1. Convert Column Names to Lowercase
  • rename_with(tolower) %>%: converts all column names in the data frame to lowercase. This helps avoid issues with inconsistent naming like “I” vs “i” or “Time” vs “time”.
  1. Add a time Column
  • mutate(day = seq_len(n()) - 1) %>%: mutate() adds a new column called day.
  • seq_len(n()): creates a sequence from 1 to the number of rows (n() is the number of rows in the current data). Subtracting 1 makes the sequence start at 0: 0, 1, 2, …, N-1. This is often used for time steps in simulations.
  1. Rename the Infected Column
  • rename(infected = i) %>%: Renames the column i to infected. This might be to Make the name clearer and avoid naming conflicts with variables or functions.
  1. Drop Any Missing Data
  • drop_na(): Removes any rows in the dataset that contain NA (missing values). Ensures the data is clean for modeling or plotting.

The Final Result: Has all-lowercase column names, Includes a day column starting at 0.Uses infected instead of i and has no missing values.

Check required column exists

if(!"infected" %in% colnames(data)) {
  stop("Required column 'infected' missing. Check your CSV file.")
}

cat("Available columns in CSV:", paste(colnames(data), collapse = ", "), "\n")
Available columns in CSV: day, s, e, infected, r, sv, iv 
  • if(!“infected” %in%: This checks whether the column “infected” exists in the data frame.
  • colnames(data): gets a vector of all column names.
  • “infected” %in% colnames(data): returns TRUE if “infected” is among them.
  • The ! negates it: if “infected” is not present, then stop() halts the execution of the program and prints the given error message.

️ Print Column Names - cat(“Available columns in CSV:”, paste(colnames(data), collapse = “,”), “”): This prints out all available column names, separated by commas. - paste(…, collapse = “,”): combines them into one string like: “s, e, infected, r, day” - cat() prints that message without quotes or line breaks, adding a newline at the end.

Summary Ensures the “infected” column is in the dataset. Stops the script with a clear error message if it’s missing. Prints all the current column names in the data, so the user can quickly verify or debug.

Split data into training and testing sets (80% training)

split_index <- floor(nrow(data) * 0.8)
train_data <- data[1:split_index, ]
test_data <- data[(split_index+1):nrow(data), ]

80% training data for fitting the model, 20% test data for evaluation (used later) - split_index <- floor(nrow(data) * 0.8): Gives the total number of rows in your dataset. - 0.8 means taking 80% of the data for training. - floor() ensures the index is a whole number (it rounds down).

This takes the first 80% of the rows (from row 1 to split_index).Stores them in train_data. These rows are used to fit/train your model.

Create Testing Set - test_data <- data[(split_index+1):nrow(data), ]: This takes the remaining 20% of the rows (from split_index + 1 to the end). Stores them in test_data. These are used to test/evaluate the model’s performance.

Why Split Like This: Helps avoid overfitting (where a model memorizes the training data) and allows you to assess how well the model generalizes to new, unseen data.

80/20 splits are a common rule of thumb, though you might also use 70/30.

SEIR Model with Vector Dynamics

malaria_model <- function(t, state, params) {
  with(as.list(c(state, params)), {
    dS <- Lambda_h - beta_h * S * Iv - mu_h * S + delta_h * Rec
    dE <- beta_h * S * Iv - (sigma_h + mu_h) * E
    dI <- sigma_h * E - (gamma_h + mu_h) * I
    dRec <- gamma_h * I - (delta_h + mu_h) * Rec
    dSv <- Lambda_v - beta_v * Sv * I - mu_v * Sv
    dIv <- beta_v * Sv * I - (nu_v + mu_v) * Iv
    
    return(list(c(dS, dE, dI, dRec, dSv, dIv)))
  })
}

Simplified wrapper function for parameter estimation

Some parametrs are fixed based on biological knowledge

model_fit <- function(t, beta_h, beta_v, gamma_h, nu_v) {
  # Fixed parameters based on biological knowledge
  fixed_params <- c(
    Lambda_h = 10, Lambda_v = 100,
    sigma_h = 0.1, delta_h = 0.01,
    mu_h = 0.0001, mu_v = 0.1
  )
  
  initial_conditions <- c(S = 990, E = 5, I = 5, Rec = 0, Sv = 500, Iv = 10)
  params <- c(fixed_params, beta_h = beta_h, beta_v = beta_v, 
              gamma_h = gamma_h, nu_v = nu_v)
  
  sol <- ode(y = initial_conditions, times = t, func = malaria_model, 
             parms = params, method = "lsoda")
  return(sol[,"I, S, E, R"])  # Return only infected counts
}
    • model_fit <- function(t, beta_h, beta_v, gamma_h, nu_v) { :This defines a function model_fit which takes four parameters as input.
  • t: A vector of time points at which the model should be evaluated.
  • beta_h: Transmission rate from mosquitoes to humans, beta_v: Transmission rate from humans to mosquitoes, gamma_h: Recovery rate of humans, nu_v: Extrinsic incubation period (the time it takes for a mosquito to become infected after feeding on an infected human).
  1. Fixed Parameters
  • fixed_params <- c( Lambda_h = 10, Lambda_v = 100, sigma_h = 0.1, delta_h = 0.01, mu_h = 0.0001, mu_v = 0.1 ) : This sets up a vector fixed_params containing fixed values for various biological parameters that are assumed to be constant across the simulation. These parameters include: Lambda_h: Human birth rate, Lambda_v: Mosquito birth rate, sigma_h: Rate at which exposed humans become infected, delta_h: Rate at which infected humans recover, mu_h: Natural death rate of humans, mu_v: Natural death rate of mosquitoes.
  1. Initial Conditions
  • initial_conditions <- c(S = 990, E = 5, I = 5, Rec = 0, Sv = 500, Iv = 10) : This sets up the initial conditions for the model:
  1. Parameters Combination
  • params <- c(fixed_params, beta_h = beta_h, beta_v = beta_v, gamma_h = gamma_h, nu_v = nu_v): This combines the fixed parameters and the parameters passed into the function (beta_h, beta_v, gamma_h, and nu_v) into a single vector called params. This vector will be passed to the ODE solver.
  1. ODE Solver Call
  • sol <- ode(y = initial_conditions, times = t, func = malaria_model, parms = params, method = “lsoda”) :This calls the ode function (from the deSolve package) to solve the system of differential equations:
  • y = initial_conditions: Specifies the initial values of the compartments (S, E, I, Rec, Sv, Iv).
  • times = t: A vector of time points at which the solution is to be computed.
  • func = malaria_model: The function malaria_model contains the system of differential equations for the malaria spread.
  • parms = params: The parameters (both fixed and dynamic) that will be used in the model.
  • method = “lsoda”: Specifies the numerical integration method (lsoda is a robust method for solving stiff ODEs).
  1. Return Infected Population
  • return(sol[,“I”]) # Return only infected counts. :The ode function returns a matrix where each row corresponds to a time point, and each column corresponds to a compartment (S, E, I, Rec, Sv, Iv). This line extracts only the infected population (I) from the solution and returns it.

Summary: Purpose: The model_fit function is used to simulate the number of infected humans (I) over time for a malaria model.

Inputs: - Time points t at which the model is evaluated. - Parameters for transmission rates (beta_h, beta_v), recovery rate (gamma_h), and extrinsic incubation period (nu_v). - Outputs: The function returns a vector of infected individuals (I) at each time step in t.

This function is useful for fitting the model to real data, as it computes the predicted infected population over time given the parameters and initial conditions.

Parameter estimation function

estimate_parameters <- function(data) {
  t <- data$day
  I_data <- data$infected  # Use actual infected counts

  # Starting values based on typical malaria parameters
  start_vals <- c(
    beta_h = 0.002,  # Transmission rate from mosquitoes to humans
    beta_v = 0.001,  # Transmission rate from humans to mosquitoes
    gamma_h = 0.05,  # Recovery rate
    nu_v = 0.1       # Extrinsic incubation period
  )
  
  # Set reasonable bounds
  lower_bounds <- c(1e-4, 1e-4, 0.01, 0.01)
  upper_bounds <- c(0.1, 0.1, 1, 1)
  
  # Perform fitting with error handling
  tryCatch({
    fit <- nlsLM(
      I_data ~ model_fit(t, beta_h, beta_v, gamma_h, nu_v),
      start = start_vals,
      lower = lower_bounds,
      upper = upper_bounds,
      control = nls.lm.control(maxiter = 500, ftol = 1e-6)
    )
    return(coef(fit))
  }, error = function(e) {
    message("Parameter estimation failed with error: ", e$message)
    message("Returning starting values as fallback")
    return(start_vals)
  })
}

step 1 - estimate_parameters <- function(data) { This defines a function called estimate_parameters. It takes data as input. data must have at least two columns: day (time points), infected (observed infected human counts).

Step 2: Extract day and infected from the data - t <- data\(day - I_data <- data\)infected

t becomes the vector of time points. I_data becomes the real-world data for the number of infected people.

Step 3: Set starting values for parameters - start_vals <- c( - beta_h = 0.002, # mosquito → human transmission rate - beta_v = 0.001, # human → mosquito transmission rate - gamma_h = 0.05, # human recovery rate - nu_v = 0.1 # mosquito incubation rate )

Starting guesses for the four parameters you want to estimate. Good starting points help the algorithm converge faster.

Step 4: Set bounds for the parameters -lower_bounds <- c(1e-4, 1e-4, 0.01, 0.01) -upper_bounds <- c(0.1, 0.1, 1, 1) -Lower and upper limits for each parameter.

Makes sure the fitted parameters stay realistic and biological.

Step 5: Fit the model to the data using nlsLM fit <- nlsLM( -I_data ~ model_fit(t, beta_h, beta_v, gamma_h, nu_v), -start = start_vals, -lower = lower_bounds, -upper = upper_bounds, -control = nls.lm.control(maxiter = 500, ftol = 1e-6) )

nlsLM = non-linear least squares fitting (from the minpack.lm package). It tries to find values of beta_h, beta_v, gamma_h, and nu_v that make model_fit(t, …) match I_data.

  • maxiter = 500 = allow up to 500 iterations.
  • ftol = 1e-6 = very small tolerance for convergence (tight fitting).

Step 6: Return the fitted parameters -return(coef(fit)) If fitting succeeds, coef(fit) gives the estimated values of the parameters.

Step 7: Handle errors gracefully -}, error = function(e) { -message(“Parameter estimation failed with error:”, e$message) -message(“Returning starting values as fallback”) -return(start_vals) })

If fitting fails (for example, the model doesn’t converge), the function: Prints an error message. Returns the starting values instead of crashing.

Estimate parameters using training data

params_opt <- estimate_parameters(train_data)
param_names <- c("Beta_h", "Beta_v", "Gamma_h", "Nu_v")
cat("\nEstimated Parameters:\n")

Estimated Parameters:
for(i in seq_along(params_opt)) {
  cat(sprintf("%-8s: %.6f\n", param_names[i], params_opt[i]))
}
Beta_h  : 0.002000
Beta_v  : 0.001000
Gamma_h : 0.050000
Nu_v    : 0.100000
  • The above parameters are the ones to be estimated. the others were fixed.

  • params_opt <- estimate_parameters(train_data) Calls the function with your train_data dataset.Stores the output (the estimated parameters) into params_opt.

  • param_names <- c(“Beta_h”, “Beta_v”, “Gamma_h”, “Nu_v”) -cat(“Parameters:”) -for(i in seq_along(params_opt)) { -cat(sprintf(“%-8s: %.6f”, param_names[i], params_opt[i])) }

Prepares nice names for printing (param_names). Loops through the estimated values (params_opt) and prints them one by one neatly.

In summary: This code takes your real malaria data and tries to find the best parameters (β_h, β_v, γ_h, ν_v) so that your model matches the data.

Set up the problem, Fit the model, Handle errors, Print results

Return estimated parameters as a named list

final_params <- list(
  beta_h = params_opt["beta_h"],
  beta_v = params_opt["beta_v"],
  gamma_h = params_opt["gamma_h"],
  nu_v = params_opt["nu_v"],
  
  # Fixed parameters
  Lambda_h = 10,
  Lambda_v = 100,
  sigma_h = 0.1,
  delta_h = 0.01,
  mu_h = 0.0001,
  mu_v = 0.1
)

The code is creating a list of all parameters needed to run your malaria model. This list combines the The estimated parameters you just fit (beta_h, beta_v, gamma_h, nu_v) Plus the fixed biological parameters you already knew (Lambda_h, Lambda_v, etc.)

Steps 1. final_params <- list(…) Create a new list called final_params. 2. beta_h = params_opt[“beta_h”] Pull the estimated beta_h value from your fitted parameters (params_opt). 3. beta_v = params_opt[“beta_v”] Pull the estimated beta_v value. 4. gamma_h = params_opt[“gamma_h”] Pull the estimated human recovery rate. 5. nu_v = params_opt[“nu_v”] Pull the estimated mosquito incubation rate. 6. Lambda_h = 10 Set human recruitment (birth rate) fixed at 10. 7. Lambda_v = 100 Set mosquito recruitment rate fixed at 100. 8. sigma_h = 0.1 Set human progression rate (E → I) fixed at 0.1. 9. delta_h = 0.01 Human disease-induced death rate fixed at 0.01. 10. mu_h = 0.0001 Natural human death rate fixed at 0.0001. 11. mu_v = 0.1 Natural mosquito death rate fixed at 0.1.

Why build final_params? Because when you run your ODE model (using ode() again), it needs all the parameters — both estimated and fixed.

We then have a clean list final_params we can easily plug into our model.

3.2. Estimating the parameters with no parameter fixed by splitting the data sets

The code below estimates all the parameters from the training set. No parameter is fixed.

library(deSolve)
library(ggplot2)
library(dplyr)
library(tidyr)

Load and preprocess data

data <- read.csv("malaria6.csv") %>%
  rename_all(tolower) %>%
  rename_all(~ gsub(" ", "", .)) %>%
  drop_na() %>%
  mutate(day = seq_len(n()) - 1)  # Ensure day starts at 0

-data <- read.csv(“malaria6.csv”) %>% -rename_all(tolower) %>% - rename_all(~ gsub(” “,”“, .)) %>% - drop_na() %>% - mutate(day = seq_len(n()) - 1) # Ensure day starts at 0

  1. read.csv(“malaria6.csv”): Read the CSV file into R. Loads your malaria data into memory.
  2. %>% rename_all(tolower): Make all column names lowercase. Avoids issues like Infected vs. infected.
  3. %>% rename_all(~ gsub(” “,”“, .)): Remove any spaces from column names. Prevents errors if columns were named like”Infected Cases”.
  4. %>% drop_na(): Drop any rows with missing values (NA). So your model won’t break from missing data.
  5. %>% mutate(day = seq_len(n()) - 1): Create a new day column: 0, 1, 2, …, (n-1). Sets up a time variable for model fitting, starting at day 0. The data is cleaned — lowercase, no spaces, no missing values, the data gets a new day column for time-series modeling, All this happens in one smooth pipeline using the dplyr package (because of %>% pipes).

Check for required column

if (!"i" %in% colnames(data)) {
  stop("The column 'i' is missing from the CSV file. Please check the column names.")
}

-if (!“i” %in% colnames(data)) { - stop(“The column ‘i’ is missing from the CSV file. Please check the column names.”) }

Steps 1 colnames(data): Get the list of all column names in your data. Check what columns are available. 2 “i” %in% colnames(data): Check if “i” exists in the column names. Logical test: TRUE if “i” is found, FALSE if missing. 3 ! Negate the result (i.e., if “i” is NOT found). So the if triggers only if “i” is missing. 4 stop(“…”) If “i” is missing, stop the program immediately and print an error message. Prevents running the model with wrong/missing data!

Why is this important? The data must have a column called “i” for infected humans for the model to work. If it’s missing (maybe a typo in the CSV), you want to catch the problem early rather than get strange errors later in the simulation. Stop() completely halts the script, so you don’t waste time running broken code.

In short: Check if your important column “i” exists. If not, immediately stop with a clear error message.

Split data into training and testing

split_index <- floor(nrow(data) * 0.8)
train_data <- data[1:split_index, ]
test_data <- data[(split_index + 1):nrow(data), ]

-split_index <- floor(nrow(data) * 0.8) -train_data <- data[1:split_index, ] -test_data <- data[(split_index + 1):nrow(data), ]

Steps 1 nrow(data): Get the number of rows in the dataset. Find out how many time points you have. 2 nrow(data) * 0.8: Multiply by 0.8. Take 80% of the data for training. 3 floor(…): Round down to the nearest whole number. You can’t have partial rows, so use whole numbers only. 4 split_index <- …: Store the index where you split the data. So you know where training ends and testing begins. 5 train_data <- data[1:split_index, ]: Select the first 80% of rows. These will be used for training the model (fitting parameters). 6 test_data <- data[(split_index + 1):nrow(data), ]: Select the remaining 20% of rows. These will be used for testing (seeing how well the model predicts new data).

Example: Suppose you have nrow(data) = 100 rows. split_index = floor(100 × 0.8) = 80. Then train_data = rows 1 to 80, test_data = rows 81 to 100

Why do this? We need training data to estimate the model’s parameters and testing data to check if the model generalizes well (doesn’t just “memorize” the training data). This is standard in machine learning and time series modeling too.

SEIR with vector dynamics

seir_model <- function(t, state, parameters) {
  with(as.list(c(state, parameters)), {
    dS <- lambda_h - beta_h * S * Iv - mu_h * S + delta_h * R
    dE <- beta_h * S * Iv - (sigma_h + mu_h) * E
    dI <- sigma_h * E - (gamma_h + mu_h) * I
    dR <- gamma_h * I - (delta_h + mu_h) * R
    dSv <- lambda_v - beta_v * Sv * I - mu_v * Sv
    dIv <- beta_v * Sv * I - (nu_v + mu_v) * Iv
    list(c(dS, dE, dI, dR, dSv, dIv))
  })
}

Objective function for optimization

objective_function <- function(params, time, observed_I) {
  initial_conditions <- c(S = 990, E = 5, I = 5, R = 0, Sv = 500, Iv = 10)
  parameters <- list(
    lambda_h = params[1], lambda_v = params[2],
      beta_h = params[3], beta_v = params[4],
    sigma_h = params[5], gamma_h = params[6],
    delta_h = params[7], mu_h = params[8],
    mu_v = params[9], nu_v = params[10]
  )
  
  times <- seq(min(time), max(time), by = 1)
  sol <- tryCatch({
    ode(y = initial_conditions, times = times, func = seir_model, parms = parameters)
  }, error = function(e) {
    return(matrix(NA, nrow = length(times), ncol = 7))
  })
  
  pred_I <- sol[, "I"]
  pred_log_I <- log(pred_I + 1)
  return(pred_log_I)
}

Steps 1 objective_function <- function(params, time, observed_I) {: Define a new function that takes: params, time, and observed_I. params are what you are trying to estimate (like beta, gamma, etc.). 2 initial_conditions <- c(…): Set the starting populations for humans and mosquitoes. You need initial values for the ODE solver. 3 parameters <- list(…): Map the incoming params vector into named model parameters. The ODE solver needs named parameters (like beta_h, mu_h, etc.). 4 times <- seq(min(time), max(time), by = 1): Create a time sequence from minimum to maximum day. You want the model to simulate at each day (1-day steps). 5 sol <- tryCatch({…}): Solve the system of ODEs using the seir_model function. This gives predicted values over time. 6 Inside tryCatch: If solving fails (for bad params), return a matrix full of NAs. Avoid crashes during fitting. 7 pred_I <- sol[, “I”]: Extract only the predicted number of infected humans (I). Focus on matching infections to observed data. 8 pred_log_I <- log(pred_I + 1): Take the log of predicted infected (+1 to avoid log(0)). Helps stabilize variance and avoid huge errors. 9 return(pred_log_I): Return the log-transformed predictions. Later used to compare against actual data.

Why log-transform the predictions? - smooths large spikes in infection numbers. - reduces the effect of very large numbers overpowering the fitting. - makes loss functions behave nicely.

Parameter estimation function

estimate_parameters <- function(train_data) {
  time <- train_data$day
  observed_I <- train_data$i
  
  #initial_guess <- rep(0.1, 10)
  #lower_bounds <- rep(0, 10)
  #upper_bounds <- rep(1, 10)
  
  initial_guess <- rep(0.1, 10)
  lower_bounds <- rep(0, 10)
  upper_bounds <- rep(1, 10)
  
  result <- optim(
    par = initial_guess,
    fn = function(params) {
      pred_log_I <- objective_function(params, time, observed_I)
      sum((pred_log_I - log(observed_I + 1))^2, na.rm = TRUE)
    },
    method = "L-BFGS-B",
    lower = lower_bounds,
    upper = upper_bounds,
    control = list(maxit = 1000)
  )
  
  return(result$par)
}
  • estimate_parameters <- function(train_data) { time <- train_data\(day observed_I <- train_data\)i

This function prepares the training set by by estimating the parameters time = the days (0, 1, 2, …). and observed_I = the number of infected individuals (“i” column).

  • initial_guess <- rep(0.1, 10) lower_bounds <- rep(0, 10) upper_bounds <- rep(1, 10)

Step 2: Set starting points and boundaries for parameters initial_guess = rep(0.1, 10) → Start of guessing all parameters are 0.1. lower_bounds = rep(0, 10) → Minimum values for 10 parameters (all ≥ 0). upper_bounds = rep(1, 10) → Maximum values for 13 parameters (all ≤ 1).

  • result <- optim( par = initial_guess, fn = function(params) { pred_log_I <- objective_function(params, time, observed_I) sum((pred_log_I - log(observed_I + 1))^2, na.rm = TRUE) }, method = “L-BFGS-B”, lower = lower_bounds, upper = upper_bounds, control = list(maxit = 1000) )

Step 3: Optimize the parameters using optim()

par = initial_guess: Start optimization from initial guesses. fn = function(params): Define what we are minimizing: the sum of squared differences between predicted and observed (both log-transformed). method = “L-BFGS-B”: A good optimization algorithm that allows lower and upper bounds. lower, upper Limits for parameters. control = list(maxit = 1000): Allow up to 1000 iterations for better convergence.

This code finds the set of parameters that makes your model match the real infection data as closely as possible.

Step 4: Return optimized parameter values result$par contains the best set of parameter estimates after fitting.

  • params_opt <- estimate_parameters(train_data): Run the estimate_parameters function to fit the model on your training data and Store the best parameters into params_opt.

  • param_names <- c(“Lambda_h”, “Lambda_v”, “Beta_h”, “Beta_v”, “Sigma_h”, “Gamma_h”, “Delta_h”, “Mu_h”, “Mu_v”, “Nu_v”) List the names of the parameters you’re estimating.

  • cat(“Estimated Parameters:”) for (i in seq_along(params_opt)) { cat(sprintf(“%-10s: %.4f”, param_names[i], params_opt[i])) }: Nicely print the results For each parameter, print its name and its estimated value (formatted to 4 decimals).

In short: Define a function to find the best parameters.

Use optim() to minimize prediction error.

Log-transform infections to smooth variability.

Print out your best estimates nicely.

Quick Summary Table:

NOTE optim() Fits the model by minimizing prediction errors. initial_guess Where optimizer starts looking. lower_bounds, upper_bounds Constrain parameters to be realistic (e.g., [0,1]). log(I + 1) Stabilizes the error by reducing influence of big values. cat(sprintf(…)) Print the estimated values cleanly for inspection.

Solve with estimated parameters

initial_conditions <- c(S = 990, E = 5, I = 5, R = 0, Sv = 500, Iv = 10)
parameters <- list(
  lambda_h = params_opt[1], lambda_v = params_opt[2],
  beta_h = params_opt[3], beta_v = params_opt[4],
  sigma_h = params_opt[5], gamma_h = params_opt[6],
  delta_h = params_opt[7], mu_h = params_opt[8],
  mu_v = params_opt[9], nu_v = params_opt[10]
)

times <- seq(0, max(data$day), by = 1)
solution <- ode(y = initial_conditions, times = times, func = seir_model, parms = parameters)
DLSODA-  At T (=R1), too much accuracy requested  
      for precision of machine..  See TOLSF (=R2) 
In above message, R1 = 1, R2 = nan
 
  • times <- seq(0, max(data$day), by = 1) solution <- ode(y = initial_conditions, times = times, func = seir_model, parms = parameters)

Step 1: Define Initial Conditions initial_conditions <- c(S = 990, E = 5, I = 5, R = 0, Sv = 500, Iv = 10) What happens:

You define the starting values for each compartment in your SEIR model:

S = 990 → Susceptible humans (how many are initially at risk of infection).

E = 5 → Exposed humans (those who are infected but not yet infectious).

I = 5 → Infected humans (currently infectious individuals).

R = 0 → Recovered humans (those who have recovered and are now immune).

Sv = 500 → Susceptible mosquitoes.

Iv = 10 → Infected mosquitoes.

Why it’s needed: These initial values serve as the starting point for your differential equation (ODE) solver. They tell the model how to begin, and the solver will evolve these values over time based on the disease dynamics.

Estimate parameters

params_opt <- estimate_parameters(train_data)
param_names <- c("Lambda_h", "Lambda_v", "Beta_h", "Beta_v", "Sigma_h", 
                 "Gamma_h", "Delta_h", "Mu_h", "Mu_v", "Nu_v")

cat("Estimated Parameters:\n")
Estimated Parameters:
for (i in seq_along(params_opt)) {
  cat(sprintf("%-10s: %.4f\n", param_names[i], params_opt[i]))
}
Lambda_h  : 1.0000
Lambda_v  : 0.9258
Beta_h    : 0.9228
Beta_v    : 0.9210
Sigma_h   : 0.0220
Gamma_h   : 0.0000
Delta_h   : 1.0000
Mu_h      : 0.0100
Mu_v      : 0.0000
Nu_v      : 0.0000

Step 2: Define Parameters (from params_opt) parameters <- list( lambda_h = params_opt[1], lambda_v = params_opt[2], beta_h = params_opt[3], beta_v = params_opt[4], sigma_h = params_opt[5], gamma_h = params_opt[6], delta_h = params_opt[7], mu_h = params_opt[8], mu_v = params_opt[9], nu_v = params_opt[10] ) This line is creating a list of parameters for the ODE model, which are essential for the disease transmission process. These values are optimized previously and are now being passed into the model. These parameters are the variables that control the dynamics of the disease spread and are key to solving the ODE system correctly. These values are needed by the ode() function to simulate how the infection evolves over time.

Step 3: Create a Time Vector - times <- seq(0, max(data\(day), by = 1) times creates a sequence of time steps starting from day 0 up to the maximum day in your data (max(data\)day)), with a step of 1 (i.e., it runs day-by-day). The ODE solver needs a time vector to simulate the model’s evolution. By providing the exact days over which the model should run, the solver can output predictions for each of those days.

Step 4: Solve the ODE System - solution <- ode(y = initial_conditions, times = times, func = seir_model, parms = parameters) What happens: This line uses the ode() function to numerically solve the system of ordinary differential equations (ODEs) based on the SEIR model.

y = initial_conditions: The initial conditions for the populations (S, E, I, R, Sv, Iv).

times = times: The vector of times over which the solution will be calculated.

func = seir_model: Specifies the function (the SEIR model) that contains the differential equations describing the disease spread.

parms = parameters: Passes in the parameters that control the disease dynamics (the transmission rates, recovery rates, etc.).

Why it’s needed: This step runs the simulation of the disease spread using the SEIR model and the parameters you’ve specified. The ode() function computes the values for each compartment (S, E, I, R, Sv, Iv) at each time step, generating a solution that reflects how the disease evolves over time.

In summary: This code sets up the initial conditions for a SEIR (Susceptible, Exposed, Infected, Recovered) model with both human and mosquito populations, defines the necessary parameters that control the transmission dynamics, and then solves the system of ODEs over a given time period to simulate the disease spread.

The result, stored in solution, contains the predicted values of S, E, I, R, Sv, Iv over time — which can then be compared with actual data to check how well the model fits.

Prepare for plotting

#solution_df <- as.data.frame(solution) %>%
 # rename(Model_I = I, Model_R = R) %>%
 # select(time, Model_I, Model_R)

#plot_data <- data %>%
  #select(day, i) %>%
  #rename(Actual_I = i) %>%
  #left_join(solution_df, by = c("day" = "time"))
# Plot
#ggplot(plot_data, aes(x = day)) +
 # geom_line(aes(y = Model_I, color = "Model Infected"), size = 1) +
  #geom_line(aes(y = Model_R, color = "Model Recovered"), size = 1) +
  #geom_point(aes(y = Actual_I, color = "Actual Infected"), alpha = 0.6, size = 2) +
  #geom_vline(xintercept = split_index, linetype = "dashed", color = "gray") +
  #labs(title = "Comparison of Actual vs. Model Cases",
   #    x = "Days",
    #   y = "Population Count",
     #  color = "Legend") +
  #scale_color_manual(values = c("Model Infected" = "blue", 
   #                             "Model Recovered" = "purple", 
    #                            "Actual Infected" = "red")) +
  #theme_minimal() +
  #theme(legend.position = "bottom")

Step 1: Prepare the Model Data for Plotting - solution_df <- as.data.frame(solution) %>% rename(Model_I = I, Model_R = R) %>% select(time, Model_I, Model_R) What happens: solution contains the output of the ODE solver, which is likely a matrix with columns for the time steps and the values of different compartments (S, E, I, R, Sv, Iv). This line is converting that matrix into a data frame using as.data.frame().

rename(Model_I = I, Model_R = R): renames the columns I and R to Model_I and Model_R, respectively, to better reflect that these are the model’s predictions for the number of infected (I) and recovered (R) individuals.

select(time, Model_I, Model_R): selects only the time, Model_I, and Model_R columns from the solution_df. The time column corresponds to the days (or time steps) for which the model was solved.

Why it’s needed: This step organizes the model’s output into a tidy format (a data frame) with only the columns needed for the plot — the time, model’s predicted infected, and recovered populations.

Step 2: Prepare the Actual Data for Plotting - plot_data <- data %>% select(day, i) %>% rename(Actual_I = i) %>% left_join(solution_df, by = c(“day” = “time”)) What happens: data is the original dataset with actual infection data (e.g., from your CSV file).

-select(day, i): selects the columns day (the day number) and i (the number of infected individuals).

-rename(Actual_I = i): renames the column i to Actual_I, reflecting that this is the observed data for infected individuals. -left_join(solution_df, by = c(“day” = “time”)): merges the model’s predictions (solution_df) with the actual data (data). The merge is done based on the matching day column from the actual data and the time column from the model output.

Why it’s needed:This step combines the actual infection data with the model’s predicted data (infected and recovered populations), so both can be plotted together for comparison.

Step 3: Create the Plot - ggplot(plot_data, aes(x = day)) + geom_line(aes(y = Model_I, color = “Model Infected”), size = 1) + #geom_line(aes(y = Model_R, color = “Model Recovered”), size = 1) + geom_point(aes(y = Actual_I, color = “Actual Infected”), alpha = 0.6, size = 2) + geom_vline(xintercept = split_index, linetype = “dashed”, color = “gray”) + labs(title = “Comparison of Actual vs. Model Cases”, x = “Days”, y = “Population Count”, color = “Legend”) + scale_color_manual(values = c(“Model Infected” = “blue”, #“Model Recovered” = “purple”, “Actual Infected” = “red”)) + theme_minimal() + theme(legend.position = “bottom”) Explanation of the Plot: ggplot(plot_data, aes(x = day)):

This starts the plot with plot_data and specifies that the x-axis will represent the day (time).

geom_line(aes(y = Model_I, color = “Model Infected”), size = 1):

Adds a line plot for the model’s infected population (Model_I).

The color of the line is labeled as “Model Infected”, and the line thickness is set to 1 (size = 1).

geom_line(aes(y = Model_R, color = “Model Recovered”), size = 1): This line is commented out, but it would add a line for the model’s recovered population (Model_R). If you wanted to display it, you could simply uncomment this line.

geom_point(aes(y = Actual_I, color = “Actual Infected”), alpha = 0.6, size = 2):Adds points for the actual observed infected population (Actual_I).

The points are given a red color (“Actual Infected”) and are made semi-transparent with alpha = 0.6.

The size of the points is set to 2.

geom_vline(xintercept = split_index, linetype = “dashed”, color = “gray”): Adds a vertical dashed line at split_index. This line is used to mark a data split point (for example, training/testing split) on the x-axis.

The line is dashed (linetype = “dashed”) and colored gray.

labs(…):

Adds titles and labels to the plot:

Title: “Comparison of Actual vs. Model Cases”.

x-axis label: “Days”.

y-axis label: “Population Count”.

Legend title: “Legend” (for distinguishing actual vs. model data).

scale_color_manual(values = c(…)):

Customizes the colors used in the plot:

“Model Infected” is blue.

“Actual Infected” is red.

The “Model Recovered” color is commented out, but it would be purple if included.

theme_minimal():

Applies a minimal theme for a clean look with no grid lines.

theme(legend.position = “bottom”):

Positions the legend at the bottom of the plot for better readability.

Summary: solution_df holds the model’s predictions, including the infected and recovered individuals over time.

plot_data combines the model’s predictions with actual infection data.

ggplot() creates a plot that:

Displays the actual infected cases as red points.

Displays the model’s predicted infected cases as a blue line.

Optionally shows the model’s predicted recovered cases (if uncommented).

Marks a dashed line to highlight the training/testing split.

The plot helps visually compare the actual infection data with the model’s predictions, which is essential for assessing the model’s accuracy.

Parameter estimation using training and testing set with comparison of the infected graphs

The full code with no parameter estimated

# Required Libraries
library(deSolve)
library(readr)
library(dplyr)
library(tidyr)
library(ggplot2)

# Load and preprocess data
data <- read.csv("malaria6.csv") %>%
  rename_all(tolower) %>%
  rename_all(~ gsub(" ", "", .)) %>%
  drop_na() %>%
  mutate(day = seq_len(n()) - 1)

# Check for required column
if (!"i" %in% colnames(data)) {
  stop("The column 'i' is missing from the CSV file. Please check the column names.")
}

# Split data
split_index <- floor(nrow(data) * 0.8)
train_data <- data[1:split_index, ]
test_data <- data[(split_index + 1):nrow(data), ]

# SEIR vector model
seir_model <- function(t, state, parameters) {
  with(as.list(c(state, parameters)), {
    dS <- lambda_h - beta_h * S * Iv - mu_h * S + delta_h * R
    dE <- beta_h * S * Iv - (sigma_h + mu_h) * E
    dI <- sigma_h * E - (gamma_h + mu_h) * I
    dR <- gamma_h * I - (delta_h + mu_h) * R
    dSv <- lambda_v - beta_v * Sv * I - mu_v * Sv
    dIv <- beta_v * Sv * I - (nu_v + mu_v) * Iv
    list(c(dS, dE, dI, dR, dSv, dIv))
  })
}

# Objective function
objective_function <- function(params, time, observed_I) {
  initial_conditions <- c(S = 990, E = 5, I = 5, R = 0, Sv = 500, Iv = 10)
  parameters <- list(
    lambda_h = params[1], lambda_v = params[2],
    beta_h = params[3], beta_v = params[4],
    sigma_h = params[5], gamma_h = params[6],
    delta_h = params[7], mu_h = params[8],
    mu_v = params[9], nu_v = params[10]
  )
  
  times <- seq(min(time), max(time), by = 1)
  sol <- tryCatch({
    ode(y = initial_conditions, times = times, func = seir_model, parms = parameters)
  }, error = function(e) {
    return(matrix(NA, nrow = length(times), ncol = 7))
  })
  
  pred_I <- sol[, "I"]
  pred_log_I <- log(pred_I + 1)
  return(pred_log_I)
}

# Parameter estimation
estimate_parameters <- function(train_data) {
  time <- train_data$day
  observed_I <- train_data$i
  
  initial_guess <- rep(0.1, 10)
  lower_bounds <- rep(0, 10)
  upper_bounds <- rep(1, 10)
  
  result <- optim(
    par = initial_guess,
    fn = function(params) {
      pred_log_I <- objective_function(params, time, observed_I)
      sum((pred_log_I - log(observed_I + 1))^2, na.rm = TRUE)
    },
    method = "L-BFGS-B",
    lower = lower_bounds,
    upper = upper_bounds,
    control = list(maxit = 1000)
  )
  
  return(result$par)
}

# Estimate and display parameters
params_opt <- estimate_parameters(train_data)
param_names <- c("Lambda_h", "Lambda_v", "Beta_h", "Beta_v", "Sigma_h", 
                 "Gamma_h", "Delta_h", "Mu_h", "Mu_v", "Nu_v")

cat("Estimated Parameters:\n")
Estimated Parameters:
for (i in seq_along(params_opt)) {
  cat(sprintf("%-10s: %.4f\n", param_names[i], params_opt[i]))
}
Lambda_h  : 1.0000
Lambda_v  : 0.9258
Beta_h    : 0.9228
Beta_v    : 0.9210
Sigma_h   : 0.0220
Gamma_h   : 0.0000
Delta_h   : 1.0000
Mu_h      : 0.0100
Mu_v      : 0.0000
Nu_v      : 0.0000
# Solve SEIR with estimated parameters
initial_conditions <- c(S = 990, E = 5, I = 5, R = 0, Sv = 500, Iv = 10)
parameters <- list(
  lambda_h = params_opt[1], lambda_v = params_opt[2],
  beta_h = params_opt[3], beta_v = params_opt[4],
  sigma_h = params_opt[5], gamma_h = params_opt[6],
  delta_h = params_opt[7], mu_h = params_opt[8],
  mu_v = params_opt[9], nu_v = params_opt[10]
)

times <- seq(0, max(data$day), by = 1)
solution <- ode(y = initial_conditions, times = times, func = seir_model, parms = parameters)

solution_df <- as.data.frame(solution) %>%
  rename(Model_I = I, Model_R = R) %>%
  select(time, Model_I, Model_R)

plot_data <- data %>%
  select(day, i) %>%
  rename(Actual_I = i) %>%
  left_join(solution_df, by = c("day" = "time"))

# Plot
ggplot(plot_data, aes(x = day)) +
  geom_line(aes(y = Model_I, color = "Model Infected"), size = 1) +
  geom_line(aes(y = Model_R, color = "Model Recovered"), size = 1) +
  geom_point(aes(y = Actual_I, color = "Actual Infected"), alpha = 0.6, size = 2) +
  geom_vline(xintercept = split_index, linetype = "dashed", color = "gray") +
  labs(title = "Comparison of Actual vs. Model Cases",
       x = "Days",
       y = "Population Count",
       color = "Legend") +
  scale_color_manual(values = c("Model Infected" = "blue", 
                                #"Model Recovered" = "purple", 
                                "Actual Infected" = "red")) +
  theme_minimal() +
  theme(legend.position = "bottom")

Python values R Studio Values Estimated Parameters: Lambda_h : 0.1269 1.000 Lambda_v : 0.1297 0.9258 Beta_h : 0.0006 0.9228 Beta_v : 0.0037 0.9210 Sigma_h : 0.1426 0.0220 Gamma_h : 0.0408 0.0000 Delta_h : 0.0261 1.0000 Mu_h : 0.0000 0.0100 Mu_v : 0.0000 0.0000 Nu_v : 0.0000 0.0000

The python estimates looks better from the graphs