Kalman Filter Definition, Formulation, and Application to Wastewater SARS-CoV-2 Concentration Data

Definition

The Kalman Filter is a widely used estimation algorithm that plays a critical role in many fields. It is designed to estimate the hidden states of the system where observations are subject to noise. It’s particularly useful when dealing with systems where there’s uncertainty about the true state, and it’s necessary to estimate that state based on noisy measurements. Also, the Kalman Filter predicts the future system state based on past estimations.

The Kalman Filter recursively updates its estimate of the system’s state using a two-step process: prediction and correction (filtering). It combines information from the previous state estimate, the current observation, and the dynamics of the system to generate an optimal estimate of the current state.

Formulation

The Kalman filter model assumes the true/hidden state of the system at time \(k\) is evolved from the state at \((k-1)\) according to \[ \mathbf{x}_k=\mathbf{F}_k \mathbf{x}_{k-1}+\mathbf{B}_k \mathbf{u}_k+\mathbf{w}_k, \; \mathbf{w}_k \sim N\left(0, \mathbf{Q}_k\right) \] where

\(\mathrm{F}_k\) is the state transition model which is applied to the previous state \(\mathbf{x}_{k-1}\). It predicts the state at the next time step based on the current state.
\(\mathbf{B}_k\) is the control-input model which is applied to the control vector \(\mathbf{u}_k\).
\(\mathbf{Q}_k\) is the process noise covariance matrix. It captures the uncertainty in the state transition process.

At time \(k\) an observation (or measurement) \(\mathbf{z}_k\) of the true state \(\mathbf{x}_k\) is made according to \[ \mathbf{z}_k=\mathbf{H}_k \mathbf{x}_k+\mathbf{v}_k, \hspace{4mm} \mathbf{v}_k \sim N\left(0, \mathbf{R}_k\right)\] where

- \(\mathrm{H}_k\) is the observation model, which maps the true state space into the observed space. It defines how the state is measured.

- \(\mathbf{R}_k\) represents the covariance of the measurement noise. It captures the uncertainty in the measurements of the system’s output

The initial state, and the noise vectors at each time \(\left\{\mathbf{x}_0, \mathbf{w}_1, \ldots, \mathbf{w}_k, \mathbf{v}_1, \ldots, \mathbf{v}_k\right\}\) are all assumed to be mutually independent.

In what follows, the notation \(\hat{\mathbf{x}}_{n \mid m}\) represents the estimate of \(\mathbf{x}\) at time \(n\) given observations up to and including at time \(m \leq n\).

The state of the filter is represented by two variables:

- \(\hat{\mathbf{x}}_{k \mid k}\), the a posteriori state estimate mean at time \(k\) given observations up to and including at time \(k\),

- \(\mathbf{P}_{k \mid k}\), the a posteriori estimate covariance matrix (a measure of the estimated accuracy of the state estimate).

The Kalman filter can be written using a two-step process: “Predict” and “Update”. The predict phase uses the state estimate from the previous time step to produce an estimate of the state at the current time step. In the update phase, the innovation (the pre-fit residual), i.e. the difference between the current a priori prediction and the current observation information, is multiplied by the optimal Kalman gain and combined with the previous state estimate to refine the state estimate. This improved estimate based on the current observation is termed the a posteriori state estimate.

Predict

\[ \begin{align*} \text{Predicted (a priori) state estimate } & : \hat{\mathbf{x}}_{k \mid k-1}=\mathbf{F}_k \hat{\mathbf{x}}_{k-1 \mid k-1}+\mathbf{B}_k \mathbf{u}_{k-1} \\ \text{Predicted (a priori) estimate covariance } & : \mathbf{P}_{k \mid k-1}=\mathbf{F}_k \mathbf{P}_{k-1 \mid k-1} \mathbf{F}_k^{\top}+\mathbf{Q}_{k-1} \end{align*} \]

Update (Correction)

\[ \begin{align*} \text{Innovation or measurement pre-fit residual } & : \tilde{\mathbf{y}}_k = \mathbf{z}_k - \mathbf{H}_k \hat{\mathbf{x}}_{k \mid k-1} \\ \text{Innovation (or pre-fit residual) covariance } & : \mathbf{S}_k = \mathbf{H}_k \mathbf{P}_{k \mid k-1} \mathbf{H}_k^{\top} + \mathbf{R}_k \\ \text{Optimal Kalman gain } & : \mathbf{K}_k = \mathbf{P}_{k \mid k-1} \mathbf{H}_k^{\top} \mathbf{S}_k^{-1} \\ \text{Updated (a posteriori) state estimate } & : \hat{\mathbf{x}}_{k \mid k} = \hat{\mathbf{x}}_{k \mid k-1} + \mathbf{K}_k \tilde{\mathbf{y}}_k \\ \text{Updated (a posteriori) covariance } & : \mathbf{P}_{k \mid k} = (\mathbf{I} - \mathbf{K}_k \mathbf{H}_k) \mathbf{P}_{k \mid k-1} \\ \text{Measurement post-fit residual } & : \tilde{\mathbf{y}}_{k \mid k} = \mathbf{z}_k - \mathbf{H}_k \hat{\mathbf{x}}_{k \mid k} \end{align*} \]

The formula for the updated (a posteriori) estimate covariance above is valid for the optimal \(\mathbf{K}_{\mathrm{k}}\) gain that minimizes the residual error, in which form it is most widely used in applications. Proof of the formula can be found in [REF], where the formula valid for any \(\mathbf{K}_{\mathrm{k}}\) is also shown.

We can simplify these formulas for practical implementation:

Predict:

\(X_{\text{predicted}} = FX + BU\)

\(P_{\text{predicted}} = FPF^T + Q\)

Update (Correction):

\(K = P_{\text{predicted}} H^T (H P_{\text{predicted}} H^T + R)^{-1}\)

\(X = X_{\text{predicted}} + K (Y - H X_{\text{predicted}})\) \(P = (I - K H) \times P_{\text{predicted}}\)

Example

We use measurements of SARS-CoV-2 RNA concentrations in wastewater (expressed as N gene copies per gram of dry weight solids) for the City of Davis. These measurements are part of a published study [Reference]. Wastewater data can be influenced by various factors beyond local COVID-19 prevalence, such as weather conditions, characteristics of the sewershed, and wastewater composition.

To estimate the underlying true concentrations, we employ a Kalman filter. As a key variable in this process, we use:

BCoV Recovery Percentage: The Bovine Coronavirus (BCoV) recovery rate, which serves as an indicator of the RNA recovery rate in the wastewater. We include this in the observation model, since it directly influences how accurately the RNA signal is recovered.
Precipitation data: We use precipitation data from the CHIRPS (Climate Hazards Group InfraRed Precipitation with Station) dataset to consider weather-related impacts on the wastewater measurements. Rainfall can dilute wastewater or introduce errors, which can affect RNA concentration measurements. By including precipitation data, we can adjust for these weather-related variations.

# Read data
library(stats) # For Kalman filter
library(ggplot2)
library(zoo)

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

df.all <- read.csv("data_all_wheater.csv")
Plants = unique(df.all$Plant)

#df_ex = df.all[df.all\$Plant==Plants[2],]
df_ex = df.all[df.all$City=="Davis",]
df_ex = df_ex[500: nrow(df_ex),]
df_ex <- df_ex %>% rename(precipitation = chirps)
#unique(df.all$City)
#colnames(df_ex )

Time independent parameters

We set the parameters for the model Kalman Filter model such as the process noise, measurement noise, and initial state estimates. For this example, we’ve selected intuitive values for these parameters based on the observed variance in our dataset. Alternatively, these parameters can also be estimated using the Expectation-Maximization (EM) algorithm, which we can demonstrate in a separate example.

# Initialize parameters
n= length(df_ex$BCoV.Recovery)
F <- matrix(c(1), nrow = 1)  # State transition matrix (assuming constant value)
H <- matrix(rep(1,n), nrow = 1)  # Measurement matrix
Q <- rep(0.05, n) # Process noise covariance
R <- rep(0.1, n)# Measurement noise covariance

# Initialize state and covariance
P <- 1  # Initial state covariance
Y <- df_ex$SC2_N_gc_g_dry_weight/max(df_ex$SC2_N_gc_g_dry_weight,na.rm=T)
#Y <- df_ex$SC2_Norm#/max(df_ex$SC2_Norm,na.rm=T)
X <- Y[1]

To handle missing (NA) values, we skip the correction step and use the prediction step to estimate the missing data.

sparse_kalman_filter <- function(Y, Q, R, H, F) {
        smoothed_signal <- numeric(length(Y))
        X <- Y[1]  # Initial state estimate
        P <- 1  # Initial state covariance
        smoothed_signal <- numeric(length(Y)) # Initialize vector to store smoothed signal
        
        # Kalman filter loop
        for (i in 2:length(Y)){
            # Prediction step
            X_predicted <- F %*% X
            P_predicted <- F %*% P %*% t(F) + Q[i]
            
            # Filtering (correction) step
            if (!is.na(Y[i])){
              K <- P_predicted %*% t(H[i]) %*% solve(H[i] %*% P_predicted %*% t(H[i]) + R[i])
              X <- X_predicted + K %*% (Y[i] - H[i] %*% X_predicted)
              P <- (1- K %*% H[i]) %*% P_predicted
            } 
            else {
              X <- X_predicted
              P <- P_predicted
            }
            
            # Save smoothed signal
            smoothed_signal[i] <- X
        }
              
        return(smoothed_signal)}

We utilized the Kalman filter on our data to produce a filtered signal, and the resulting plot is displayed below:

smoothed_signals1 <- list(sparse_kalman_filter(Y, Q, R, H, F))

add_smoothed_signals <- function(original_signal, smoothed_signals, metadata) {
        df_kalman <- data.frame(
        Date = as.Date(metadata$Date),
        Original_Signal = original_signal,
        prec = metadata$precipitation,
        temp = metadata$temperature)
      
        for (i in seq_along(smoothed_signals)) {
            signal_name = paste0("Filtered_", i)
            df_kalman[[signal_name]] = smoothed_signals[[i]]
        }
      
        return(df_kalman)}

df_kalman1 = add_smoothed_signals(original_signal=Y, smoothed_signals= smoothed_signals1, metadata=df_ex)

library(ggplot2)
library(RColorBrewer)

plot_results<- function(smoothed_signals, df_kalman){
      num_signals <- length(smoothed_signals)
      colors <- brewer.pal(max(3, num_signals+1), "Set1")  
      colors <- rep(colors, length.out = num_signals+1)  
      breaks <- seq(min(df_kalman$Date), max(df_kalman$Date), by = "3 month")
      
      p <- ggplot(df_kalman, aes(x = Date)) +
        geom_point(aes(y = Original_Signal, color = "Original"))+
        labs(x = "Date", y = "N gene gc/g dry weight") +
        theme_minimal() + scale_x_date(breaks = breaks, date_labels = "%b %Y") + labs(color='Wastewater signal') 
      
      
      for (i in seq_along(smoothed_signals)) {
          signal_name <- paste0("Filtered_", i)
          p <- p + geom_point(aes(y = !!sym(signal_name), color = signal_name))
      }
      return(p)
}
plot_results(smoothed_signals1, df_kalman1)

## Warning: Removed 240 rows containing missing values or values outside the scale range
## (`geom_point()`).

The figure above demonstrates how the Kalman filtering process corrects measurements that visually seem like outliers

Adding the BCoV recovery

Next, we incorporate BCoV recovery into our measurement noise model. To assess how BCoV recovery impacts our estimation of the underlying process, we plot the differences between the Kalman-filtered signal and the original data, against the BCoV recovery.

#print(head(df_ex))
df_ex$BCoV.Recovery <- na.aggregate(df_ex$BCoV.Recovery)
df_kalman1 = df_kalman1 %>% mutate(res=(Filtered_1-Original_Signal)/Original_Signal)
plot( df_ex$BCoV.Recovery,df_kalman1$res)

In the previous plot, we observe that beyond a certain threshold—approximately at a BCoV recovery value of 2—further increases in BCoV recovery do not significantly impact variability. This suggests we could use a sigmoidal-shaped weighting function that assigns lower weight to data points with BCoV recovery below 2. Beyond this threshold, a uniform weight seems suitable, as additional BCoV recovery doesn’t impact variability. Using this approach, we can define the variance of the measurement noise by setting it inversely proportional to BCoV recovery, resulting in higher variance for smaller BCoV recovery values. This understanding provides a framework for adjusting our model’s sensitivity based on BCoV recovery levels.

f_bcov <- function(x) {
  # Adjust the offset (lower bound) and steepness of the transition
  base_value = 0.7  # starting value
  max_value = 1     # maximum value
  scale_factor = 3 # controls steepness
  transition_point = 2 # controls where the steepness occurs
  
  # Calculate the sigmoidal function
  base_value + (max_value - base_value) / (1 + exp(-scale_factor * (x - transition_point)))
}
plot(seq(0, 5, length.out = 100), f_bcov(seq(0, 5, length.out = 100)))

f_bcov(seq(0, 5, length.out = 100))

##   [1] 0.7007418 0.7008628 0.7010035 0.7011670 0.7013570 0.7015779 0.7018344
##   [8] 0.7021324 0.7024784 0.7028799 0.7033458 0.7038861 0.7045123 0.7052376
##  [15] 0.7060771 0.7070479 0.7081696 0.7094639 0.7109556 0.7126722 0.7146441
##  [22] 0.7169049 0.7194908 0.7224408 0.7257964 0.7296002 0.7338955 0.7387249
##  [29] 0.7441283 0.7501409 0.7567909 0.7640966 0.7720637 0.7806824 0.7899252
##  [36] 0.7997453 0.8100759 0.8208308 0.8319067 0.8431865 0.8545441 0.8658497
##  [43] 0.8769761 0.8878040 0.8982269 0.9081551 0.9175177 0.9262640 0.9343628
##  [50] 0.9418008 0.9485809 0.9547190 0.9602416 0.9651827 0.9695815 0.9734801
##  [57] 0.9769218 0.9799497 0.9826052 0.9849279 0.9869549 0.9887200 0.9902544
##  [64] 0.9915861 0.9927405 0.9937398 0.9946041 0.9953510 0.9959959 0.9965524
##  [71] 0.9970323 0.9974460 0.9978025 0.9981095 0.9983738 0.9986014 0.9987972
##  [78] 0.9989658 0.9991107 0.9992354 0.9993427 0.9994349 0.9995143 0.9995825
##  [85] 0.9996411 0.9996915 0.9997348 0.9997721 0.9998041 0.9998316 0.9998553
##  [92] 0.9998756 0.9998931 0.9999081 0.9999210 0.9999321 0.9999417 0.9999499
##  [99] 0.9999569 0.9999630

df_ex$BCoV.proc = sapply(df_ex$BCoV.Recovery, f_bcov)
#Rr= R[1]/df_ex$BCoV.proc
Rr= R[1]/df_ex$BCoV.proc

smoothed_signals2 <- list(sparse_kalman_filter(Y, Q, R, H, F),
                          sparse_kalman_filter(Y, Q, Rr, H, F))

df_kalman2 = add_smoothed_signals(original_signal=Y,
                                  smoothed_signals= smoothed_signals2,
                                  metadata=df_ex)
plot_results(smoothed_signals2, df_kalman2)

## Warning: Removed 240 rows containing missing values or values outside the scale range
## (`geom_point()`).

The above plot both show the estimation a constant covariance of the measurement noise R, as we computed in the first example, and the estimation including the BCCoV in R, Rt=R/BCCoVt, the inclusion of the BBCOv in R give more variability when we have a low recovery. The differences displayed in the estimations are small, and showing the predictiones with Rt a little lower thant the predicitions with R.

Adding precipitation data

#### Precipitation
df_ex$Date=as.Date(df_ex$Date)
breaks <- seq(min(df_ex$Date), max(df_ex$Date), by = "3 month")
factor_prec <- max(df_ex$precipitation, na.rm = TRUE) / max(df_ex$SC2_N_gc_g_dry_weight, na.rm = TRUE)
plot_prec <- ggplot(df_ex, aes(x = Date)) +
  geom_point(aes(y = precipitation, color = "Precipitation")) +
  geom_point(aes(y = factor_prec * SC2_N_gc_g_dry_weight, color = "SARS-CoV2 conc")) +
  scale_y_continuous("SARS-CoV2", sec.axis = sec_axis(~. * factor_prec, name = "Precipitation"))+scale_x_date(breaks = breaks, date_labels = "%b %Y")

plot_prec

## Warning: Removed 240 rows containing missing values or values outside the scale range
## (`geom_point()`).

#df_ex$precipitation

df_kalman2 = df_kalman2 %>% mutate(res=(Filtered_2-Original_Signal)/Original_Signal)
#plot( df_kalman2$prec,df_kalman2$Original_Signal)
ggplot(df_kalman2  %>% filter(prec>0), aes(x=prec, y=Original_Signal))+geom_point()+geom_smooth(method='lm')

## `geom_smooth()` using formula = 'y ~ x'

## Warning: Removed 29 rows containing non-finite outside the scale range
## (`stat_smooth()`).

## Warning: Removed 29 rows containing missing values or values outside the scale range
## (`geom_point()`).

#plot(1: length(df_kalman$res), df_kalman$res)
#points(1:length(df_kalman$prec), df_kalman$prec, col='blue')

#df_kalman$res

We introduce the precipitation in the observation operator into the Kalman filter.

df_kalman2$prec <- as.numeric(as.character(df_kalman2$prec))
df_kalman2 = df_kalman2 %>% mutate(df_kalman2, prec_s = prec/max(prec,na.rm=TRUE))
Ht= 1+4*df_kalman2$prec_s

smoothed_signals <- list(sparse_kalman_filter(Y, Q, Rr,H,F), sparse_kalman_filter(Y, Q, Rr,Ht,F))

df_kalman_new = add_smoothed_signals(original_signal=Y, smoothed_signals= smoothed_signals, metadata=df_ex)
plot_results(smoothed_signals, df_kalman_new)

## Warning: Removed 240 rows containing missing values or values outside the scale range
## (`geom_point()`).