Box-plots are quite differents between sites. This is interesting to the model transferability.

1 - Caribou watershed data

There are no data in winter as the river is frozen. For a first work, we choose to take only the data from 2019 spring to 2019 end.

1 - Modelling nitrate at caribou with benchmark linear model

The best linear model linking water quality variables to nitrate concentration is chosen with a stepwise process. The best linear model is with all variables :

\[ \begin{aligned} Nitrate_t = & \beta_0 + \beta_1 Conductivity_t + \beta_2 Dissolved Oxygen_t + \\ & \beta_3 Turbidity_t + \beta_4 Temperature_t + \epsilon_t\\ \end{aligned} \] with \(\epsilon_t \sim \mathcal{N}(0,\sigma^{2})\)

Adjusted R-squared is about 0.36 for this model. All variables are *** relevant.

model_lm<-lm(nitrate_mean ~ temp_mean + spec_cond + oxygen 
              + turbidity , data = data)
library(car)
avPlots(model_lm)

library(MASS)
best_lm_model <- stepAIC(model_lm, direction = "both", 
                      trace = FALSE)
summary(best_lm_model)

## 
## Call:
## lm(formula = nitrate_mean ~ temp_mean + spec_cond + oxygen + 
##     turbidity, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.6692  -1.5079   0.1193   2.4988   6.2609 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 16.6884520  1.4636632  11.402   <2e-16 ***
## temp_mean   -0.0755560  0.0331963  -2.276   0.0229 *  
## spec_cond   -0.0832385  0.0022356 -37.233   <2e-16 ***
## oxygen       1.5064953  0.1040898  14.473   <2e-16 ***
## turbidity   -0.0042587  0.0001988 -21.427   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.365 on 10502 degrees of freedom
##   (12341 observations deleted due to missingness)
## Multiple R-squared:  0.3678, Adjusted R-squared:  0.3675 
## F-statistic:  1527 on 4 and 10502 DF,  p-value: < 2.2e-16

forecast::checkresiduals(best_lm_model)

## 
##  Breusch-Godfrey test for serial correlation of order up to 10
## 
## data:  Residuals
## LM test = 10456, df = 10, p-value < 2.2e-16

Quick residuals study highlight the temporal autocorrelation is present in residuals.

2 - Modelling nitrate at Caribou with a two-step modelling

#split data into training dataset and predict dataset
data4model<-data_cari[7010:22300,]
data_train<- data4model[1:10000,]
data_predict<- data4model[10001:15291,]

training data set = 10000 point, predicting data set = 5839

2 - 1 Step one: gam model

I used a stepwise model selection to have the best gam model with avaliable environmental variables; The equation of this model is :

\[ \begin{aligned} Nitrate_t = \beta_0 & + f_1 (Conductivity_t) + f_2(Dissolved Oxygen_t) \\ & + f_3 (log(Turbidity_t + 1)) + f_4 (Temperature_t)+ \epsilon_t \end{aligned} \]

with

\(\epsilon_t \sim \mathcal{N}(0,\sigma^{2})\)

best_model_gam_cari<-mgcv::gam(nitrate_mean ~  s(spec_cond, k = 12) + s(oxygen, k = 12) + 
                            +  s(temp_mean, k = 12) +
                            s(log(turbidity+1) ,  k = 12), data = data_train)

# no_best_model_gam<-mgcv::gam(nitrate_mean ~  s(spec_cond, k = 12) + s(oxygen, k = 12) + 
#                             s(oxygen_sat, k = 12) + s(chloro, k =  12) + s(ph, k = 12) + 
#                             s(turbidity,  k = 12), data = data_train)




p <- getViz(best_model_gam_cari)
(plot(p) +  l_ciPoly() + l_points(alpha=0.2) + l_fitLine() + l_rug()) %>%
  print(pages=1)

best_model_gam_cari<-mgcv::gam(nitrate_mean ~  s(spec_cond, k = 12) + s(oxygen, k = 12) + 
                           + s(temp_mean, k = 12) + s(log(turbidity+1) ,  k = 12), data = data_train, na.action=na.exclude)

#fill where we deleted na
data_train <- data_train %>%
  mutate(gam_res = residuals(best_model_gam_cari)) %>%
  as_tsibble(index=start_date_time) %>%
  fill_gaps()

data_train %>% gg_tsdisplay(gam_res, plot_type='histogram')

Pacf(data_train$gam_res)

This gam model reach a better \(r^2\) (0.87 ) than the benchmark linear model (0.54). But a high temporal auto-correlation is also present in the model residuals \(\epsilon_t\). GCV = 3.2506 ## 3 - 2 - 2 Step two: Modelling temporal auto-correlation in residuals

## Series: best_model_gam_cari$residuals 
## ARIMA(3,1,1) 
## 
## Coefficients:
##         ar1     ar2     ar3      ma1
##       0.498  0.1023  0.0949  -0.6996
## s.e.  0.062  0.0180  0.0129   0.0614
## 
## sigma^2 estimated as 0.1302:  log likelihood=-2457.38
## AIC=4924.75   AICc=4924.76   BIC=4958.38

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(3,1,1)
## Q* = 24.766, df = 6, p-value = 0.0003773
## 
## Model df: 4.   Total lags used: 10

P-value of Ljung-Box test is < 0.01 so there is no more information in the residuals. I fit the ARIMA model on the GAM residuals and look at prediction and RMSE

ARIMA predict very weel GAM residuals.

The equation of this model is:

\[ \begin{aligned} Nitrate_t = \beta_0 & + f_1 (Conductivity_t) + f_2(Dissolved Oxygen_t) \\ & + f_3 (log(Turbidity_t + 1)) + f_4 (Temperature_t)+ u_t\\ \end{aligned} \]

with \(u_t\) ARIMA(3,1,1) errors \(u_t = \rho_1u_{t-1} + \rho_2u_{t-2} + \rho_3u_{t-3}+ \epsilon_t + \phi1\epsilon_{t-1}\) with \(\rho\) the autocorrelation parameters (AR = 3 for us) and \(\phi\) the moving average parameters (MA = 1)

\(\epsilon_t\) term is a new error term that follows the usual assumptions that we make about regression errors: \(\epsilon_t \sim \mathcal{N}(0,\sigma^{2})\)

3 - 3 Predict with the two-step model

3 - 3 - 1 Future values

I used a training data set to build the model and a test data set to assess the model predictive power. In the following plots, real data is in dark, predicted nirate concentration is in lightblue and associated confidence interval is in darkblue.

3 - 3 - 1 NA values

NA values can come from 2 case : 1/ the sensor did wored, 2/ the sensor gives an anomaly.

The major problem with ARIMA modelling is if there is no data before and after the point we are trying to predict (ex : an anomaly during 2 weeks) the model do not work. As Gam + arima is better than just GAM, When GAM + ARIMA can predict we use it, when GAM only can predict we use it. Confidence intervals are calculated accordingly. This allow to predict more nitrate values.

If gam and ar worked , gam fit + ar on gam residuals:

\[ \begin{aligned} Nitrate_t = \beta_0 & + f_1 (Conductivity_t) + f_2(Dissolved Oxygen_t) \\ & + f_3 (log(Turbidity_t + 1)) + f_4 (Temperature_t)+ \epsilon_t\\ \end{aligned} \]

If gam prediction worked but ar not == take gam model only:

\[ \begin{aligned} Nitrate_t = \beta_0 & + f_1 (Conductivity_t) + f_2(Dissolved Oxygen_t) \\ & + f_3 (log(Turbidity_t + 1)) + f_4 (Temperature_t)+ \epsilon_t\\ \end{aligned} \]

Exemple where GAM + ARIMA didn’t worked because of the ARIMA process do not find neighbors.

plot(as.numeric(nitrate_predictions[7500:8100]), ylab = "NO3- predictions", 
     type = "p",  pch = 19, col = "cyan",
      ylim = c(min(lower[7500:8100], na.rm = TRUE), max(upper[7500:8100], na.rm = TRUE))) #, type="n", ylim=range(lower,upper)
points(as.numeric(data_train$nitrate_mean[7500:8100]), 
       type = "p",  pch = 19) #, type="n", ylim=range(lower,upper)
lines(as.numeric(upper[7500:8100]), ylab = "", col = 4, type = "l", cex = 0.25)
lines(as.numeric(lower[7500:8100]), ylab = "", col = 4,  cex = 0.25)

We can predict NA!!!!

2 - Model Nitrate in Arikaree River creek in Colorado

Model transferability to other geographical region is important for the model to be used for a lot of people. We will find the best model to predict nitrate concentration at new places and compare the used variables and the parameters between the two models.

The Arikaree River is located on the plains of Eastern Colorado and is a tributary to the Republican River.

## # A tibble: 25,041 x 20
##    start_date_time     nitrate_mean label_nitrate_m… temp_mean label_temp_mean
##    <dttm>                     <dbl>            <dbl>     <dbl>           <dbl>
##  1 2019-04-15 03:45:00           NA                1      11.8               0
##  2 2019-04-15 04:00:00           NA                1      11.7               0
##  3 2019-04-15 04:15:00           NA                1      11.6               0
##  4 2019-04-15 04:30:00           NA                1      11.5               0
##  5 2019-04-15 04:45:00           NA                1      11.4               0
##  6 2019-04-15 05:00:00           NA                1      11.3               0
##  7 2019-04-15 05:15:00           NA                1      11.2               0
##  8 2019-04-15 05:30:00           NA                1      11.1               0
##  9 2019-04-15 05:45:00           NA                1      11.0               0
## 10 2019-04-15 06:00:00           NA                1      10.9               0
## # … with 25,031 more rows, and 15 more variables: spec_cond <dbl>,
## #   label_spec_cond <dbl>, oxygen <dbl>, label_oxygen <dbl>, oxygen_sat <dbl>,
## #   label_oxygen_sat <dbl>, ph <dbl>, label_ph <dbl>, chloro <dbl>,
## #   label_chloro <dbl>, turbidity <dbl>, label_turbidity <dbl>, f_dom <dbl>,
## #   label_f_dom <dbl>, time <int>

** linear benchmark model **

model_lm<-lm(nitrate_mean ~ temp_mean + spec_cond + oxygen +
            turbidity , data = data_arik)

summary(model_lm)

## 
## Call:
## lm(formula = nitrate_mean ~ temp_mean + spec_cond + oxygen + 
##     turbidity, data = data_arik)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.4333 -1.2985 -0.0447  1.1372 15.0967 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.787e+00  2.478e-01  19.316  < 2e-16 ***
## temp_mean   -1.063e-01  2.368e-03 -44.911  < 2e-16 ***
## spec_cond    2.982e-03  3.783e-04   7.882  3.4e-15 ***
## oxygen       1.930e-02  8.228e-03   2.345    0.019 *  
## turbidity   -5.318e-04  4.725e-05 -11.254  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.94 on 17821 degrees of freedom
##   (17214 observations deleted due to missingness)
## Multiple R-squared:  0.229,  Adjusted R-squared:  0.2288 
## F-statistic:  1323 on 4 and 17821 DF,  p-value: < 2.2e-16

2 - 2 Step one: gam model

\[ \begin{aligned} Nitrate_t = \beta_0 & + f_1 (Conductivity_t) + f_2 (Dissolved Oxygen_t) \\ & f_3 (Turbidity_t) + f_4 (Temperature_t) +\epsilon_t \end{aligned} \]

with

\(\epsilon_t \sim \mathcal{N}(0,\sigma^{2})\)

Same variables than in Caribou creek were selected, but parameters of smoothing functions are differents

This gam model reach a r-sq of = 0.83. But a high temporal auto-correlation is also present in the model residuals \(\epsilon_t\). GCV = 1.2796

5 - 1 - 2 Step two: Modelling temporal auto-correlation in residuals

We want to predict the best gam model residuals \(\epsilon_t\) with an ARMA model.

## Series: best_model_gam_arik$residuals 
## ARIMA(2,1,1) 
## 
## Coefficients:
##          ar1      ar2      ma1
##       0.0688  -0.1236  -0.3726
## s.e.  0.0350   0.0141   0.0349
## 
## sigma^2 estimated as 0.1304:  log likelihood=-4828.57
## AIC=9665.14   AICc=9665.14   BIC=9694.73

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(2,1,1)
## Q* = 81.201, df = 7, p-value = 7.883e-15
## 
## Model df: 3.   Total lags used: 10

The best ARIMA model is ARIMA(2.1.1). P-value of Ljung-Box test is < 0.01 so there is no more information in the residuals. I fit the ARIMA model on the GAM residuals and look at prediction and RMSE

The equation of this model can be wrote as:

\[ \begin{aligned} Nitrate_t = & \beta_0 + f_1 (Conductivity_t) + f_2 (Dissolved Oxygen_t) \\ & + f_5 (Chlorophyl a_t) + f_6 (Turbidity_t) + u_t \end{aligned} \]

with \(u_t\) ARIMA(2,1,1) errors \(u_t = \rho_1u_{t-1} + \rho_2u_{t-2} +\epsilon_t + \phi1\epsilon_{t-1}\) with \(\rho\) the autocorrelation parameters (AR = 2 here) and \(\phi\) the moving average parameters (MA = 1 here)

\(\epsilon_t\) term is a new error term that follows the usual assumptions that we make about regression errors: \(\epsilon_t \sim \mathcal{N}(0,\sigma^{2})\)

5 - 2 - 1 Predict with the two-step model in future

5 - 2 - 2 NA values

If gam and ar worked , gam fit + ar on gam residuals:

\[ \begin{aligned} Nitrate_t = \beta_0 & + f_1 (Conductivity_t) + f_2(Dissolved Oxygen_t) \\ & + f_3 (log(Turbidity_t + 1)) + f_4 (Temperature_t)+ \epsilon_t\\ \end{aligned} \]

If gam prediction worked but ar not == take gam model only:

\[ \begin{aligned} Nitrate_t = \beta_0 & + f_1 (Conductivity_t) + f_2(Dissolved Oxygen_t) \\ & + f_3 (log(Turbidity_t + 1)) + f_4 (Temperature_t)+ \epsilon_t\\ \end{aligned} \]

plot(as.numeric(nitrate_predictions[0:100]), ylab = "NO3- predictions", 
     type = "p",  pch = 19, col = "cyan",
      ylim = c(min(lower[0:100], na.rm = TRUE), max(upper[0:100], na.rm = TRUE))) #, type="n", ylim=range(lower,upper)
points(as.numeric(data_train$nitrate_mean[0:100]), 
       type = "p",  pch = 19) #, type="n", ylim=range(lower,upper)
lines(as.numeric(upper[0:100]), ylab = "", col = 4, type = "l", cex = 0.25)
lines(as.numeric(lower[0:100]), ylab = "", col = 4,  cex = 0.25)

3 - Model nitrate at Kings creek in Kansas

Plot of the data I use:

##6 Model nitrate at Kings creek

6 - 1 - 1 Step one: gam model

\[ \begin{aligned} Nitrate_t = \beta_0 & + f_1 (Conductivity_t) + f_2 (Dissolved Oxygen_t) \\ & f_3 (Turbidity_t) + f_4 (Temperature_t) + \epsilon_t \end{aligned} \]

with

\(\epsilon_t \sim \mathcal{N}(0,\sigma^{2})\)

This gam model reach a \(r-sq\) of = 0.93 . But a high temporal auto-correlation is also present in the model residuals \(\epsilon_t\). GCV = 0.93288

6 - 1 - 2 Step two: Modelling temporal auto-correlation in residuals

We want to predict the best gam model residuals \(\epsilon_t\) with an ARMA model.

## Series: best_model_gam_king$residuals 
## ARIMA(2,1,2) 
## 
## Coefficients:
##         ar1      ar2      ma1     ma2
##       0.225  -0.0961  -0.9553  0.1841
## s.e.  0.126   0.0140   0.1262  0.0974
## 
## sigma^2 estimated as 0.4335:  log likelihood=-10184.15
## AIC=20378.3   AICc=20378.3   BIC=20414.44

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(2,1,2)
## Q* = 26.962, df = 6, p-value = 0.0001472
## 
## Model df: 4.   Total lags used: 10

The best ARMA model is ARIMA(2,1,2). P_value of Ljung-Box test is < 0.01 so no information is still in the residuals.

The equation of this model can be wrote as:

\[ \begin{aligned} Nitrate_t = & \beta_0 + f_1 (Conductivity_t) + f_2 (Dissolved Oxygen_t) \\ & + f_3 (Turbidity_t) + f_4 (Temperature_t) + u_t \end{aligned} \]

with \(u_t\) ARIMA(2,1,2) errors \(u_t = \rho_1u_{t-1} + \rho_2u_{t-2} + \epsilon_t + \phi_1\epsilon_{t-1} + \phi_2\epsilon_{t-2}\) with \(\rho\) the autocorrelation parameters (AR = 12 for us) and \(\phi\) the moving average parameters (MA = 2 here)

\(\epsilon_t\) term is a new error term that follows the usual assumptions that we make about regression errors: \(\epsilon_t \sim \mathcal{N}(0,\sigma^{2})\)

6 - 2 Predict with the two-step model in future

6 - 2 - 2 Predict NA values

If gam and ar worked , gam fit + ar on gam residuals:

\[ \begin{aligned} Nitrate_t = \beta_0 & + f_1 (Conductivity_t) + f_2(Dissolved Oxygen_t) \\ & + f_3 (log(Turbidity_t + 1)) + f_4 (Temperature_t)+ \epsilon_t\\ \end{aligned} \]

If gam prediction worked but ar not == take gam model only:

\[ \begin{aligned} Nitrate_t = \beta_0 & + f_1 (Conductivity_t) + f_2(Dissolved Oxygen_t) \\ & + f_3 (log(Turbidity_t + 1)) + f_4 (Temperature_t)+ \epsilon_t\\ \end{aligned} \]

Modelling nitrate without ph chloro and fdom

1 - Caribou watershed data

1 - Modelling nitrate at caribou with benchmark linear model

2 - Modelling nitrate at Caribou with a two-step modelling

2 - 1 Step one: gam model

3 - 3 Predict with the two-step model

3 - 3 - 1 Future values

3 - 3 - 1 NA values

2 - Model Nitrate in Arikaree River creek in Colorado

2 - 2 Step one: gam model

5 - 1 - 2 Step two: Modelling temporal auto-correlation in residuals

5 - 2 - 1 Predict with the two-step model in future

5 - 2 - 2 NA values

3 - Model nitrate at Kings creek in Kansas

6 - 1 - 1 Step one: gam model

6 - 1 - 2 Step two: Modelling temporal auto-correlation in residuals

6 - 2 Predict with the two-step model in future

6 - 2 - 2 Predict NA values