Box-plots are quite differents between sites. This is interesting to the model transferability.
There are no data in winter as the river is frozen. For a first work, we choose to take only the data from 2019 spring to 2019 end.
The best linear model linking water quality variables to nitrate concentration is chosen with a stepwise process. The best linear model is with all variables :
\[ \begin{aligned} Nitrate_t = & \beta_0 + \beta_1 Conductivity_t + \beta_2 Dissolved Oxygen_t + \\ & \beta_3 Turbidity_t + \beta_4 Temperature_t + \epsilon_t\\ \end{aligned} \] with \(\epsilon_t \sim \mathcal{N}(0,\sigma^{2})\)
Adjusted R-squared is about 0.36 for this model. All variables are *** relevant.
model_lm<-lm(nitrate_mean ~ temp_mean + spec_cond + oxygen
+ turbidity , data = data)
library(car)
avPlots(model_lm)
library(MASS)
best_lm_model <- stepAIC(model_lm, direction = "both",
trace = FALSE)
summary(best_lm_model)
##
## Call:
## lm(formula = nitrate_mean ~ temp_mean + spec_cond + oxygen +
## turbidity, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.6692 -1.5079 0.1193 2.4988 6.2609
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16.6884520 1.4636632 11.402 <2e-16 ***
## temp_mean -0.0755560 0.0331963 -2.276 0.0229 *
## spec_cond -0.0832385 0.0022356 -37.233 <2e-16 ***
## oxygen 1.5064953 0.1040898 14.473 <2e-16 ***
## turbidity -0.0042587 0.0001988 -21.427 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.365 on 10502 degrees of freedom
## (12341 observations deleted due to missingness)
## Multiple R-squared: 0.3678, Adjusted R-squared: 0.3675
## F-statistic: 1527 on 4 and 10502 DF, p-value: < 2.2e-16
forecast::checkresiduals(best_lm_model)
##
## Breusch-Godfrey test for serial correlation of order up to 10
##
## data: Residuals
## LM test = 10456, df = 10, p-value < 2.2e-16
Quick residuals study highlight the temporal autocorrelation is present in residuals.
#split data into training dataset and predict dataset
data4model<-data_cari[7010:22300,]
data_train<- data4model[1:10000,]
data_predict<- data4model[10001:15291,]
training data set = 10000 point, predicting data set = 5839
I used a stepwise model selection to have the best gam model with avaliable environmental variables; The equation of this model is :
\[ \begin{aligned} Nitrate_t = \beta_0 & + f_1 (Conductivity_t) + f_2(Dissolved Oxygen_t) \\ & + f_3 (log(Turbidity_t + 1)) + f_4 (Temperature_t)+ \epsilon_t \end{aligned} \]
with
\(\epsilon_t \sim \mathcal{N}(0,\sigma^{2})\)
best_model_gam_cari<-mgcv::gam(nitrate_mean ~ s(spec_cond, k = 12) + s(oxygen, k = 12) +
+ s(temp_mean, k = 12) +
s(log(turbidity+1) , k = 12), data = data_train)
# no_best_model_gam<-mgcv::gam(nitrate_mean ~ s(spec_cond, k = 12) + s(oxygen, k = 12) +
# s(oxygen_sat, k = 12) + s(chloro, k = 12) + s(ph, k = 12) +
# s(turbidity, k = 12), data = data_train)
p <- getViz(best_model_gam_cari)
(plot(p) + l_ciPoly() + l_points(alpha=0.2) + l_fitLine() + l_rug()) %>%
print(pages=1)
best_model_gam_cari<-mgcv::gam(nitrate_mean ~ s(spec_cond, k = 12) + s(oxygen, k = 12) +
+ s(temp_mean, k = 12) + s(log(turbidity+1) , k = 12), data = data_train, na.action=na.exclude)
#fill where we deleted na
data_train <- data_train %>%
mutate(gam_res = residuals(best_model_gam_cari)) %>%
as_tsibble(index=start_date_time) %>%
fill_gaps()
data_train %>% gg_tsdisplay(gam_res, plot_type='histogram')
Pacf(data_train$gam_res)
This gam model reach a better \(r^2\) (0.87 ) than the benchmark linear model (0.54). But a high temporal auto-correlation is also present in the model residuals \(\epsilon_t\). GCV = 3.2506 ## 3 - 2 - 2 Step two: Modelling temporal auto-correlation in residuals
## Series: best_model_gam_cari$residuals
## ARIMA(3,1,1)
##
## Coefficients:
## ar1 ar2 ar3 ma1
## 0.498 0.1023 0.0949 -0.6996
## s.e. 0.062 0.0180 0.0129 0.0614
##
## sigma^2 estimated as 0.1302: log likelihood=-2457.38
## AIC=4924.75 AICc=4924.76 BIC=4958.38
##
## Ljung-Box test
##
## data: Residuals from ARIMA(3,1,1)
## Q* = 24.766, df = 6, p-value = 0.0003773
##
## Model df: 4. Total lags used: 10
P-value of Ljung-Box test is < 0.01 so there is no more information in the residuals. I fit the ARIMA model on the GAM residuals and look at prediction and RMSE
ARIMA predict very weel GAM residuals.
The equation of this model is:
\[ \begin{aligned} Nitrate_t = \beta_0 & + f_1 (Conductivity_t) + f_2(Dissolved Oxygen_t) \\ & + f_3 (log(Turbidity_t + 1)) + f_4 (Temperature_t)+ u_t\\ \end{aligned} \]
with \(u_t\) ARIMA(3,1,1) errors \(u_t = \rho_1u_{t-1} + \rho_2u_{t-2} + \rho_3u_{t-3}+ \epsilon_t + \phi1\epsilon_{t-1}\) with \(\rho\) the autocorrelation parameters (AR = 3 for us) and \(\phi\) the moving average parameters (MA = 1)
\(\epsilon_t\) term is a new error term that follows the usual assumptions that we make about regression errors: \(\epsilon_t \sim \mathcal{N}(0,\sigma^{2})\)
I used a training data set to build the model and a test data set to assess the model predictive power. In the following plots, real data is in dark, predicted nirate concentration is in lightblue and associated confidence interval is in darkblue.
NA values can come from 2 case : 1/ the sensor did wored, 2/ the sensor gives an anomaly.
The major problem with ARIMA modelling is if there is no data before and after the point we are trying to predict (ex : an anomaly during 2 weeks) the model do not work. As Gam + arima is better than just GAM, When GAM + ARIMA can predict we use it, when GAM only can predict we use it. Confidence intervals are calculated accordingly. This allow to predict more nitrate values.
If gam and ar worked , gam fit + ar on gam residuals:
\[ \begin{aligned} Nitrate_t = \beta_0 & + f_1 (Conductivity_t) + f_2(Dissolved Oxygen_t) \\ & + f_3 (log(Turbidity_t + 1)) + f_4 (Temperature_t)+ \epsilon_t\\ \end{aligned} \]
If gam prediction worked but ar not == take gam model only:
\[ \begin{aligned} Nitrate_t = \beta_0 & + f_1 (Conductivity_t) + f_2(Dissolved Oxygen_t) \\ & + f_3 (log(Turbidity_t + 1)) + f_4 (Temperature_t)+ \epsilon_t\\ \end{aligned} \]
Exemple where GAM + ARIMA didn’t worked because of the ARIMA process do not find neighbors.
plot(as.numeric(nitrate_predictions[7500:8100]), ylab = "NO3- predictions",
type = "p", pch = 19, col = "cyan",
ylim = c(min(lower[7500:8100], na.rm = TRUE), max(upper[7500:8100], na.rm = TRUE))) #, type="n", ylim=range(lower,upper)
points(as.numeric(data_train$nitrate_mean[7500:8100]),
type = "p", pch = 19) #, type="n", ylim=range(lower,upper)
lines(as.numeric(upper[7500:8100]), ylab = "", col = 4, type = "l", cex = 0.25)
lines(as.numeric(lower[7500:8100]), ylab = "", col = 4, cex = 0.25)
We can predict NA!!!!
Model transferability to other geographical region is important for the model to be used for a lot of people. We will find the best model to predict nitrate concentration at new places and compare the used variables and the parameters between the two models.
The Arikaree River is located on the plains of Eastern Colorado and is a tributary to the Republican River.
## # A tibble: 25,041 x 20
## start_date_time nitrate_mean label_nitrate_m… temp_mean label_temp_mean
## <dttm> <dbl> <dbl> <dbl> <dbl>
## 1 2019-04-15 03:45:00 NA 1 11.8 0
## 2 2019-04-15 04:00:00 NA 1 11.7 0
## 3 2019-04-15 04:15:00 NA 1 11.6 0
## 4 2019-04-15 04:30:00 NA 1 11.5 0
## 5 2019-04-15 04:45:00 NA 1 11.4 0
## 6 2019-04-15 05:00:00 NA 1 11.3 0
## 7 2019-04-15 05:15:00 NA 1 11.2 0
## 8 2019-04-15 05:30:00 NA 1 11.1 0
## 9 2019-04-15 05:45:00 NA 1 11.0 0
## 10 2019-04-15 06:00:00 NA 1 10.9 0
## # … with 25,031 more rows, and 15 more variables: spec_cond <dbl>,
## # label_spec_cond <dbl>, oxygen <dbl>, label_oxygen <dbl>, oxygen_sat <dbl>,
## # label_oxygen_sat <dbl>, ph <dbl>, label_ph <dbl>, chloro <dbl>,
## # label_chloro <dbl>, turbidity <dbl>, label_turbidity <dbl>, f_dom <dbl>,
## # label_f_dom <dbl>, time <int>
** linear benchmark model **
model_lm<-lm(nitrate_mean ~ temp_mean + spec_cond + oxygen +
turbidity , data = data_arik)
summary(model_lm)
##
## Call:
## lm(formula = nitrate_mean ~ temp_mean + spec_cond + oxygen +
## turbidity, data = data_arik)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.4333 -1.2985 -0.0447 1.1372 15.0967
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.787e+00 2.478e-01 19.316 < 2e-16 ***
## temp_mean -1.063e-01 2.368e-03 -44.911 < 2e-16 ***
## spec_cond 2.982e-03 3.783e-04 7.882 3.4e-15 ***
## oxygen 1.930e-02 8.228e-03 2.345 0.019 *
## turbidity -5.318e-04 4.725e-05 -11.254 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.94 on 17821 degrees of freedom
## (17214 observations deleted due to missingness)
## Multiple R-squared: 0.229, Adjusted R-squared: 0.2288
## F-statistic: 1323 on 4 and 17821 DF, p-value: < 2.2e-16
\[ \begin{aligned} Nitrate_t = \beta_0 & + f_1 (Conductivity_t) + f_2 (Dissolved Oxygen_t) \\ & f_3 (Turbidity_t) + f_4 (Temperature_t) +\epsilon_t \end{aligned} \]
with
\(\epsilon_t \sim \mathcal{N}(0,\sigma^{2})\)
Same variables than in Caribou creek were selected, but parameters of smoothing functions are differents
This gam model reach a r-sq of = 0.83. But a high temporal auto-correlation is also present in the model residuals \(\epsilon_t\). GCV = 1.2796
We want to predict the best gam model residuals \(\epsilon_t\) with an ARMA model.
## Series: best_model_gam_arik$residuals
## ARIMA(2,1,1)
##
## Coefficients:
## ar1 ar2 ma1
## 0.0688 -0.1236 -0.3726
## s.e. 0.0350 0.0141 0.0349
##
## sigma^2 estimated as 0.1304: log likelihood=-4828.57
## AIC=9665.14 AICc=9665.14 BIC=9694.73
##
## Ljung-Box test
##
## data: Residuals from ARIMA(2,1,1)
## Q* = 81.201, df = 7, p-value = 7.883e-15
##
## Model df: 3. Total lags used: 10
The best ARIMA model is ARIMA(2.1.1). P-value of Ljung-Box test is < 0.01 so there is no more information in the residuals. I fit the ARIMA model on the GAM residuals and look at prediction and RMSE
The equation of this model can be wrote as:
\[ \begin{aligned} Nitrate_t = & \beta_0 + f_1 (Conductivity_t) + f_2 (Dissolved Oxygen_t) \\ & + f_5 (Chlorophyl a_t) + f_6 (Turbidity_t) + u_t \end{aligned} \]
with \(u_t\) ARIMA(2,1,1) errors \(u_t = \rho_1u_{t-1} + \rho_2u_{t-2} +\epsilon_t + \phi1\epsilon_{t-1}\) with \(\rho\) the autocorrelation parameters (AR = 2 here) and \(\phi\) the moving average parameters (MA = 1 here)
\(\epsilon_t\) term is a new error term that follows the usual assumptions that we make about regression errors: \(\epsilon_t \sim \mathcal{N}(0,\sigma^{2})\)
I used a training data set to build the model and a test data set to assess the model predictive power. In the following plots, real data is in dark, predicted nirate concentration is in lightblue and associated confidence interval is in darkblue.
If gam and ar worked , gam fit + ar on gam residuals:
\[ \begin{aligned} Nitrate_t = \beta_0 & + f_1 (Conductivity_t) + f_2(Dissolved Oxygen_t) \\ & + f_3 (log(Turbidity_t + 1)) + f_4 (Temperature_t)+ \epsilon_t\\ \end{aligned} \]
If gam prediction worked but ar not == take gam model only:
\[ \begin{aligned} Nitrate_t = \beta_0 & + f_1 (Conductivity_t) + f_2(Dissolved Oxygen_t) \\ & + f_3 (log(Turbidity_t + 1)) + f_4 (Temperature_t)+ \epsilon_t\\ \end{aligned} \]
plot(as.numeric(nitrate_predictions[0:100]), ylab = "NO3- predictions",
type = "p", pch = 19, col = "cyan",
ylim = c(min(lower[0:100], na.rm = TRUE), max(upper[0:100], na.rm = TRUE))) #, type="n", ylim=range(lower,upper)
points(as.numeric(data_train$nitrate_mean[0:100]),
type = "p", pch = 19) #, type="n", ylim=range(lower,upper)
lines(as.numeric(upper[0:100]), ylab = "", col = 4, type = "l", cex = 0.25)
lines(as.numeric(lower[0:100]), ylab = "", col = 4, cex = 0.25)
Plot of the data I use:
##6 Model nitrate at Kings creek
\[ \begin{aligned} Nitrate_t = \beta_0 & + f_1 (Conductivity_t) + f_2 (Dissolved Oxygen_t) \\ & f_3 (Turbidity_t) + f_4 (Temperature_t) + \epsilon_t \end{aligned} \]
with
\(\epsilon_t \sim \mathcal{N}(0,\sigma^{2})\)
This gam model reach a \(r-sq\) of = 0.93 . But a high temporal auto-correlation is also present in the model residuals \(\epsilon_t\). GCV = 0.93288
We want to predict the best gam model residuals \(\epsilon_t\) with an ARMA model.
## Series: best_model_gam_king$residuals
## ARIMA(2,1,2)
##
## Coefficients:
## ar1 ar2 ma1 ma2
## 0.225 -0.0961 -0.9553 0.1841
## s.e. 0.126 0.0140 0.1262 0.0974
##
## sigma^2 estimated as 0.4335: log likelihood=-10184.15
## AIC=20378.3 AICc=20378.3 BIC=20414.44
##
## Ljung-Box test
##
## data: Residuals from ARIMA(2,1,2)
## Q* = 26.962, df = 6, p-value = 0.0001472
##
## Model df: 4. Total lags used: 10
The best ARMA model is ARIMA(2,1,2). P_value of Ljung-Box test is < 0.01 so no information is still in the residuals.
The equation of this model can be wrote as:
\[ \begin{aligned} Nitrate_t = & \beta_0 + f_1 (Conductivity_t) + f_2 (Dissolved Oxygen_t) \\ & + f_3 (Turbidity_t) + f_4 (Temperature_t) + u_t \end{aligned} \]
with \(u_t\) ARIMA(2,1,2) errors \(u_t = \rho_1u_{t-1} + \rho_2u_{t-2} + \epsilon_t + \phi_1\epsilon_{t-1} + \phi_2\epsilon_{t-2}\) with \(\rho\) the autocorrelation parameters (AR = 12 for us) and \(\phi\) the moving average parameters (MA = 2 here)
\(\epsilon_t\) term is a new error term that follows the usual assumptions that we make about regression errors: \(\epsilon_t \sim \mathcal{N}(0,\sigma^{2})\)
I used a training data set to build the model and a test data set to assess the model predictive power. In the following plots, real data is in dark, predicted nirate concentration is in lightblue and associated confidence interval is in darkblue.
If gam and ar worked , gam fit + ar on gam residuals:
\[ \begin{aligned} Nitrate_t = \beta_0 & + f_1 (Conductivity_t) + f_2(Dissolved Oxygen_t) \\ & + f_3 (log(Turbidity_t + 1)) + f_4 (Temperature_t)+ \epsilon_t\\ \end{aligned} \]
If gam prediction worked but ar not == take gam model only:
\[ \begin{aligned} Nitrate_t = \beta_0 & + f_1 (Conductivity_t) + f_2(Dissolved Oxygen_t) \\ & + f_3 (log(Turbidity_t + 1)) + f_4 (Temperature_t)+ \epsilon_t\\ \end{aligned} \]