Change log
Remaining To-Dos
Data Cleaning
Daily Aggregation
Features
Setting
Transfer Model: Yampa River
- FDOM-TOC
- FDOM-Phosphorus
Transfer Function: Yampa River
Investigate influence of FDOM/Turbidity
Evaluate: Poudre River
- FDOM-TOC
Simple linear regression
Sandbox

Change log

Read incoming data from second data collection campaign (Spring - Summer 2024) and refresh models
Refreshed cleaned Poudre sensor data (but didn’t gain that many observations)
Separate script for FDOM-TOC transfer function
Remove lat/long from transfer functions covariates to address spatial auto-correlation
Add turbidity as potential confounding variable in the prediction of TOC from FDOM
Compare FDOM to lab-measured total phosphorus
Applied transfer function to Poudre data (and cleaned time of collection, but still need more data)
Uncertainty estimates, i.e. prediction intervals
Showed no less in performance with GLM and even simple linear regression > ensemble learner

Remaining To-Dos

Refresh Poudre data again once Kat has reviewed it (expecting ~ 28 obs distributed among sites)

Data Cleaning

Applied suggested limits by Kat provided in Notion ‘Basic wq data cleaning’

## Key: <river, parameterName>
##     river parameterName n_prefilter n_cleaned pct_cleaned
##    <char>        <fctr>       <int>     <int>       <num>
## 1: poudre         Chl-a       12637     12613         100
## 2: poudre  Conductivity       42760     21717          51
## 3: poudre          FDOM       12635     12628         100
## 4: poudre     Turbidity       42758     42659         100
## 5:  yampa         Chl-a      135956    135109          99
## 6:  yampa  Conductivity      135414    109227          81
## 7:  yampa          FDOM      135955    133323          98
## 8:  yampa     Turbidity      132344    105170          79

## Key: <river, lab_parameter>
##      river lab_parameter n_prefilter n_cleaned pct_cleaned
##     <char>        <fctr>       <int>     <int>       <num>
##  1: poudre    Phosphorus          53        53         100
##  2: poudre     Potassium          53        53         100
##  3: poudre           TOC          53        53         100
##  4: poudre  Conductivity          53        11          21
##  5: poudre       Nitrate          53        53         100
##  6: poudre      Kjeldahl          53        53         100
##  7: poudre            TN          53        46          87
##  8: poudre     Turbidity          53        51          96
##  9: poudre         Chl-a          53        42          79
## 10:  yampa    Phosphorus         300       300         100
## 11:  yampa     Potassium         300       300         100
## 12:  yampa           TOC         300       300         100
## 13:  yampa  Conductivity         300       264          88
## 14:  yampa       Nitrate         300       300         100
## 15:  yampa      Kjeldahl         300       300         100
## 16:  yampa            TN         300       292          97
## 17:  yampa     Turbidity         300       300         100

Removed spikes in sensor data that were 3 standard deviations away from both the previous and consecutive observations. If relatively high values persist (i.e. are observed at more than one consecutive 15-min reading), they are retained. Below is an example from two days in August for turbidity measurements before and after cleaning.

Does these times of sample collection make sense?

##     time_of_day_sample_collected     N
##                           <char> <int>
##  1:                           01     3
##  2:                           02    14
##  3:                           03     5
##  4:                           04     3
##  5:                           06     3
##  6:                           07    13
##  7:                           08    16
##  8:                           09    15
##  9:                           10    17
## 10:                           11    52
## 11:                           12    60
## 12:                           13    49
## 13:                           14    33
## 14:                           15    38
## 15:                           16    19
## 16:                           17     6
## 17:                           18     3
## 18:                           19     1
## 19:                           20     1
## 20:                           21     2
##     time_of_day_sample_collected     N

Daily Aggregation

Rolling 24-hour mean, demonstrated for turbidity from two dates in August.

Features

sensor_value - rolling 24-hour mean of cleaned sensor FDOM readings, within 1 hour of grab sample
sensor_value_turbidity - rolling 24-hour mean of cleaned sensor Turbidity readings, within 30 minutes of sensor FDOM reading
hours_since_last_cleaning - number of hours since last cleaning of sensor, including first install

Setting

Develop model on Yampa River, test on Poudre River.

Transfer Model: Yampa River

Predict water quality parameters with rolling 24-hour sensor value mean

FDOM-TOC

## [1] "Lab parameter: TOC - Sensor_parameter: FDOM"

##           learner coefficients   MSE    se fold_sd fold_min_MSE fold_max_MSE
##            <fctr>        <num> <num> <num>   <num>        <num>        <num>
##  1:          mean        0.000   1.9 0.134    0.77         1.29          3.2
##  2:           glm        0.866   1.0 0.077    0.23         0.85          1.5
##  3: xgb.xgboost_1        0.000  24.1 0.809    3.36        20.69         29.9
##  4: xgb.xgboost_2        0.000  24.1 0.809    3.38        20.69         29.9
##  5: xgb.xgboost_3        0.000  24.1 0.809    3.38        20.69         29.9
##  6: xgb.xgboost_4        0.000  10.5 0.449    2.64         7.62         14.8
##  7: xgb.xgboost_5        0.000  10.5 0.449    2.73         7.62         15.0
##  8: xgb.xgboost_6        0.000  10.5 0.450    2.80         7.62         15.2
##  9:         knn_1        0.045   1.6 0.124    0.76         1.01          2.9
## 10:         knn_2        0.045   1.6 0.124    0.76         1.01          2.9
## 11:         knn_3        0.045   1.6 0.124    0.76         1.01          2.9
## 12:  SuperLearner           NA   1.0 0.075    0.20         0.84          1.4
## [1] "RMSE: 0.999321706891901"
## [1] "RMSE normalized: 0.183053892271452"
## [1] "MAE: 0.827308062709728"
## [1] "Coefficient of Variation: 52.2576785330135"

FDOM-Phosphorus

## [1] "Lab parameter: Phosphorus - Sensor_parameter: FDOM"

##           learner coefficients    MSE      se fold_sd fold_min_MSE fold_max_MSE
##            <fctr>        <num>  <num>   <num>   <num>        <num>        <num>
##  1:          mean       0.0617 0.0063 0.00075  0.0034       0.0039        0.012
##  2:           glm       0.4740 0.0061 0.00068  0.0041       0.0030        0.013
##  3: xgb.xgboost_1       0.0014 0.1910 0.00369  0.0272       0.1426        0.210
##  4: xgb.xgboost_2       0.0014 0.1910 0.00369  0.0272       0.1426        0.210
##  5: xgb.xgboost_3       0.0014 0.1910 0.00369  0.0272       0.1426        0.210
##  6: xgb.xgboost_4       0.0000 0.0814 0.00226  0.0180       0.0489        0.095
##  7: xgb.xgboost_5       0.0000 0.0814 0.00226  0.0180       0.0489        0.095
##  8: xgb.xgboost_6       0.0000 0.0814 0.00226  0.0180       0.0489        0.095
##  9:         knn_1       0.1534 0.0061 0.00080  0.0046       0.0029        0.014
## 10:         knn_2       0.1534 0.0061 0.00080  0.0046       0.0029        0.014
## 11:         knn_3       0.1534 0.0061 0.00080  0.0046       0.0029        0.014
## 12:  SuperLearner           NA 0.0058 0.00069  0.0042       0.0027        0.013
## [1] "RMSE: 0.0760407253397713"
## [1] "RMSE normalized: 1.59777395215769"
## [1] "MAE: 0.0628990899574583"
## [1] "Coefficient of Variation: 102.60815729867"

Transfer Function: Yampa River

Investigate influence of FDOM/Turbidity

Evaluate: Poudre River

Not a great comparison because only 14 valid lab measurements at two locations (Lincoln St and Chambers Outflow).

FDOM-TOC

## [1] "RMSE: 1.56555631228574"
## [1] "RMSE normalized: 0.473770080994333"
## [1] "MAE: 1.42150505582266"
## [1] "Coefficient of Variation: 108.729426942472"

Simple linear regression

glm is the only learner (other than mean, which represents a null learner) being used in the transfer function built on the Yampa data so it makes sense to investigate a simple linear regression.

## [1] "RMSE: 1.04433010849974"
## [1] "RMSE normalized: 0.191298447595739"
## [1] "MAE: 0.87838692536327"
## [1] "Coefficient of Variation: 57.0709456747822"

## [1] "Width of 95% prediction interval: 4.110938771567"

## [1] "Coverage: 97.5 %"

SLR: sensor value as the only covariate (does not increase number of samples from Poudre)

## [1] "RMSE: 1.06818703195333"
## [1] "RMSE normalized: 0.195668514477789"
## [1] "MAE: 0.916026960209324"
## [1] "Coefficient of Variation: 59.7082129708858"

## [1] "Width of 95% prediction interval: 4.20485002701913"

## [1] "Coverage: 98.3333333333333 %"

US Watershed Carbon - Transfer Function

Katie Fankhauser

2024-06-04

Change log

Remaining To-Dos

Data Cleaning

Daily Aggregation

Features

Setting

Transfer Model: Yampa River

FDOM-TOC

FDOM-Phosphorus

Transfer Function: Yampa River

Investigate influence of FDOM/Turbidity

Evaluate: Poudre River

FDOM-TOC

Simple linear regression

Sandbox