Change log

Remaining To-Dos

Data Cleaning

Applied suggested limits by Kat provided in Notion ‘Basic wq data cleaning’

## Key: <river, parameterName>
##     river parameterName n_prefilter n_cleaned pct_cleaned
##    <char>        <fctr>       <int>     <int>       <num>
## 1: poudre         Chl-a       12637     12613         100
## 2: poudre  Conductivity       42760     21717          51
## 3: poudre          FDOM       12635     12628         100
## 4: poudre     Turbidity       42758     42659         100
## 5:  yampa         Chl-a      135956    135109          99
## 6:  yampa  Conductivity      135414    109227          81
## 7:  yampa          FDOM      135955    133323          98
## 8:  yampa     Turbidity      132344    105170          79
## Key: <river, lab_parameter>
##      river lab_parameter n_prefilter n_cleaned pct_cleaned
##     <char>        <fctr>       <int>     <int>       <num>
##  1: poudre    Phosphorus          53        53         100
##  2: poudre     Potassium          53        53         100
##  3: poudre           TOC          53        53         100
##  4: poudre  Conductivity          53        11          21
##  5: poudre       Nitrate          53        53         100
##  6: poudre      Kjeldahl          53        53         100
##  7: poudre            TN          53        46          87
##  8: poudre     Turbidity          53        51          96
##  9: poudre         Chl-a          53        42          79
## 10:  yampa    Phosphorus         300       300         100
## 11:  yampa     Potassium         300       300         100
## 12:  yampa           TOC         300       300         100
## 13:  yampa  Conductivity         300       264          88
## 14:  yampa       Nitrate         300       300         100
## 15:  yampa      Kjeldahl         300       300         100
## 16:  yampa            TN         300       292          97
## 17:  yampa     Turbidity         300       300         100

Removed spikes in sensor data that were 3 standard deviations away from both the previous and consecutive observations. If relatively high values persist (i.e. are observed at more than one consecutive 15-min reading), they are retained. Below is an example from two days in August for turbidity measurements before and after cleaning.

Does these times of sample collection make sense?

##     time_of_day_sample_collected     N
##                           <char> <int>
##  1:                           01     3
##  2:                           02    14
##  3:                           03     5
##  4:                           04     3
##  5:                           06     3
##  6:                           07    13
##  7:                           08    16
##  8:                           09    15
##  9:                           10    17
## 10:                           11    52
## 11:                           12    60
## 12:                           13    49
## 13:                           14    33
## 14:                           15    38
## 15:                           16    19
## 16:                           17     6
## 17:                           18     3
## 18:                           19     1
## 19:                           20     1
## 20:                           21     2
##     time_of_day_sample_collected     N

Daily Aggregation

Rolling 24-hour mean, demonstrated for turbidity from two dates in August.

Features

  1. sensor_value - rolling 24-hour mean of cleaned sensor FDOM readings, within 1 hour of grab sample
  2. sensor_value_turbidity - rolling 24-hour mean of cleaned sensor Turbidity readings, within 30 minutes of sensor FDOM reading
  3. hours_since_last_cleaning - number of hours since last cleaning of sensor, including first install

Setting

Develop model on Yampa River, test on Poudre River.

Transfer Model: Yampa River

Predict water quality parameters with rolling 24-hour sensor value mean

FDOM-TOC

## [1] "Lab parameter: TOC - Sensor_parameter: FDOM"

##           learner coefficients   MSE    se fold_sd fold_min_MSE fold_max_MSE
##            <fctr>        <num> <num> <num>   <num>        <num>        <num>
##  1:          mean        0.000   1.9 0.134    0.77         1.29          3.2
##  2:           glm        0.866   1.0 0.077    0.23         0.85          1.5
##  3: xgb.xgboost_1        0.000  24.1 0.809    3.36        20.69         29.9
##  4: xgb.xgboost_2        0.000  24.1 0.809    3.38        20.69         29.9
##  5: xgb.xgboost_3        0.000  24.1 0.809    3.38        20.69         29.9
##  6: xgb.xgboost_4        0.000  10.5 0.449    2.64         7.62         14.8
##  7: xgb.xgboost_5        0.000  10.5 0.449    2.73         7.62         15.0
##  8: xgb.xgboost_6        0.000  10.5 0.450    2.80         7.62         15.2
##  9:         knn_1        0.045   1.6 0.124    0.76         1.01          2.9
## 10:         knn_2        0.045   1.6 0.124    0.76         1.01          2.9
## 11:         knn_3        0.045   1.6 0.124    0.76         1.01          2.9
## 12:  SuperLearner           NA   1.0 0.075    0.20         0.84          1.4
## [1] "RMSE: 0.999321706891901"
## [1] "RMSE normalized: 0.183053892271452"
## [1] "MAE: 0.827308062709728"
## [1] "Coefficient of Variation: 52.2576785330135"

FDOM-Phosphorus

## [1] "Lab parameter: Phosphorus - Sensor_parameter: FDOM"

##           learner coefficients    MSE      se fold_sd fold_min_MSE fold_max_MSE
##            <fctr>        <num>  <num>   <num>   <num>        <num>        <num>
##  1:          mean       0.0617 0.0063 0.00075  0.0034       0.0039        0.012
##  2:           glm       0.4740 0.0061 0.00068  0.0041       0.0030        0.013
##  3: xgb.xgboost_1       0.0014 0.1910 0.00369  0.0272       0.1426        0.210
##  4: xgb.xgboost_2       0.0014 0.1910 0.00369  0.0272       0.1426        0.210
##  5: xgb.xgboost_3       0.0014 0.1910 0.00369  0.0272       0.1426        0.210
##  6: xgb.xgboost_4       0.0000 0.0814 0.00226  0.0180       0.0489        0.095
##  7: xgb.xgboost_5       0.0000 0.0814 0.00226  0.0180       0.0489        0.095
##  8: xgb.xgboost_6       0.0000 0.0814 0.00226  0.0180       0.0489        0.095
##  9:         knn_1       0.1534 0.0061 0.00080  0.0046       0.0029        0.014
## 10:         knn_2       0.1534 0.0061 0.00080  0.0046       0.0029        0.014
## 11:         knn_3       0.1534 0.0061 0.00080  0.0046       0.0029        0.014
## 12:  SuperLearner           NA 0.0058 0.00069  0.0042       0.0027        0.013
## [1] "RMSE: 0.0760407253397713"
## [1] "RMSE normalized: 1.59777395215769"
## [1] "MAE: 0.0628990899574583"
## [1] "Coefficient of Variation: 102.60815729867"

Transfer Function: Yampa River

Investigate influence of FDOM/Turbidity

Evaluate: Poudre River

Not a great comparison because only 14 valid lab measurements at two locations (Lincoln St and Chambers Outflow).

FDOM-TOC

## [1] "RMSE: 1.56555631228574"
## [1] "RMSE normalized: 0.473770080994333"
## [1] "MAE: 1.42150505582266"
## [1] "Coefficient of Variation: 108.729426942472"

Simple linear regression

glm is the only learner (other than mean, which represents a null learner) being used in the transfer function built on the Yampa data so it makes sense to investigate a simple linear regression.

## [1] "RMSE: 1.04433010849974"
## [1] "RMSE normalized: 0.191298447595739"
## [1] "MAE: 0.87838692536327"
## [1] "Coefficient of Variation: 57.0709456747822"

## [1] "Width of 95% prediction interval: 4.110938771567"
## [1] "Coverage: 97.5 %"

SLR: sensor value as the only covariate (does not increase number of samples from Poudre)

## [1] "RMSE: 1.06818703195333"
## [1] "RMSE normalized: 0.195668514477789"
## [1] "MAE: 0.916026960209324"
## [1] "Coefficient of Variation: 59.7082129708858"

## [1] "Width of 95% prediction interval: 4.20485002701913"
## [1] "Coverage: 98.3333333333333 %"

Sandbox