Applied suggested limits by Kat provided in Notion ‘Basic wq data cleaning’
## Key: <river, parameterName>
## river parameterName n_prefilter n_cleaned pct_cleaned
## <char> <fctr> <int> <int> <num>
## 1: poudre Chl-a 12637 12613 100
## 2: poudre Conductivity 42760 21717 51
## 3: poudre FDOM 12635 12628 100
## 4: poudre Turbidity 42758 42659 100
## 5: yampa Chl-a 135956 135109 99
## 6: yampa Conductivity 135414 109227 81
## 7: yampa FDOM 135955 133323 98
## 8: yampa Turbidity 132344 105170 79
## Key: <river, lab_parameter>
## river lab_parameter n_prefilter n_cleaned pct_cleaned
## <char> <fctr> <int> <int> <num>
## 1: poudre Phosphorus 53 53 100
## 2: poudre Potassium 53 53 100
## 3: poudre TOC 53 53 100
## 4: poudre Conductivity 53 11 21
## 5: poudre Nitrate 53 53 100
## 6: poudre Kjeldahl 53 53 100
## 7: poudre TN 53 46 87
## 8: poudre Turbidity 53 51 96
## 9: poudre Chl-a 53 42 79
## 10: yampa Phosphorus 300 300 100
## 11: yampa Potassium 300 300 100
## 12: yampa TOC 300 300 100
## 13: yampa Conductivity 300 264 88
## 14: yampa Nitrate 300 300 100
## 15: yampa Kjeldahl 300 300 100
## 16: yampa TN 300 292 97
## 17: yampa Turbidity 300 300 100
Removed spikes in sensor data that were 3 standard deviations away from both the previous and consecutive observations. If relatively high values persist (i.e. are observed at more than one consecutive 15-min reading), they are retained. Below is an example from two days in August for turbidity measurements before and after cleaning.
Does these times of sample collection make sense?
## time_of_day_sample_collected N
## <char> <int>
## 1: 01 3
## 2: 02 14
## 3: 03 5
## 4: 04 3
## 5: 06 3
## 6: 07 13
## 7: 08 16
## 8: 09 15
## 9: 10 17
## 10: 11 52
## 11: 12 60
## 12: 13 49
## 13: 14 33
## 14: 15 38
## 15: 16 19
## 16: 17 6
## 17: 18 3
## 18: 19 1
## 19: 20 1
## 20: 21 2
## time_of_day_sample_collected N
Rolling 24-hour mean, demonstrated for turbidity from two dates in August.
Develop model on Yampa River, test on Poudre River.
Predict water quality parameters with rolling 24-hour sensor value mean
## [1] "Lab parameter: TOC - Sensor_parameter: FDOM"
## learner coefficients MSE se fold_sd fold_min_MSE fold_max_MSE
## <fctr> <num> <num> <num> <num> <num> <num>
## 1: mean 0.000 1.9 0.134 0.77 1.29 3.2
## 2: glm 0.866 1.0 0.077 0.23 0.85 1.5
## 3: xgb.xgboost_1 0.000 24.1 0.809 3.36 20.69 29.9
## 4: xgb.xgboost_2 0.000 24.1 0.809 3.38 20.69 29.9
## 5: xgb.xgboost_3 0.000 24.1 0.809 3.38 20.69 29.9
## 6: xgb.xgboost_4 0.000 10.5 0.449 2.64 7.62 14.8
## 7: xgb.xgboost_5 0.000 10.5 0.449 2.73 7.62 15.0
## 8: xgb.xgboost_6 0.000 10.5 0.450 2.80 7.62 15.2
## 9: knn_1 0.045 1.6 0.124 0.76 1.01 2.9
## 10: knn_2 0.045 1.6 0.124 0.76 1.01 2.9
## 11: knn_3 0.045 1.6 0.124 0.76 1.01 2.9
## 12: SuperLearner NA 1.0 0.075 0.20 0.84 1.4
## [1] "RMSE: 0.999321706891901"
## [1] "RMSE normalized: 0.183053892271452"
## [1] "MAE: 0.827308062709728"
## [1] "Coefficient of Variation: 52.2576785330135"
## [1] "Lab parameter: Phosphorus - Sensor_parameter: FDOM"
## learner coefficients MSE se fold_sd fold_min_MSE fold_max_MSE
## <fctr> <num> <num> <num> <num> <num> <num>
## 1: mean 0.0617 0.0063 0.00075 0.0034 0.0039 0.012
## 2: glm 0.4740 0.0061 0.00068 0.0041 0.0030 0.013
## 3: xgb.xgboost_1 0.0014 0.1910 0.00369 0.0272 0.1426 0.210
## 4: xgb.xgboost_2 0.0014 0.1910 0.00369 0.0272 0.1426 0.210
## 5: xgb.xgboost_3 0.0014 0.1910 0.00369 0.0272 0.1426 0.210
## 6: xgb.xgboost_4 0.0000 0.0814 0.00226 0.0180 0.0489 0.095
## 7: xgb.xgboost_5 0.0000 0.0814 0.00226 0.0180 0.0489 0.095
## 8: xgb.xgboost_6 0.0000 0.0814 0.00226 0.0180 0.0489 0.095
## 9: knn_1 0.1534 0.0061 0.00080 0.0046 0.0029 0.014
## 10: knn_2 0.1534 0.0061 0.00080 0.0046 0.0029 0.014
## 11: knn_3 0.1534 0.0061 0.00080 0.0046 0.0029 0.014
## 12: SuperLearner NA 0.0058 0.00069 0.0042 0.0027 0.013
## [1] "RMSE: 0.0760407253397713"
## [1] "RMSE normalized: 1.59777395215769"
## [1] "MAE: 0.0628990899574583"
## [1] "Coefficient of Variation: 102.60815729867"
Not a great comparison because only 14 valid lab measurements at two locations (Lincoln St and Chambers Outflow).
## [1] "RMSE: 1.56555631228574"
## [1] "RMSE normalized: 0.473770080994333"
## [1] "MAE: 1.42150505582266"
## [1] "Coefficient of Variation: 108.729426942472"
glm is the only learner (other than mean, which
represents a null learner) being used in the transfer function built on
the Yampa data so it makes sense to investigate a simple linear
regression.
## [1] "RMSE: 1.04433010849974"
## [1] "RMSE normalized: 0.191298447595739"
## [1] "MAE: 0.87838692536327"
## [1] "Coefficient of Variation: 57.0709456747822"
## [1] "Width of 95% prediction interval: 4.110938771567"
## [1] "Coverage: 97.5 %"
SLR: sensor value as the only covariate (does not increase number of samples from Poudre)
## [1] "RMSE: 1.06818703195333"
## [1] "RMSE normalized: 0.195668514477789"
## [1] "MAE: 0.916026960209324"
## [1] "Coefficient of Variation: 59.7082129708858"
## [1] "Width of 95% prediction interval: 4.20485002701913"
## [1] "Coverage: 98.3333333333333 %"