Applied suggested limits by Kat provided in Notion ‘Basic wq data cleaning’
## river parameterName n_prefilter n_cleaned pct_cleaned
## 1: yampa Chl-a 99127 34431 35
## 2: yampa Conductivity 97267 27934 29
## 3: yampa FDOM 99139 34140 34
## 4: yampa Turbidity 93806 25792 27
## river lab_parameter n_prefilter n_cleaned pct_cleaned
## 1: poudre Phosphorus 148 9 6.1
## 2: poudre Potassium 148 9 6.1
## 3: poudre TOC 148 9 6.1
## 4: poudre Conductivity 148 9 6.1
## 5: poudre Nitrate 148 9 6.1
## 6: poudre Kjeldahl 148 9 6.1
## 7: poudre TN 148 9 6.1
## 8: poudre Turbidity 148 9 6.1
## 9: poudre Chl-a 148 9 6.1
## 10: yampa Phosphorus 268 225 84.0
## 11: yampa Potassium 268 225 84.0
## 12: yampa TOC 268 225 84.0
## 13: yampa Conductivity 268 225 84.0
## 14: yampa Nitrate 268 225 84.0
## 15: yampa Kjeldahl 268 225 84.0
## 16: yampa TN 268 225 84.0
## 17: yampa Turbidity 268 225 84.0
## 18: yampa Chl-a 268 225 84.0
Removed spikes in sensor data that were 3 standard deviations away from both the previous and consecutive observations. If relatively high values persist (i.e. are observed at more than one consecutive 15-min reading), they are retained. Below is an example from two days in August for turbidity measurements before and after cleaning.
Does these times of sample collection make sense?
## time_of_day_sample_collected N
## 1: 06 3
## 2: 07 11
## 3: 08 10
## 4: 09 6
## 5: 10 8
## 6: 11 35
## 7: 12 45
## 8: 13 40
## 9: 14 16
## 10: 15 30
## 11: 16 18
## 12: 17 8
## 13: 18 1
## 14: 20 1
## 15: 21 2
Rolling 24-hour mean, demonstrated for turbidity from two dates in August.
Predict water quality parameters with rolling 24-hour sensor value mean
## [1] "Lab parameter: Conductivity - Sensor_parameter: Conductivity"
## learner coefficients MSE se fold_sd fold_min_MSE fold_max_MSE
## 1: mean 0.000 10530 680 2369 7160 12448
## 2: glm 0.831 1127 193 543 505 1925
## 3: xgb.xgboost_1 0.000 128413 3964 46744 74985 184283
## 4: xgb.xgboost_2 0.000 128449 3963 46778 74985 184283
## 5: xgb.xgboost_3 0.000 128449 3963 46778 74985 184283
## 6: xgb.xgboost_4 0.000 54404 1855 22475 23227 78481
## 7: xgb.xgboost_5 0.000 54417 1849 22662 23230 78481
## 8: xgb.xgboost_6 0.000 54407 1849 22665 23230 78481
## 9: knn_1 0.056 1964 243 514 1270 2570
## 10: knn_2 0.056 1964 243 514 1270 2570
## 11: knn_3 0.056 1964 243 514 1270 2570
## 12: SuperLearner NA 1091 191 564 558 1909
## [1] "RMSE: 33.0308039545134"
## [1] "RMSE normalized: 0.0902492451323614"
## [1] "MAE: 24.026196364925"
## [1] "Coefficient of Variation: 13.0689618271284"
## [1] "Lab parameter: Turbidity - Sensor_parameter: Turbidity"
## learner coefficients MSE se fold_sd fold_min_MSE fold_max_MSE
## 1: mean 0.0 133 47 89 32.4 234
## 2: glm 0.7 83 24 66 6.1 173
## 3: xgb.xgboost_1 0.0 169 54 136 9.8 311
## 4: xgb.xgboost_2 0.0 169 54 136 9.7 311
## 5: xgb.xgboost_3 0.0 169 54 136 9.7 312
## 6: xgb.xgboost_4 0.0 120 43 98 7.7 200
## 7: xgb.xgboost_5 0.0 121 43 99 7.9 205
## 8: xgb.xgboost_6 0.3 120 43 98 8.6 203
## 9: knn_1 0.0 114 39 83 27.4 216
## 10: knn_2 0.0 114 39 83 27.4 216
## 11: knn_3 0.0 114 39 83 27.4 216
## 12: SuperLearner NA 74 26 71 6.1 161
## [1] "RMSE: 8.62004900148416"
## [1] "RMSE normalized: 1.11657946246983"
## [1] "MAE: 4.36896670927391"
## [1] "Coefficient of Variation: 59.2185555754875"
## [1] "Lab parameter: TOC - Sensor_parameter: FDOM"
## learner coefficients MSE se fold_sd fold_min_MSE fold_max_MSE
## 1: mean 0.000 1.88 0.136 1.29 0.44 3.56
## 2: glm 0.874 0.62 0.071 0.22 0.30 0.85
## 3: xgb.xgboost_1 0.000 22.99 0.759 5.72 15.89 28.81
## 4: xgb.xgboost_2 0.000 22.94 0.751 5.49 16.25 28.63
## 5: xgb.xgboost_3 0.000 22.94 0.751 5.49 16.25 28.63
## 6: xgb.xgboost_4 0.000 10.48 0.444 2.34 6.55 12.56
## 7: xgb.xgboost_5 0.013 10.36 0.456 1.91 7.02 11.78
## 8: xgb.xgboost_6 0.013 10.36 0.456 1.91 7.02 11.78
## 9: knn_1 0.034 1.86 0.131 1.22 0.44 3.43
## 10: knn_2 0.034 1.86 0.131 1.22 0.44 3.43
## 11: knn_3 0.034 1.86 0.131 1.22 0.44 3.43
## 12: SuperLearner NA 0.59 0.072 0.19 0.33 0.81
## [1] "RMSE: 0.770026385842371"
## [1] "RMSE normalized: 0.143845386786441"
## [1] "MAE: 0.590310180752545"
## [1] "Coefficient of Variation: 37.9637645133302"
## [1] "Lab parameter: TN - Sensor_parameter: FDOM"
## learner coefficients MSE se fold_sd fold_min_MSE fold_max_MSE
## 1: mean 0.000 0.064 0.0056 0.032 0.023 0.091
## 2: glm 0.646 0.052 0.0071 0.031 0.028 0.103
## 3: xgb.xgboost_1 0.090 0.092 0.0066 0.057 0.037 0.172
## 4: xgb.xgboost_2 0.000 0.093 0.0067 0.059 0.038 0.179
## 5: xgb.xgboost_3 0.000 0.093 0.0067 0.059 0.040 0.179
## 6: xgb.xgboost_4 0.000 0.069 0.0056 0.052 0.026 0.148
## 7: xgb.xgboost_5 0.000 0.074 0.0062 0.059 0.029 0.170
## 8: xgb.xgboost_6 0.000 0.077 0.0066 0.062 0.030 0.179
## 9: knn_1 0.088 0.062 0.0055 0.030 0.023 0.087
## 10: knn_2 0.088 0.062 0.0055 0.030 0.023 0.087
## 11: knn_3 0.088 0.062 0.0055 0.030 0.023 0.087
## 12: SuperLearner NA 0.048 0.0057 0.028 0.022 0.084
## [1] "RMSE: 0.219782205340004"
## [1] "RMSE normalized: 0.731026752749025"
## [1] "MAE: 0.170337280157402"
## [1] "Coefficient of Variation: 83.9673769708658"
## [1] "Lab parameter: Kjeldahl - Sensor_parameter: FDOM"
## learner coefficients MSE se fold_sd fold_min_MSE fold_max_MSE
## 1: mean 0.000 0.043 0.0035 0.018 0.017 0.064
## 2: glm 0.760 0.028 0.0031 0.009 0.019 0.042
## 3: xgb.xgboost_1 0.000 0.094 0.0064 0.054 0.045 0.168
## 4: xgb.xgboost_2 0.000 0.094 0.0064 0.054 0.046 0.169
## 5: xgb.xgboost_3 0.000 0.094 0.0064 0.054 0.046 0.169
## 6: xgb.xgboost_4 0.000 0.053 0.0039 0.031 0.026 0.101
## 7: xgb.xgboost_5 0.000 0.054 0.0040 0.032 0.025 0.104
## 8: xgb.xgboost_6 0.078 0.054 0.0040 0.031 0.024 0.101
## 9: knn_1 0.054 0.041 0.0035 0.017 0.017 0.062
## 10: knn_2 0.054 0.041 0.0035 0.017 0.017 0.062
## 11: knn_3 0.054 0.041 0.0035 0.017 0.017 0.062
## 12: SuperLearner NA 0.027 0.0028 0.010 0.015 0.037
## [1] "RMSE: 0.164654874689819"
## [1] "RMSE normalized: 0.665090651039663"
## [1] "MAE: 0.134745653812427"
## [1] "Coefficient of Variation: 71.0062427941777"
## [1] "Lab parameter: Nitrate - Sensor_parameter: FDOM"
## learner coefficients MSE se fold_sd fold_min_MSE fold_max_MSE
## 1: mean 0.6521 0.0087 0.0028 0.0111 0.00046 0.027
## 2: glm 0.2343 0.0108 0.0029 0.0098 0.00495 0.028
## 3: xgb.xgboost_1 0.0000 0.1932 0.0037 0.0269 0.17034 0.238
## 4: xgb.xgboost_2 0.0039 0.1921 0.0036 0.0250 0.17034 0.233
## 5: xgb.xgboost_3 0.0039 0.1921 0.0036 0.0250 0.17034 0.233
## 6: xgb.xgboost_4 0.0000 0.0969 0.0035 0.0463 0.06343 0.178
## 7: xgb.xgboost_5 0.0000 0.1025 0.0045 0.0598 0.06343 0.208
## 8: xgb.xgboost_6 0.0000 0.0999 0.0041 0.0545 0.06343 0.196
## 9: knn_1 0.0352 0.0091 0.0028 0.0109 0.00102 0.027
## 10: knn_2 0.0352 0.0091 0.0028 0.0109 0.00102 0.027
## 11: knn_3 0.0352 0.0091 0.0028 0.0109 0.00102 0.027
## 12: SuperLearner NA 0.0085 0.0028 0.0109 0.00112 0.027
## [1] "RMSE: 0.092088224013393"
## [1] "RMSE normalized: 1.80900678975075"
## [1] "MAE: 0.0593522868363946"
## [1] "Coefficient of Variation: 104.810380544096"
## [1] "Lab parameter: TN - Sensor_parameter: Chl-a"
## learner coefficients MSE se fold_sd fold_min_MSE fold_max_MSE
## 1: mean 0.000 0.064 0.0057 0.016 0.048 0.091
## 2: glm 0.865 0.045 0.0050 0.017 0.021 0.063
## 3: xgb.xgboost_1 0.000 0.094 0.0067 0.041 0.049 0.138
## 4: xgb.xgboost_2 0.000 0.094 0.0067 0.041 0.049 0.139
## 5: xgb.xgboost_3 0.013 0.094 0.0067 0.041 0.049 0.139
## 6: xgb.xgboost_4 0.000 0.072 0.0057 0.026 0.046 0.106
## 7: xgb.xgboost_5 0.000 0.073 0.0058 0.028 0.047 0.111
## 8: xgb.xgboost_6 0.000 0.074 0.0059 0.029 0.045 0.114
## 9: knn_1 0.041 0.064 0.0057 0.013 0.049 0.085
## 10: knn_2 0.041 0.064 0.0057 0.013 0.049 0.085
## 11: knn_3 0.041 0.064 0.0057 0.013 0.049 0.085
## 12: SuperLearner NA 0.044 0.0049 0.015 0.024 0.061
## [1] "RMSE: 0.210514196966526"
## [1] "RMSE normalized: 0.703354098099235"
## [1] "MAE: 0.168207705388649"
## [1] "Coefficient of Variation: 76.8406299098867"
## [1] "Lab parameter: Kjeldahl - Sensor_parameter: Chl-a"
## learner coefficients MSE se fold_sd fold_min_MSE fold_max_MSE
## 1: mean 0.000 0.041 0.0034 0.013 0.022 0.055
## 2: glm 0.875 0.030 0.0036 0.016 0.017 0.058
## 3: xgb.xgboost_1 0.000 0.096 0.0065 0.039 0.051 0.138
## 4: xgb.xgboost_2 0.000 0.096 0.0065 0.039 0.051 0.138
## 5: xgb.xgboost_3 0.000 0.096 0.0065 0.039 0.051 0.138
## 6: xgb.xgboost_4 0.000 0.060 0.0044 0.023 0.034 0.089
## 7: xgb.xgboost_5 0.000 0.060 0.0044 0.024 0.034 0.090
## 8: xgb.xgboost_6 0.000 0.060 0.0043 0.023 0.035 0.089
## 9: knn_1 0.042 0.041 0.0035 0.010 0.023 0.048
## 10: knn_2 0.042 0.041 0.0035 0.010 0.023 0.048
## 11: knn_3 0.042 0.041 0.0035 0.010 0.023 0.048
## 12: SuperLearner NA 0.030 0.0034 0.015 0.019 0.055
## [1] "RMSE: 0.172179941701376"
## [1] "RMSE normalized: 0.69861948688877"
## [1] "MAE: 0.13784452263106"
## [1] "Coefficient of Variation: 77.4371506982431"
## [1] "Lab parameter: Nitrate - Sensor_parameter: Chl-a"
## learner coefficients MSE se fold_sd fold_min_MSE fold_max_MSE
## 1: mean 0.4223 0.0095 0.0029 0.013 0.0022 0.033
## 2: glm 0.5719 0.0089 0.0028 0.012 0.0007 0.030
## 3: xgb.xgboost_1 0.0046 0.1911 0.0035 0.041 0.1388 0.226
## 4: xgb.xgboost_2 0.0000 0.1911 0.0035 0.041 0.1388 0.226
## 5: xgb.xgboost_3 0.0012 0.1911 0.0035 0.041 0.1388 0.226
## 6: xgb.xgboost_4 0.0000 0.0837 0.0026 0.027 0.0543 0.118
## 7: xgb.xgboost_5 0.0000 0.0829 0.0027 0.027 0.0543 0.117
## 8: xgb.xgboost_6 0.0000 0.0831 0.0027 0.027 0.0543 0.118
## 9: knn_1 0.0000 0.0099 0.0029 0.013 0.0027 0.034
## 10: knn_2 0.0000 0.0099 0.0029 0.013 0.0027 0.034
## 11: knn_3 0.0000 0.0099 0.0029 0.013 0.0027 0.034
## 12: SuperLearner NA 0.0082 0.0028 0.012 0.0010 0.031
## [1] "RMSE: 0.0902906125251513"
## [1] "RMSE normalized: 1.78168362030871"
## [1] "MAE: 0.0585938849504407"
## [1] "Coefficient of Variation: 101.067121804499"
Proceed with non-bias corrected (not necessary) and rolling 24-hour sensor values as predictors (better performance).
For this outcome, transfer function will use only
sensor_value as the predictor.
## [1] "Lab parameter: Conductivity - Sensor_parameter: Conductivity"
## learner coefficients MSE se fold_sd fold_min_MSE fold_max_MSE
## 1: mean 0.0038 10530 680 2369 7160 12448
## 2: glm 0.9949 1115 185 604 482 1913
## 3: xgb.xgboost_1 0.0000 128710 3973 47066 74989 185101
## 4: xgb.xgboost_2 0.0000 128710 3973 47066 74989 185101
## 5: xgb.xgboost_3 0.0000 128710 3973 47066 74989 185101
## 6: xgb.xgboost_4 0.0000 53394 1861 22990 25516 79218
## 7: xgb.xgboost_5 0.0000 53406 1861 22990 25566 79264
## 8: xgb.xgboost_6 0.0013 53417 1862 23001 25566 79264
## 9: knn_1 0.0000 1386 207 958 564 2851
## 10: knn_2 0.0000 1386 207 958 564 2851
## 11: knn_3 0.0000 1386 207 958 564 2851
## 12: SuperLearner NA 1114 184 600 498 1920
## [1] "RMSE: 33.3820393016348"
## [1] "RMSE normalized: 0.091208916746324"
## [1] "MAE: 23.9534414956678"
## [1] "Coefficient of Variation: 13.3483789971074"
Demonstration of uncertainty estimates in transfer function based on bias-corrected bootstrapped replicates of transfer function and the normal approximation equi-tailed two-sided 95% confidence intervals (i.e. statistics +- 1.96*SD). In final product, will need to increase number of replicates and consider bias-corrected and accelerated(BCa) bootstrap interval.
Did not apply a transfer function since parameter does not describe any lab-measured parameter well. Moreover, the sensor output is not an informative predictor.
## [1] "Lab parameter: TN - Sensor_parameter: Chl-a"
## learner coefficients MSE se fold_sd fold_min_MSE fold_max_MSE
## 1: mean 0.23 0.064 0.0057 0.0164 0.048 0.091
## 2: glm 0.00 0.065 0.0057 0.0166 0.049 0.093
## 3: xgb.xgboost_1 0.00 0.094 0.0066 0.0408 0.049 0.140
## 4: xgb.xgboost_2 0.00 0.094 0.0066 0.0408 0.049 0.140
## 5: xgb.xgboost_3 0.00 0.094 0.0066 0.0406 0.049 0.139
## 6: xgb.xgboost_4 0.00 0.072 0.0053 0.0237 0.044 0.101
## 7: xgb.xgboost_5 0.00 0.072 0.0053 0.0234 0.043 0.098
## 8: xgb.xgboost_6 0.15 0.071 0.0052 0.0219 0.044 0.097
## 9: knn_1 0.21 0.059 0.0051 0.0101 0.051 0.077
## 10: knn_2 0.21 0.059 0.0051 0.0101 0.051 0.077
## 11: knn_3 0.21 0.059 0.0051 0.0101 0.051 0.077
## 12: SuperLearner NA 0.058 0.0049 0.0095 0.050 0.074
## [1] "RMSE: 0.240290701506969"
## [1] "RMSE normalized: 0.802841100863809"
## [1] "MAE: 0.193874136473176"
## [1] "Coefficient of Variation: 100.11567215922"