For the Vitens Open Data Challenge we analysed sensor data measured by the Smart Water Grid sensors that Vitens employs to monitor its network of drinking water pipes and conduits. Its central question: What are the correlations between changes in water quality measued by Eventlab sensors and other real-time measurements, statuses and alarm values? As a first step in our analysis we present the results of our exploratory analysis in this report.
Our most important result so far is that the variance of one of the turbidity variables (FR-PTW_TR01QI02PV) correlates strongly (negatively) and significantly (p<0.05) with the variance in two of the Eventlab variables (FR-MOALD-vitnor1 and FR-MOALD-vitnor3).
Additional documentation can be viewed here: http://bit.ly/1PBn7h2
Our data set contains 885 files and its total size is a little over 5 GB. All files are in .csv format and contain Timestamp-Value pairs. Furthermore the data files contain some additional information in the header comments.
The file name contains the variable that was measured, in format Dump_<variable>.csv:
## [1] "Dump_ACE-FR-Balansgebied-Buitenpost-levering.csv"
## [2] "Dump_ACE-FR-Balansgebied-Buitenpost-levering meetwaarden opgeteld.csv"
## [3] "Dump_ACE-FR-Balansgebied-Dokkum_Holwerd-levering.csv"
## [4] "Dump_ACE-FR-Balansgebied-Gorredijk-levering.csv"
According to the documentation measurements are done in the time range January 2014 to December 2015. We confirmed this by looking at the data. We also observe that some variables are only available in specific time ranges, often data is missing in large blocks of this 2-year time range (see graph below).
## Time Variable Value
## Min. :2014-01-21 14:03:00 Length:1916092 Min. : 0.000000
## 1st Qu.:2014-12-08 12:59:00 Class :character 1st Qu.: 0.006357
## Median :2015-04-02 12:17:00 Mode :character Median : 0.030431
## Mean :2015-03-23 07:40:38 Mean : 0.062391
## 3rd Qu.:2015-08-11 10:01:15 3rd Qu.: 0.079799
## Max. :2015-12-01 07:13:00 Max. :25.631880
The frequency of data logging is 1 measurement per minute at most, though certainly not for all variables.
In the documentation we see the following explanation of variable names:
| Quantity | Variable code | Remarks |
|---|---|---|
| Eventlab data | *vitnor* |
3 per sensor (vitnor1, vitnor2, vitnor3). Alarm codes are given in case 1 or more values above threshold (code Orange for 1 > max(vitnor*) >= 1.5, code Red for max(vitnor*) > 1.5) |
| Temperature | *TM* |
|
| Flow | *FT*, *VO* |
Negative values allowed (in case of opposite flow direction) |
| Pressure | *PT*, *DO* |
We found that these had to be followed by 2 digits (*PTxx*, *DOxx*) in order not to select incorrect variables |
| Conductivity | FR-PNB_TR00QI03PV*, FR-POH_-TR00QI03PV* |
|
| Acidity (pH) | FR-PNB_TR00QI01PV*, FR-POH_-TR00QI02PV*, FR-PSP-TR00QI01*, FR-PTW_TR01QI01PV* |
|
| Turbidity | FR-PTW_TR01QI02PV*, FR-PSP-TR00QI02*, FR-PNB_TR00QI02PV*, FR-POH_-TR00QI01PV* |
|
| Other | (None of the above) | Status values of pumps, reservoirs, valves, etc |
We are interested in Eventlab measurements (variable names *vitnor*) as response variables:
*vitnor1*, *vitnor2* and *vitnor3*)That leaves 699 variables that we can use as potential predictors:
The data quality of these variables varies. Some files do not even contain any data points.
Based on large blocks of consecutive data seen in exploratory plots and multiple peaks of Eventlab variables exceeding the threshold we decided to focus on the FR.MOBMS.vitnor1 variable as response variable as a first exploratory step:
Zooming in on a specific Eventlab alarm event starting around 2015-06-29 02:00:00 we observe that the Eventlab values can rise and exceed the threshold (>1 resp >1.5) in a matter of minutes, sometimes falling abruptly, sometimes more gradually, and that this process can repeat itself in less than an hour.
Some challenges we encountered:
Start of our approach was to do the following:
We decided to aggregate data for multiple reasons: in order to compress our data set, to increase the chances of different variables both having a value for a given time so that they can be correlated/compared, to reduce the number of peak Eventlab events (instead of one event per minute) and also to increase our chances of finding correlated signals in which a small time lag between variables is present (~ hour time scale).
With regard to subsetting, we selected temperature, flow, pressure, conductivity, acidity and turbidity as potential predictor variables and Eventlab measurements as response variables. We chose not to look into the ‘other’ measured quantities for now, especially since we have limited background of their meaning. Also, we looked at specific time ranges in order to simplify our investigation.
We then try to find relationships between measured quantities by looking at correlations of their aggregated statistics. For example: Since we are interested in changes (deltas), one approach is to correlate Eventlab hourly variance with hourly variance in flow measurements.
We notice many peak events above the orange and red thresholds (>1 resp >1.5) for Eventlab sensor data.
Furthermore, the graphs of hourly maximums and hourly variance look very similar because a high hourly max usually corresponds with an abrupt peak (within hour time window) meaning that the hourly variance is high as well.
We observe what looks like steam conduits with temperature 100 degrees Celcius, some hot water measurements in the 20 to 75 degrees region, plus seasonal changes in mean water temperature.
Mean flow is in the 0 to 700 m^3/hour region, with some outliers on both the positive and negative sides (could be leaks). In the graph below we also see negative outliers in mean pressure:
We have 2 measured variables for conductivity. In the plot above we see large drops in one of these conductivity variables between March and May 2014.
The single outlier in this unvalidated data makes the fluctuations in the above plot harder to read, although we see some clear positive and negative outliers.
We have 4 measured variables describing turbidity. We see some fluctuations in the long time range (months) and peaks/regions with higher turbidity. Turbidity could be related to water quality, although we are lacking specific domain knowledge here.
Because we are interested in correlated peak events, we also looked at the hourly variance of these variables, as a measure of their variability on the timescale of less than an hour.
Again, we see the seasonal differences in the hourly temperature variance.
Some flow variables change a lot more (2 orders of magnitude) than others.
Again, we see some peaks in which the hourly pressure variance is more than 1 order of magnitude higher than normal.
Hourly variance data is missing for a lot of the time, except for the PR.POH_TR
In order to investigate the relationship between different Eventlab alarm events within a specific time period, we plot a quick heat map:
We observe small groups (often triples) of more closely related variables. One triple displays a somewhat larger distance to the others, but furthermore no clear pattern is visible. The grey area on the right corresponds to Eventlab measurements that did not have an alarm value (above the threshold) at all in the given time frame.
A correlation plot for Eventlab and flow variances mainly seems to display correlations within the flow and Eventlab datasets, not between them:
A correlation plot of Eventlab and pressure variances displays a lot of correlations within the pressure and Eventlab datasets, not between them:
We find the following correlation plot for Eventlab, acidity, turbidity and conductivity variances:
And observe that one of the turbidity variables correlates strongly with two of the Eventlab variables! Note that this is a negative correlation though, requiring further research.
Last but not least: it’s safe to say that including more domain knowledge about measured quantities and how they were gathered may make it substantially easier to focus on more effective approaches and to draw meaningful conclusions.
Data set provided by Vitens.