This document discusses the analysis of hourly digital water meter data. The logic described in this report forms the basis of a Proof of Concept developed in MS SQL. The proof of concept is be the starting point for our final business intelligence solution for digital metering.
The data used in this paper is simulated using a standard diurnal curve, a stochastic number of occupants, randomly assigned leaks and level of accuracy. A portion of the data has been removed to simulate missing meter reads. Refer to a previous report for details of the simulation algorithm. The complete data set can be downloaded here.
The code in this paper is developed in R, using the Tidyverse data science library. The code is developed as a first step towards developing an Open Source library for digital metering data analysis in the R language.
This first code snippet reads the simulated data. In the production setting, data will be provided daily through an FTP service and stored on a database.
meter_reads <- read_csv("meter_reads.csv")
## Parsed with column specification:
## cols(
## DevEUI = col_integer(),
## TimeStampUTC = col_datetime(format = ""),
## Count = col_integer(),
## Forced = col_integer(),
## Tamper = col_integer()
## )
rtu <- unique(meter_reads$DevEUI)
All data is read into memory (meter_reads). The rtu variable contains the list of transmitters ids (DevEUI). The data contains 100 connection points from 2020-01-01 00:01:35 to 2020-04-09 23:59:50. The data consists of 230760 meter reads (23.0762804 reads per RTU per day).
Time stamps of the meter reads are stored in Universal Time Coordinate (UTC). All analysis is undertaken in the local time zone (Australia/Melbourne, AEST). The input for the functions described in this report is in AEST.
Digital meter data is a univariate time series with data generally Missing Completely At Random (MCAR). There is no confounding variable of interest that causes transmission issues. Because missing data is not caused by consumer behaviour (the variable of interest), there is no need to impute the missing points by default. The digital meters use cumulative reads, so consumption between the last two successful transmissions is known. The amount of missing data is limited through a contractual performance indicator. Missing data will thus not be imputed. Meter reads required at specific times not available in the data stream are determined through interpolation.
To facilitate analysis, two auxiliary functions are required:
Data analysis is undertaken on slices of the complete data set. This generic function slices the available data by a vector of RTU ids and a date range. This function adds a new timestamp variable in AEST. If no date range is provided, all available data for the selected RTUs is provided.
slice_reads <- function(rtus, dates = range(meter_reads$TimeStampUTC)) {
filter(meter_reads, DevEUI %in% rtus) %>%
mutate(TimeStampAEST = as.POSIXct(format(TimeStampUTC,
tz = "Australia/Melbourne"))) %>%
filter(TimeStampAEST >= as.POSIXct(dates[1]) &
TimeStampAEST <= as.POSIXct(dates[2])) %>%
arrange(DevEUI, TimeStampAEST)
}
The graph below shows the meter reads for the second RTU in the list, over two days. The RTUs transmit data at hourly intervals at a fixed random offset from the whole hour. The measured variable are the cumulative impeller counts. Each count represents a flow of five litres (for the Elster V100 meter). The graph shows zero flow at night, which indicates this property does not have a leak. The graph also shows some randomly missing data points.
slice_reads(rtu[2], c("2020-02-06", "2020-02-08")) %>%
ggplot(aes(x = TimeStampAEST, y = Count, colour = factor(DevEUI))) +
geom_line() + geom_point()
This function interpolates the cumulative counts for a series of RTUs over a vector of time stamps. The output is a data frame with DevEUI, timestamp in AEST and the cumulative count. This function forms the basis for further analysis. The example shows the counts for two meters over two days an the graph superimposes an interpolated point over the raw data.
Although the actual data only will show integer counts, interpolated values are numeric values. The decimals are retained to distinguish them from actual reads.
interpolate_count <- function(rtus, timestamps) {
timestamps <- as.POSIXct(timestamps, tz = "Australia/Melbourne")
results <- vector("list", length(rtus))
for (r in seq_along(rtus)) {
interp <- slice_reads(rtus[r]) %$%
approx(TimeStampAEST, Count, timestamps)
results[[r]] <- data_frame(DevEUI = rep(rtus[r], length(timestamps)),
TimeStampAEST = timestamps,
Count = interp$y)
}
return(do.call(rbind, results))
}
interpolate_count(rtu[2:3], seq.POSIXt(as.POSIXct("2020-02-01"), as.POSIXct("2020-02-2"), by = "day"))
slice_reads(rtu[2], c("2020-02-06 00:00", "2020-02-06 08:00")) %>%
ggplot(aes(x = TimeStampAEST, y = Count, colour = factor(DevEUI))) +
geom_line() + geom_point() +
geom_point(data = interpolate_count(rtu[2], "2020-02-06 06:00"), colour = "blue")
Daily flows are calculated by estimating the cumulative count at midnight through linear interpolation between the two nearest data points for each day, for each RTU. This function calculates daily consumption for a list of RTUs, between two dates. To determine the network flow, the consumption is grouped by date. There is minimal variation between daily flows due to the nature of the simulated data.
daily_consumption <- function(rtus, dates) {
timestamps <- seq.POSIXt(as.POSIXct(min(dates)) - 24 * 3600,
as.POSIXct(max(dates)), by = "day")
interpolate_count(rtus, timestamps) %>%
group_by(DevEUI) %>%
mutate(Consumption = c(0, diff(Count)) * 5,
Date = format(TimeStampAEST, "%F")) %>%
filter(TimeStampAEST != timestamps[1]) %>%
select(DevEUI, Date, Consumption)
}
daily_consumption(rtu[32:35], c("2020-02-01", "2020-02-7")) %>%
ggplot(aes(x = Date, y = Consumption)) + geom_col() +
facet_wrap(~DevEUI) +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
daily_consumption(rtu[32:35], c("2020-02-01", "2020-02-7")) %>%
group_by(Date) %>%
summarise(Consumption = sum(Consumption)) %>%
ggplot(aes(x = Date, y = Consumption)) + geom_col() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
The diurnal flow function use the slicing and interpolation functions described above. The function slides the data set is sliced and interpolates the cumulative reads for each hour. The function calculates the flow rate in litres per hour and groups the results by hour.
The diurnal curve function shows the range of flows for one or more connections for each hour over a period of time. The graph below shows the diurnal curve for RTU 12 to 20 for the month of February. The orange line indicates the mean flow and the shaded area indicates the maximum and minimum values.
plot_diurnal_connections <- function(rtus, dates) {
timestamps <- seq.POSIXt(as.POSIXct(dates[1]), as.POSIXct(dates[2]), by = "hour")
interpolate_count(rtus, timestamps) %>%
mutate(Rate = c(0, diff(Count * 5)),
Hour = as.integer(format(TimeStampAEST, "%H"))) %>%
filter(Rate >= 0) %>%
group_by(Hour) %>%
summarise(min = min(Rate), mean = mean(Rate), max = max(Rate)) %>%
ggplot(aes(x = Hour, ymin = min, ymax = max)) +
geom_ribbon(fill = "lightblue", alpha = 0.5) +
geom_line(aes(x = Hour, y = mean), col = "orange", size = 1) +
ggtitle("Connections Diurnal flow") + ylab("Flow rate [L/h]")
}
plot_diurnal_connections(rtu[12:20], c("2020-02-01", "2020-03-01"))
The summed diurnal flow shows the minimum, average and maximum daily flow for a set of RTUs over a time period. This diurnal curve approximates the curve of the associated network flow meter.
plot_diurnal_network <- function(rtus, dates) {
timestamps <- seq.POSIXt(as.POSIXct(dates[1]), as.POSIXct(dates[2]), by = "hour")
interpolate_count(rtus, timestamps) %>%
mutate(Rate = c(0, diff(Count * 5))) %>%
filter(Rate >= 0) %>%
group_by(TimeStampAEST) %>%
summarise (Rate = sum(Rate)) %>%
mutate(Hour = as.integer(format(TimeStampAEST, "%H"))) %>%
group_by(Hour) %>%
summarise(min = min(Rate), mean = mean(Rate), max = max(Rate)) %>%
ggplot(aes(x = Hour, ymin = min, ymax = max)) +
geom_ribbon(fill = "lightblue", alpha = 0.5) +
geom_line(aes(x = Hour, y = mean), col = "orange", size = 1) +
ggtitle("Network Diurnal flow") + ylab("Flow rate [L/h]")
}
plot_diurnal_network(rtu[12:20], c("2020-02-01", "2020-03-01"))
Diurnal flows can also be presented through a series of box plots. This approach provides more detailed statistical information about each data point. The horizontal line visualises the median flow and the box indicates the the 25th and 75th percentiles. The line is determined by the interquartile distance. The dots indicate outliers.
plot_diurnal_box <- function(rtus, dates) {
timestamps <- seq.POSIXt(as.POSIXct(dates[1]), as.POSIXct(dates[2]), by = "hour")
interpolate_count(rtus, timestamps) %>%
mutate(Rate = c(0, diff(Count * 5)),
Hour = as.integer(format(TimeStampAEST, "%H"))) %>%
filter(Rate >= 0) %>%
group_by(Hour) %>%
ggplot(aes(x = factor(Hour), y = Rate)) +
geom_boxplot() +
ggtitle("Diurnal flow") + ylab("Flow rate [L/h]") + xlab("Time")
}
plot_diurnal_box(rtu[12:20], c("2020-02-01", "2020-03-01"))
The only determinant of water consumption is human behaviour. Customers use water to create value in their lives. A leak is defined by water consumption without a behavioural cause, in other words, water that has no value. Digital metering data can be used to identify properties with a possible leak. Although these leaks are not non-revenue-water, they impose a cost on the consumer and the natural environment. To detect a leak at a property, the algorithm needs to detect flows of water that wee not caused by human behaviour. Leak detection thus requires an understanding of consumer behaviour. Given the low amount of data we have about customers, leak detection needs to be based on assumptions and a stochastic approach to model human behaviour.
The main assumption is that the diurnal flow for urban connections without leaks will have periods of zero flow when occupants are either asleep or away. If a connection has a continuous flow over a certain period, this should be flagged as a possible leak. There could, however, be behavioural reasons for a continuous flow, such as automated sprinklers, evaporative cooling, shift workers an so on.
In this phase of benefits analysis, leak detection is limited to identifying connections without zero flow. This function investigates a set of RTUs over a time period and shows the ones without zero flow. The flow rate of the possible leak is defined by the lowest flow between reads in the data set. The diurnal curve shows water consumption for connection 1335858.
detect_leaks <- function(rtus, dates) {
slice_reads(rtus, dates) %>%
mutate(Time = c(0, diff(TimeStampAEST)),
Flow = c(0, diff(Count * 5)),
Rate = Flow / Time) %>%
filter(Time > 0) %>%
group_by(DevEUI) %>%
summarise(Leak= min(Flow)) %>%
filter(Leak > 0)
}
detect_leaks(rtu, c("2020-02-01", "2020-03-01"))
plot_diurnal_box(1335858, c("2020-02-01", "2020-03-01"))
This leak detection rule will be further developed to reduce false positives. This will be achieved by adding dynamic bandwith monitoring, based on code developed by J.H. FitiƩ of Vitens N.V. in the Netherlands. Their algorithm will be enhanced by adding maximum temperature as an input.
This work is copyright of Coliban Water and licensed under a Creative Commons Attribution-ShareAlike 3.0 Australia License.