Introduction

This script simulates digital metering data to assist with the development of data storage and analysis for the Coliban Water digital metering project, contracted to Ventia.

Using simulated data to develop databases and algorithms has two advantages. Firstly, there is no need to wait for the first data sets from Ventia. Secondly, the simulated data is calibrated because the consumption patterns, leaks and missing data points are predetermined. The predetermined nature of the data can be used to validate the algorithms. The purpose of this simulation is not to develop an accurate model of water consumption but simulate a pre-determined data set to assist with algorithm development.

This simulation is written in the R programming language, using the Tidyverse library.

Data structure

The raw data obtained from Ventia will contain five main variables:

  • \(DevEUI\): Unique ID of the Radio Telemetry Unit (RTU) attached to the water meter.
  • \(TimeStampUTC\): Timestamp of data received at the base station in UTC.
  • \(Count\): Cumulative number of impeller revolutions. Each revolution in an Elster V100 meter indicates five litres.
  • \(Forced\): Indicates whether the read was a regular value or a forced read (0/1)
  • \(Tamper\): Tamper flag (state of the RTU) in three different states:
    • No tamper
    • Tamper start (to help identify the beginning of tamper)
    • Ongoing tamper

The code in this report simulates these five variables.

Data storage and analysis

Ventia provides the raw data through daily CSV files through a secured FTP server. Each CSV file contains one day of data. The CSV file is downloaded daily by Coliban Water and stored in a database. This raw data is used for:

  • Performance monitoring (contractual KPIs)
  • Data aggregation for billing purposes
  • Anomaly detection
  • Specific data science projects (water resources, modelling etc.)

Initiatilisation

The first code snippet defines the boundary conditions and simulation parameters.

# Boundary conditions
n <- 100 # Number of simluated meters
d <- 100 # Number of days to simulate
s <- as.POSIXct("2020-01-01", tz = "UTC") # Start of simulation

All pseudo-random number sequences in this code are seeded so that the simulation is repeatable.

Device id numbers consists of random numbers with 6 digits.

Each RTU transmits hourly cumulative meter reads. To prevent network congestion, each unit transmits at a predetermined random time, e.g. 21:36 past every hour. The offset for each time stamp is a random number between 0 and 3599 seconds.

set.seed(1969) # Seed random number generator for reproducibility
rtu <- sample(1E6:2E6, n, replace = FALSE) # 6-digit id
offset <- sample(0:3599, n, replace = TRUE) # Unique Random offset for each RTU

Generic Diurnal Curve

Water consumption is simulated using a generic diurnal curve for indoor water use (Gurung et al., 2014). This curve is based on real data, including leaking connections. To create an idealised diurnal curve, night-time flows (0:00 - 3:00) are truncated to zero. The curve is also shifted by eleven hours because the raw data is stored in UTC.

diurnal <- round(c(1.36, 1.085, 0.98, 1.05, 1.58, 3.87, 9.37, 13.3, 12.1, 10.3, 8.44, 7.04, 6.11, 5.68, 5.58, 6.67, 8.32, 10.0, 9.37, 7.73, 6.59, 5.18, 3.55, 2.11)) - 1
tdiff <- 11
diurnal <- c(diurnal[(tdiff + 1): 24], diurnal[1:tdiff])

data.frame(TimeUTC = 0:23,
           Flow = diurnal) %>% 
  ggplot(aes(x = TimeUTC, y = Flow)) + 
  geom_area(fill = "dodgerblue2", alpha = 0.5) +
  scale_x_continuous(breaks = 0:23) + ylab("Flow [L/h/p]")

Each connection services a number of occupants based on a Poisson distribution with a mean of approximately 2.5 occupants per connection (\(\lambda=1.5\)).

occupants <- rpois(n, 1.5) + 1 # Number of occupants per connection

as.data.frame(occupants) %>%
  ggplot(aes(occupants)) + geom_bar(fill = "dodgerblue2", alpha = 0.5) + 
  xlab("Occupants") + ylab("Connections") + ggtitle("Occupants per connection")

The diurnal curve is not adjusted for seasonal influences.

Leak simulation

A leak is defined by a constant flow through the meter, in addition to the idealised diurnal curve. A weighted binomial distribution (\(\theta\) = 0.1) models approximately one in ten properties with a leak. The size of the leak is derived from a random number between 10 and 50 litres per hour.

leaks <- rbinom(n, 1, prob = .1) * sample(10:50, n, replace = TRUE)

data.frame(DevEUI = rtu, Leak = leaks) %>%
  subset(Leak > 0) %>%
  kable(row.names = FALSE, col.names = c("RTU", "Leak [L/h]"))
RTU Leak [L/h]
1335858 25
1176781 13
1648095 33
1794270 41
1242535 45
1022428 28
1878904 33

Digital metering data simulation

The next code snippet simulates the digital metering data using the assumptions and parameters outlined above.

The data is stored in a matrix through a loop that cycles through each connection. The DevEUI is repeated over the simulated time period (24 times the number of days). The second variable is the time stamp plus the predetermined offset for each RTU. The meter count is defied by the cumulative sum of the diurnal flow, multiplied by the number of occupants. Each point in the diurnal deviates from the model curve by ±10%. Any predetermined leakage is added on top of each meter read over the whole period of $d days. The simulated data does not contain any forced reads. Tampering is simulated by randomly assigning the number 0, 1 or 2 to the read, weighted towards selecting 0. The table shows the first rows of the simulated data.

meter_reads <- matrix(ncol = 5, nrow = 24 * n * d)
colnames(meter_reads) <- c("DevEUI", "TimeStampUTC", "Count", "Forced", "Tamper")

for (i in 1:n) {
  r <- ((i - 1) * 24 * d + 1):(i * 24 * d)
  meter_reads[r, 1] <- rep(rtu[i], each = (24 * d))
  meter_reads[r, 2] <- seq.POSIXt(s, by = "hour", length.out = 24 * d) + offset[i]
  meter_reads[r, 3] <- round(cumsum((rep(diurnal * occupants[i], d) + leaks[i]) * 
                                  runif(24 * d, 0.9, 1.1))/5)
  meter_reads[r,4] <- 0
  meter_reads[r,5] <- sample(c(rep(0,1000), 1, 2), (24 * d), replace = TRUE)
} 

meter_reads <- meter_reads %>% 
  as_data_frame() %>%
  mutate(TimeStampUTC = as.POSIXct(TimeStampUTC, origin = "1970-01-01", tz ="UTC"))

kable(head(meter_reads), row.names = FALSE)
DevEUI TimeStampUTC Count Forced Tamper
1217941 2020-01-01 00:39:33 5 0 0
1217941 2020-01-01 01:39:33 9 0 0
1217941 2020-01-01 02:39:33 13 0 0
1217941 2020-01-01 03:39:33 17 0 0
1217941 2020-01-01 04:39:33 22 0 0
1217941 2020-01-01 05:39:33 28 0 0

Missing Data Points

The data transmission process is not 100% reliable and the base station will not received some reads. This simulation identifies reads to be removed from the data through the temporary variable \(remove\).

This simulation includes two types of failures:

# Initialise temp variable
meter_reads <- mutate(meter_reads, remove = 0)

# Define faulty RTUs (2% of fleet)
faulty <- rtu[rbinom(n, 1, prob = 0.02) == 1]
meter_reads$remove[meter_reads$DevEUI %in% faulty] <- rbinom(sum(meter_reads$DevEUI %in% faulty), 1, prob = .95)

# Data loss
missing <- sample(1:(nrow(meter_reads) - 5), 0.005 * nrow(meter_reads))
for (m in missing){
  meter_reads[m:(m + sample(1:5, 1)), "remove"] <- 1
}

# Remove data points
meter_reads <- filter(meter_reads, remove == 0) %>%
  select(-remove)

To simulate missing data, a binomial distribution with \(\theta=0.02\) is used to determine the faulty meters. The faulty meters (1912240, 1584604) only transmit 5% of the collected data, also determined with a binomial distribution.

To simulate randomly missing data points, 0.5% of data points is identified and from those points, between 1 and 5 reads are removed. The as missing identified data points and the temporary variable are removed from the data.

Store data

The last code snippet stores the complete data file and the daily files on disk as CSV files. The daily files are used to develop FTP scripts and the total set to develop algorithms.

# Write complete set to disk
write.csv(meter_reads, "meter_reads.csv", row.names = FALSE)

# Write daily files to disk
for (i in 1:d) {
  day <- s + 24 * 3600 * (i - 1)
  write.csv(subset(meter_reads, as.Date(TimeStampUTC) == day), 
            paste0("digsim/meter_reads_", day, ".csv"),
            row.names = FALSE)
  }

The complete data set can be downloaded here.