This script simulates digital metering data to assist with the development of data storage and analysis for the Coliban Water digital metering project, contracted to Ventia.
Using simulated data to develop databases and algorithms has two advantages. Firstly, there is no need to wait for the first data sets from Ventia. Secondly, the simulated data is calibrated because the consumption patterns, leaks and missing data points are predetermined. The predetermined nature of the data can be used to validate the algorithms. The purpose of this simulation is not to develop an accurate model of water consumption but simulate a pre-determined data set to assist with algorithm development.
This simulation is written in the R programming language, using the Tidyverse library.
The raw data obtained from Ventia will contain five main variables:
The code in this report simulates these five variables.
Ventia provides the raw data through daily CSV files through a secured FTP server. Each CSV file contains one day of data. The CSV file is downloaded daily by Coliban Water and stored in a database. This raw data is used for:
The first code snippet defines the boundary conditions and simulation parameters.
# Boundary conditions
n <- 100 # Number of simluated meters
d <- 100 # Number of days to simulate
s <- as.POSIXct("2020-01-01", tz = "UTC") # Start of simulation
All pseudo-random number sequences in this code are seeded so that the simulation is repeatable.
Device id numbers consists of random numbers with 6 digits.
Each RTU transmits hourly cumulative meter reads. To prevent network congestion, each unit transmits at a predetermined random time, e.g. 21:36 past every hour. The offset for each time stamp is a random number between 0 and 3599 seconds.
set.seed(1969) # Seed random number generator for reproducibility
rtu <- sample(1E6:2E6, n, replace = FALSE) # 6-digit id
offset <- sample(0:3599, n, replace = TRUE) # Unique Random offset for each RTU
Water consumption is simulated using a generic diurnal curve for indoor water use (Gurung et al., 2014). This curve is based on real data, including leaking connections. To create an idealised diurnal curve, night-time flows (0:00 - 3:00) are truncated to zero. The curve is also shifted by eleven hours because the raw data is stored in UTC.
diurnal <- round(c(1.36, 1.085, 0.98, 1.05, 1.58, 3.87, 9.37, 13.3, 12.1, 10.3, 8.44, 7.04, 6.11, 5.68, 5.58, 6.67, 8.32, 10.0, 9.37, 7.73, 6.59, 5.18, 3.55, 2.11)) - 1
tdiff <- 11
diurnal <- c(diurnal[(tdiff + 1): 24], diurnal[1:tdiff])
data.frame(TimeUTC = 0:23,
Flow = diurnal) %>%
ggplot(aes(x = TimeUTC, y = Flow)) +
geom_area(fill = "dodgerblue2", alpha = 0.5) +
scale_x_continuous(breaks = 0:23) + ylab("Flow [L/h/p]")
Each connection services a number of occupants based on a Poisson distribution with a mean of approximately 2.5 occupants per connection (\(\lambda=1.5\)).
occupants <- rpois(n, 1.5) + 1 # Number of occupants per connection
as.data.frame(occupants) %>%
ggplot(aes(occupants)) + geom_bar(fill = "dodgerblue2", alpha = 0.5) +
xlab("Occupants") + ylab("Connections") + ggtitle("Occupants per connection")
The diurnal curve is not adjusted for seasonal influences.
A leak is defined by a constant flow through the meter, in addition to the idealised diurnal curve. A weighted binomial distribution (\(\theta\) = 0.1) models approximately one in ten properties with a leak. The size of the leak is derived from a random number between 10 and 50 litres per hour.
leaks <- rbinom(n, 1, prob = .1) * sample(10:50, n, replace = TRUE)
data.frame(DevEUI = rtu, Leak = leaks) %>%
subset(Leak > 0) %>%
kable(row.names = FALSE, col.names = c("RTU", "Leak [L/h]"))
RTU | Leak [L/h] |
---|---|
1335858 | 25 |
1176781 | 13 |
1648095 | 33 |
1794270 | 41 |
1242535 | 45 |
1022428 | 28 |
1878904 | 33 |
The next code snippet simulates the digital metering data using the assumptions and parameters outlined above.
The data is stored in a matrix through a loop that cycles through each connection. The DevEUI is repeated over the simulated time period (24 times the number of days). The second variable is the time stamp plus the predetermined offset for each RTU. The meter count is defied by the cumulative sum of the diurnal flow, multiplied by the number of occupants. Each point in the diurnal deviates from the model curve by ±10%. Any predetermined leakage is added on top of each meter read over the whole period of $d
days. The simulated data does not contain any forced reads. Tampering is simulated by randomly assigning the number 0, 1 or 2 to the read, weighted towards selecting 0. The table shows the first rows of the simulated data.
meter_reads <- matrix(ncol = 5, nrow = 24 * n * d)
colnames(meter_reads) <- c("DevEUI", "TimeStampUTC", "Count", "Forced", "Tamper")
for (i in 1:n) {
r <- ((i - 1) * 24 * d + 1):(i * 24 * d)
meter_reads[r, 1] <- rep(rtu[i], each = (24 * d))
meter_reads[r, 2] <- seq.POSIXt(s, by = "hour", length.out = 24 * d) + offset[i]
meter_reads[r, 3] <- round(cumsum((rep(diurnal * occupants[i], d) + leaks[i]) *
runif(24 * d, 0.9, 1.1))/5)
meter_reads[r,4] <- 0
meter_reads[r,5] <- sample(c(rep(0,1000), 1, 2), (24 * d), replace = TRUE)
}
meter_reads <- meter_reads %>%
as_data_frame() %>%
mutate(TimeStampUTC = as.POSIXct(TimeStampUTC, origin = "1970-01-01", tz ="UTC"))
kable(head(meter_reads), row.names = FALSE)
DevEUI | TimeStampUTC | Count | Forced | Tamper |
---|---|---|---|---|
1217941 | 2020-01-01 00:39:33 | 5 | 0 | 0 |
1217941 | 2020-01-01 01:39:33 | 9 | 0 | 0 |
1217941 | 2020-01-01 02:39:33 | 13 | 0 | 0 |
1217941 | 2020-01-01 03:39:33 | 17 | 0 | 0 |
1217941 | 2020-01-01 04:39:33 | 22 | 0 | 0 |
1217941 | 2020-01-01 05:39:33 | 28 | 0 | 0 |
The data transmission process is not 100% reliable and the base station will not received some reads. This simulation identifies reads to be removed from the data through the temporary variable \(remove\).
This simulation includes two types of failures:
# Initialise temp variable
meter_reads <- mutate(meter_reads, remove = 0)
# Define faulty RTUs (2% of fleet)
faulty <- rtu[rbinom(n, 1, prob = 0.02) == 1]
meter_reads$remove[meter_reads$DevEUI %in% faulty] <- rbinom(sum(meter_reads$DevEUI %in% faulty), 1, prob = .95)
# Data loss
missing <- sample(1:(nrow(meter_reads) - 5), 0.005 * nrow(meter_reads))
for (m in missing){
meter_reads[m:(m + sample(1:5, 1)), "remove"] <- 1
}
# Remove data points
meter_reads <- filter(meter_reads, remove == 0) %>%
select(-remove)
To simulate missing data, a binomial distribution with \(\theta=0.02\) is used to determine the faulty meters. The faulty meters (1912240, 1584604) only transmit 5% of the collected data, also determined with a binomial distribution.
To simulate randomly missing data points, 0.5% of data points is identified and from those points, between 1 and 5 reads are removed. The as missing identified data points and the temporary variable are removed from the data.
The last code snippet stores the complete data file and the daily files on disk as CSV files. The daily files are used to develop FTP scripts and the total set to develop algorithms.
# Write complete set to disk
write.csv(meter_reads, "meter_reads.csv", row.names = FALSE)
# Write daily files to disk
for (i in 1:d) {
day <- s + 24 * 3600 * (i - 1)
write.csv(subset(meter_reads, as.Date(TimeStampUTC) == day),
paste0("digsim/meter_reads_", day, ".csv"),
row.names = FALSE)
}
The complete data set can be downloaded here.
This work is copyright of Coliban Water and licensed under a Creative Commons Attribution-ShareAlike 3.0 Australia License.