This is in response to my forum post on R Studio Community: https://community.rstudio.com/t/combining-data-frames-based-on-column-between-another-column/62539

It’s important to post solutions found after publicly asking for help.

I’m trying to combine two data tables, one table of events that have been grouped into zones, and another of sensor measures. My goal is to compute summary statistics based on the grouped events.

library(ggplot2)
library(dplyr)
library(lubridate)
library(tidyr) # For `fill`ing in values up and down
library(forcats) # For factors in ggplot2
library(knitr) # For tables to look nice
set.seed(1)

The first table of events and zones looks like this:

events2<-data.frame(
  event_id = seq(1,6),
  note = c("a", "e", 'i', 'o', 'u', 'y'),
  client_time = c('2020-02-01 08:12:00', '2020-02-01 08:16:00', '2020-02-01 08:22:00', '2020-02-01 13:38:00', '2020-02-01 21:02:00', '2020-02-01 21:20:00'),
  zone_id = c(1,1,1,2,3,3),
  zone_start = c(rep('2020-02-01 08:12:00', 3), '2020-02-01 13:38:00', rep('2020-02-01 21:02:00', 2)),
  zone_end = c(rep('2020-02-01 12:30:00', 3), '2020-02-01 15:38:00', rep('2020-02-01 22:34:00', 2))
) %>% mutate(zone_start = ymd_hms(zone_start, tz = "UTC"), zone_end = ymd_hms(zone_end, tz = "UTC"), client_time = ymd_hms(client_time, tz = "UTC"))
head(events2) %>% kable(caption = "A table of events and zones")
A table of events and zones
event_id note client_time zone_id zone_start zone_end
1 a 2020-02-01 08:12:00 1 2020-02-01 08:12:00 2020-02-01 12:30:00
2 e 2020-02-01 08:16:00 1 2020-02-01 08:12:00 2020-02-01 12:30:00
3 i 2020-02-01 08:22:00 1 2020-02-01 08:12:00 2020-02-01 12:30:00
4 o 2020-02-01 13:38:00 2 2020-02-01 13:38:00 2020-02-01 15:38:00
5 u 2020-02-01 21:02:00 3 2020-02-01 21:02:00 2020-02-01 22:34:00
6 y 2020-02-01 21:20:00 3 2020-02-01 21:02:00 2020-02-01 22:34:00

The second table of measurements is much, much longer, and looks like this:

measurements <- data.frame(
    client_time = seq(as.POSIXct("2020-02-01 00:00:00", tz="GMT"),  length.out=1440/15, by='15 min'), # 15 minute intervals
    value = round(abs((120*(rnorm(1440/15))))+40) # Random numbers between 50 and 300
) %>% mutate(client_time = ymd_hms(client_time, tz = "UTC"))
head(measurements) %>% kable(caption = "Measurements from a sensor, maybe one near by?")
Measurements from a sensor, maybe one near by?
client_time value
2020-02-01 00:00:00 115
2020-02-01 00:15:00 62
2020-02-01 00:30:00 140
2020-02-01 00:45:00 231
2020-02-01 01:00:00 80
2020-02-01 01:15:00 138

I’m trying to create a new data frame that joins the measurements to the zoned events based on if client_time on measurement is between the zone start and end. I’d also like to add a column based on what zone the measurement took place.

The zones do not overlap, and I do not anticipate the zones overlapping in the future.

ggplot() + 
  geom_rect(data=events2, aes(xmin=zone_start, xmax=zone_end, ymin=0, ymax=400, color=as.factor(zone_id)), fill=NA) +
  geom_line(data=measurements, aes(x=client_time, y=value)) +
  scale_x_datetime(date_breaks = '3 hours', date_labels="%I %p", minor_breaks = '1 hour', timezone = "GMT") +
  labs(title="On the same graph, but far from ideal", color="Zone ID")

In this graph, I’d like the lines that makeup measurements to be associated/joined with the zones.

Solution

To start, I bind the two tables together. Then I arrange by the timestamp column. Before grouping, the sensor measurements have the postceding and preceding zone id applied.

I intuited that down was the way to start, and settled on downup after experimenting.

zoned <- bind_rows(events2, measurements) %>% as_tibble() %>% arrange(client_time) %>% fill(zone_id, .direction = 'downup') 
zoned %>% 
  ggplot(aes(x=client_time)) + 
  geom_rect(aes(xmin=zone_start, xmax=zone_end, ymin=0, ymax=400, color=as.factor(zone_id)), fill=NA) +
  geom_point(aes(y=100, color=as.factor(zone_id), shape=as.factor(event_id)), fill=NA) +
  geom_line(aes(y=value,  color=as_factor(zone_id))) +
  scale_x_datetime(date_breaks = '3 hours', date_labels="%I %p", minor_breaks = '1 hour', timezone = "GMT") +
  labs(title="Measurements have zones, but still too many of them", color="Zone ID", shape="Event ID")
## Warning: Removed 96 rows containing missing values (geom_rect).
## Warning: Removed 96 rows containing missing values (geom_point).
## Warning: Removed 2 rows containing missing values (geom_path).

From the above graphic, we have the measurements associated with the zone, but, we still have excess measurements that are not needed.

Before those measurements can be trimmed, we need to fill in zone_end, and zone_start to the measurements. I tried filter(client_time < zone_end), but that didn’t work.

It’s important to use >= and <= to ensure events on the end of zones get included.

final <- zoned %>% fill(zone_end, .direction = 'downup') %>% fill(zone_start, .direction = 'downup') %>% group_by(zone_id) %>% filter(client_time >= zone_start) %>% filter(client_time <= zone_end)
ggplot(final) + 
  geom_rect(aes(xmin=zone_start, xmax=zone_end, ymin=0, ymax=400, color=as.factor(zone_id)), fill=NA) +
  geom_point(aes(x=client_time, y=100, color=as.factor(zone_id), shape=as.factor(event_id)), fill=NA) +
  geom_line(aes(x=client_time, y=value,color=as_factor(zone_id))) +
  scale_x_datetime(date_breaks = '3 hours', date_labels="%I %p", minor_breaks = '1 hour', timezone = "GMT") +
  labs(title="The ideal, exactly what I need", color="Zone ID", shape="Event ID")
## Warning: Removed 32 rows containing missing values (geom_point).
## Warning: Removed 3 rows containing missing values (geom_path).

From the above graph, it looks like we’ve snipped the measurements, and associated them in the way I need. To reduce the libraries loaded for this example, you can interpolate the values with a mutate(value = zoo::na.approx(value)) on the end.

bind_rows(events2, measurements) %>%  # Combine into one frame
  arrange(client_time) %>% # Sort by event time
  fill(zone_start, .direction = 'downup') %>%  # Back fill when zones start, to filter
  fill(zone_end, .direction = 'downup') %>% # Back fill when zones ends, to filter
  fill(zone_id, .direction = 'downup') %>% # Back fill when zone id, to group
  group_by(zone_id) %>% filter(client_time >= zone_start) %>% filter(client_time <= zone_end) %>% ungroup() %>%  # Group by zone, and filter values outside
  slice(29:34) # Slice to loaf to show success
## # A tibble: 6 x 7
##   event_id note  client_time         zone_id zone_start         
##      <int> <fct> <dttm>                <dbl> <dttm>             
## 1       NA <NA>  2020-02-01 15:15:00       2 2020-02-01 13:38:00
## 2       NA <NA>  2020-02-01 15:30:00       2 2020-02-01 13:38:00
## 3        5 u     2020-02-01 21:02:00       3 2020-02-01 21:02:00
## 4       NA <NA>  2020-02-01 21:15:00       3 2020-02-01 21:02:00
## 5        6 y     2020-02-01 21:20:00       3 2020-02-01 21:02:00
## 6       NA <NA>  2020-02-01 21:30:00       3 2020-02-01 21:02:00
## # … with 2 more variables: zone_end <dttm>, value <dbl>

As you can see, the table formated with measurements only in the relevant zones.

Time to get back to my day job, and on to the change point detection that needs to be done.