Imagine a large, convoluted table that was produced by the joining of multiple other tables. After joining everything together, more rows were produced than expected! We need to group everything together by the primary key (the one we thought would keep everything straight and guide us through the join operations) and find out which columns prevented the joins from working as expected. This is a very simple, small example:
bigtable <- tibble(
id = c(1, 1, 2, 3, 3, 3),
val1 = c(12, 12, 13, 14, 15, 15),
val2 = c("orange", "orange", "carrot", "grape", "grape", "grape"),
val3 = c("a", "b", "c", "f", "g", "d")
)
bigtable
## # A tibble: 6 × 4
## id val1 val2 val3
## <dbl> <dbl> <chr> <chr>
## 1 1 12 orange a
## 2 1 12 orange b
## 3 2 13 carrot c
## 4 3 14 grape f
## 5 3 15 grape g
## 6 3 15 grape d
And we want to figure out what sets apart the unique rows. So that seems easy, just keep only the columns where the values differ within each ID group! In reality, this problem was presented by a much, much larger table.
So this was my first pass - filtering by n() > 1
keeps only multirow groups, which are the problem ones that we are
interested in.
bigtable %>%
group_by(id) %>%
filter( n() > 1 ) %>%
group_modify(function(dat, key) {
dat %>%
select(!where(~ is_equal(.x)))
})
## Error in is_equal(.x): could not find function "is_equal"
And I was surprised! Purrr has functions like is_atomic
, is_integer, is_list,
is_logical, even is_formula! But no
is_equal.
So I did some digging for base R functions, and the closest thing I
found that I thought I could use was combn - since we are
just looking for simple comparisons, I figured that would at least give
me the possible combinations of any vector. I was a little surprised
that combn doesn’t duplicate values (i.e. no replacement),
but at least for our purposes that’s actually more efficient.
combn(c(1, 2, 3), m = 2, simplify = FALSE)
## [[1]]
## [1] 1 2
##
## [[2]]
## [1] 1 3
##
## [[3]]
## [1] 2 3
And you can see how we could easily turn this into the rest of the function:
combn_equal <- function(vec) {
vec %>%
combn(m = 2, simplify = FALSE) %>%
map_lgl(function(pair) pair[1] == pair[2]) %>%
all
}
combn_equal(c(1, 2, 3))
## [1] FALSE
Once we throw in some NA values obviously there would
need to be some control flow depending on how you interpret
NA == val, which we’ll deal with later.
This wasn’t particularly hard to come up with, but I was (and am)
surprised that this functionality isn’t included with the extremely
comprehensive purrr or dplyr package.
Let’s get rid of the combn that we’re hanging onto and
just do the work ourselves.
We can see that all we need to do is produce a n * n
matrix and avoid repeating ourselves as combn so generously
does for us.
make_pairs <- function(vec) {
results <- list()
index <- 1
for (i in seq(1, length(vec))) {
for (j in seq(i, length(vec))) {
results[[index]] <- c(vec[i], vec[j])
index <- index + 1
}
}
}
But obviously this has repeats (i == i) that we know
would return TRUE, so we can offset things by one to skip
over those.
make_pairs <- function(vec) {
results <- list()
index <- 1
for (i in seq(1, length(vec)-1)) {
for (j in seq(i+1, length(vec))) {
results[[index]] <- c(vec[i], vec[j])
index <- index + 1
}
}
print(results)
}
make_pairs(c("a", "b", "c"))
## [[1]]
## [1] "a" "b"
##
## [[2]]
## [1] "a" "c"
##
## [[3]]
## [1] "b" "c"
So now that we have our combn function set up, we can
set in short circuiting comparisons to save some time - once one
comparison fails, we know we can return FALSE for the whole
set.
We also need to add some checks for NA, and for
length-one (or nonexistent) inputs. In this instance (in my opinion),
partial missing values should return FALSE, although in
tidyverse-style I’m sure someone could add in an argument for how to
NA.
If this looks sloppy (it is), see the bonus section below for a more elegant, functional solution.
is_equal <- function(vec) {
if ( length(vec) <= 1 | all(is.na(vec)) ) return(TRUE)
else if (any(is.na(vec))) return(FALSE)
else {
for ( i in seq(1, length(vec) - 1) ) {
for ( j in seq(i+1, length(vec)) ) {
if (vec[i] != vec[j]) return(FALSE)
}
}
}
return(TRUE)
}
is_equal(c("a", "b", NA))
## [1] FALSE
So after all of this, we can finally use our is_equals function to figure out what was causing the problem with our joined table.
bigtable %>%
group_by(id) %>%
filter( n() > 1 ) %>%
group_modify(function(dat, key) {
dat %>%
select(!where( ~is_equal(.x) ))
}) %>%
ungroup()
## # A tibble: 5 × 3
## id val3 val1
## <dbl> <chr> <dbl>
## 1 1 a NA
## 2 1 b NA
## 3 3 f 14
## 4 3 g 15
## 5 3 d 15
Now we can see where the issues were - even better, we can see that
more than one column was problematic for id == 3 by virtue
of grouping. This could be made more explicit by using
group_map so that each group would possess only the
identifier columns associated with it.
When we have datasets with a silly amount of unknown columns, being
able to quickly tell which columns are differentiating factors within
groups other than our presumed primary key is extremely valuable,
especially with very-not-tidy data. I hope this simple application
demonstrates the utility of an is_equal function.
I wanted to see if I could replicate this using more functional style programming, such as accumulate or reduce. Obviously anything that works with a for loop could be done with recursion so we’ll just skip that part.
Fun way to do it is to use Boolean logic - as far as I can tell, this
only works if we negate the final result since we’re making use of the
or operator to maintain our final result if any of the
pairs are not equal.
We’ll never get a full list of combinations, but that doesn’t matter
since if any of the steps are actually equal we’ll know transitively
that if we make it to the end and the result remains FALSE
that all other combinations must be equal.
We still need to add a few phrases to deal with missing values, but overall it’s a lot cleaner and more clever I think!
mapreduce_equal <- function(vec) {
if (all(is.na(vec)) | length(vec) <= 1) TRUE
else if (any(is.na(vec))) FALSE
else {
1:(length(vec) - 1) %>%
map_lgl(function(i) vec[i] != vec[i+1]) %>%
reduce(`|`) %>%
{!.}
}
}
mapreduce_equal(c(1, 2, 2, 2))
## [1] FALSE
And here it is in action with the original application:
bigtable %>%
group_by(id) %>%
filter( n() > 1 ) %>%
group_modify(function(dat, key) {
dat %>%
select(!where( ~mapreduce_equal(.x) ))
}) %>%
ungroup()
## # A tibble: 5 × 3
## id val3 val1
## <dbl> <chr> <dbl>
## 1 1 a NA
## 2 1 b NA
## 3 3 f 14
## 4 3 g 15
## 5 3 d 15
I couldn’t resist comparing things, so here is a microbenchmark setup
to compare the nested for loop solution to the purrr
functional solution. I also threw in the complete solution using
combn, listed below:
combn_equal <- function(vec) {
if (all(is.na(vec)) | length(vec) <= 1) TRUE
else if (any(is.na(vec))) FALSE
else {
vec %>%
combn(m = 2, simplify = FALSE) %>%
map_lgl(function(pair) pair[1] == pair[2]) %>%
all()
}
}
and now we can compare our for loop, mapreduce, and combn implementations.
benchmark <- function(x) {
microbenchmark::microbenchmark(
forloop = is_equal(x),
comb = combn_equal(x),
mapreduce = mapreduce_equal(x),
times = 999,
check = 'equal',
unit = "microseconds"
) %>%
print
}
sample(c(1:5), 100, replace = TRUE) %>%
benchmark()
## Unit: microseconds
## expr min lq mean median uq max neval
## forloop 5.207 5.9040 8.061076 6.929 9.3685 26.240 999
## comb 2918.626 3000.5235 3226.369743 3048.473 3110.8340 45003.486 999
## mapreduce 213.446 224.2495 262.651294 235.258 262.0720 1930.157 999
In this test (no missing values) the nested for loop
(is_equal) comes out on top as expected. Due to the
short-circuiting return statements we never really have to
get through the entire vec while both other functions
consider the input in its entirety.
However, the surprise comes from mapreduce_equal
actually beating out combn_equal by a large margin. I’m no
R performance expert, but I’d imagine there’s a significant amount of
overhead in the combn function returning a list of vectors
vs. keeping in vector format in our mapreduce
implementation.
Once we add missing values into the mix though the performance becomes mostly the same due to the quick checks we can run before actually getting into the vector at all:
sample(c(1:10, NA), 100, replace = TRUE) %>%
benchmark()
## Unit: microseconds
## expr min lq mean median uq max neval
## forloop 0.656 0.697 0.7736647 0.697 0.697 10.086 999
## comb 0.615 0.656 0.7657848 0.656 0.697 13.489 999
## mapreduce 0.615 0.656 0.7870440 0.656 0.697 7.462 999
Here since NA is virtually guaranteed to be present in our test set the results are more or less equivalent. Just for fun, we can test a much shorter vec to give our other non-short-circuiting functions a chance:
sample(c(1:5), 5, replace = TRUE) %>%
benchmark()
## Unit: microseconds
## expr min lq mean median uq max neval
## forloop 4.674 5.289 5.731135 5.535 5.904 13.981 999
## comb 22.345 23.739 25.381175 24.436 26.650 60.844 999
## mapreduce 57.892 59.860 69.311464 61.049 66.338 4661.905 999
Here I can only assume that the overhead of the list produced by
combn is much smaller, and that perhaps the application of
the base R all allowed it to beat out the use of
reduce in the mapreduce_equal function. Either
way, I think mapreduce_equal is still my personal favorite
of the bunch!