The Problem

Imagine a large, convoluted table that was produced by the joining of multiple other tables. After joining everything together, more rows were produced than expected! We need to group everything together by the primary key (the one we thought would keep everything straight and guide us through the join operations) and find out which columns prevented the joins from working as expected. This is a very simple, small example:

bigtable <- tibble(
  id = c(1, 1, 2, 3, 3, 3),
  val1 = c(12, 12, 13, 14, 15, 15),
  val2 = c("orange", "orange", "carrot", "grape", "grape", "grape"),
  val3 = c("a", "b", "c", "f", "g", "d")
)
bigtable
## # A tibble: 6 × 4
##      id  val1 val2   val3 
##   <dbl> <dbl> <chr>  <chr>
## 1     1    12 orange a    
## 2     1    12 orange b    
## 3     2    13 carrot c    
## 4     3    14 grape  f    
## 5     3    15 grape  g    
## 6     3    15 grape  d

And we want to figure out what sets apart the unique rows. So that seems easy, just keep only the columns where the values differ within each ID group! In reality, this problem was presented by a much, much larger table.

So this was my first pass - filtering by n() > 1 keeps only multirow groups, which are the problem ones that we are interested in.

bigtable %>% 
  group_by(id) %>% 
  filter( n() > 1 ) %>% 
  group_modify(function(dat, key) {
    dat %>% 
      select(!where(~ is_equal(.x)))
  })
## Error in is_equal(.x): could not find function "is_equal"

And I was surprised! Purrr has functions like is_atomic , is_integer, is_list, is_logical, even is_formula! But no is_equal.

The (quick) Solution

So I did some digging for base R functions, and the closest thing I found that I thought I could use was combn - since we are just looking for simple comparisons, I figured that would at least give me the possible combinations of any vector. I was a little surprised that combn doesn’t duplicate values (i.e. no replacement), but at least for our purposes that’s actually more efficient.

combn(c(1, 2, 3), m = 2, simplify = FALSE)
## [[1]]
## [1] 1 2
## 
## [[2]]
## [1] 1 3
## 
## [[3]]
## [1] 2 3

And you can see how we could easily turn this into the rest of the function:

combn_equal <- function(vec) {
  vec %>% 
    combn(m = 2, simplify = FALSE) %>% 
    map_lgl(function(pair) pair[1] == pair[2]) %>% 
    all
}
combn_equal(c(1, 2, 3))
## [1] FALSE

Once we throw in some NA values obviously there would need to be some control flow depending on how you interpret NA == val, which we’ll deal with later.

This wasn’t particularly hard to come up with, but I was (and am) surprised that this functionality isn’t included with the extremely comprehensive purrr or dplyr package.

The (base) Solution

Let’s get rid of the combn that we’re hanging onto and just do the work ourselves.

We can see that all we need to do is produce a n * n matrix and avoid repeating ourselves as combn so generously does for us.

make_pairs <- function(vec) {
  results <- list()
  index <- 1
  for (i in seq(1, length(vec))) {
    for (j in seq(i, length(vec))) {
      results[[index]] <- c(vec[i], vec[j])
      index <- index + 1
    }
  }
}

But obviously this has repeats (i == i) that we know would return TRUE, so we can offset things by one to skip over those.

make_pairs <- function(vec) {
  results <- list()
  index <- 1
  for (i in seq(1, length(vec)-1)) {
    for (j in seq(i+1, length(vec))) {
      results[[index]] <- c(vec[i], vec[j])
      index <- index + 1
    }
  }
  print(results)
}
make_pairs(c("a", "b", "c"))
## [[1]]
## [1] "a" "b"
## 
## [[2]]
## [1] "a" "c"
## 
## [[3]]
## [1] "b" "c"

So now that we have our combn function set up, we can set in short circuiting comparisons to save some time - once one comparison fails, we know we can return FALSE for the whole set.

We also need to add some checks for NA, and for length-one (or nonexistent) inputs. In this instance (in my opinion), partial missing values should return FALSE, although in tidyverse-style I’m sure someone could add in an argument for how to NA.

If this looks sloppy (it is), see the bonus section below for a more elegant, functional solution.

is_equal <- function(vec) {
  if ( length(vec) <= 1 | all(is.na(vec)) ) return(TRUE)
  else if (any(is.na(vec))) return(FALSE)
  else {
    for ( i in seq(1, length(vec) - 1) ) {
      for ( j in seq(i+1, length(vec)) ) {
        if (vec[i] != vec[j]) return(FALSE)
      }
    }
  }
  return(TRUE)
}
is_equal(c("a", "b", NA))
## [1] FALSE

The Application

So after all of this, we can finally use our is_equals function to figure out what was causing the problem with our joined table.

bigtable %>% 
  group_by(id) %>% 
  filter( n() > 1 ) %>% 
  group_modify(function(dat, key) {
    dat %>% 
      select(!where( ~is_equal(.x) ))
  }) %>% 
  ungroup()
## # A tibble: 5 × 3
##      id val3   val1
##   <dbl> <chr> <dbl>
## 1     1 a        NA
## 2     1 b        NA
## 3     3 f        14
## 4     3 g        15
## 5     3 d        15

Now we can see where the issues were - even better, we can see that more than one column was problematic for id == 3 by virtue of grouping. This could be made more explicit by using group_map so that each group would possess only the identifier columns associated with it.

When we have datasets with a silly amount of unknown columns, being able to quickly tell which columns are differentiating factors within groups other than our presumed primary key is extremely valuable, especially with very-not-tidy data. I hope this simple application demonstrates the utility of an is_equal function.

Bonus Functional Solution

I wanted to see if I could replicate this using more functional style programming, such as accumulate or reduce. Obviously anything that works with a for loop could be done with recursion so we’ll just skip that part.

Fun way to do it is to use Boolean logic - as far as I can tell, this only works if we negate the final result since we’re making use of the or operator to maintain our final result if any of the pairs are not equal.

We’ll never get a full list of combinations, but that doesn’t matter since if any of the steps are actually equal we’ll know transitively that if we make it to the end and the result remains FALSE that all other combinations must be equal.

We still need to add a few phrases to deal with missing values, but overall it’s a lot cleaner and more clever I think!

mapreduce_equal <- function(vec) {
  if (all(is.na(vec)) | length(vec) <= 1) TRUE
  else if (any(is.na(vec))) FALSE
  else {
    1:(length(vec) - 1) %>%
      map_lgl(function(i) vec[i] != vec[i+1]) %>%
      reduce(`|`) %>%
      {!.}
  }
}
mapreduce_equal(c(1, 2, 2, 2))
## [1] FALSE

And here it is in action with the original application:

bigtable %>% 
  group_by(id) %>% 
  filter( n() > 1 ) %>% 
  group_modify(function(dat, key) {
    dat %>% 
      select(!where( ~mapreduce_equal(.x) ))
  }) %>% 
  ungroup()
## # A tibble: 5 × 3
##      id val3   val1
##   <dbl> <chr> <dbl>
## 1     1 a        NA
## 2     1 b        NA
## 3     3 f        14
## 4     3 g        15
## 5     3 d        15

Performance Comparison

I couldn’t resist comparing things, so here is a microbenchmark setup to compare the nested for loop solution to the purrr functional solution. I also threw in the complete solution using combn, listed below:

combn_equal <- function(vec) {
  if (all(is.na(vec)) | length(vec) <= 1) TRUE
  else if (any(is.na(vec))) FALSE
  else {
      vec %>% 
        combn(m = 2, simplify = FALSE) %>% 
        map_lgl(function(pair) pair[1] == pair[2]) %>% 
        all()
  }
}

and now we can compare our for loop, mapreduce, and combn implementations.

benchmark <- function(x) {
  microbenchmark::microbenchmark(
    forloop = is_equal(x),
    comb = combn_equal(x),
    mapreduce = mapreduce_equal(x),
    times = 999,
    check = 'equal',
    unit = "microseconds"
  ) %>% 
    print
}
sample(c(1:5), 100, replace = TRUE) %>% 
  benchmark()
## Unit: microseconds
##       expr      min        lq        mean   median        uq       max neval
##    forloop    5.207    5.9040    8.061076    6.929    9.3685    26.240   999
##       comb 2918.626 3000.5235 3226.369743 3048.473 3110.8340 45003.486   999
##  mapreduce  213.446  224.2495  262.651294  235.258  262.0720  1930.157   999

In this test (no missing values) the nested for loop (is_equal) comes out on top as expected. Due to the short-circuiting return statements we never really have to get through the entire vec while both other functions consider the input in its entirety.

However, the surprise comes from mapreduce_equal actually beating out combn_equal by a large margin. I’m no R performance expert, but I’d imagine there’s a significant amount of overhead in the combn function returning a list of vectors vs. keeping in vector format in our mapreduce implementation.

Once we add missing values into the mix though the performance becomes mostly the same due to the quick checks we can run before actually getting into the vector at all:

sample(c(1:10, NA), 100, replace = TRUE) %>% 
  benchmark()
## Unit: microseconds
##       expr   min    lq      mean median    uq    max neval
##    forloop 0.656 0.697 0.7736647  0.697 0.697 10.086   999
##       comb 0.615 0.656 0.7657848  0.656 0.697 13.489   999
##  mapreduce 0.615 0.656 0.7870440  0.656 0.697  7.462   999

Here since NA is virtually guaranteed to be present in our test set the results are more or less equivalent. Just for fun, we can test a much shorter vec to give our other non-short-circuiting functions a chance:

sample(c(1:5), 5, replace = TRUE) %>% 
  benchmark()
## Unit: microseconds
##       expr    min     lq      mean median     uq      max neval
##    forloop  4.674  5.289  5.731135  5.535  5.904   13.981   999
##       comb 22.345 23.739 25.381175 24.436 26.650   60.844   999
##  mapreduce 57.892 59.860 69.311464 61.049 66.338 4661.905   999

Here I can only assume that the overhead of the list produced by combn is much smaller, and that perhaps the application of the base R all allowed it to beat out the use of reduce in the mapreduce_equal function. Either way, I think mapreduce_equal is still my personal favorite of the bunch!