Intro

In this guide, I’ll share my learnings on how to leverage diffdf and flextable to spot differences between two data frames. I provide a real world use case that builds on the concepts presented at the end.

From Scratch

I’ll begin by creating two data frames that have slight differences. Assume that the cell values between the two data frames can differ while the number of observations (rows) and variables (columns) are similar.

We can create a flextable for each data frame quite easily.

id

sex

smk

1

m

former

2

f

never

3

f

current

4

m

current

5

f

never

id

sex

smk

1

f

former

2

f

never

3

f

current

4

m

former

5

f

never

Comparing with diffdf

The diffdf package can be used to find and return differences between two data frames.

Below I create an object named deltas that will store the results of the diffdf() call on the two data frames. I’ve specified that df1 is the base (or “original”) and that df2 is the comparator.

The cell differences between the two data frames are stored in deltas as tibbles and are indexed by variable.

You may access them like so:

# A tibble: 1 x 4
  VARIABLE ..ROWNUMBER.. BASE  COMPARE
* <chr>            <int> <chr> <chr>  
1 sex                  1 m     f      
# A tibble: 1 x 4
  VARIABLE ..ROWNUMBER.. BASE    COMPARE
* <chr>            <int> <chr>   <chr>  
1 smk                  4 current former 

Within each tibble, the value of VARIABLE gives a natural column index, while the value of ..ROWNUMBER.. gives a row index. Together, these form coordinates that provide us a road map of precisely which cells in df2 differ from those in df1.

Now, we need to extract, clean and store each variables results so that this information can be readily used for styling. In this example, I use a loop to account for many variables at once. My end result is stored in delta_data.

    variable rownumber    base compare
sex      sex         1       m       f
smk      smk         4 current  former

Styling Differences

We can now use delta_data to style the flextable for df2. Below are a few examples of how one might do this.

My general approach to styling is to iterate over the rows of coordinates in delta_data.

Background colors

Change the background color for cells that differ.

id

sex

smk

1

f

former

2

f

never

3

f

current

4

m

former

5

f

never

Here is df1 again for comparisons.

id

sex

smk

1

m

former

2

f

never

3

f

current

4

m

current

5

f

never

Font Color

You can change the font color in a very similar way.

id

sex

smk

1

f

former

2

f

never

3

f

current

4

m

former

5

f

never

Cell Borders

Adding a border around cells that differ. Note: This requires the officer package. I’ve found officer and flextable actually complement each other really well!

id

sex

smk

1

f

former

2

f

never

3

f

current

4

m

former

5

f

never

Combining Styles

It’s straight forward to combine or stack styles using the pipe operator.

id

sex

smk

1

f

former

2

f

never

3

f

current

4

m

former

5

f

never

Extensions

The examples presented for in this tutorial were kept simple for demonstration purposes. Going beyond, the combination of diffdf and flextable together seem to have the potential to form a powerful work flow tool.

diffdf

In addition to finding individual cell differences, you can also use diffdf to return other useful information such as the presence or absence of columns and rows. More examples can be found in the package vignette.

flextable

In flextable, you can use a variety of formatting techniques to draw attention to differences. I’ve attempted to demonstrate a few of them, and you can view other appraoches here. I found it very straight forward to test, adapt and combine different approaches due to the intuitive design of flextable.

Use Case

As part of my current role, I receieve pre-computed statistical tables that are saved as data sets (e.g. .sas7bdat, .csv). I import these into R as data frames for further processing for a variety of projects.

Given that this work flow is well established, I looked at whether comparisons between statistical tables were feasible with the approach outlined in this tutorial. One of the primary aims would be to detect changes between versions of pre-computed statistical tables.

In the first panel, I have the initial version of a table.

In the second panel, I have an updated version of a table.

I’ve setup the following schema to highlight differences:

  • Cell differences have a sky blue background with a border

  • New columns have a pink background with a border

  • New rows have text highlighted in green

  • I have a hardcoded a rule to highlight the prescence of -9999 in the P column in yellow.

    • The background for this cell is blue, because this cell value is also different from the original table.

If you are interested in the code for this example, you can view some of my progress here.

I’d like to thank my colleague Robert (R.A.) for the inspiration to scale my learnings into something potentially useful across our team.

Matt

Use Case App

2020-11-10: Since publishing this original tutorial, I’ve made some progress in porting the use case into an app!

In the panel below, I have an interface that let’s the user specify which kinds of differences should be highlighted and by which color. The eventual goal is give the user the ability to specify their own tables through an fileInput widget.

Click the image to enlarge.

Once the tables are reviewed, the user can download the updated table as a word document (.docx). They can also specify whether the “styling” should be retained. Below is a screen shot of the downloaded document. This is partly enabled by the ever awesome officer package I mentioned earlier!

Click the image to enlarge.

Stay tuned for future developments. I hope to be able to post the code related to this progress very soon! In the meantime, you can still view the guts here.

Thanks for your time!

Matt