In this guide, I’ll share my learnings on how to leverage diffdf and flextable to spot differences between two data frames. I provide a real world use case that builds on the concepts presented at the end.
I’ll begin by creating two data frames that have slight differences. Assume that the cell values between the two data frames can differ while the number of observations (rows) and variables (columns) are similar.
library(flextable)
library(diffdf)
library(dplyr)
df1 <- data.frame(id = 1:5,
sex = c('m','f','f','m','f'),
smk = c('former','never','current','current','never'))
df2 <- data.frame(id = 1:5,
sex = c('f','f','f','m','f'),
smk = c('former','never','current','former','never'))We can create a flextable for each data frame quite easily.
id | sex | smk |
1 | m | former |
2 | f | never |
3 | f | current |
4 | m | current |
5 | f | never |
id | sex | smk |
1 | f | former |
2 | f | never |
3 | f | current |
4 | m | former |
5 | f | never |
The diffdf package can be used to find and return differences between two data frames.
Below I create an object named deltas that will store the results of the diffdf() call on the two data frames. I’ve specified that df1 is the base (or “original”) and that df2 is the comparator.
The cell differences between the two data frames are stored in deltas as tibbles and are indexed by variable.
You may access them like so:
# A tibble: 1 x 4
VARIABLE ..ROWNUMBER.. BASE COMPARE
* <chr> <int> <chr> <chr>
1 sex 1 m f
# A tibble: 1 x 4
VARIABLE ..ROWNUMBER.. BASE COMPARE
* <chr> <int> <chr> <chr>
1 smk 4 current former
Within each tibble, the value of VARIABLE gives a natural column index, while the value of ..ROWNUMBER.. gives a row index. Together, these form coordinates that provide us a road map of precisely which cells in df2 differ from those in df1.
Now, we need to extract, clean and store each variables results so that this information can be readily used for styling. In this example, I use a loop to account for many variables at once. My end result is stored in delta_data.
#get variable names
var_list <- colnames(df1)
#create a place holder
delta_list <- list()
#iterate over variables to find if cell differences was detected - get "coordinates"
for(i in var_list) {
dat <- data.frame(eval(parse(text=paste0("deltas$VarDiff_",i)))) %>%
janitor::clean_names()
delta_list[[i]] <- dat
}
#put all found cell difference coordinates into a single data frame
delta_data <- do.call(rbind, delta_list)
#view results
delta_data variable rownumber base compare
sex sex 1 m f
smk smk 4 current former
We can now use delta_data to style the flextable for df2. Below are a few examples of how one might do this.
My general approach to styling is to iterate over the rows of coordinates in delta_data.
Change the background color for cells that differ.
df2 %>%
flextable() %>%
#Style rows individually from delta_data
(function(x){
for (i in 1:nrow(delta_data)) {
#bg() is the flextable styling function
x <- bg(x,
i=delta_data$rownumber[[i]],
j=delta_data$variable[[i]],
bg="yellow",
part="body")}
x
})()id | sex | smk |
1 | f | former |
2 | f | never |
3 | f | current |
4 | m | former |
5 | f | never |
Here is df1 again for comparisons.
id | sex | smk |
1 | m | former |
2 | f | never |
3 | f | current |
4 | m | current |
5 | f | never |
You can change the font color in a very similar way.
df2 %>%
flextable() %>%
#Style rows individually from delta_data
(function(x){
for (i in 1:nrow(delta_data)) {
#color() is the flextable styling function
x <- color(x,
i=delta_data$rownumber[[i]],
j=delta_data$variable[[i]],
color="red",
part="body")
}
x
})()id | sex | smk |
1 | f | former |
2 | f | never |
3 | f | current |
4 | m | former |
5 | f | never |
Adding a border around cells that differ. Note: This requires the officer package. I’ve found officer and flextable actually complement each other really well!
library(officer)
df2 %>%
flextable() %>%
#Style rows individually from delta_data
(function(x){
for (i in 1:nrow(delta_data)) {
#border() is the flextable styling function
x <- border(x,
i=delta_data$rownumber[[i]],
j=delta_data$variable[[i]],
#fp_border() is from officer
border=fp_border(color = "orange", style="dashed", width=1.5),
part="body")
}
x
})()id | sex | smk |
1 | f | former |
2 | f | never |
3 | f | current |
4 | m | former |
5 | f | never |
It’s straight forward to combine or stack styles using the pipe operator.
df2 %>%
flextable() %>%
#Style rows individually from delta_data
(function(x){
for (i in 1:nrow(delta_data)) {
#piping together two flextable styling functions
x <- border(x,
i=delta_data$rownumber[[i]],
j=delta_data$variable[[i]],
border=fp_border(color = "black", style="dashed", width=1.5),
part="body") %>%
bg(x,
i=delta_data$rownumber[[i]],
j=delta_data$variable[[i]],
bg="skyblue",
part="body")
}
x
})()id | sex | smk |
1 | f | former |
2 | f | never |
3 | f | current |
4 | m | former |
5 | f | never |
The examples presented for in this tutorial were kept simple for demonstration purposes. Going beyond, the combination of diffdf and flextable together seem to have the potential to form a powerful work flow tool.
In addition to finding individual cell differences, you can also use diffdf to return other useful information such as the presence or absence of columns and rows. More examples can be found in the package vignette.
In flextable, you can use a variety of formatting techniques to draw attention to differences. I’ve attempted to demonstrate a few of them, and you can view other appraoches here. I found it very straight forward to test, adapt and combine different approaches due to the intuitive design of flextable.
As part of my current role, I receieve pre-computed statistical tables that are saved as data sets (e.g. .sas7bdat, .csv). I import these into R as data frames for further processing for a variety of projects.
Given that this work flow is well established, I looked at whether comparisons between statistical tables were feasible with the approach outlined in this tutorial. One of the primary aims would be to detect changes between versions of pre-computed statistical tables.
In the first panel, I have the initial version of a table.
In the second panel, I have an updated version of a table.
I’ve setup the following schema to highlight differences:
Cell differences have a sky blue background with a border
New columns have a pink background with a border
New rows have text highlighted in green
I have a hardcoded a rule to highlight the prescence of -9999 in the P column in yellow.
If you are interested in the code for this example, you can view some of my progress here.
I’d like to thank my colleague Robert (R.A.) for the inspiration to scale my learnings into something potentially useful across our team.
Matt
2020-11-10: Since publishing this original tutorial, I’ve made some progress in porting the use case into an app!
In the panel below, I have an interface that let’s the user specify which kinds of differences should be highlighted and by which color. The eventual goal is give the user the ability to specify their own tables through an fileInput widget.
Click the image to enlarge.
Once the tables are reviewed, the user can download the updated table as a word document (.docx). They can also specify whether the “styling” should be retained. Below is a screen shot of the downloaded document. This is partly enabled by the ever awesome officer package I mentioned earlier!
Click the image to enlarge.
Stay tuned for future developments. I hope to be able to post the code related to this progress very soon! In the meantime, you can still view the guts here.
Thanks for your time!
Matt