We’re going to mine this book’s data

The Flavor Bible is a kitchen reference for embellishing and developing recipes. Tons of chefs swear by it. And now that you’re aware of it, you might notice it hanging out in the bookshelves of your favorite restaurant.

Despite first appearances, The Flavor Bible is not a cookbook. More accurately, it’s a thesaurus for food pairings. It contains hundreds of pages of flavor matching charts telling you which flavors will go well with a given heading, based on the collective opinion of several hundred world-renowned chefs.

The flavor matching charts: flavors marked green, headings marked red

Chefs use TFB with a technique dubbed “flavor webbing” – sequentially adding flavors to a recipe as long as they pair with what’s already been added, until you get a set of flavors that inspire a recipe.

Chef Grant Achatz builds a flavor web that looks like the start of a baked beans recipe.

In this blog post we’re going to visualize flavor webs in R using network visualization packages. Our flavor webs will keep track of the flavors we’re searching and tell us which flavors pair with everything else. We will start with a pdf of the book, extract its text, munge it into “tidy data” format, create a database, and visualize flavor webs built from the database using the NetworkD3 package. By the end of the blog post, you’ll know how to make interesting plots like this one that I reused from my web app:

Flavor web with chicken and tomatoes as search terms. Cauliflower connects to both search terms.

If you want to skip straight to the app, here’s a link to the app. Here’s a link to the github page for all the data and code.

Finding a digital copy

I found a few digital versions of the Flavor Bible online (I won’t link directly but they all came from LibGen) and assessed their formatting. Some formats were better for parsing than others. I’ve marked the features of interest in different colors:

In purple are sections of text that we want to ignore
In green are flavors that pair well with the most recent heading
In red are headings that we use to find what its pairings are

Compare two electronic versions of TFB below. Note how the version on the right differs from the version on the left. The right-hand version is single-column and text is at perfect right angles to the page. That makes it much easier to parse text.

The text on the right is much easier to parse.

So I chose the example on the right as the source text and then figured out which pages contain the flavor matching charts. I saw that the flavor matching charts began on page 42, line 16 and continued until the end of page 811. By extracting only the flavor matching charts we will have an easier time coming up with rules for munging the data into tidy format.

Automatically identifying headings and flavors

We will be using the tm library for extracting the content from the book and tidyverse for munging and transforming the data. Eventually we will end up with a two-column dataframe with headings in one column, flavors in the other, and each row corresponding to a flavor pairing between a heading and a flavor.

library(tm) # for extracting text from the pdf
library(tidyverse) #for piping command and munging into tidy format

I prefer working with dataframes over lists, so I transformed the VCorpus object we get from the tm::Corpus method into a dataframe with a “page” column indicating the page number and a “text” column indicating the text found for a given line on a given page.

#preserve layout for distinguishing headings
read <- readPDF(control = list(text = "-layout")) 
#extract the text while preserving layout
document <- Corpus(URISource("./Flavor-Bible-epub.pdf"),
                   readerControl = list(reader = read))
#create new list elements at line breaks
text <- content(document[[1]]) %>%
        strsplit("\r\n") 
#keep only the flavor matching charts
#transform lists to single array
lines <- text[42:811] %>% unlist() 
#index the line breaks by page number
page_num <- sapply(42:811, function(x) rep(x, length(text[[x]]))) %>% unlist() 
#convert to data frame & exclude first 15 lines
extraction <- data_frame(page = page_num, text = lines)[-c(1:15),]

Next we’ll develop some heuristics for turning the dataframe into a tidy, two-column database of headings and flavors. In other words, we’ll add column to our extraction dataframe that help us classify each row in the “text” column as a heading, flavor, or a line that we should ignore. The book’s Editor used a consistent set of rules for the formatting, so picking out headings and flavors is a matter of reverse engineering the formatting that the Editor used. Easy to say, not always so easy to do.

Fortunately, in this case, the heuristics were simple enough. I won’t bore you with the process I went through to find these heuristics, since it was essentially Guess and Check. You could also call it Feature Engineering.

Let’s eschew the ado and just show you the rules:

Headings are lines where

the first three letters of the string are uppercase AND
the whole page has less than 2 indents AND
there is no leading dash AND
there is no indent.

Flavors are lines where

Either there are < 2 indents on the page AND
- no leading dashes
- no pronouns
Or there are >= 2 indents page AND
- an indent
- no dashes
- no pronouns

And we ignore anything that isn’t a heading or flavor.

I can only talk about how I determined these rules in the broadest of terms. Feature engineering is more of a creative, artistic, iterative-process than a planned, deterministic, scientific process. I repeated the guess and check process many times until I was retrieving most of the flavors. Importantly, I was looking only at the flavor matching charts.

Here are those rules implemented in r code:

#exclude lines with pronouns - tend not to be flavors
pronouns <- c(" I | YOU | WE | THEY | THEIR | MY | OUR ")

extraction <- extraction %>% 
          mutate(#detect uppercase, indenting, leading dashes, pronouns
                 caps = str_detect(text, pattern = "^[[:upper:]]{3,}"),
                 indent = str_detect(text, pattern = "^[[:space:]]{1,}"),
                 dashes = str_trim(text,"left") %>%
                           str_detect(., pattern = "^[-–—]"),
                 sentence = str_detect(text %>% toupper(), pronouns)) %>% 
          group_by(page) %>% 
          #label pages with fewer than two indents total
          mutate(few_indents = sum(indent) < 2) %>% 
          ungroup %>% 
          mutate(#then write the new rules for each class
                 heading = caps & !few_indents & !dashes & !indent,
                 flavor = ifelse(few_indents, 
                                 !dashes & !sentence, 
                                 indent & !sentence & !dashes),
                 ignore = !heading & !flavor)

Admittedly, the rules are not perfect. I encourage you to find the data and try cleaning the data yourself. See if you can improve on this pipeline!

Now that we’ve calculated the relevant features, let’s use them to build our database.

Munging the flavor charts into a tidy two-column database

Okay now we are going to make a tidy database for our flavor web visualizations. We want two columns of data in long form, the first column is the heading and the second column is the flavor pair for that heading. To get there, we’ll have to transform our 10-column dataframe with the following steps:

remove all the rows where the “ignore” column is TRUE.
make a vector with all the unique headings
repeat each heading in a vector until the index of the next heading
make a vector of all the flavors
combine the headings vector and the flavors vector in a dataframe

#remove all the rows where "ignore" is TRUE.
flavor_matches <- extraction[!extraction$ignore,]
#make a vector of all the headings
headings_vec <- flavor_matches$text[flavor_matches$heading]
#repeat each heading until the index of the next heading
headings <- headings_vec[cumsum(flavor_matches$heading)]
#make a vector of all the text (minus ignored rows) and trim whitespace
flavor_matches$text <- str_trim(flavor_matches$text, "both")
#combine the headings vector and the text vector, filter headings from flavors
tidy_flavors <- data.frame(main = headings, pairing = flavor_matches$text) %>% 
                filter(pairing != headings)

Then we examine the first few rows and write to csv if we like what we see.

head(tidy_flavors)

##            main                        pairing
## 1 ACHIOTE SEEDS                           beef
## 2 ACHIOTE SEEDS                        chicken
## 3 ACHIOTE SEEDS                         chiles
## 4 ACHIOTE SEEDS     citrus (e.g., sour orange)
## 5 ACHIOTE SEEDS                           fish
## 6 ACHIOTE SEEDS game birds (e.g., duck, quail)

#write.csv(x = tidy_flavors, file = "flavor_bible_full.csv", row.names = FALSE)

Well done! We’ve got a tidy two-column database of main flavors and their suggested pairings. This will be the back end for all our flavor webs.

Building flavor webs in NetworkD3

We will use the network visualization package networkD3 and a convenience function from igraph to display a web of the flavors the user inputs.

library(igraph)
library(networkD3)

Load the tidy, two-column database we created earlier. I convert to lowercase just to make searching and filtering easier.

tidy_flavors <- apply(tidy_flavors, 2, tolower) %>% as.data.frame

In the context of networks, you can think of the data as a list of connections where the “main” column is the source node, the “pairing” column is the target node, and each row is an edge that connects the two nodes. Although this data format makes sense intuitively, the NetworkD3 package needs the data in a slightly different format. We use the igraph:graph_from_data_frame function to get us part way there, and then we have to do a bit more munging of the data to get us the rest of the way before we can visualize flavor webs.

So our objective is to write a function that takes a list of ingredients as input and gives a networkD3 graph object as output.

make_web <- function(flavors){
        #visualize only listed flavors
        flavor_network <- tidy_flavors[tidy_flavors$main %in% flavors,]
        #use function from igraph library to turn data frame into graph object
        flavor_net <- igraph::graph_from_data_frame(flavor_network, directed=F)
        #de-duplicate nodes 
        flavor_network <- flavor_network[!duplicated(flavor_network$pairing),]
        flavor_network <- flavor_network[!flavor_network$pairing %in% flavors,]
        flavor_network <- rbind(data.frame(main = sort(flavors),
                                           pairing = sort(flavors)),
                                flavor_network)
        #assign flavors to their respective group
        groups <- setNames(as.character(flavor_network$pairing), 
                           flavor_network$main)
        #convert igraph object to networkD3 object
        g <- networkD3::igraph_to_networkD3(flavor_net, group = names(groups))
        
        return(g)
}

Okay now we have a function for turning our tidy_flavors dataframe into a networkD3 graph object. Let’s compare the structure of the data before moving onto the final step.

flavor_web <- make_web(c("ham", "honey"))
str(flavor_web)

## List of 2
##  $ links:'data.frame':   152 obs. of  2 variables:
##   ..$ source: num [1:152] 0 1 1 0 1 0 0 1 1 0 ...
##   ..$ target: num [1:152] 6 63 64 7 65 8 9 66 67 10 ...
##  $ nodes:'data.frame':   140 obs. of  2 variables:
##   ..$ name : Factor w/ 140 levels "allspice","almonds",..: 50 56 124 136 134 127 1 4 6 7 ...
##   ..$ group: Factor w/ 2 levels "ham","honey": 1 2 1 1 1 1 1 1 1 1 ...

Pay attention to the column names for each of the two dataframe contained within the graph object because we refer to them in the next step.

Once we have a networkD3 graph object, we can display it with the forceNetwork function. There are a whole bunch of options to set in the graphs. The basic ones are the Links and Nodes options, which we use our graph object to assign the corresponding list elements. Then we assign the Source, Target, Group, and NodeID the appropriate columns from our graph object. Then there are the aesthetic settings, which I won’t belabor.

Let’s write the function for turning the networkD3 graph object into a force network.

display_web <- function(g){
        
        forceNetwork(
                Links = g$links, Nodes = g$nodes,
                Source = "source", Target = "target",
                Group = "group", NodeID = "name",
                opacity = 0.9, fontSize = 16, zoom = T,
                colourScale = JS("d3.scaleOrdinal(d3.schemeCategory10);"),
                opacityNoHover = 0, legend = T, linkDistance = 70, charge = -30,
                linkColour = "#CCC", fontFamily = "Fantasy")
}

Now we can take a list of ingredients and generate something resembling a flavor web!

display_web(flavor_web)

What I like about NetworkD3 compared to other network visualization packages is that it can make these interactive graphs. The data is so much more alive and interesting this way, don’t you think? Notice that the two main nodes – “ham” and “honey” – are what we entered into our `make_web’ function earlier. There are three types of nodes other than the main nodes: the nodes connected to ham, the nodes connected to honey, and the nodes connected to both. Adding “sweet potatoes” – which connects to both “ham” and “honey”– to our list of main nodes would extend our web of flavors and get us closer to a recipe. We’re flavor webbing now!

But to really begin flavor webbing, we’ll want to make it reactive – and that’s where shiny apps come in handy. Stay tuned for the next post where I dive into the mechanics of shiny apps and show you how to make cool, reactive network visualizations. Or to cut straight to the app you can follow this link.

Conclusions and future directions

It’s been a fun project and I’m proud to say I genuinely prefer this format to the physical copy of the book. My wife and I recently cooked up a vanilla fish recipe that we would’ve never tried had we not gotten the suggestion from the app. I encourage you to give it a shot the next time you are cooking at home.

Late in the project I discovered tools for extracting text and font information from epub files directly without having to convert them to pdf. This is a possibly faster way to include text bolding information and could be a good exercise in text mining.

Well, that’s it. If you have any thoughts or reactions to this post, feel free to hit me up at my github page for the project.

Thanks for reading!

Flavor Bible Visualizations

Alexander Reeves

May 10, 2018

Finding a digital copy

Automatically identifying headings and flavors

Munging the flavor charts into a tidy two-column database

Building flavor webs in NetworkD3

Conclusions and future directions