Introduction

Mankind has experimented for centuries in the kitchen, or behind the bar, in order to prepare the best possible versions of dishes and drinks. Hundreds of thousands of combinations, recipes and various flavor variations. In trying to produce delicious combinations of ingredients, have we found a method in this madness? Perhaps an algorithm? Are we able to judge the type and category of product by the ingredients alone?

We will try to answer this question in the following study. Using data from the service https://www.kaggle.com/datasets/ai-first/cocktail-ingredients we are going to research most famous cocktails recipes. We will check if text mining methods could help us with finding out cocktails categories and also we are going to try distinguish between non-alcoholic cocktails and those with alcohol.

Data description

Data set consists of 546 observations and 41 variables. For the purposes of our analysis we are going to use only 3 of them.

## spc_tbl_ [546 × 41] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ...1           : num [1:546] 0 1 2 3 4 5 6 7 8 9 ...
##  $ strDrink       : chr [1:546] "'57 Chevy with a White License Plate" "1-900-FUK-MEUP" "110 in the shade" "151 Florida Bushwacker" ...
##  $ dateModified   : POSIXct[1:546], format: "2016-07-18 22:49:04" "2016-07-18 22:27:04" ...
##  $ idDrink        : num [1:546] 14029 15395 15423 14588 15346 ...
##  $ strAlcoholic   : chr [1:546] "Alcoholic" "Alcoholic" "Alcoholic" "Alcoholic" ...
##  $ strCategory    : chr [1:546] "Cocktail" "Shot" "Beer" "Milk / Float / Shake" ...
##  $ strDrinkThumb  : chr [1:546] "http://www.thecocktaildb.com/images/media/drink/qyyvtu1468878544.jpg" "http://www.thecocktaildb.com/images/media/drink/uxywyw1468877224.jpg" "http://www.thecocktaildb.com/images/media/drink/xxyywq1454511117.jpg" "http://www.thecocktaildb.com/images/media/drink/rvwrvv1468877323.jpg" ...
##  $ strGlass       : chr [1:546] "Highball glass" "Old-fashioned glass" "Beer Glass" "Beer mug" ...
##  $ strIBA         : chr [1:546] NA NA NA NA ...
##  $ strIngredient1 : chr [1:546] "Creme de Cacao" "Absolut Kurant" "Lager" "Malibu rum" ...
##  $ strIngredient10: chr [1:546] NA NA NA NA ...
##  $ strIngredient11: chr [1:546] NA NA NA NA ...
##  $ strIngredient12: chr [1:546] NA NA NA NA ...
##  $ strIngredient13: logi [1:546] NA NA NA NA NA NA ...
##  $ strIngredient14: logi [1:546] NA NA NA NA NA NA ...
##  $ strIngredient15: logi [1:546] NA NA NA NA NA NA ...
##  $ strIngredient2 : chr [1:546] "Vodka" "Grand Marnier" "Tequila" "Light rum" ...
##  $ strIngredient3 : chr [1:546] NA "Chambord raspberry liqueur" NA "151 proof rum" ...
##  $ strIngredient4 : chr [1:546] NA "Midori melon liqueur" NA "Dark Creme de Cacao" ...
##  $ strIngredient5 : chr [1:546] NA "Malibu rum" NA "Cointreau" ...
##  $ strIngredient6 : chr [1:546] NA "Amaretto" NA "Milk" ...
##  $ strIngredient7 : chr [1:546] NA "Cranberry juice" NA "Coconut liqueur" ...
##  $ strIngredient8 : chr [1:546] NA "Pineapple juice" NA "Vanilla ice-cream" ...
##  $ strIngredient9 : chr [1:546] NA NA NA NA ...
##  $ strInstructions: chr [1:546] "1. Fill a rocks glass with ice 2.add white creme de cacao and vodka 3.stir" "Shake ingredients in a mixing tin filled with ice cubes. Strain into a rocks glass." "Drop shooter in glass. Fill with beer" "Combine all ingredients. Blend until smooth. Garnish with chocolate shavings if desired." ...
##  $ strMeasure1    : chr [1:546] "1 oz white" "1/2 oz" "16 oz" "1/2 oz" ...
##  $ strMeasure10   : chr [1:546] NA NA NA NA ...
##  $ strMeasure11   : chr [1:546] NA NA NA NA ...
##  $ strMeasure12   : chr [1:546] NA NA NA NA ...
##  $ strMeasure13   : logi [1:546] NA NA NA NA NA NA ...
##  $ strMeasure14   : logi [1:546] NA NA NA NA NA NA ...
##  $ strMeasure15   : logi [1:546] NA NA NA NA NA NA ...
##  $ strMeasure2    : chr [1:546] "1 oz" "1/4 oz" "1.5 oz" "1/2 oz" ...
##  $ strMeasure3    : chr [1:546] NA "1/4 oz" NA "1/2 oz Bacardi" ...
##  $ strMeasure4    : chr [1:546] NA "1/4 oz" NA "1 oz" ...
##  $ strMeasure5    : chr [1:546] NA "1/4 oz" NA "1 oz" ...
##  $ strMeasure6    : chr [1:546] NA "1/4 oz" NA "3 oz" ...
##  $ strMeasure7    : chr [1:546] NA "1/2 oz" NA "1 oz" ...
##  $ strMeasure8    : chr [1:546] NA "1/4 oz" NA "1 cup" ...
##  $ strMeasure9    : chr [1:546] NA NA NA NA ...
##  $ strVideo       : logi [1:546] NA NA NA NA NA NA ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ...1 = col_double(),
##   ..   strDrink = col_character(),
##   ..   dateModified = col_datetime(format = ""),
##   ..   idDrink = col_double(),
##   ..   strAlcoholic = col_character(),
##   ..   strCategory = col_character(),
##   ..   strDrinkThumb = col_character(),
##   ..   strGlass = col_character(),
##   ..   strIBA = col_character(),
##   ..   strIngredient1 = col_character(),
##   ..   strIngredient10 = col_character(),
##   ..   strIngredient11 = col_character(),
##   ..   strIngredient12 = col_character(),
##   ..   strIngredient13 = col_logical(),
##   ..   strIngredient14 = col_logical(),
##   ..   strIngredient15 = col_logical(),
##   ..   strIngredient2 = col_character(),
##   ..   strIngredient3 = col_character(),
##   ..   strIngredient4 = col_character(),
##   ..   strIngredient5 = col_character(),
##   ..   strIngredient6 = col_character(),
##   ..   strIngredient7 = col_character(),
##   ..   strIngredient8 = col_character(),
##   ..   strIngredient9 = col_character(),
##   ..   strInstructions = col_character(),
##   ..   strMeasure1 = col_character(),
##   ..   strMeasure10 = col_character(),
##   ..   strMeasure11 = col_character(),
##   ..   strMeasure12 = col_character(),
##   ..   strMeasure13 = col_logical(),
##   ..   strMeasure14 = col_logical(),
##   ..   strMeasure15 = col_logical(),
##   ..   strMeasure2 = col_character(),
##   ..   strMeasure3 = col_character(),
##   ..   strMeasure4 = col_character(),
##   ..   strMeasure5 = col_character(),
##   ..   strMeasure6 = col_character(),
##   ..   strMeasure7 = col_character(),
##   ..   strMeasure8 = col_character(),
##   ..   strMeasure9 = col_character(),
##   ..   strVideo = col_logical()
##   .. )
##  - attr(*, "problems")=<externalptr>

We are interested in ‘strAlcoholic’, ‘strCategory’ and ‘strInstructions’.

table(cocktails_ds$strAlcoholic)
## 
##        Alcoholic    Non alcoholic    Non Alcoholic Optional alcohol 
##              478               57                1                9
table(cocktails_ds$strCategory)
## 
##                 Beer             Cocktail                Cocoa 
##                   13                   64                    9 
##         Coffee / Tea     Homemade Liqueur Milk / Float / Shake 
##                   25                   12                   17 
##       Ordinary Drink        Other/Unknown  Punch / Party Drink 
##                  275                   34                   37 
##                 Shot    Soft Drink / Soda 
##                   49                   11

Feature ‘strAlcoholic’ determines if cocktail includes alcohol, alcohol is optional or cocktails is non-alcoholic. Feature ‘strCategory’ describes type of drink such as “Beer” or “Coffee / Tea”. Feature ‘strInstructions’ is a string containing cocktails recipes. It will be the main source for our text mining processes.

Data preparation

library(tm)
library(SnowballC)
library(pander)
library(tidyverse)
library(caret)
library(insight)
library(wordcloud)
library(LiblineaR)
library(e1071)
library(textmineR)
library(cvms)
library(ape)
library(cluster)

table(cocktails_ds$strAlcoholic)
## 
##        Alcoholic    Non alcoholic    Non Alcoholic Optional alcohol 
##              478               57                1                9
#these 2 categories sounds simillar then we are merging them into one
cocktails_ds$strAlcoholic <- ifelse(cocktails_ds$strAlcoholic=="Non alcoholic","Non Alcoholic",cocktails_ds$strAlcoholic)

#we are getting rid of quotes and leaving only alphanumeric characters
cocktails_ds$strInstructions <- noquote(cocktails_ds$strInstructions)
cocktails_ds$strInstructions <- gsub('[^[:alnum:] ]','',cocktails_ds$strInstructions)

#at the end of data preparation ready-to-go data is partitioned to train and test sets
training_obs <- createDataPartition(cocktails_ds$strCategory, 
                                    p = 0.7, 
                                    list = FALSE) 
cocktails_ds.train <- cocktails_ds[training_obs, ]
cocktails_ds.test  <- cocktails_ds[-training_obs, ]

To perform text categorization on our cocktails data we are going to prepare Document Term Matrix using ‘tm’ library. We are removing punctuation and numbers from cocktails recipes.

corpus_cock <- Corpus(VectorSource(cocktails_ds.train$strInstructions))
tdm_cock <- DocumentTermMatrix(corpus_cock, list(removePunctuation = TRUE, 
                                               removeNumbers = TRUE))

str(tdm_cock)
## List of 6
##  $ i       : int [1:6273] 1 1 1 1 1 1 1 1 1 1 ...
##  $ j       : int [1:6273] 1 2 3 4 5 6 7 8 9 10 ...
##  $ v       : num [1:6273] 1 1 1 1 1 1 1 1 1 1 ...
##  $ nrow    : int 387
##  $ ncol    : int 991
##  $ dimnames:List of 2
##   ..$ Docs : chr [1:387] "1" "2" "3" "4" ...
##   ..$ Terms: chr [1:991] "add" "and" "cacao" "creme" ...
##  - attr(*, "class")= chr [1:2] "DocumentTermMatrix" "simple_triplet_matrix"
##  - attr(*, "weighting")= chr [1:2] "term frequency" "tf"

As we can see we have produced 1066 unique terms from the data.

In next step we are combining produced tdm with dependent variable values and we are changing types of vars. Also test data is being prepared.

training_set_cock <- as.matrix(tdm_cock)
training_set_cock <- cbind(training_set_cock, cocktails_ds.train$strCategory)
colnames(training_set_cock)[ncol(training_set_cock)] <- "y"
training_set_cock <- as.data.frame(training_set_cock)
training_set_cock$y <- as.factor(training_set_cock$y)
training_set_cock[sapply(training_set_cock, is.character)] <- lapply(training_set_cock[sapply(training_set_cock, is.character)],  as.numeric)

#data for testing
test_cock_corpus <- Corpus(VectorSource(cocktails_ds.test$strInstructions))
test_cock_tdm <- DocumentTermMatrix(test_cock_corpus, control=list(dictionary = Terms(tdm_cock)))
test_cock_tdm <- as.matrix(test_cock_tdm)

Data is ready for categorization.

In next step we are going to create data input for clustering research.

#Clustering preparation
dtm_cocktails <- CreateDtm(doc_vec = cocktails_ds$strInstructions, # character vector of documents
                 doc_names = cocktails_ds$strAlcoholic, # document names
                 ngram_window = c(1,1), # minimum and maximum n-gram length
                 stopword_vec = c(stopwords::stopwords("en"), # stopwords from tm
                                  stopwords::stopwords(source = "smart")), 
                 lower = TRUE, 
                 remove_punctuation = TRUE,
                 remove_numbers = TRUE, 
                 verbose = FALSE,
                 cpus = 2) 

# developing the matrix of term counts to get the IDF vector
tf_cocktails <- TermDocFreq(dtm_cocktails)
tfidf <- t(dtm_cocktails[ , tf_cocktails$term ]) * tf_cocktails$idf
tfidf <- t(tfidf)
csim <- tfidf / sqrt(rowSums(tfidf * tfidf))
csim <- csim %*% t(csim)
cdist <- as.dist(1 - csim)

Now We have all the objects ready for modelling.

Beer, cocktail or soda? - drinks categorization

With ML methods we are going to categorize cocktails into 11 classes.

At the begging - Support Vector Machine model.

modelSVM <- train(y ~., data = training_set_cock, method = 'svmLinear3')

modelSVM
## L2 Regularized Support Vector Machine (dual) with Linear Kernel 
## 
## 387 samples
## 991 predictors
##  11 classes: 'Beer', 'Cocktail', 'Cocoa', 'Coffee / Tea', 'Homemade Liqueur', 'Milk / Float / Shake', 'Ordinary Drink', 'Other/Unknown', 'Punch / Party Drink', 'Shot', 'Soft Drink / Soda' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 387, 387, 387, 387, 387, 387, ... 
## Resampling results across tuning parameters:
## 
##   cost  Loss  Accuracy   Kappa    
##   0.25  L1    0.5971650  0.4135591
##   0.25  L2    0.5963030  0.4182925
##   0.50  L1    0.5814070  0.4021432
##   0.50  L2    0.5804802  0.4016740
##   1.00  L1    0.5617666  0.3821602
##   1.00  L2    0.5676438  0.3904007
## 
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were cost = 0.25 and Loss = L1.

As we can observe we have got ca. 63% accuracy on train cocktails categorization with 11 classes. Let’s try our model out on test data.

model_test_pred_result <- predict(modelSVM, newdata = test_cock_tdm)

accuracy <- mean(model_test_pred_result == cocktails_ds.test$strCategory)*100
accuracy
## [1] 67.92453

The accuracy on test set is even better - ca. 65%.

conf_mat <- confusion_matrix(targets =as.factor(cocktails_ds.test$strCategory),
                             predictions = model_test_pred_result)
plot_confusion_matrix(conf_mat$`Confusion Matrix`[[1]], rotate_y_text = FALSE)

Confusion matrix can help us better understand the accuracy of predictions. “Ordinary Drink” is the most common class and also one of the best predicted. One class - “Cocoa” was predicted with 100% accuracy - job well done! On the other side class “Cocktail” is the second most poorly predicted class. It’s interesting that the worst predicted class is “Soft Drink/Soda”. All cases of that class were predicted as “Ordinary Drink”. Probably the only difference between “Ordinary Drink” and “Soft Drink/Soda” was the addition of alcohol what was not clear for our model.

Now we are going to try different model - random forest.

modelRf <- train(y ~., data = training_set_cock, method = 'rf')
modelRf
## Random Forest 
## 
## 387 samples
## 991 predictors
##  11 classes: 'Beer', 'Cocktail', 'Cocoa', 'Coffee / Tea', 'Homemade Liqueur', 'Milk / Float / Shake', 'Ordinary Drink', 'Other/Unknown', 'Punch / Party Drink', 'Shot', 'Soft Drink / Soda' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 387, 387, 387, 387, 387, 387, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##     2   0.4936899  0.0000000
##    44   0.6139069  0.3465717
##   990   0.6004534  0.3832486
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 44.
#test
modelRf_test_pred_result <- predict(modelRf, newdata = test_cock_tdm)
accuracy <- mean(modelRf_test_pred_result == cocktails_ds.test$strCategory)*100
accuracy
## [1] 69.18239

The accuracy on train data is ca. 62.5%. Result is slightly worse than for SVM. But accuracy on test is slightly better than for SVM - ca. 67%

Do I include alcohol? - drinks clustering

We are going to cluster cocktails and find out if it is possible to distiguish alcoholic drinks from non-alcoholic.

Firstly let’s prepare hierarchical clustering dendogram using Ward method.

hc <- hclust(cdist, "ward.D2") #nic nie widać
clustering <- cutree(hc, k=3)
plot(hc, hang = -1, cex = 0.6, main = "Hierarchical clustering",
     ylab = "", xlab = "", yaxt = "n")

rect.hclust(hc, 3, border = "cyan")

Labels are hardly visible. We will present it in a different way.

hc_phylo <- as.phylo(hc)
plot(hc_phylo, type = "unrooted", cex = 0.6,no.margin = TRUE, x.lim=c(-1,5), y.lim=c(0,4))

A lot of labels don’t help us with analysis. Although We can observe that one cluster is the biggest, second is medium-small and the last one is tiny. But even without further analysis we can assume that ‘Alcoholic’ labeld records are included in every cluster.

To better investigate clusters we should look into them. Below we can check TOP 10 words in every cluster.

p_words <- colSums(dtm_cocktails) / sum(dtm_cocktails)

cluster_words <- lapply(unique(clustering), function(x){
  rows <- dtm_cocktails[ clustering == x , ]
  
  rows <- rows[ , colSums(rows) > 0 ]
  
  colSums(rows) / sum(rows) - p_words[ colnames(rows) ]
})

# create a summary table of the top 10 words defining each cluster
cluster_summary <- data.frame(cluster = unique(clustering),
                              size = as.numeric(table(clustering)),
                              top_words = sapply(cluster_words, function(d){
                                paste(
                                  names(d)[ order(d, decreasing = TRUE) ][ 1:10 ], 
                                  collapse = ", ")
                              }),
                              stringsAsFactors = FALSE)

cluster_summary
##   cluster size
## 1       1  453
## 2       2   62
## 3       3   31
##                                                                            top_words
## 1                     add, pour, fill, mix, water, top, coffee, vodka, shot, blender
## 2 cocktail, strain, ingredients, shake, glass, serve, ice, chilled, shaker, contents
## 3   halffilled, combine, cubes, strain, shaker, glass, garnish, shake, ice, cocktail

As we can see first cluster is clearly ‘Alcoholic’ cluster - there is a “vodka” as one of the top words. Second and third cluster is hard to recognize but there is no words connected with alcohol which is a good sign.

wordcloud::wordcloud(words = names(cluster_words[[ 1 ]]), 
                     freq = cluster_words[[ 1 ]], 
                     max.words = 50, 
                     random.order = FALSE, 
                     colors = c("red", "green", "blue"),
                     main = "Top words in cluster 1")

As it was mentioned before first cluster is clearly “Alcoholic” cluster. We can spot key words like “vodka”, “rum” or “beer” there. Although the most common words in that cluster are “add”, “fill”, “mix” or “pour” which are verbs used in every drink preparation.

wordcloud::wordcloud(words = names(cluster_words[[ 2 ]]), 
                     freq = cluster_words[[ 2 ]], 
                     max.words = 50, 
                     random.order = FALSE, 
                     colors = c("red", "green", "blue"),
                     main = "Top words in cluster 2")

Second cluster can be “Non-Alcoholic” cluster or “Optional alcohol” cluster but there is no words connected with alcohol in it so we bet it is an alcohol-free option. Most of the displayed words are contected to the process of cocktail preparation.

wordcloud::wordcloud(words = names(cluster_words[[ 3 ]]), 
                     freq = cluster_words[[ 3 ]], 
                     max.words = 50, 
                     random.order = FALSE, 
                     colors = c("red", "green", "blue"),
                     main = "Top words in cluster 3")

Last cluster may be “Optional alcohol” one. Most used words are not connected with alcohol. Despite this we can find some spirits in presented world cloud. We can find there “vermouth” (which can be alcohol free) but also “scotch”, “gin” and “whiskey”. It could be also an “Alcoholic” cluster but we that cluster is much smaller than the first one so it should be cluster representing “Optional alcohol” class.

Now we will try to clusterize our cocktails using K-Means method

kfit <- kmeans(cdist, 3, iter.max=10,nstart=100)
kfit$betweenss/kfit$totss*100
## [1] 43.68033
kfit2 <- kmeans(cdist, 3, iter.max=500,nstart=100, algorithm="MacQueen")
kfit2$betweenss/kfit2$totss*100
## [1] 43.68033

Different approaches gave us the same results. 43% accuracy - not so good.

clusplot(as.matrix(cdist), kfit$cluster, color=T, shade=T, labels=3, lines=0,span=TRUE)

The plot of clusters does not looks clear but could give as helpful adivces. We can observe that the most common “Alcoholic” class occurs in all of the clusters. Moreover we can spot that we have 1 big and 2 smaller clusters.

Summary

To sum up cocktails recipes are varied in terms of ingredients but most activities are repeated. There are also basic products that are used in most cocktails, drinks or beverages. Distinguishing cocktails into different types such as “Ordinary Drink”, “Cocktail”, “Soft Drink/Soda” and so one can be done with Text Mining Machine Learning methods with pretty good accuracy. We have achieved 65-71% accuracy on test data even though we had 11 different target classes. Clustering cocktails into 3 categories was a harder task. Majority of cocktails are alcoholic but they can mix with alcoholic-optional cluster members. “Non-alcoholic” cluster should be the purest one but actually sometimes only one word changes cocktail from alcohol-free to alcoholic which can be confusing.

Despite many adversities, text mining methods can be helpful in analyzing recipes and formulas. Word analysis gives us powerful capabilities that help us understand and, above all, summarize long and intricate texts.