Mankind has experimented for centuries in the kitchen, or behind the bar, in order to prepare the best possible versions of dishes and drinks. Hundreds of thousands of combinations, recipes and various flavor variations. In trying to produce delicious combinations of ingredients, have we found a method in this madness? Perhaps an algorithm? Are we able to judge the type and category of product by the ingredients alone?
We will try to answer this question in the following study. Using data from the service https://www.kaggle.com/datasets/ai-first/cocktail-ingredients we are going to research most famous cocktails recipes. We will check if text mining methods could help us with finding out cocktails categories and also we are going to try distinguish between non-alcoholic cocktails and those with alcohol.
Data set consists of 546 observations and 41 variables. For the purposes of our analysis we are going to use only 3 of them.
## spc_tbl_ [546 × 41] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ...1 : num [1:546] 0 1 2 3 4 5 6 7 8 9 ...
## $ strDrink : chr [1:546] "'57 Chevy with a White License Plate" "1-900-FUK-MEUP" "110 in the shade" "151 Florida Bushwacker" ...
## $ dateModified : POSIXct[1:546], format: "2016-07-18 22:49:04" "2016-07-18 22:27:04" ...
## $ idDrink : num [1:546] 14029 15395 15423 14588 15346 ...
## $ strAlcoholic : chr [1:546] "Alcoholic" "Alcoholic" "Alcoholic" "Alcoholic" ...
## $ strCategory : chr [1:546] "Cocktail" "Shot" "Beer" "Milk / Float / Shake" ...
## $ strDrinkThumb : chr [1:546] "http://www.thecocktaildb.com/images/media/drink/qyyvtu1468878544.jpg" "http://www.thecocktaildb.com/images/media/drink/uxywyw1468877224.jpg" "http://www.thecocktaildb.com/images/media/drink/xxyywq1454511117.jpg" "http://www.thecocktaildb.com/images/media/drink/rvwrvv1468877323.jpg" ...
## $ strGlass : chr [1:546] "Highball glass" "Old-fashioned glass" "Beer Glass" "Beer mug" ...
## $ strIBA : chr [1:546] NA NA NA NA ...
## $ strIngredient1 : chr [1:546] "Creme de Cacao" "Absolut Kurant" "Lager" "Malibu rum" ...
## $ strIngredient10: chr [1:546] NA NA NA NA ...
## $ strIngredient11: chr [1:546] NA NA NA NA ...
## $ strIngredient12: chr [1:546] NA NA NA NA ...
## $ strIngredient13: logi [1:546] NA NA NA NA NA NA ...
## $ strIngredient14: logi [1:546] NA NA NA NA NA NA ...
## $ strIngredient15: logi [1:546] NA NA NA NA NA NA ...
## $ strIngredient2 : chr [1:546] "Vodka" "Grand Marnier" "Tequila" "Light rum" ...
## $ strIngredient3 : chr [1:546] NA "Chambord raspberry liqueur" NA "151 proof rum" ...
## $ strIngredient4 : chr [1:546] NA "Midori melon liqueur" NA "Dark Creme de Cacao" ...
## $ strIngredient5 : chr [1:546] NA "Malibu rum" NA "Cointreau" ...
## $ strIngredient6 : chr [1:546] NA "Amaretto" NA "Milk" ...
## $ strIngredient7 : chr [1:546] NA "Cranberry juice" NA "Coconut liqueur" ...
## $ strIngredient8 : chr [1:546] NA "Pineapple juice" NA "Vanilla ice-cream" ...
## $ strIngredient9 : chr [1:546] NA NA NA NA ...
## $ strInstructions: chr [1:546] "1. Fill a rocks glass with ice 2.add white creme de cacao and vodka 3.stir" "Shake ingredients in a mixing tin filled with ice cubes. Strain into a rocks glass." "Drop shooter in glass. Fill with beer" "Combine all ingredients. Blend until smooth. Garnish with chocolate shavings if desired." ...
## $ strMeasure1 : chr [1:546] "1 oz white" "1/2 oz" "16 oz" "1/2 oz" ...
## $ strMeasure10 : chr [1:546] NA NA NA NA ...
## $ strMeasure11 : chr [1:546] NA NA NA NA ...
## $ strMeasure12 : chr [1:546] NA NA NA NA ...
## $ strMeasure13 : logi [1:546] NA NA NA NA NA NA ...
## $ strMeasure14 : logi [1:546] NA NA NA NA NA NA ...
## $ strMeasure15 : logi [1:546] NA NA NA NA NA NA ...
## $ strMeasure2 : chr [1:546] "1 oz" "1/4 oz" "1.5 oz" "1/2 oz" ...
## $ strMeasure3 : chr [1:546] NA "1/4 oz" NA "1/2 oz Bacardi" ...
## $ strMeasure4 : chr [1:546] NA "1/4 oz" NA "1 oz" ...
## $ strMeasure5 : chr [1:546] NA "1/4 oz" NA "1 oz" ...
## $ strMeasure6 : chr [1:546] NA "1/4 oz" NA "3 oz" ...
## $ strMeasure7 : chr [1:546] NA "1/2 oz" NA "1 oz" ...
## $ strMeasure8 : chr [1:546] NA "1/4 oz" NA "1 cup" ...
## $ strMeasure9 : chr [1:546] NA NA NA NA ...
## $ strVideo : logi [1:546] NA NA NA NA NA NA ...
## - attr(*, "spec")=
## .. cols(
## .. ...1 = col_double(),
## .. strDrink = col_character(),
## .. dateModified = col_datetime(format = ""),
## .. idDrink = col_double(),
## .. strAlcoholic = col_character(),
## .. strCategory = col_character(),
## .. strDrinkThumb = col_character(),
## .. strGlass = col_character(),
## .. strIBA = col_character(),
## .. strIngredient1 = col_character(),
## .. strIngredient10 = col_character(),
## .. strIngredient11 = col_character(),
## .. strIngredient12 = col_character(),
## .. strIngredient13 = col_logical(),
## .. strIngredient14 = col_logical(),
## .. strIngredient15 = col_logical(),
## .. strIngredient2 = col_character(),
## .. strIngredient3 = col_character(),
## .. strIngredient4 = col_character(),
## .. strIngredient5 = col_character(),
## .. strIngredient6 = col_character(),
## .. strIngredient7 = col_character(),
## .. strIngredient8 = col_character(),
## .. strIngredient9 = col_character(),
## .. strInstructions = col_character(),
## .. strMeasure1 = col_character(),
## .. strMeasure10 = col_character(),
## .. strMeasure11 = col_character(),
## .. strMeasure12 = col_character(),
## .. strMeasure13 = col_logical(),
## .. strMeasure14 = col_logical(),
## .. strMeasure15 = col_logical(),
## .. strMeasure2 = col_character(),
## .. strMeasure3 = col_character(),
## .. strMeasure4 = col_character(),
## .. strMeasure5 = col_character(),
## .. strMeasure6 = col_character(),
## .. strMeasure7 = col_character(),
## .. strMeasure8 = col_character(),
## .. strMeasure9 = col_character(),
## .. strVideo = col_logical()
## .. )
## - attr(*, "problems")=<externalptr>
We are interested in ‘strAlcoholic’, ‘strCategory’ and ‘strInstructions’.
table(cocktails_ds$strAlcoholic)
##
## Alcoholic Non alcoholic Non Alcoholic Optional alcohol
## 478 57 1 9
table(cocktails_ds$strCategory)
##
## Beer Cocktail Cocoa
## 13 64 9
## Coffee / Tea Homemade Liqueur Milk / Float / Shake
## 25 12 17
## Ordinary Drink Other/Unknown Punch / Party Drink
## 275 34 37
## Shot Soft Drink / Soda
## 49 11
Feature ‘strAlcoholic’ determines if cocktail includes alcohol, alcohol is optional or cocktails is non-alcoholic. Feature ‘strCategory’ describes type of drink such as “Beer” or “Coffee / Tea”. Feature ‘strInstructions’ is a string containing cocktails recipes. It will be the main source for our text mining processes.
library(tm)
library(SnowballC)
library(pander)
library(tidyverse)
library(caret)
library(insight)
library(wordcloud)
library(LiblineaR)
library(e1071)
library(textmineR)
library(cvms)
library(ape)
library(cluster)
table(cocktails_ds$strAlcoholic)
##
## Alcoholic Non alcoholic Non Alcoholic Optional alcohol
## 478 57 1 9
#these 2 categories sounds simillar then we are merging them into one
cocktails_ds$strAlcoholic <- ifelse(cocktails_ds$strAlcoholic=="Non alcoholic","Non Alcoholic",cocktails_ds$strAlcoholic)
#we are getting rid of quotes and leaving only alphanumeric characters
cocktails_ds$strInstructions <- noquote(cocktails_ds$strInstructions)
cocktails_ds$strInstructions <- gsub('[^[:alnum:] ]','',cocktails_ds$strInstructions)
#at the end of data preparation ready-to-go data is partitioned to train and test sets
training_obs <- createDataPartition(cocktails_ds$strCategory,
p = 0.7,
list = FALSE)
cocktails_ds.train <- cocktails_ds[training_obs, ]
cocktails_ds.test <- cocktails_ds[-training_obs, ]
To perform text categorization on our cocktails data we are going to prepare Document Term Matrix using ‘tm’ library. We are removing punctuation and numbers from cocktails recipes.
corpus_cock <- Corpus(VectorSource(cocktails_ds.train$strInstructions))
tdm_cock <- DocumentTermMatrix(corpus_cock, list(removePunctuation = TRUE,
removeNumbers = TRUE))
str(tdm_cock)
## List of 6
## $ i : int [1:6273] 1 1 1 1 1 1 1 1 1 1 ...
## $ j : int [1:6273] 1 2 3 4 5 6 7 8 9 10 ...
## $ v : num [1:6273] 1 1 1 1 1 1 1 1 1 1 ...
## $ nrow : int 387
## $ ncol : int 991
## $ dimnames:List of 2
## ..$ Docs : chr [1:387] "1" "2" "3" "4" ...
## ..$ Terms: chr [1:991] "add" "and" "cacao" "creme" ...
## - attr(*, "class")= chr [1:2] "DocumentTermMatrix" "simple_triplet_matrix"
## - attr(*, "weighting")= chr [1:2] "term frequency" "tf"
As we can see we have produced 1066 unique terms from the data.
In next step we are combining produced tdm with dependent variable values and we are changing types of vars. Also test data is being prepared.
training_set_cock <- as.matrix(tdm_cock)
training_set_cock <- cbind(training_set_cock, cocktails_ds.train$strCategory)
colnames(training_set_cock)[ncol(training_set_cock)] <- "y"
training_set_cock <- as.data.frame(training_set_cock)
training_set_cock$y <- as.factor(training_set_cock$y)
training_set_cock[sapply(training_set_cock, is.character)] <- lapply(training_set_cock[sapply(training_set_cock, is.character)], as.numeric)
#data for testing
test_cock_corpus <- Corpus(VectorSource(cocktails_ds.test$strInstructions))
test_cock_tdm <- DocumentTermMatrix(test_cock_corpus, control=list(dictionary = Terms(tdm_cock)))
test_cock_tdm <- as.matrix(test_cock_tdm)
Data is ready for categorization.
In next step we are going to create data input for clustering research.
#Clustering preparation
dtm_cocktails <- CreateDtm(doc_vec = cocktails_ds$strInstructions, # character vector of documents
doc_names = cocktails_ds$strAlcoholic, # document names
ngram_window = c(1,1), # minimum and maximum n-gram length
stopword_vec = c(stopwords::stopwords("en"), # stopwords from tm
stopwords::stopwords(source = "smart")),
lower = TRUE,
remove_punctuation = TRUE,
remove_numbers = TRUE,
verbose = FALSE,
cpus = 2)
# developing the matrix of term counts to get the IDF vector
tf_cocktails <- TermDocFreq(dtm_cocktails)
tfidf <- t(dtm_cocktails[ , tf_cocktails$term ]) * tf_cocktails$idf
tfidf <- t(tfidf)
csim <- tfidf / sqrt(rowSums(tfidf * tfidf))
csim <- csim %*% t(csim)
cdist <- as.dist(1 - csim)
Now We have all the objects ready for modelling.
With ML methods we are going to categorize cocktails into 11 classes.
At the begging - Support Vector Machine model.
modelSVM <- train(y ~., data = training_set_cock, method = 'svmLinear3')
modelSVM
## L2 Regularized Support Vector Machine (dual) with Linear Kernel
##
## 387 samples
## 991 predictors
## 11 classes: 'Beer', 'Cocktail', 'Cocoa', 'Coffee / Tea', 'Homemade Liqueur', 'Milk / Float / Shake', 'Ordinary Drink', 'Other/Unknown', 'Punch / Party Drink', 'Shot', 'Soft Drink / Soda'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 387, 387, 387, 387, 387, 387, ...
## Resampling results across tuning parameters:
##
## cost Loss Accuracy Kappa
## 0.25 L1 0.5971650 0.4135591
## 0.25 L2 0.5963030 0.4182925
## 0.50 L1 0.5814070 0.4021432
## 0.50 L2 0.5804802 0.4016740
## 1.00 L1 0.5617666 0.3821602
## 1.00 L2 0.5676438 0.3904007
##
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were cost = 0.25 and Loss = L1.
As we can observe we have got ca. 63% accuracy on train cocktails categorization with 11 classes. Let’s try our model out on test data.
model_test_pred_result <- predict(modelSVM, newdata = test_cock_tdm)
accuracy <- mean(model_test_pred_result == cocktails_ds.test$strCategory)*100
accuracy
## [1] 67.92453
The accuracy on test set is even better - ca. 65%.
conf_mat <- confusion_matrix(targets =as.factor(cocktails_ds.test$strCategory),
predictions = model_test_pred_result)
plot_confusion_matrix(conf_mat$`Confusion Matrix`[[1]], rotate_y_text = FALSE)
Confusion matrix can help us better understand the accuracy of predictions. “Ordinary Drink” is the most common class and also one of the best predicted. One class - “Cocoa” was predicted with 100% accuracy - job well done! On the other side class “Cocktail” is the second most poorly predicted class. It’s interesting that the worst predicted class is “Soft Drink/Soda”. All cases of that class were predicted as “Ordinary Drink”. Probably the only difference between “Ordinary Drink” and “Soft Drink/Soda” was the addition of alcohol what was not clear for our model.
Now we are going to try different model - random forest.
modelRf <- train(y ~., data = training_set_cock, method = 'rf')
modelRf
## Random Forest
##
## 387 samples
## 991 predictors
## 11 classes: 'Beer', 'Cocktail', 'Cocoa', 'Coffee / Tea', 'Homemade Liqueur', 'Milk / Float / Shake', 'Ordinary Drink', 'Other/Unknown', 'Punch / Party Drink', 'Shot', 'Soft Drink / Soda'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 387, 387, 387, 387, 387, 387, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.4936899 0.0000000
## 44 0.6139069 0.3465717
## 990 0.6004534 0.3832486
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 44.
#test
modelRf_test_pred_result <- predict(modelRf, newdata = test_cock_tdm)
accuracy <- mean(modelRf_test_pred_result == cocktails_ds.test$strCategory)*100
accuracy
## [1] 69.18239
The accuracy on train data is ca. 62.5%. Result is slightly worse than for SVM. But accuracy on test is slightly better than for SVM - ca. 67%
We are going to cluster cocktails and find out if it is possible to distiguish alcoholic drinks from non-alcoholic.
Firstly let’s prepare hierarchical clustering dendogram using Ward method.
hc <- hclust(cdist, "ward.D2") #nic nie widać
clustering <- cutree(hc, k=3)
plot(hc, hang = -1, cex = 0.6, main = "Hierarchical clustering",
ylab = "", xlab = "", yaxt = "n")
rect.hclust(hc, 3, border = "cyan")
Labels are hardly visible. We will present it in a different way.
hc_phylo <- as.phylo(hc)
plot(hc_phylo, type = "unrooted", cex = 0.6,no.margin = TRUE, x.lim=c(-1,5), y.lim=c(0,4))
A lot of labels don’t help us with analysis. Although We can observe that one cluster is the biggest, second is medium-small and the last one is tiny. But even without further analysis we can assume that ‘Alcoholic’ labeld records are included in every cluster.
To better investigate clusters we should look into them. Below we can check TOP 10 words in every cluster.
p_words <- colSums(dtm_cocktails) / sum(dtm_cocktails)
cluster_words <- lapply(unique(clustering), function(x){
rows <- dtm_cocktails[ clustering == x , ]
rows <- rows[ , colSums(rows) > 0 ]
colSums(rows) / sum(rows) - p_words[ colnames(rows) ]
})
# create a summary table of the top 10 words defining each cluster
cluster_summary <- data.frame(cluster = unique(clustering),
size = as.numeric(table(clustering)),
top_words = sapply(cluster_words, function(d){
paste(
names(d)[ order(d, decreasing = TRUE) ][ 1:10 ],
collapse = ", ")
}),
stringsAsFactors = FALSE)
cluster_summary
## cluster size
## 1 1 453
## 2 2 62
## 3 3 31
## top_words
## 1 add, pour, fill, mix, water, top, coffee, vodka, shot, blender
## 2 cocktail, strain, ingredients, shake, glass, serve, ice, chilled, shaker, contents
## 3 halffilled, combine, cubes, strain, shaker, glass, garnish, shake, ice, cocktail
As we can see first cluster is clearly ‘Alcoholic’ cluster - there is a “vodka” as one of the top words. Second and third cluster is hard to recognize but there is no words connected with alcohol which is a good sign.
wordcloud::wordcloud(words = names(cluster_words[[ 1 ]]),
freq = cluster_words[[ 1 ]],
max.words = 50,
random.order = FALSE,
colors = c("red", "green", "blue"),
main = "Top words in cluster 1")
As it was mentioned before first cluster is clearly “Alcoholic” cluster. We can spot key words like “vodka”, “rum” or “beer” there. Although the most common words in that cluster are “add”, “fill”, “mix” or “pour” which are verbs used in every drink preparation.
wordcloud::wordcloud(words = names(cluster_words[[ 2 ]]),
freq = cluster_words[[ 2 ]],
max.words = 50,
random.order = FALSE,
colors = c("red", "green", "blue"),
main = "Top words in cluster 2")
Second cluster can be “Non-Alcoholic” cluster or “Optional alcohol” cluster but there is no words connected with alcohol in it so we bet it is an alcohol-free option. Most of the displayed words are contected to the process of cocktail preparation.
wordcloud::wordcloud(words = names(cluster_words[[ 3 ]]),
freq = cluster_words[[ 3 ]],
max.words = 50,
random.order = FALSE,
colors = c("red", "green", "blue"),
main = "Top words in cluster 3")
Last cluster may be “Optional alcohol” one. Most used words are not connected with alcohol. Despite this we can find some spirits in presented world cloud. We can find there “vermouth” (which can be alcohol free) but also “scotch”, “gin” and “whiskey”. It could be also an “Alcoholic” cluster but we that cluster is much smaller than the first one so it should be cluster representing “Optional alcohol” class.
Now we will try to clusterize our cocktails using K-Means method
kfit <- kmeans(cdist, 3, iter.max=10,nstart=100)
kfit$betweenss/kfit$totss*100
## [1] 43.68033
kfit2 <- kmeans(cdist, 3, iter.max=500,nstart=100, algorithm="MacQueen")
kfit2$betweenss/kfit2$totss*100
## [1] 43.68033
Different approaches gave us the same results. 43% accuracy - not so good.
clusplot(as.matrix(cdist), kfit$cluster, color=T, shade=T, labels=3, lines=0,span=TRUE)
The plot of clusters does not looks clear but could give as helpful adivces. We can observe that the most common “Alcoholic” class occurs in all of the clusters. Moreover we can spot that we have 1 big and 2 smaller clusters.
To sum up cocktails recipes are varied in terms of ingredients but most activities are repeated. There are also basic products that are used in most cocktails, drinks or beverages. Distinguishing cocktails into different types such as “Ordinary Drink”, “Cocktail”, “Soft Drink/Soda” and so one can be done with Text Mining Machine Learning methods with pretty good accuracy. We have achieved 65-71% accuracy on test data even though we had 11 different target classes. Clustering cocktails into 3 categories was a harder task. Majority of cocktails are alcoholic but they can mix with alcoholic-optional cluster members. “Non-alcoholic” cluster should be the purest one but actually sometimes only one word changes cocktail from alcohol-free to alcoholic which can be confusing.
Despite many adversities, text mining methods can be helpful in analyzing recipes and formulas. Word analysis gives us powerful capabilities that help us understand and, above all, summarize long and intricate texts.