Association rules are a fundamental concept in data mining and machine learning, serving as a powerful analytical tool to discover meaningful relationships and patterns within datasets. The primary objective of association rule mining is to reveal associations or correlations between items in a dataset based on their co-occurrence in transactions. This technique is particularly valuable in scenarios where understanding the inherent relationships between different items or variables can provide actionable insights.
The central notion revolves around identifying frequent itemsets, which are subsets of items that frequently appear together in the dataset. These itemsets are used to generate association rules that express relationships between items in the form of “if-then” statements. The rules consist of an antecedent (the “if” part) and a consequent (the “then” part). For example, if item X is present in a transaction, there is a high likelihood that item Y will also be present. Association rules are quantified by metrics such as support, confidence, and lift, which help evaluate the strength and significance of the identified associations.
This study endeavors to apply association rule mining techniques, specifically the Apriori and Eclat algorithms, to unravel patterns within a comprehensive food recipe dataset. The primary focus lies in discerning associations among ingredients within recipes, shedding light on commonly co-occurring components in culinary creations. The dataset under consideration comprises a vast repository of over 0.5 million recipes. However, for the sake of computational efficiency, the analysis concentrates on the initial subset of 20,000 recipes.
The overarching goal of this study is to extract actionable insights for culinary and retail industries. By identifying associations among ingredients, the study aims to provide valuable information for various applications, including recipe recommendation systems, inventory management, and strategic product placement. Additionally, the visualization of discovered association rules using the arulesViz package enhances the interpretability of the findings, facilitating a more intuitive understanding of the complex relationships within the culinary landscape. Through these analytical endeavors, the study aims to contribute to the enhancement of decision-making processes in culinary and retail domains.
# loading packages
library(tidyverse)
library(plyr)
library(arules)
library(arulesViz)
library(arulesCBA)
library(ggplot2)
The Food.com recipes dataset contains 522,517 recipes from 312 different categories. This dataset provides information about each recipe like cooking times, servings, ingredients, nutrition, instructions, and more.
The reviews dataset contains 1,401,982 reviews from 271,907 different users. This dataset provides information about the author, rating, review text, and more.
# loading dataset
df <- read.csv("recipes.csv", header = TRUE, sep=',')
# let's see first 2 rows of dataset
head(df, 2)
## RecipeId Name AuthorId AuthorName CookTime
## 1 38 Low-Fat Berry Blue Frozen Dessert 1533 Dancer PT24H
## 2 39 Biryani 1567 elly9812 PT25M
## PrepTime TotalTime DatePublished
## 1 PT45M PT24H45M 1999-08-09T21:46:00Z
## 2 PT4H PT4H25M 1999-08-29T13:12:00Z
## Description
## 1 Make and share this Low-Fat Berry Blue Frozen Dessert recipe from Food.com.
## 2 Make and share this Biryani recipe from Food.com.
## Images
## 1 c("https://img.sndimg.com/food/image/upload/w_555,h_416,c_fit,fl_progressive,q_95/v1/img/recipes/38/YUeirxMLQaeE1h3v3qnM_229%20berry%20blue%20frzn%20dess.jpg", "https://img.sndimg.com/food/image/upload/w_555,h_416,c_fit,fl_progressive,q_95/v1/img/recipes/38/AFPDDHATWzQ0b1CDpDAT_255%20berry%20blue%20frzn%20dess.jpg", "https://img.sndimg.com/food/image/upload/w_555,h_416,c_fit,fl_progressive,q_95/v1/img/recipes/38/UYgf9nwMT2SGGJCuzILO_228%20berry%20blue%20frzn%20dess.jpg", "https://img.sndimg.com/food/image/upload/w_555,h_416,c_fit,fl_progressive,q_95/v1/img/recipes/38/PeBMJN2TGSaYks2759BA_20140722_202142.jpg", \n"https://img.sndimg.com/food/image/upload/w_555,h_416,c_fit,fl_progressive,q_95/v1/img/recipes/38/picuaETeN.jpg", "https://img.sndimg.com/food/image/upload/w_555,h_416,c_fit,fl_progressive,q_95/v1/img/recipes/38/pictzvxW5.jpg")
## 2 c("https://img.sndimg.com/food/image/upload/w_555,h_416,c_fit,fl_progressive,q_95/v1/img/recipes/39/picM9Mhnw.jpg", "https://img.sndimg.com/food/image/upload/w_555,h_416,c_fit,fl_progressive,q_95/v1/img/recipes/39/picHv4Ocr.jpg")
## RecipeCategory
## 1 Frozen Desserts
## 2 Chicken Breast
## Keywords
## 1 c("Dessert", "Low Protein", "Low Cholesterol", "Healthy", "Free Of...", "Summer", "Weeknight", "Freezer", "Easy")
## 2 c("Chicken Thigh & Leg", "Chicken", "Poultry", "Meat", "Asian", "Indian", "Weeknight", "Stove Top")
## RecipeIngredientQuantities
## 1 c("4", "1/4", "1", "1")
## 2 c("1", "4", "2", "2", "8", "1/4", "8", "1/2", "1", "1", "1/4", "1/4", "1/2", "1/4", "2", "3", NA, "2", "1", "1", "8", "2", "1/3", "1/3", "1/3", "6")
## RecipeIngredientParts
## 1 c("blueberries", "granulated sugar", "vanilla yogurt", "lemon juice")
## 2 c("saffron", "milk", "hot green chili peppers", "onions", "garlic", "clove", "peppercorns", "cardamom seed", "cumin seed", "poppy seed", "mace", "cilantro", "mint leaf", "fresh lemon juice", "plain yogurt", "boneless chicken", "salt", "ghee", "onion", "tomatoes", "basmati rice", "long-grain rice", "raisins", "cashews", "eggs")
## AggregatedRating ReviewCount Calories FatContent SaturatedFatContent
## 1 4.5 4 170.9 2.5 1.3
## 2 3.0 1 1110.7 58.8 16.6
## CholesterolContent SodiumContent CarbohydrateContent FiberContent
## 1 8.0 29.8 37.1 3.6
## 2 372.8 368.4 84.4 9.0
## SugarContent ProteinContent RecipeServings RecipeYield
## 1 30.2 3.2 4 <NA>
## 2 20.4 63.4 6 <NA>
## RecipeInstructions
## 1 c("Toss 2 cups berries with sugar.", "Let stand for 45 minutes, stirring occasionally.", "Transfer berry-sugar mixture to food processor.", "Add yogurt and process until smooth.", "Strain through fine sieve. Pour into baking pan (or transfer to ice cream maker and process according to manufacturers' directions). Freeze uncovered until edges are solid but centre is soft. Transfer to processor and blend until smooth again.", "Return to pan and freeze until edges are solid.", "Transfer to processor and blend until smooth again.", \n"Fold in remaining 2 cups of blueberries.", "Pour into plastic mold and freeze overnight. Let soften slightly to serve.")
## 2 c("Soak saffron in warm milk for 5 minutes and puree in blender.", "Add chiles, onions, ginger, garlic, cloves, peppercorns, cardamom seeds, cinnamon, coriander and cumin seeds, poppy seeds, nutmeg, mace, cilantro or mint leaves and lemon juice. Blend into smooth paste. Put paste into large bowl, add yogurt and mix well.", "Marinate chicken in yogurt mixture with salt, covered for at least 2 - 6 hours in refrigerator.", "In skillet. heat oil over medium heat for 1 minute. Add ghee and 15 seconds later add onion and fry for about8 minutes.", \n"Reserve for garnish.", "In same skillet, cook chicken with its marinade with tomatoes for about 10 minutes over medium heat, uncovered.", "Remove chicken pieces from the sauce and set aside. Add rice to sauce, bring to boil, and cook, covered over low heat for 15 minutes.", "Return chicken and add raisins, cashews and almonds; mix well.", "Simmer, covered for 5 minutes.", "Place chicken, eggs and rice in large serving dish in such a way that yellow of the eggs, the saffron-colored rice, the nuts and the chicken make a colorful display.", \n"Add reserved onion as garnish.")
# check the data structure
str(df)
## 'data.frame': 522517 obs. of 28 variables:
## $ RecipeId : int 38 39 40 41 42 43 44 45 46 47 ...
## $ Name : chr "Low-Fat Berry Blue Frozen Dessert" "Biryani" "Best Lemonade" "Carina's Tofu-Vegetable Kebabs" ...
## $ AuthorId : int 1533 1567 1566 1586 1538 34879 1596 1580 1533 1573 ...
## $ AuthorName : chr "Dancer" "elly9812" "Stephen Little" "Cyclopz" ...
## $ CookTime : chr "PT24H" "PT25M" "PT5M" "PT20M" ...
## $ PrepTime : chr "PT45M" "PT4H" "PT30M" "PT24H" ...
## $ TotalTime : chr "PT24H45M" "PT4H25M" "PT35M" "PT24H20M" ...
## $ DatePublished : chr "1999-08-09T21:46:00Z" "1999-08-29T13:12:00Z" "1999-09-05T19:52:00Z" "1999-09-03T14:54:00Z" ...
## $ Description : chr "Make and share this Low-Fat Berry Blue Frozen Dessert recipe from Food.com." "Make and share this Biryani recipe from Food.com." "This is from one of my first Good House Keeping cookbooks. You must use a *zester* in order to avoid getting "| __truncated__ "This dish is best prepared a day in advance to allow the ingredients to soak in the marinade overnight." ...
## $ Images : chr "c(\"https://img.sndimg.com/food/image/upload/w_555,h_416,c_fit,fl_progressive,q_95/v1/img/recipes/38/YUeirxMLQa"| __truncated__ "c(\"https://img.sndimg.com/food/image/upload/w_555,h_416,c_fit,fl_progressive,q_95/v1/img/recipes/39/picM9Mhnw."| __truncated__ "c(\"https://img.sndimg.com/food/image/upload/w_555,h_416,c_fit,fl_progressive,q_95/v1/img/recipes/40/picJ4Sz3N."| __truncated__ "c(\"https://img.sndimg.com/food/image/upload/w_555,h_416,c_fit,fl_progressive,q_95/v1/img/recipes/41/picmbLig8."| __truncated__ ...
## $ RecipeCategory : chr "Frozen Desserts" "Chicken Breast" "Beverages" "Soy/Tofu" ...
## $ Keywords : chr "c(\"Dessert\", \"Low Protein\", \"Low Cholesterol\", \"Healthy\", \"Free Of...\", \"Summer\", \"Weeknight\", \""| __truncated__ "c(\"Chicken Thigh & Leg\", \"Chicken\", \"Poultry\", \"Meat\", \"Asian\", \"Indian\", \"Weeknight\", \"Stove Top\")" "c(\"Low Protein\", \"Low Cholesterol\", \"Healthy\", \"Summer\", \"< 60 Mins\")" "c(\"Beans\", \"Vegetable\", \"Low Cholesterol\", \"Weeknight\", \"Broil/Grill\", \"Oven\")" ...
## $ RecipeIngredientQuantities: chr "c(\"4\", \"1/4\", \"1\", \"1\")" "c(\"1\", \"4\", \"2\", \"2\", \"8\", \"1/4\", \"8\", \"1/2\", \"1\", \"1\", \"1/4\", \"1/4\", \"1/2\", \"1/4\","| __truncated__ "c(\"1 1/2\", \"1\", NA, \"1 1/2\", NA, \"3/4\")" "c(\"12\", \"1\", \"2\", \"1\", \"10\", \"1\", \"3\", \"2\", \"2\", \"2\", \"1\", \"2\", \"1/2\", \"1/4\", \"4\")" ...
## $ RecipeIngredientParts : chr "c(\"blueberries\", \"granulated sugar\", \"vanilla yogurt\", \"lemon juice\")" "c(\"saffron\", \"milk\", \"hot green chili peppers\", \"onions\", \"garlic\", \"clove\", \"peppercorns\", \"car"| __truncated__ "c(\"sugar\", \"lemons, rind of\", \"lemon, zest of\", \"fresh water\", \"fresh lemon juice\")" "c(\"extra firm tofu\", \"eggplant\", \"zucchini\", \"mushrooms\", \"soy sauce\", \"low sodium soy sauce\", \"ol"| __truncated__ ...
## $ AggregatedRating : num 4.5 3 4.5 4.5 4.5 1 5 4 5 4 ...
## $ ReviewCount : int 4 1 10 2 11 1 23 3 2 2 ...
## $ Calories : num 171 1111 311 536 104 ...
## $ FatContent : num 2.5 58.8 0.2 24 0.4 19.3 66.8 7.1 0 5.6 ...
## $ SaturatedFatContent : num 1.3 16.6 0 3.8 0.1 10.9 31.9 1.7 0 1.4 ...
## $ CholesterolContent : num 8 373 0 0 0 ...
## $ SodiumContent : num 29.8 368.4 1.8 1558.6 959.3 ...
## $ CarbohydrateContent : num 37.1 84.4 81.5 64.2 25.1 58 29.1 37.5 1.1 4.5 ...
## $ FiberContent : num 3.6 9 0.4 17.3 4.8 1.8 3.1 0.5 0.2 0.6 ...
## $ SugarContent : num 30.2 20.4 77.2 32.1 17.7 42.5 5 24.7 0.2 1.6 ...
## $ ProteinContent : num 3.2 63.4 0.3 29.3 4.3 7 45.3 4.2 0.1 0.8 ...
## $ RecipeServings : int 4 6 4 2 4 8 2 8 NA NA ...
## $ RecipeYield : chr NA NA NA "4 kebabs" ...
## $ RecipeInstructions : chr "c(\"Toss 2 cups berries with sugar.\", \"Let stand for 45 minutes, stirring occasionally.\", \"Transfer berry-s"| __truncated__ "c(\"Soak saffron in warm milk for 5 minutes and puree in blender.\", \"Add chiles, onions, ginger, garlic, clov"| __truncated__ "c(\"Into a 1 quart Jar with tight fitting lid, put sugar and lemon peel, or zest; add 1 1/2 cups very hot wate"| __truncated__ "c(\"Drain the tofu, carefully squeezing out excess water, and pat dry with paper towels.\", \"Cut tofu into on"| __truncated__ ...
Our focus in this analysis is directed towards the “RecipeIngredientPart” column, chosen for its encapsulation of multiple ingredients within a single entry. This column functions analogously to a basket transaction, where each entry represents a collection of ingredients used in a particular food recipe. By treating each entry as a transaction, we aim to employ association rule mining techniques to unveil patterns and relationships among these ingredients.
# select first 20000 rows
df1 <- data.frame(recipe = df$RecipeIngredientParts[1:20000])
head(df1)
## recipe
## 1 c("blueberries", "granulated sugar", "vanilla yogurt", "lemon juice")
## 2 c("saffron", "milk", "hot green chili peppers", "onions", "garlic", "clove", "peppercorns", "cardamom seed", "cumin seed", "poppy seed", "mace", "cilantro", "mint leaf", "fresh lemon juice", "plain yogurt", "boneless chicken", "salt", "ghee", "onion", "tomatoes", "basmati rice", "long-grain rice", "raisins", "cashews", "eggs")
## 3 c("sugar", "lemons, rind of", "lemon, zest of", "fresh water", "fresh lemon juice")
## 4 c("extra firm tofu", "eggplant", "zucchini", "mushrooms", "soy sauce", "low sodium soy sauce", "olive oil", "maple syrup", "honey", "red wine vinegar", "lemon juice", "garlic cloves", "mustard powder", "black pepper")
## 5 c("plain tomato juice", "cabbage", "onion", "carrots", "celery")
## 6 c("graham cracker crumbs", "sugar", "butter", "sugar", "cornstarch", "salt", "milk", "vanilla extract", "water", "gelatin", "rum", "cream of tartar", "sugar")
As observed earlier, the dataset requires preprocessing before being transformed into transactional data. It is imperative to initiate a cleaning process that involves the elimination of the initial ‘c’ letter, as well as the removal of parentheses and quotes from the row of the dataset.
clean_ingredients <- function(entry) {
# Extract ingredients from the 4th index
ingredients <- substr(entry, 4, nchar(entry))
# Remove parentheses and quotes
cleaned_ingredients <- gsub('["()]', '', ingredients)
return(cleaned_ingredients)
}
# Apply the function to each row
df1$cleaned_ingredients <- lapply(df1$recipe, clean_ingredients)
# creting new dataframe with cleaned column
new_df <- data.frame(CleanedIngredientsColumn = sapply(df1$cleaned_ingredients, paste, collapse = ", "))
head(new_df, 3)
## CleanedIngredientsColumn
## 1 blueberries, granulated sugar, vanilla yogurt, lemon juice
## 2 saffron, milk, hot green chili peppers, onions, garlic, clove, peppercorns, cardamom seed, cumin seed, poppy seed, mace, cilantro, mint leaf, fresh lemon juice, plain yogurt, boneless chicken, salt, ghee, onion, tomatoes, basmati rice, long-grain rice, raisins, cashews, eggs
## 3 sugar, lemons, rind of, lemon, zest of, fresh water, fresh lemon juice
At this stage, our dataset is prepared for conversion into transactional format. Subsequently, we can leverage this transactional data to implement association rule mining algorithms.
# transfering data into transactional format
trans <- strsplit(new_df$CleanedIngredientsColumn, ", ")
# Convert the data to a transaction format
trans <- as(trans, "transactions")
## Warning in asMethod(object): removing duplicated items in transactions
# check first 5 transactions
inspect(trans[1:5])
## items
## [1] {blueberries,
## granulated sugar,
## lemon juice,
## vanilla yogurt}
## [2] {basmati rice,
## boneless chicken,
## cardamom seed,
## cashews,
## cilantro,
## clove,
## cumin seed,
## eggs,
## fresh lemon juice,
## garlic,
## ghee,
## hot green chili peppers,
## long-grain rice,
## mace,
## milk,
## mint leaf,
## onion,
## onions,
## peppercorns,
## plain yogurt,
## poppy seed,
## raisins,
## saffron,
## salt,
## tomatoes}
## [3] {fresh lemon juice,
## fresh water,
## lemon,
## lemons,
## rind of,
## sugar,
## zest of}
## [4] {black pepper,
## eggplant,
## extra firm tofu,
## garlic cloves,
## honey,
## lemon juice,
## low sodium soy sauce,
## maple syrup,
## mushrooms,
## mustard powder,
## olive oil,
## red wine vinegar,
## soy sauce,
## zucchini}
## [5] {cabbage,
## carrots,
## celery,
## onion,
## plain tomato juice}
In a total of 20,000 recipes, which ingredients are frequently utilized for cooking various foods?
# Plotting the Most Frequent Ingredients in Foods
itemFrequencyPlot(trans, topN = 15, type = "relative",
main = "Frequent Ingredients in Foods - Item Frequency",
col = rainbow(20))
Salt stands out as the predominant ingredient, making appearances in over 20,000 recipes. Following closely, butter and sugar secure the second and third positions, respectively. In contrast, eggs and baking powder are employed to a lesser extent. Now, we assess the transactions using the summary function to gain insights into their characteristics.
summary(trans)
## transactions as itemMatrix in sparse format with
## 20000 rows (elements/itemsets/transactions) and
## 4254 columns (items) and a density of 0.001770346
##
## most frequent items:
## salt butter sugar onion water (Other)
## 7768 5351 4857 3770 3627 125248
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## 353 696 1093 1703 2127 2401 2352 2143 1907 1560 1164 821 583 429 247 174
## 17 18 19 20 21 22 23 24 25 26 27 28 30 39
## 90 57 40 27 7 9 5 1 3 4 1 1 1 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 5.000 7.000 7.531 10.000 39.000
##
## includes extended item information - examples:
## labels
## 1 low-fat milk
## 2 \nyogurt
## 3 1% fat buttermilk
The dataset comprises 20,000 transactions, featuring a total of 4,254 unique ingredients and exhibiting a density of 0.0018. Among the most frequent items, salt and butter take the lead with 7,768 and 5,351 occurrences, respectively, across all transactions. The maximum transaction frequency is recorded at 2,401 times, characterized by 7 elements. Additionally, the maximum size of elements in a transaction is 39, observed in a single transaction.
The Apriori algorithm is a classic association rule mining technique used to discover patterns in datasets. It operates based on the principle of association rules, aiming to find frequent itemsets in a transactional dataset. The algorithm employs a breadth-first search strategy to identify and prune itemsets with lower support, ultimately revealing meaningful associations among items.
Support Support is a measure used in association rule mining to quantify the frequency of occurrence of a particular itemset in the dataset. It is calculated as the ratio of transactions containing the itemset to the total number of transactions. High support values indicate that the itemset is frequently present in the dataset.
Confidence Confidence measures the reliability or strength of an association rule. It is calculated as the ratio of the support of the combined antecedent and consequent of a rule to the support of the antecedent alone. High confidence indicates a strong likelihood that the presence of the antecedent implies the presence of the consequent in a transaction.
# apply Apriori algorithm to discover association rules
rules <- apriori(trans, parameter = list(support = 0.006,
confidence = 0.4, minlen=2, maxlen = 15, target = "rules"))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.4 0.1 1 none FALSE TRUE 5 0.006 2
## maxlen target ext
## 15 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 120
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[4254 item(s), 20000 transaction(s)] done [0.08s].
## sorting and recoding items ... [196 item(s)] done [0.00s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 3 4 5 6 done [0.02s].
## writing ... [1532 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
# show summary of rules
summary(rules)
## set of 1532 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3 4 5 6
## 165 595 550 204 18
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 3.000 4.000 3.553 4.000 6.000
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.00600 Min. :0.4000 Min. :0.00635 Min. : 1.030
## 1st Qu.:0.00720 1st Qu.:0.4773 1st Qu.:0.01229 1st Qu.: 1.869
## Median :0.00895 Median :0.5567 Median :0.01655 Median : 2.539
## Mean :0.01230 Mean :0.5804 Mean :0.02244 Mean : 3.165
## 3rd Qu.:0.01315 3rd Qu.:0.6794 3rd Qu.:0.02460 3rd Qu.: 3.765
## Max. :0.11970 Max. :0.9449 Max. :0.26755 Max. :30.916
## count
## Min. : 120
## 1st Qu.: 144
## Median : 179
## Mean : 246
## 3rd Qu.: 263
## Max. :2394
##
## mining info:
## data ntransactions support confidence
## trans 20000 0.006 0.4
## call
## apriori(data = trans, parameter = list(support = 0.006, confidence = 0.4, minlen = 2, maxlen = 15, target = "rules"))
The Apriori algorithm has generated a total of 1,532 rules. The median length of these rules is 4, indicating that, on average, each rule involves 4 ingredients. Notably, the maximum lift value is nearly 31, a considerable deviation from the third quartile (Q3) of the lift, which is at 3.7.
inspect(head(rules, n = 10, by = "lift"))
## lhs rhs support confidence
## [1] {rind of} => {lemon} 0.00850 0.7296137
## [2] {buttermilk, sugar} => {baking soda} 0.00780 0.8571429
## [3] {buttermilk, flour} => {baking soda} 0.00640 0.8311688
## [4] {buttermilk, eggs} => {baking soda} 0.00635 0.8089172
## [5] {clove} => {cinnamon} 0.00725 0.7795699
## [6] {buttermilk, salt} => {baking soda} 0.00900 0.7627119
## [7] {cinnamon, eggs, flour, salt} => {baking soda} 0.00675 0.7258065
## [8] {cinnamon, eggs, flour, sugar} => {baking soda} 0.00665 0.7150538
## [9] {nutmeg, salt, sugar} => {cinnamon} 0.00600 0.7185629
## [10] {cinnamon, eggs, flour} => {baking soda} 0.00845 0.6897959
## coverage lift count
## [1] 0.01165 30.91584 170
## [2] 0.00910 13.59465 156
## [3] 0.00770 13.18269 128
## [4] 0.00785 12.82977 127
## [5] 0.00930 12.25739 145
## [6] 0.01180 12.09694 180
## [7] 0.00930 11.51160 135
## [8] 0.00930 11.34106 133
## [9] 0.00835 11.29816 120
## [10] 0.01225 10.94046 169
The {rind of} => {lemon} with 0.73 confidence and lift is 31. This indicates a strong association, suggesting that when the ingredient “rind of” is used, there is a high probability of also using “lemon.” Also, multiple rules showcase a strong association between combinations of “buttermilk,” “sugar,” “flour,” or “eggs” with “baking soda.” For example, the rule {buttermilk, sugar} => {baking soda} has a confidence of 0.8571, indicating a high likelihood of using “baking soda” when “buttermilk” and “sugar” are present.
Let’s make visualizations of rules using arulesViz package.
plot(rules, measure=c("support","lift"), shading="confidence",
main="Ingredient transactions rules")
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.
We can see that many of rules’ confidence between 0.7 and 0.9, and their lift are among 2-10 interval.
plot(rules, method = "graph", measure = "support",
shading = "lift", main = "Ingredients association Rules Graph")
## Warning: Unknown control parameters: main
## Available control parameters (with default values):
## layout = stress
## circular = FALSE
## ggraphdots = NULL
## edges = <environment>
## nodes = <environment>
## nodetext = <environment>
## colors = c("#EE0000FF", "#EEEEEEFF")
## engine = ggplot2
## max = 100
## verbose = FALSE
## Warning: Too many rules supplied. Only plotting the best 100 using 'lift'
## (change control parameter max if needed).
The visualization illustrates that baking soda, eggs, flour, and salt are centrally positioned in the graph. This implies a common occurrence of these ingredients, suggesting a prevalent use in the preparation of quickly cooked dishes such as omelets.
plot(rules, method = "two-key plot")
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.
The order represents number of ingredients in the both side of the rules. 2 size orders are rare rather than other size rules, however their support level is higher.
plot(rules, method = "matrix", engine = "htmlwidget")
## Warning: Too many rules supplied. Only plotting the best 1000 using 'lift'
## (change control parameter max if needed).
Parallel coordinates plots are designed to visualize multidimensional data where each dimension is displayed separately on the x-axis and the y-axis is shared. Each data point is represented by a line connecting the values for each dimension. We make a plot top 50 rules by lift measure decending order.
# top 50 rules by lift measure decending order
rules.top50 <- sort(rules, by="lift", decreasing=TRUE)[1:50]
plot(rules.top50, method="paracoord")
Delve into the Apriori algorithm to explore its intricacies and elucidate the behaviors of rules by leveraging appearance and control parameters.
We will elevate the support level to 0.009 and confidence to 0.7. Our objective is to identify frequent itemsets where the right-hand side is “butter” .
itemset <- apriori(trans, parameter=list(support=0.009,confidence=0.7,
minlen=2, maxlen=10, target="frequent itemsets"),
appearance=list(default="lhs", rhs="butter"),
control=list(sort=0, verbose=TRUE))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## NA 0.1 1 none FALSE TRUE 5 0.009 2
## maxlen target ext
## 10 frequent itemsets TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 0 TRUE
##
## Absolute minimum support count: 180
##
## set item appearances ...[1 item(s)] done [0.00s].
## set transactions ...[4254 item(s), 20000 transaction(s)] done [0.05s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 3 4 5 done [0.02s].
## sorting transactions ... done [0.00s].
## writing ... [598 set(s)] done [0.00s].
## creating S4 object ... done [0.00s].
summary(itemset)
## set of 598 itemsets
##
## most frequent items:
## salt sugar butter flour eggs (Other)
## 224 150 145 127 124 837
##
## element (itemset/transaction) length distribution:sizes
## 2 3 4 5
## 287 218 86 7
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 2.000 3.000 2.687 3.000 5.000
##
## summary of quality measures:
## support count
## Min. :0.00900 Min. : 180.0
## 1st Qu.:0.01080 1st Qu.: 216.0
## Median :0.01362 Median : 272.5
## Mean :0.01839 Mean : 367.9
## 3rd Qu.:0.01974 3rd Qu.: 394.8
## Max. :0.11970 Max. :2394.0
##
## includes transaction ID lists: FALSE
##
## mining info:
## data ntransactions support confidence
## trans 20000 0.009 1
## call
## apriori(data = trans, parameter = list(support = 0.009, confidence = 0.7, minlen = 2, maxlen = 10, target = "frequent itemsets"), appearance = list(default = "lhs", rhs = "butter"), control = list(sort = 0, verbose = TRUE))
The Apriori algorithm has discovered a total of 598 itemsets where the appearance of “butter” is on the right-hand side. Check the first 10 rules.
inspect(head(itemset, n=10))
## items support count
## [1] {butter, lemon juice} 0.01865 373
## [2] {butter, eggs} 0.07335 1467
## [3] {butter, garlic} 0.02100 420
## [4] {butter, milk} 0.06305 1261
## [5] {butter, onion} 0.04315 863
## [6] {butter, onions} 0.01330 266
## [7] {butter, raisins} 0.00915 183
## [8] {butter, salt} 0.11970 2394
## [9] {butter, sugar} 0.08495 1699
## [10] {black pepper, butter} 0.01105 221
# make itemset graph plot to understand frequent itemsets
plot(itemset, method = "graph", reorder=TRUE)
## Warning: Unknown control parameters: reorder
## Available control parameters (with default values):
## layout = stress
## circular = FALSE
## ggraphdots = NULL
## edges = <environment>
## nodes = <environment>
## nodetext = <environment>
## colors = c("#EE0000FF", "#EEEEEEFF")
## engine = ggplot2
## max = 100
## verbose = FALSE
## Warning: Too many itemsets supplied. Only plotting the best 100 using 'support'
## (change control parameter max if needed).
Salt consistently emerges as a prevailing ingredient, demonstrating enduring prominence. Furthermore, both sugar and butter exhibit substantial influence in culinary compositions. In contrast, pungent spices such as garlic, pepper, and onion are comparatively less utilized when juxtaposed with other ingredients in the dataset.
# now discover itemset rules by defining target = rules
rules_butter <- apriori(trans, parameter=list(support=0.009,confidence=0.5,
minlen=2, maxlen=15, target="rules"),
appearance=list(default="lhs", rhs="butter"),
control=list(sort=0, verbose=TRUE))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.5 0.1 1 none FALSE TRUE 5 0.009 2
## maxlen target ext
## 15 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 0 TRUE
##
## Absolute minimum support count: 180
##
## set item appearances ...[1 item(s)] done [0.00s].
## set transactions ...[4254 item(s), 20000 transaction(s)] done [0.05s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 3 4 5 done [0.02s].
## writing ... [65 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
summary(rules_butter)
## set of 65 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3 4 5
## 6 38 20 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 3.000 3.000 3.246 4.000 5.000
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.00900 Min. :0.5000 Min. :0.01300 Min. :1.869
## 1st Qu.:0.01060 1st Qu.:0.5160 1st Qu.:0.01890 1st Qu.:1.929
## Median :0.01305 Median :0.5318 Median :0.02345 Median :1.988
## Mean :0.01705 Mean :0.5610 Mean :0.03098 Mean :2.097
## 3rd Qu.:0.01810 3rd Qu.:0.5961 3rd Qu.:0.03220 3rd Qu.:2.228
## Max. :0.07695 Max. :0.7654 Max. :0.14885 Max. :2.861
## count
## Min. : 180.0
## 1st Qu.: 212.0
## Median : 261.0
## Mean : 341.1
## 3rd Qu.: 362.0
## Max. :1539.0
##
## mining info:
## data ntransactions support confidence
## trans 20000 0.009 0.5
## call
## apriori(data = trans, parameter = list(support = 0.009, confidence = 0.5, minlen = 2, maxlen = 15, target = "rules"), appearance = list(default = "lhs", rhs = "butter"), control = list(sort = 0, verbose = TRUE))
Transactions has 65 rules which butter appears on right hand side
# make itemset graph plot to understand frequent itemsets
plot(rules_butter, method = "graph", reorder=TRUE)
## Warning: Unknown control parameters: reorder
## Available control parameters (with default values):
## layout = stress
## circular = FALSE
## ggraphdots = NULL
## edges = <environment>
## nodes = <environment>
## nodetext = <environment>
## colors = c("#EE0000FF", "#EEEEEEFF")
## engine = ggplot2
## max = 100
## verbose = FALSE
# make graph plot interactive way
plot(rules_butter, method = "graph", reorder=TRUE, engine = "htmlwidget")
## Warning: Unknown control parameters: reorder
## Available control parameters (with default values):
## itemCol = #CBD2FC
## nodeCol = c("#EE0000", "#EE0303", "#EE0606", "#EE0909", "#EE0C0C", "#EE0F0F", "#EE1212", "#EE1515", "#EE1818", "#EE1B1B", "#EE1E1E", "#EE2222", "#EE2525", "#EE2828", "#EE2B2B", "#EE2E2E", "#EE3131", "#EE3434", "#EE3737", "#EE3A3A", "#EE3D3D", "#EE4040", "#EE4444", "#EE4747", "#EE4A4A", "#EE4D4D", "#EE5050", "#EE5353", "#EE5656", "#EE5959", "#EE5C5C", "#EE5F5F", "#EE6262", "#EE6666", "#EE6969", "#EE6C6C", "#EE6F6F", "#EE7272", "#EE7575", "#EE7878", "#EE7B7B", "#EE7E7E", "#EE8181", "#EE8484", "#EE8888", "#EE8B8B", "#EE8E8E", "#EE9191", "#EE9494", "#EE9797", "#EE9999", "#EE9B9B", "#EE9D9D", "#EE9F9F", "#EEA0A0", "#EEA2A2", "#EEA4A4", "#EEA5A5", "#EEA7A7", "#EEA9A9", "#EEABAB", "#EEACAC", "#EEAEAE", "#EEB0B0", "#EEB1B1", "#EEB3B3", "#EEB5B5", "#EEB7B7", "#EEB8B8", "#EEBABA", "#EEBCBC", "#EEBDBD", "#EEBFBF", "#EEC1C1", "#EEC3C3", "#EEC4C4", "#EEC6C6", "#EEC8C8", "#EEC9C9", "#EECBCB", "#EECDCD", "#EECFCF", "#EED0D0", "#EED2D2", "#EED4D4", "#EED5D5", "#EED7D7", "#EED9D9", "#EEDBDB", "#EEDCDC", "#EEDEDE", "#EEE0E0", "#EEE1E1", "#EEE3E3", "#EEE5E5", "#EEE7E7", "#EEE8E8", "#EEEAEA", "#EEECEC", "#EEEEEE")
## precision = 3
## igraphLayout = layout_nicely
## interactive = TRUE
## engine = visNetwork
## max = 100
## selection_menu = TRUE
## degree_highlight = 1
## verbose = FALSE
The Eclat (Equivalence Class Clustering and bottom-up Lattice Traversal) algorithm is a frequent itemset mining algorithm used to discover associations in transactional datasets. Unlike Apriori, Eclat employs a vertical layout for data representation, focusing on transactions rather than itemsets. It efficiently explores the dataset’s intersection to find frequent itemsets.
support: Specifies the minimum support for an itemset to be considered frequent. It is a value between 0 and 1.
maxlen: Specifies the maximum length of itemsets to be mined. It can be useful to limit the search space.
target: Specifies the target type of the mining task. It can be “frequent itemsets” or “maximal itemsets”.
appearance: A list of appearance constraints to enforce.
# apply Eclat algorithm
freq_items <- eclat(trans, parameter=list(support=0.009, minlen=2, maxlen=15))
## Eclat
##
## parameter specification:
## tidLists support minlen maxlen target ext
## FALSE 0.009 2 15 frequent itemsets TRUE
##
## algorithmic control:
## sparse sort verbose
## 7 -2 TRUE
##
## Absolute minimum support count: 180
##
## create itemset ...
## set transactions ...[4254 item(s), 20000 transaction(s)] done [0.07s].
## sorting and recoding items ... [138 item(s)] done [0.00s].
## creating sparse bit matrix ... [138 row(s), 20000 column(s)] done [0.00s].
## writing ... [598 set(s)] done [0.04s].
## Creating S4 object ... done [0.00s].
# show summary statistics of frequent itemset
summary(freq_items)
## set of 598 itemsets
##
## most frequent items:
## salt sugar butter flour eggs (Other)
## 224 150 145 127 124 837
##
## element (itemset/transaction) length distribution:sizes
## 2 3 4 5
## 287 218 86 7
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 2.000 3.000 2.687 3.000 5.000
##
## summary of quality measures:
## support count
## Min. :0.00900 Min. : 180.0
## 1st Qu.:0.01080 1st Qu.: 216.0
## Median :0.01362 Median : 272.5
## Mean :0.01839 Mean : 367.9
## 3rd Qu.:0.01974 3rd Qu.: 394.8
## Max. :0.11970 Max. :2394.0
##
## includes transaction ID lists: FALSE
##
## mining info:
## data ntransactions support
## trans 20000 0.009
## call
## eclat(data = trans, parameter = list(support = 0.009, minlen = 2, maxlen = 15))
The Eclat algorithm successfully identifies 598 frequent itemsets, most common size 2. Subsequently, utilizing these frequent itemsets, we proceed to establish association rules for the transactions in the dataset.
## Create rules from the itemsets
rules_eclat <- ruleInduction(freq_items, trans, confidence = .6)
summary(rules_eclat)
## set of 289 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3 4 5
## 17 122 128 22
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 3.000 4.000 3.536 4.000 5.000
##
## summary of quality measures:
## support confidence lift itemset
## Min. :0.00900 Min. :0.6011 Min. : 1.549 Min. : 1.0
## 1st Qu.:0.01040 1st Qu.:0.6426 1st Qu.: 1.937 1st Qu.:227.0
## Median :0.01285 Median :0.6908 Median : 2.554 Median :323.0
## Mean :0.01619 Mean :0.6990 Mean : 2.805 Mean :315.3
## 3rd Qu.:0.01735 3rd Qu.:0.7506 3rd Qu.: 3.203 3rd Qu.:422.0
## Max. :0.09060 Max. :0.8511 Max. :12.097 Max. :583.0
##
## mining info:
## data ntransactions support
## trans 20000 0.009
## call
## eclat(data = trans, parameter = list(support = 0.009, minlen = 2, maxlen = 15))
## confidence
## 0.6
inspect(tail(rules_eclat))
## lhs rhs support confidence lift itemset
## [1] {flour, sugar, water} => {salt} 0.00975 0.7358491 1.894565 544
## [2] {flour, water} => {salt} 0.01870 0.6666667 1.716444 545
## [3] {butter, flour, sugar} => {salt} 0.02265 0.6088710 1.567639 550
## [4] {flour, sugar} => {salt} 0.04685 0.6697641 1.724418 551
## [5] {flour} => {salt} 0.09060 0.6086664 1.567112 554
## [6] {butter, sugar, water} => {salt} 0.00920 0.6133333 1.579128 583
There are a total of 289 rules that align with our identified frequent itemsets. Create paracoordinal plot by using rules_eclat
plot(rules_eclat[1:50], method="paracoord")
Most frequent ingredients which appear on right hand side are butter, flour, eggs and salt. Meanwhile spicies such as garlic, onion appeared once.
plot(rules_eclat, method = "graph", reorder=TRUE)
## Warning: Unknown control parameters: reorder
## Available control parameters (with default values):
## layout = stress
## circular = FALSE
## ggraphdots = NULL
## edges = <environment>
## nodes = <environment>
## nodetext = <environment>
## colors = c("#EE0000FF", "#EEEEEEFF")
## engine = ggplot2
## max = 100
## verbose = FALSE
## Warning: Too many rules supplied. Only plotting the best 100 using 'lift'
## (change control parameter max if needed).
The weclat algorithm, standing for Weighted Eclat, is an extension of the traditional Eclat algorithm designed to handle weighted transaction data. The primary difference lies in its ability to perform weighted association rule mining (WARM) by considering transaction weights in the process.
weight <- sample(1:40 + 1:10, size = 20000, replace = TRUE)
## add weight information
transactionInfo(trans) <- data.frame(weight = weight)
inspect(trans[1:5])
## items weight
## [1] {blueberries,
## granulated sugar,
## lemon juice,
## vanilla yogurt} 10
## [2] {basmati rice,
## boneless chicken,
## cardamom seed,
## cashews,
## cilantro,
## clove,
## cumin seed,
## eggs,
## fresh lemon juice,
## garlic,
## ghee,
## hot green chili peppers,
## long-grain rice,
## mace,
## milk,
## mint leaf,
## onion,
## onions,
## peppercorns,
## plain yogurt,
## poppy seed,
## raisins,
## saffron,
## salt,
## tomatoes} 18
## [3] {fresh lemon juice,
## fresh water,
## lemon,
## lemons,
## rind of,
## sugar,
## zest of} 38
## [4] {black pepper,
## eggplant,
## extra firm tofu,
## garlic cloves,
## honey,
## lemon juice,
## low sodium soy sauce,
## maple syrup,
## mushrooms,
## mustard powder,
## olive oil,
## red wine vinegar,
## soy sauce,
## zucchini} 10
## [5] {cabbage,
## carrots,
## celery,
## onion,
## plain tomato juice} 16
# apply weclat algorithm with same Eclat parameters
weclat_itemset <- weclat(trans,
parameter = list(support=0.009, minlen=2, maxlen=15),
control = list(verbose = TRUE))
## Weighted Eclat (WEclat)
##
## parameter specification:
## support minlen maxlen target ext
## 0.009 2 15 <NA> NA
##
## algorithmic control:
## sort verbose
## NA TRUE
summary(weclat_itemset)
## set of 594 itemsets
##
## most frequent items:
## salt sugar butter eggs flour (Other)
## 216 148 142 125 123 832
##
## element (itemset/transaction) length distribution:sizes
## 2 3 4 5
## 290 216 82 6
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 2.00 3.00 2.67 3.00 5.00
##
## summary of quality measures:
## support
## Min. :0.009007
## 1st Qu.:0.010885
## Median :0.013637
## Mean :0.018440
## 3rd Qu.:0.019825
## Max. :0.118469
##
## includes transaction ID lists: FALSE
##
## mining info:
## data ntransactions support
## trans 20000 0.009
Weclat algorithm finds 577 common itemsests that are less than eclat
## create association rules
weclat_rules <- ruleInduction(weclat_itemset, trans, confidence = .6)
summary(weclat_rules)
## set of 277 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3 4 5
## 17 120 122 18
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 3.000 4.000 3.509 4.000 5.000
##
## summary of quality measures:
## support confidence lift itemset
## Min. :0.00895 Min. :0.6011 Min. : 1.549 Min. : 1.0
## 1st Qu.:0.01080 1st Qu.:0.6453 1st Qu.: 1.937 1st Qu.:222.0
## Median :0.01305 Median :0.6891 Median : 2.554 Median :316.0
## Mean :0.01650 Mean :0.6987 Mean : 2.807 Mean :309.8
## 3rd Qu.:0.01855 3rd Qu.:0.7506 3rd Qu.: 3.203 3rd Qu.:396.0
## Max. :0.09060 Max. :0.8511 Max. :12.097 Max. :582.0
##
## mining info:
## data ntransactions support confidence
## trans 20000 0.009 0.6
inspect(weclat_rules[1:10])
## lhs rhs support confidence lift
## [1] {white pepper} => {salt} 0.00990 0.6947368 1.788715
## [2] {buttermilk} => {baking soda} 0.01135 0.6878788 10.910052
## [3] {buttermilk, salt} => {baking soda} 0.00900 0.7627119 12.096937
## [4] {baking soda, buttermilk} => {salt} 0.00900 0.7929515 2.041585
## [5] {buttermilk} => {salt} 0.01180 0.7151515 1.841276
## [6] {salt, shortening} => {flour} 0.01040 0.6246246 4.196336
## [7] {flour, shortening} => {salt} 0.01040 0.8320000 2.142122
## [8] {shortening, sugar} => {salt} 0.01005 0.8375000 2.156282
## [9] {salt, shortening} => {sugar} 0.01005 0.6036036 2.485500
## [10] {shortening} => {salt} 0.01665 0.7985612 2.056028
## itemset
## [1] 1
## [2] 3
## [3] 4
## [4] 4
## [5] 6
## [6] 13
## [7] 13
## [8] 15
## [9] 15
## [10] 16
plot(weclat_rules[1:50], method="paracoord")
There is higher confidence and lift between salt, eggs, cinnamon and baking soda. We can’t find this higher lift in eclat algorithm.
plot(weclat_rules, method = "graph", reorder=TRUE)
## Warning: Unknown control parameters: reorder
## Available control parameters (with default values):
## layout = stress
## circular = FALSE
## ggraphdots = NULL
## edges = <environment>
## nodes = <environment>
## nodetext = <environment>
## colors = c("#EE0000FF", "#EEEEEEFF")
## engine = ggplot2
## max = 100
## verbose = FALSE
## Warning: Too many rules supplied. Only plotting the best 100 using 'lift'
## (change control parameter max if needed).
In summary, our exploration of association rule mining in a food dataset, employing Apriori, Eclat, and Weclat algorithms, has revealed meaningful patterns in ingredient combinations. Apriori laid the groundwork for frequent itemset identification, Eclat efficiently handled large datasets, and Weclat extended the analysis to weighted transaction data. The findings underscore prevalent ingredient associations, offering valuable insights into culinary compositions and showcasing the versatility of association rule mining in deciphering intricate food-related datasets.