In this notebook I will explore the basic association rule and frequent itemset mining methods in R.

Importing necessary libraries

library(glue)
library(arules)
library(arulesViz)

Loading and examining how the dataset looks like

df=read.transactions('dataset.csv',
                     format='single',
                     header=T,
                     cols=c("Member_number","itemDescription"),
                     sep=",")
print(as(df,"data.frame")[1:3,"items"])
## [1] "{canned beer,hygiene articles,misc. beverages,pastry,pickled vegetables,salty snack,sausage,semi-finished bread,soda,whole milk,yogurt}"
## [2] "{beef,curd,frankfurter,rolls/buns,sausage,soda,whipped/sour cream,white bread,whole milk}"                                              
## [3] "{butter,butter milk,frozen vegetables,other vegetables,specialty chocolate,sugar,tropical fruit,whole milk}"

The dataset consists of transactions. Each row represents a single basket bought by a customer.

How many items have been a part of each transaction?

hist=hist(size(df),
          breaks=6,
          plot=F)

labels=round(hist$counts/nrow(df)*100,1)

hist(size(df),
     col='lightgreen',
     breaks=6,
     main="Histogram of basket sizes",
     sub=glue("max = {max(size(df))}, min = {min(size(df))}"),
     ylim=c(0,2000),
     ylab="Counts",
     xlim=c(0,30),
     xlab="Basket size",
     labels = paste0(labels,"%"),
     yaxt='n'
     )

axis(side=2,at=seq(0,2001,1000),labels=seq(0,2001,1000))

how many distinct items are there?

length(itemFrequency(df))
## [1] 167

how often has each item been bought?

rounded_freq=sprintf("%s: %.1f%%",
                     names(sort(itemFrequency(df),decreasing=T)), 
                     round(100*sort(itemFrequency(df),decreasing=T),1))
cat("",paste(names(rounded_freq), rounded_freq, "\n"))
##   whole milk: 45.8% 
##   other vegetables: 37.7% 
##   rolls/buns: 35.0% 
##   soda: 31.3% 
##   yogurt: 28.3% 
##   tropical fruit: 23.4% 
##   root vegetables: 23.1% 
##   bottled water: 21.4% 
##   sausage: 20.6% 
##   citrus fruit: 18.5% 
##   pastry: 17.8% 
##   pip fruit: 17.1% 
##   shopping bags: 16.8% 
##   canned beer: 16.5% 
##   bottled beer: 15.9% 
##   whipped/sour cream: 15.5% 
##   newspapers: 14.0% 
##   frankfurter: 13.8% 
##   brown bread: 13.6% 
##   domestic eggs: 13.3% 
##   pork: 13.2% 
##   butter: 12.6% 
##   fruit/vegetable juice: 12.5% 
##   curd: 12.1% 
##   beef: 12.0% 
##   margarine: 11.7% 
##   coffee: 11.5% 
##   frozen vegetables: 10.3% 
##   chicken: 10.1% 
##   white bread: 8.9% 
##   cream cheese : 8.9% 
##   chocolate: 8.6% 
##   dessert: 8.6% 
##   napkins: 8.1% 
##   hamburger meat: 8.0% 
##   berries: 8.0% 
##   UHT-milk: 7.9% 
##   onions: 7.6% 
##   salty snack: 6.9% 
##   waffles: 6.9% 
##   sugar: 6.6% 
##   long life bakery product: 6.5% 
##   butter milk: 6.5% 
##   meat: 6.4% 
##   ham: 6.3% 
##   frozen meals: 6.3% 
##   beverages: 6.2% 
##   misc. beverages: 5.9% 
##   specialty chocolate: 5.8% 
##   ice cream: 5.6% 
##   oil: 5.6% 
##   grapes: 5.5% 
##   candy: 5.4% 
##   hard cheese: 5.3% 
##   specialty bar: 5.3% 
##   hygiene articles: 5.2% 
##   sliced cheese: 5.2% 
##   chewing gum: 4.5% 
##   white wine: 4.4% 
##   cat food: 4.4% 
##   red/blush wine: 4.0% 
##   herbs: 3.9% 
##   processed cheese: 3.8% 
##   soft cheese: 3.8% 
##   flour: 3.6% 
##   semi-finished bread: 3.6% 
##   dishes: 3.4% 
##   pickled vegetables: 3.3% 
##   detergent: 3.3% 
##   packaged fruit/vegetables: 3.2% 
##   baking powder: 3.1% 
##   pasta: 3.0% 
##   pot plants: 3.0% 
##   canned fish: 3.0% 
##   liquor: 2.6% 
##   frozen fish: 2.6% 
##   seasonal products: 2.6% 
##   spread cheese: 2.5% 
##   condensed milk: 2.4% 
##   mustard: 2.3% 
##   frozen dessert: 2.3% 
##   cake bar: 2.3% 
##   salt: 2.3% 
##   pet care: 2.2% 
##   canned vegetables: 2.1% 
##   roll products : 2.1% 
##   turkey: 2.0% 
##   photo/film: 2.0% 
##   mayonnaise: 1.9% 
##   cling film/bags: 1.9% 
##   dish cleaner: 1.9% 
##   frozen potato products: 1.8% 
##   specialty cheese: 1.8% 
##   sweet spreads: 1.7% 
##   dog food: 1.7% 
##   flower (seeds): 1.7% 
##   liquor (appetizer): 1.7% 
##   candles: 1.7% 
##   finished products: 1.6% 
##   chocolate marshmallow: 1.5% 
##   Instant food products: 1.5% 
##   zwieback: 1.5% 
##   instant coffee: 1.5% 
##   vinegar: 1.3% 
##   rice: 1.3% 
##   liver loaf: 1.2% 
##   soups: 1.2% 
##   popcorn: 1.2% 
##   curd cheese: 1.2% 
##   sparkling wine: 1.2% 
##   house keeping products: 1.2% 
##   sauces: 1.1% 
##   cereals: 1.1% 
##   softener: 1.1% 
##   female sanitary products: 1.0% 
##   spices: 1.0% 
##   brandy: 1.0% 
##   male cosmetics: 0.9% 
##   meat spreads: 0.9% 
##   jam: 0.9% 
##   dental care: 0.8% 
##   nuts/prunes: 0.8% 
##   ketchup: 0.8% 
##   rum: 0.8% 
##   cleaner: 0.8% 
##   kitchen towels: 0.8% 
##   artif. sweetener: 0.7% 
##   fish: 0.7% 
##   specialty fat: 0.7% 
##   light bulbs: 0.7% 
##   snack products: 0.7% 
##   tea: 0.7% 
##   abrasive cleaner: 0.6% 
##   nut snack: 0.6% 
##   organic sausage: 0.6% 
##   potato products: 0.6% 
##   tidbits: 0.6% 
##   canned fruit: 0.5% 
##   syrup: 0.5% 
##   skin care: 0.5% 
##   soap: 0.5% 
##   prosecco: 0.5% 
##   bathroom cleaner: 0.4% 
##   cookware: 0.4% 
##   cocoa drinks: 0.4% 
##   flower soil/fertilizer: 0.4% 
##   pudding powder: 0.4% 
##   cooking chocolate: 0.4% 
##   ready soups: 0.4% 
##   honey: 0.3% 
##   cream: 0.3% 
##   frozen fruits: 0.3% 
##   specialty vegetables: 0.3% 
##   organic products: 0.3% 
##   decalcifier: 0.2% 
##   hair spray: 0.2% 
##   liqueur: 0.2% 
##   whisky: 0.2% 
##   salad dressing: 0.2% 
##   frozen chicken: 0.1% 
##   make up remover: 0.1% 
##   rubbing alcohol: 0.1% 
##   toilet cleaner: 0.1% 
##   bags: 0.1% 
##   baby cosmetics: 0.1% 
##   kitchen utensil: 0.0% 
##   preservation products: 0.0%

Which items have been bought most frequently?

top10=100*sort(itemFrequency(df, type="relative"), decreasing=T)[1:10]
par(mar=c(5, 8, 4, 2))
barplot=barplot(top10, 
        horiz=T, 
        main="10 most frequently bought items", 
        xlab="Frequency [%]", 
        xlim=c(0, 60),
        las=2,
        col='lightgreen',
        xaxt='n')
axis(1,las=1)

Frequent itemsets

How do we determine if a given itemset is frequent or not? To assess this, we have to calculate the SUPPORT, which essentially tells us how much of all the transactions are the ones with a given basket. It can also be interpreted as a probability of randomly choosing a transaction containing a given itemset from the dataset.

In rder to mine those itemsets, I will use the ECLAT algorithm, which is explained in great detail in this paper. I’m using ECLAT instead of apriori algorithm, because it much less computationally expensive. I’ll mine the most frequent itemsets of sizes ranging from 2 to 5.

frequent_itemsets=eclat(df,
                        parameter=list(support=0.01, minlen=2, maxlen=5     ))

Here are the most frequent itemsets of each size. It seems that all larger baskets are supersets of their predeccessors.

for (i in 2:5)
{
  inspect(sort(subset(frequent_itemsets,size(frequent_itemsets)==i), by='support')[1])
}
##     items                          support   count
## [1] {other vegetables, whole milk} 0.1913802 746  
##     items                                      support    count
## [1] {other vegetables, rolls/buns, whole milk} 0.08209338 320  
##     items                                              support   count
## [1] {other vegetables, rolls/buns, whole milk, yogurt} 0.0343766 134  
##     items                  support count
## [1] {other vegetables,                  
##      rolls/buns,                        
##      sausage,                           
##      whole milk,                        
##      yogurt}            0.01359672    53

Now I’ll begin mining the association rules, which are essentially saying which itemsets are likely to be included with a given itemset in the same basket. The lagorithm in use here is the popular apriori, which is showcased in this paper. It mines association rules of a specified length, support and confidence. Confidence can be interpreted as a ratio of the amount of baskets containing both LHS and RHS itemsets and the amount of baskets containing just the LHS itemset. It is essentially a probability of finding the itemset A together with the itemset B in the basket (knowing that the basket already contains itemset B).

rules=apriori(df,
              parameter=list(support=0.1, 
                             confidence=0.5, 
                             minlen=2))

With rigorous support and confidence requirements the algorithm mined only 5 simple association rules. Besides the support and confidence, Each rule can also be characterized with lift and coverage.

Lift is the ratio of the product of LHS and RHS support values divided by the support of a basket containing both of them. Lift values greater than 1 indicate that LHS and RHS are more likely to be bought together.

Coverage is simply the support of the LHS itemset. It represents how often LHS itemset occurs in the dataset.

inspectDT(rules)

Here is a way of visualizing mined association rules on a graph. Arrows are pointing from the LHS itemset to the RHS itemset. The rule is represented by a dot between the arrows.

The graph shows that all generated association rules are about items being bought together with whole milk.

plot(rules, method="graph")

Now I’ll mine association rules which are bigger (baskets containing at least 3 items). To do so I had to lower the support and confidence requirements and specify a minimum length.

more_rules=apriori(df,
                   parameter=list(support=0.025, confidence=0.25, minlen=3))

With that many rules created, there is a need to check if some of them aren’t there purely by chance. Fisher’s exact test based on contingency tables is a quick and easy way of doing this. How the test works is described in great detail in this lecture.

## There are 35 insignificant rules.

Removing insignificant rules.

more_rules=more_rules[is.significant(more_rules, 
                                     trans1, 
                                     alpha=0.05, 
                                     adjust='none')]

Now I need to check if the rules created are in fact the maximal sets, i.e. if there are no general rules which already contain them.

## 36 sets were not maximal.

Removing rules which are not maximal.

more_rules=more_rules[is.maximal(more_rules)]

After removing all unnecessary rules I can move on to displaying them here.

inspectDT(more_rules)

Now we can see how the relationships between more than 2 items look like on a graph. Here are the top 10 rules sorted by their confidence level.

Yet again, most of the rules here are somewhat connected to whole milk. It has to do with the fact that whole milk was a part of over 45% of the transactions.