Overview

Imagine 10000 receipts sitting on your table. Each receipt represents a transaction with items that were purchased. The receipt is a representation of stuff that went into a customer’s basket - and therefore ‘Market Basket Analysis’.

That is exactly what the Groceries Data Set contains: a collection of receipts with each line representing 1 receipt and the items purchased. Each line is called a transaction and each column in a row represents an item. The data set is attached.

Your assignment is to use R to mine the data for association rules. You should report support, confidence and lift and your top 10 rules by lift.

Extra credit: do a simple cluster analysis on the data as well. Use whichever packages you like.

Market Basket Analysis

df <- read.transactions("GroceryDataSet.csv", header=FALSE, sep=",")
summary(df)
## transactions as itemMatrix in sparse format with
##  9835 rows (elements/itemsets/transactions) and
##  169 columns (items) and a density of 0.02609146 
## 
## most frequent items:
##       whole milk other vegetables       rolls/buns             soda 
##             2513             1903             1809             1715 
##           yogurt          (Other) 
##             1372            34055 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
## 2159 1643 1299 1005  855  645  545  438  350  246  182  117   78   77   55   46 
##   17   18   19   20   21   22   23   24   26   27   28   29   32 
##   29   14   14    9   11    4    6    1    1    1    1    3    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   4.409   6.000  32.000 
## 
## includes extended item information - examples:
##             labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3   baby cosmetics

Whole milk is the most frequently bought items followed by other vegetables

Top 20 items by frequency.

## Association Rules The apriori function of the arules package is used to generate the association rules.

The parameter support is “defined as the proportion of transactions in the data set which contain the item set. For example the item set {milk, bread} has a support of 2/5 = 0.4 since it occurs in 40% of all transactions (2 out of 5 transactions).”

The parameter confidence has a value of 0.55. In the example of {milk, bread} => {butter}, this means that 55% of transactions containing milk and bread also has butter.

Use minlen = 2 so that the LHS (antecedent) is not empty. By default, apriori has a minlen value of 1 (empty LHS).

The association rules need to meet the minimum support and confidence values. Support of 0.02 and confidence of 0.3 generated 37 rules.

The model below with support of 0.02 and confidence of 0.3 generated 10 association rules.

basket_model <- apriori(df, parameter = list(support = 0.02, confidence = 0.3 , minlen=2))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.3    0.1    1 none FALSE            TRUE       5    0.02      2
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 196 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.01s].
## sorting and recoding items ... [59 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [37 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

Summary of model

Below is summary of base model with support value of 0.02 and confidence value of 0.3 applied to data set.

summary(basket_model)
## set of 37 rules
## 
## rule length distribution (lhs + rhs):sizes
##  2  3 
## 32  5 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   2.000   2.000   2.135   2.000   3.000 
## 
## summary of quality measures:
##     support          confidence        coverage            lift      
##  Min.   :0.02003   Min.   :0.3079   Min.   :0.04342   Min.   :1.205  
##  1st Qu.:0.02318   1st Qu.:0.3483   1st Qu.:0.05765   1st Qu.:1.514  
##  Median :0.02755   Median :0.3868   Median :0.07229   Median :1.756  
##  Mean   :0.03134   Mean   :0.3915   Mean   :0.08227   Mean   :1.730  
##  3rd Qu.:0.03325   3rd Qu.:0.4249   3rd Qu.:0.09395   3rd Qu.:1.915  
##  Max.   :0.07483   Max.   :0.5129   Max.   :0.19349   Max.   :2.842  
##      count      
##  Min.   :197.0  
##  1st Qu.:228.0  
##  Median :271.0  
##  Mean   :308.3  
##  3rd Qu.:327.0  
##  Max.   :736.0  
## 
## mining info:
##  data ntransactions support confidence
##    df          9835    0.02        0.3
##                                                                                call
##  apriori(data = df, parameter = list(support = 0.02, confidence = 0.3, minlen = 2))

Rules sorted by Lift

When there are too many association rules that satisfy the support and confidence constraints, lift can be used to further filter or rank the rules. Lift with greater values indicate stronger association.

The strongest association with the highest lift value is {citrus fruit,root vegetables} ==> {other vegetables}. At the bottom is {domestic eggs,other vegetables} ==> {whole milk} .

inspect(sort(basket_model, by="lift")[1:10])
##      lhs                                    rhs                support   
## [1]  {other vegetables, whole milk}      => {root vegetables}  0.02318251
## [2]  {root vegetables, whole milk}       => {other vegetables} 0.02318251
## [3]  {root vegetables}                   => {other vegetables} 0.04738180
## [4]  {whipped/sour cream}                => {other vegetables} 0.02887646
## [5]  {whole milk, yogurt}                => {other vegetables} 0.02226741
## [6]  {other vegetables, yogurt}          => {whole milk}       0.02226741
## [7]  {butter}                            => {whole milk}       0.02755465
## [8]  {pork}                              => {other vegetables} 0.02165735
## [9]  {curd}                              => {whole milk}       0.02613116
## [10] {other vegetables, root vegetables} => {whole milk}       0.02318251
##      confidence coverage   lift     count
## [1]  0.3097826  0.07483477 2.842082 228  
## [2]  0.4740125  0.04890696 2.449770 228  
## [3]  0.4347015  0.10899847 2.246605 466  
## [4]  0.4028369  0.07168277 2.081924 284  
## [5]  0.3974592  0.05602440 2.054131 219  
## [6]  0.5128806  0.04341637 2.007235 219  
## [7]  0.4972477  0.05541434 1.946053 271  
## [8]  0.3756614  0.05765125 1.941476 213  
## [9]  0.4904580  0.05327911 1.919481 257  
## [10] 0.4892704  0.04738180 1.914833 228

Graph of Rules

Plot of the 37 rules.

plot(basket_model, jitter = 10)

### Plot of the 10 rules using the paracoord method.

head(basket_model, n = 10, by = "confidence") %>%  
    plot(method = "paracoord")

Graph method of 10 rules

head(basket_model, n = 10, by = "confidence") %>% 
  plot(method = "graph", engine = "htmlwidget")

Cluster analysis

Using the hierarchical cluster analysis to cluster the transactions. The hierarchical clusters creates cluster dedograms and it can include different methods in the hclust function. We will look at the complete method and the ward method.

Complete Method

trans2 <- df[ , itemFrequency(df) > 0.05]
glist <- dissimilarity(trans2, which = "items")
# plot dendrogram
hc1 <- hclust(glist, method = "complete" )

plot(hc1, cex = 0.8, hang = 2)

Ward Method

Serperated into 7 different clusters with borders around them.

hc2 <- hclust(glist, method = "ward.D2" )

plot(hc2, cex = 0.7)
rect.hclust(hc2, k = 7, border = 2:5)

Tanglegram compares the two dendrograms.

library(dendextend)
## Warning: package 'dendextend' was built under R version 4.3.3
## 
## ---------------------
## Welcome to dendextend version 1.17.1
## Type citation('dendextend') for how to cite the package.
## 
## Type browseVignettes(package = 'dendextend') for the package vignette.
## The github page is: https://github.com/talgalili/dendextend/
## 
## Suggestions and bug-reports can be submitted at: https://github.com/talgalili/dendextend/issues
## You may ask questions at stackoverflow, use the r and dendextend tags: 
##   https://stackoverflow.com/questions/tagged/dendextend
## 
##  To suppress this message use:  suppressPackageStartupMessages(library(dendextend))
## ---------------------
## 
## Attaching package: 'dendextend'
## The following object is masked from 'package:stats':
## 
##     cutree
# Compute 2 hierarchical clusterings
hc3 <- hclust(glist, method = "complete")
hc4 <- hclust(glist, method = "ward.D2")

# Create two dendrograms
dend1 <- as.dendrogram (hc3)
dend2 <- as.dendrogram (hc4)

tanglegram(dend1, dend2)