Market basket analysis

Association rules

Association rule mining is a machine learning technique helping to uncover relationships between databases. One example of application of might be Market Basket Analysis. Market basket analysis is a data mining application/procedure used by retailers to discover relationships between items people buy to identify customer purchasing patterns and then employ them in order to increase sales.

Libraries

In order to implement market basket analysis algorithms arules and arulesViz packages was utilized. Those packages offer the environment for representing, manipulating, measuring, visualizing and analyzing transaction data using association rules.

Dataset

The Groceries dataset used in this paper is a built-in dataset from arules package consisting of 30 days of real-world transaction data from a local grocery outlet.

library(arules)

## Loading required package: Matrix

## 
## Attaching package: 'arules'

## The following objects are masked from 'package:base':
## 
##     abbreviate, write

library(arulesViz)
data("Groceries")
transactions = Groceries
head(Groceries)

## transactions in sparse format with
##  6 transactions (rows) and
##  169 items (columns)

From the summary we can derived the information that:

the dataset consists of 9835 transactions and the items which are aggregated to 169 categories
which grocery products appeared most frequently
an average transaction contains 4.409 products
a minimum number of items in a transaction is 1 and there are 2159 transactions of this size
a maximum number of items in a transaction is 32 and there was only one transaction of this size
there are 2,6% non zero cells in the matrix

summary(transactions)

## transactions as itemMatrix in sparse format with
##  9835 rows (elements/itemsets/transactions) and
##  169 columns (items) and a density of 0.02609146 
## 
## most frequent items:
##       whole milk other vegetables       rolls/buns             soda 
##             2513             1903             1809             1715 
##           yogurt          (Other) 
##             1372            34055 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
## 2159 1643 1299 1005  855  645  545  438  350  246  182  117   78   77   55   46 
##   17   18   19   20   21   22   23   24   26   27   28   29   32 
##   29   14   14    9   11    4    6    1    1    1    1    3    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   4.409   6.000  32.000 
## 
## includes extended item information - examples:
##        labels  level2           level1
## 1 frankfurter sausage meat and sausage
## 2     sausage sausage meat and sausage
## 3  liver loaf sausage meat and sausage

LIST(head(transactions))

## [[1]]
## [1] "citrus fruit"        "semi-finished bread" "margarine"          
## [4] "ready soups"        
## 
## [[2]]
## [1] "tropical fruit" "yogurt"         "coffee"        
## 
## [[3]]
## [1] "whole milk"
## 
## [[4]]
## [1] "pip fruit"     "yogurt"        "cream cheese " "meat spreads" 
## 
## [[5]]
## [1] "other vegetables"         "whole milk"              
## [3] "condensed milk"           "long life bakery product"
## 
## [[6]]
## [1] "whole milk"       "butter"           "yogurt"           "rice"            
## [5] "abrasive cleaner"

inspect(head(transactions))

##     items                     
## [1] {citrus fruit,            
##      semi-finished bread,     
##      margarine,               
##      ready soups}             
## [2] {tropical fruit,          
##      yogurt,                  
##      coffee}                  
## [3] {whole milk}              
## [4] {pip fruit,               
##      yogurt,                  
##      cream cheese ,           
##      meat spreads}            
## [5] {other vegetables,        
##      whole milk,              
##      condensed milk,          
##      long life bakery product}
## [6] {whole milk,              
##      butter,                  
##      yogurt,                  
##      rice,                    
##      abrasive cleaner}

size(head(transactions))

## [1] 4 3 1 4 4 5

length(head(transactions))

## [1] 6

Item frequency

The plot below shows the 25 items, which appeared most frequently in this dataset. The most frequent five items are respectively whole milk, other vegetables, rolls/buns, soda and yogurt.

itemFrequencyPlot(
  transactions,
  topN = 25,
  type = "absolute",
  main = "Item frequency",
  cex.names = 0.85
)

Sparcity of the data

The plot below shows how sparse is the matrix for the first 10 transactions

image(transactions[1:10])

The plot below shows how sparse is the matrix for the random sample of 80 transactions

image(sample(transactions, 80))

Data statistics

One-dimensional tables

The table below enables us to see the fraction of all transactions in which a given product occurred

head(round(itemFrequency(transactions),3))

##       frankfurter           sausage        liver loaf               ham 
##             0.059             0.094             0.005             0.026 
##              meat finished products 
##             0.026             0.007

head(itemFrequency(transactions, type="absolute"))

##       frankfurter           sausage        liver loaf               ham 
##               580               924                50               256 
##              meat finished products 
##               254                64

Two-dimensional tables

The symetric matrices of n x n containing the co-occurence counts between pairs of items

The count measure indicates the number of transactions in which both events occurred together

cctab<-crossTable(transactions, measure="count", sort=TRUE) 
head(round(cctab,2))

The support measure shows how frequently an item appeard in total number of transactions

stab<-crossTable(transactions, measure="support", sort=TRUE) 
head(round(stab, 3))

The lift measure illustrates how often two given products are bought together than separately

ltab<-crossTable(transactions, measure="lift", sort=TRUE) 
head(ltab)

Apriori algorithm

The Apriori algorithm used to determine rules the association rules between items. This algorithm identify often occuring sets of items and based on them generates rules. Firstly it finds frequent single items in the database and then add to them other items as long as they appear together sufficiently often in the database.

Creating rules

The thresholds of the support and confidence parameters were respectively 0.006 and therefore the algorithm returned only the rules having support at 0.6% and confidence at 2.5% at least.

rules.transactions<-apriori(transactions, parameter=list(support =
                                                      0.006, confidence = 0.25, minlen = 2))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##        0.25    0.1    1 none FALSE            TRUE       5   0.006      2
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 59 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [109 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [463 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

Sorting rules by confidence

The interpretation of the first row:

in 0.6% of transactions butter, whipped cream/sour cream and whole milk appear together
the conditional probability of occuring whole milk in a transaction providing that the transaction also contains butter and whipped/sour cream is 0.66
the items appear together in transactions at 2.58 the rate we would expect while the itemsets {butter, whipped cream/sour} and {whole milk} are independent This says how likely item Y is purchased when item X is purchased, while controlling for how popular item Y is. In Table 1, the lift of {apple -> beer} is 1,which implies no association between items.

rules.by.conf<-sort(rules.transactions, by="confidence", decreasing=TRUE)
inspect(head(rules.by.conf))

##     lhs                     rhs              support confidence    coverage     lift count
## [1] {butter,                                                                              
##      whipped/sour cream} => {whole milk} 0.006710727  0.6600000 0.010167768 2.583008    66
## [2] {butter,                                                                              
##      yogurt}             => {whole milk} 0.009354347  0.6388889 0.014641586 2.500387    92
## [3] {root vegetables,                                                                     
##      butter}             => {whole milk} 0.008235892  0.6377953 0.012913066 2.496107    81
## [4] {tropical fruit,                                                                      
##      curd}               => {whole milk} 0.006507372  0.6336634 0.010269446 2.479936    64
## [5] {tropical fruit,                                                                      
##      butter}             => {whole milk} 0.006202339  0.6224490 0.009964413 2.436047    61
## [6] {tropical fruit,                                                                      
##      other vegetables,                                                                    
##      yogurt}             => {whole milk} 0.007625826  0.6198347 0.012302999 2.425816    75

Sorting by lift

rules.by.lift<-sort(rules.transactions, by="lift", decreasing=TRUE) 
inspect(head(rules.by.lift))

##     lhs                    rhs                      support confidence   coverage     lift count
## [1] {herbs}             => {root vegetables}    0.007015760  0.4312500 0.01626843 3.956477    69
## [2] {berries}           => {whipped/sour cream} 0.009049314  0.2721713 0.03324860 3.796886    89
## [3] {tropical fruit,                                                                            
##      other vegetables,                                                                          
##      whole milk}        => {root vegetables}    0.007015760  0.4107143 0.01708185 3.768074    69
## [4] {beef,                                                                                      
##      other vegetables}  => {root vegetables}    0.007930859  0.4020619 0.01972547 3.688692    78
## [5] {tropical fruit,                                                                            
##      other vegetables}  => {pip fruit}          0.009456024  0.2634561 0.03589222 3.482649    93
## [6] {beef,                                                                                      
##      whole milk}        => {root vegetables}    0.008032537  0.3779904 0.02125064 3.467851    79

Sorting by count

rules.by.count<- sort(rules.transactions, by="count", decreasing=TRUE)
inspect(head(rules.by.count))

##     lhs                   rhs                support    confidence coverage 
## [1] {other vegetables} => {whole milk}       0.07483477 0.3867578  0.1934926
## [2] {whole milk}       => {other vegetables} 0.07483477 0.2928770  0.2555160
## [3] {rolls/buns}       => {whole milk}       0.05663447 0.3079049  0.1839349
## [4] {yogurt}           => {whole milk}       0.05602440 0.4016035  0.1395018
## [5] {root vegetables}  => {whole milk}       0.04890696 0.4486940  0.1089985
## [6] {root vegetables}  => {other vegetables} 0.04738180 0.4347015  0.1089985
##     lift     count
## [1] 1.513634 736  
## [2] 1.513634 736  
## [3] 1.205032 557  
## [4] 1.571735 551  
## [5] 1.756031 481  
## [6] 2.246605 466

rules_butter <-
  apriori(
    data = transactions,
    parameter = list(supp = 0.001, conf = 0.15),
    appearance = list(default = "lhs", rhs = "butter"),
    control = list(verbose = F)
  )
rules_cbeer_dt <- inspect(rules_butter[1:5], linebreak = FALSE)

##     lhs                        rhs      support     confidence coverage   
## [1] {jam}                   => {butter} 0.001220132 0.2264151  0.005388917
## [2] {Instant food products} => {butter} 0.001220132 0.1518987  0.008032537
## [3] {flower (seeds)}        => {butter} 0.001626843 0.1568627  0.010371124
## [4] {turkey}                => {butter} 0.001525165 0.1875000  0.008134215
## [5] {rice}                  => {butter} 0.001830198 0.2400000  0.007625826
##     lift     count
## [1] 4.085858 12   
## [2] 2.741145 12   
## [3] 2.830725 16   
## [4] 3.383601 15   
## [5] 4.331009 18

inspectDT(rules.transactions)

Visualization of the results

plot(rules.transactions, shading = "order",engine = "html")

## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.

plot(rules.transactions, method = "graph")

## Warning: Too many rules supplied. Only plotting the best 100 using
## 'lift' (change control parameter max if needed).

plot(rules.transactions, method = "matrix", engine = "html")

plot(rules.transactions, method="paracoord", control=list(reorder=TRUE))

plot(rules.transactions, method = "graph", limit = 20, engine = "html" )

Equivalence Class Clustering and bottom-up Lattice Traversal (ECLAT) algorithm

The Eclat algortim finds frequent itemsets and provides measures for them. This algorithm was introduced to adress the weakness of the aforementioned Apriori algorithm. Due to the fact that at each stage it uses the recent generated dataset to learn frequent itemset is more effient than the Apriori which scans the original database repeatedly. What is more Eclat is faster since it gives less metrics, unlike Apriori it does not include the Lift and Confidence metrics.

head(inspect(transactions[1:5]))

##     items                     
## [1] {citrus fruit,            
##      semi-finished bread,     
##      margarine,               
##      ready soups}             
## [2] {tropical fruit,          
##      yogurt,                  
##      coffee}                  
## [3] {whole milk}              
## [4] {pip fruit,               
##      yogurt,                  
##      cream cheese ,           
##      meat spreads}            
## [5] {other vegetables,        
##      whole milk,              
##      condensed milk,          
##      long life bakery product}

## NULL

The Eclat implementation

freq.items<-eclat(transactions, parameter=list(supp=0.01, maxlen=15))

## Eclat
## 
## parameter specification:
##  tidLists support minlen maxlen            target  ext
##     FALSE    0.01      1     15 frequent itemsets TRUE
## 
## algorithmic control:
##  sparse sort verbose
##       7   -2    TRUE
## 
## Absolute minimum support count: 98 
## 
## create itemset ... 
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [88 item(s)] done [0.00s].
## creating sparse bit matrix ... [88 row(s), 9835 column(s)] done [0.00s].
## writing  ... [333 set(s)] done [0.01s].
## Creating S4 object  ... done [0.00s].

The results and interpretation

From the first row of the results we can see that the probability of whole milk and hard cheese occuring together in one transaction is 0.1%.

inspect(freq.items[1:5])

##     items                           support    count
## [1] {whole milk, hard cheese}       0.01006609  99  
## [2] {whole milk, butter milk}       0.01159126 114  
## [3] {other vegetables, butter milk} 0.01037112 102  
## [4] {ham, whole milk}               0.01148958 113  
## [5] {whole milk, sliced cheese}     0.01077783 106

The vector of support values

round(support(items(freq.items), transactions) , 2)

##   [1] 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01
##  [16] 0.01 0.02 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.02 0.01 0.01
##  [31] 0.02 0.02 0.01 0.02 0.01 0.01 0.02 0.01 0.01 0.01 0.02 0.01 0.01 0.02 0.02
##  [46] 0.01 0.01 0.01 0.02 0.02 0.01 0.01 0.02 0.01 0.03 0.02 0.01 0.02 0.01 0.01
##  [61] 0.01 0.02 0.01 0.01 0.01 0.01 0.01 0.01 0.02 0.02 0.01 0.01 0.01 0.02 0.02
##  [76] 0.02 0.01 0.01 0.01 0.01 0.02 0.02 0.01 0.02 0.02 0.03 0.02 0.01 0.01 0.01
##  [91] 0.01 0.01 0.01 0.02 0.02 0.01 0.01 0.01 0.01 0.01 0.01 0.03 0.02 0.01 0.01
## [106] 0.01 0.01 0.03 0.02 0.02 0.02 0.01 0.01 0.01 0.01 0.01 0.03 0.02 0.02 0.01
## [121] 0.01 0.01 0.01 0.01 0.01 0.03 0.02 0.01 0.02 0.02 0.01 0.01 0.01 0.01 0.01
## [136] 0.01 0.01 0.01 0.01 0.03 0.03 0.01 0.02 0.01 0.02 0.01 0.01 0.01 0.03 0.03
## [151] 0.01 0.02 0.01 0.02 0.02 0.01 0.01 0.01 0.01 0.01 0.03 0.02 0.02 0.02 0.02
## [166] 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.03 0.03 0.02 0.02 0.01 0.02 0.02 0.01
## [181] 0.01 0.02 0.02 0.02 0.02 0.02 0.01 0.01 0.01 0.02 0.01 0.03 0.03 0.03 0.02
## [196] 0.02 0.01 0.01 0.01 0.01 0.03 0.02 0.02 0.02 0.03 0.02 0.02 0.01 0.01 0.02
## [211] 0.01 0.01 0.02 0.04 0.04 0.02 0.03 0.02 0.02 0.01 0.01 0.01 0.01 0.02 0.05
## [226] 0.05 0.02 0.03 0.02 0.01 0.01 0.04 0.03 0.04 0.03 0.02 0.01 0.02 0.06 0.04
## [241] 0.03 0.02 0.06 0.04 0.07 0.26 0.19 0.18 0.14 0.17 0.11 0.10 0.11 0.09 0.10
## [256] 0.08 0.09 0.08 0.07 0.07 0.06 0.08 0.06 0.06 0.06 0.08 0.06 0.06 0.05 0.05
## [271] 0.05 0.05 0.06 0.05 0.04 0.04 0.04 0.08 0.04 0.04 0.04 0.03 0.04 0.03 0.03
## [286] 0.03 0.03 0.03 0.03 0.02 0.03 0.03 0.03 0.02 0.03 0.03 0.03 0.02 0.02 0.03
## [301] 0.03 0.02 0.02 0.02 0.02 0.02 0.03 0.03 0.02 0.02 0.02 0.02 0.02 0.02 0.02
## [316] 0.02 0.01 0.02 0.01 0.01 0.01 0.01 0.01 0.01 0.02 0.01 0.01 0.01 0.01 0.01
## [331] 0.01 0.01 0.01

freq.rules<-ruleInduction(freq.items, transactions, confidence=0.9) 
freq.rules

## set of 0 rules

inspect(freq.rules) # screening the rules

summary(rules.transactions)

## set of 463 rules
## 
## rule length distribution (lhs + rhs):sizes
##   2   3   4 
## 150 297  16 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   2.000   3.000   2.711   3.000   4.000 
## 
## summary of quality measures:
##     support           confidence        coverage             lift       
##  Min.   :0.006101   Min.   :0.2500   Min.   :0.009964   Min.   :0.9932  
##  1st Qu.:0.007117   1st Qu.:0.2971   1st Qu.:0.018709   1st Qu.:1.6229  
##  Median :0.008744   Median :0.3554   Median :0.024809   Median :1.9332  
##  Mean   :0.011539   Mean   :0.3786   Mean   :0.032608   Mean   :2.0351  
##  3rd Qu.:0.012303   3rd Qu.:0.4495   3rd Qu.:0.035892   3rd Qu.:2.3565  
##  Max.   :0.074835   Max.   :0.6600   Max.   :0.255516   Max.   :3.9565  
##      count      
##  Min.   : 60.0  
##  1st Qu.: 70.0  
##  Median : 86.0  
##  Mean   :113.5  
##  3rd Qu.:121.0  
##  Max.   :736.0  
## 
## mining info:
##          data ntransactions support confidence
##  transactions          9835   0.006       0.25
##                                                                                            call
##  apriori(data = transactions, parameter = list(support = 0.006, confidence = 0.25, minlen = 2))

Results

There were find 463 rules. The most rules has a set consisitng of 3 items.

Sumary

In this paper Market Basked Analisys in R was performed. We explored Groceries dataset, implemented Apriori algorithm to create association rules, ECLAT algorithm to discover most frequent itemsets and interpreted the obtained results. Such an analisys can help with better understanding customer buying patterns and can be employed by retailers in order to boost sales. z