Market Basket Analysis

Introduction

The main purpose of the project is to compute all association rules within the sales transactions based on Market Basket Anaylsis method. In addition, I will conduct the marketing research in order to analyse the motivation of supermarket’s customers to buy croissants.

In my analysis, I will use dataset “sales”, which presents 8000 records of supermarket itemsets. The methodology of the project will concentrate on: explanatory data analysis of the dataset with respect to bought items frequency; using Eclat Algorithm to find the most frequent itemsets; finding all association rules for sales transactions; computing various rules’ quality measures; extracting the rules results for croissant. The analysis will be supported by the necessary visualisations. The outcome of the project is to distinguish the number of rules and specifications of croissant rules for further marketing analysis.

Data Processing

Firstly, the needed libraries are introduced:

arules - consists of data structures, mining algorithms and interest measures.
arulesViz - enables visualization of association rules.
arulesCBA - provides classification algorithms based on association rules.

library(arules)

## Ładowanie wymaganego pakietu: Matrix

## 
## Dołączanie pakietu: 'arules'

## Następujące obiekty zostały zakryte z 'package:base':
## 
##     abbreviate, write

library(arulesViz)
library(arulesCBA)

## 
## Dołączanie pakietu: 'arulesCBA'

## Następujący obiekt został zakryty z 'package:arules':
## 
##     rules

sales <-read.csv("//Users//slawek//Downloads//sales.csv")
tail(sales)

##               Item.1        Item.2            Item.3                 Item.4
## 7994            beef          curd frozen vegetables         frozen dessert
## 7995 root vegetables      UHT-milk         ice cream frozen potato products
## 7996     butter milk specialty bar           napkins                       
## 7997         sausage          pork    tropical fruit        root vegetables
## 7998  tropical fruit        pastry              soda              chocolate
## 7999     canned beer                                                       
##                Item.5       Item.6       Item.7        Item.8  Item.9
## 7994        croissant       coffee   newspapers shopping bags        
## 7995    bottled water         soda bottled beer specialty bar napkins
## 7996                                                                 
## 7997 other vegetables orange juice         curd     croissant    soda
## 7998                                                                 
## 7999                                                                 
##            Item.10 Item.11 Item.12 Item.13 Item.14 Item.15
## 7994                                                      
## 7995 shopping bags                                        
## 7996                                                      
## 7997                                                      
## 7998                                                      
## 7999

The dataset presents 8000 supermarket transactions 185 items from 1 to 15 items in an itemset.

trans<-read.transactions("sales.csv", format="basket", sep=",", skip=0)

## Warning in readLines(file, encoding = encoding): niekompletna końcowa linia
## znaleziona w 'sales.csv'

trans

## transactions in sparse format with
##  8000 transactions (rows) and
##  185 items (columns)

Explanatory Data Analysis

On the basis of transactions summary it is stated that the most frequently bought items are:

Orange juice;
Other vegetables;
Croissant;
Soda;
Bottled water.

summary(trans)

## transactions as itemMatrix in sparse format with
##  8000 rows (elements/itemsets/transactions) and
##  185 columns (items) and a density of 0.02350135 
## 
## most frequent items:
##     orange juice other vegetables        croissant             soda 
##             2046             1527             1478             1406 
##    bottled water          (Other) 
##              882            27443 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15 
## 1761 1344 1038  819  701  528  439  380  288  188  139   96   61   59  159 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   4.348   6.000  15.000 
## 
## includes extended item information - examples:
##             labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3   baby cosmetics

Such an insight is also visible on the frequency plot below. E.g. orange juice, as the most frequently purchased item, was bought more than 2000 times.

itemFrequencyPlot(trans,topN=15,type="absolute")

Eclat Algorithm

Prior to association rule methodology, Eclat algorithm will be taken into consideration to find the most frequent itemsets. Support and confidence levels are the standard measures that descibe the rules and patterns between items.Although the default support value is 0.1 I will use it on the level 0.05 to gain accurate results with more than one item in a basket.

freq_items<-eclat(trans, parameter=list(supp=0.05, maxlen=10))

## Eclat
## 
## parameter specification:
##  tidLists support minlen maxlen            target  ext
##     FALSE    0.05      1     10 frequent itemsets TRUE
## 
## algorithmic control:
##  sparse sort verbose
##       7   -2    TRUE
## 
## Absolute minimum support count: 400 
## 
## create itemset ... 
## set transactions ...[185 item(s), 8000 transaction(s)] done [0.00s].
## sorting and recoding items ... [28 item(s)] done [0.00s].
## creating sparse bit matrix ... [28 row(s), 8000 column(s)] done [0.00s].
## writing  ... [31 set(s)] done [0.00s].
## Creating S4 object  ... done [0.00s].

inspect(freq_items)

##      items                            support  count
## [1]  {orange juice, root vegetables}  0.050250  402 
## [2]  {croissant, orange juice}        0.057250  458 
## [3]  {orange juice, other vegetables} 0.073625  589 
## [4]  {orange juice}                   0.255750 2046 
## [5]  {other vegetables}               0.190875 1527 
## [6]  {croissant}                      0.184750 1478 
## [7]  {soda}                           0.175750 1406 
## [8]  {root vegetables}                0.110250  882 
## [9]  {tropical fruit}                 0.103750  830 
## [10] {bottled water}                  0.110250  882 
## [11] {sausage}                        0.092375  739 
## [12] {rasberry jam}                   0.060000  480 
## [13] {citrus fruit}                   0.082750  662 
## [14] {shopping bags}                  0.095750  766 
## [15] {pastry}                         0.087500  700 
## [16] {whipped/sour cream}             0.072250  578 
## [17] {pip fruit}                      0.075625  605 
## [18] {coke}                           0.070000  560 
## [19] {domestic eggs}                  0.063625  509 
## [20] {newspapers}                     0.078000  624 
## [21] {butter}                         0.055250  442 
## [22] {margarine}                      0.058875  471 
## [23] {brown bread}                    0.065000  520 
## [24] {bottled beer}                   0.080250  642 
## [25] {frankfurter}                    0.058125  465 
## [26] {curd}                           0.053125  425 
## [27] {pork}                           0.056625  453 
## [28] {coffee}                         0.059000  472 
## [29] {raspberry jam}                  0.077000  616 
## [30] {beef}                           0.051625  413 
## [31] {canned beer}                    0.081125  649

Methodology - Association Rules

The methodology of creating association rules is based on Apriori algorithm. In order to find the most optimal condition to find proper association rules, three scenarios are taken into consideration:

supp=0.01, conf=0.5;
supp=0.01, conf=0.6;
supp=0.05, conf=0.5.

rules<-apriori(trans, parameter=list(supp=0.01, conf=0.5))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.5    0.1    1 none FALSE            TRUE       5    0.01      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 80 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[185 item(s), 8000 transaction(s)] done [0.00s].
## sorting and recoding items ... [88 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [17 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

rules2<-apriori(trans, parameter=list(supp=0.01, conf=0.6))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.6    0.1    1 none FALSE            TRUE       5    0.01      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 80 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[185 item(s), 8000 transaction(s)] done [0.00s].
## sorting and recoding items ... [88 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [6 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

rules3<-apriori(trans, parameter=list(supp=0.05, conf=0.5))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.5    0.1    1 none FALSE            TRUE       5    0.05      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 400 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[185 item(s), 8000 transaction(s)] done [0.00s].
## sorting and recoding items ... [28 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 done [0.00s].
## writing ... [0 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

When support level is equal to 0.01 and confidence to 0.5, there are 17 association rules for all transactions. Support of level 0.01 with confidence of level 0.6 results in only 6 rules, wheres in case of the third scenario (supp=0.05, conf=0.5) - there are no rules. Therefore, I will conduct my analysis with supp=0.01, conf=0.5 for 17 rules.

Methodology - Quality Measures

There are two more standard association rules measures used apart from support and confidence mentioned before, which are lift and support count. These metrics are often used to determine the interestingness of association patterns. In addition there exist tons of other quality measures to assess various characteristics of generated rules. For the purpose of my analysis, I will also take into account conviction, jacard index, chi-squared and hyperLift, due to the statement by authors of the paper “Analysing the quality of Association Rules by Computing an Interestingness Measures” of their outperforming outcomes.

The meanings of the mentioned quality measures:

support - is a widely used measure as it represents the statistical significance of a pattern; it is anti-monotonic, thus it is extensively used in development of efficient algorithms for mining patterns;
confidence - denotes the likelihood of certain items purchased together, that is why it is extensively used for assessing cross-selling strategies;
lift - is of range is [0, +∞] for an itemset e.g. A <> B (lift says how likely A is purchased when B is purchased while controlling the popularity of A), when = 1 items are independent, so no inferences can be made, when > 1 it means that A is likely to be purchased together with B;
support count - minimum support and count measures are sometimes called “strong” measures for association rules;
conviction - compares the probability that A appears without B if they were dependent on the actual frequency of the appearance of A without B, developed as an alternative to confidence but shares similar characteristics to lift measure, however conviction is directed since it also uses the information of the absence of the consequent;
Jaccard coefficient - it is a null-invariant measure for dependence using the Jaccard similarity between the two sets of transactions that contain the items in A and B, respectively;it is used to calculate the Jaccard Index, which presents the likelihood of buying two products together;
hyperLift - it is an adaptation of the lift measure where instead of dividing by the expected count under independence a higher quantile of the hypergeometric count distribution is used; hyperLift is more robust for low counts and results in fewer false positives when used for rule filtering.

The aforementioned Interest Measures’ results are presented below.

#rules analysis with respect to support
rules.supp<-sort(rules, by="support", decreasing=TRUE) 
inspect(rules.supp)

##      lhs                     rhs                 support confidence coverage     lift count
## [1]  {rasberry jam}       => {orange juice}     0.039000  0.6500000 0.060000 2.541544   312
## [2]  {rasberry jam}       => {other vegetables} 0.031250  0.5208333 0.060000 2.728662   250
## [3]  {other vegetables,                                                                    
##       rasberry jam}       => {orange juice}     0.019750  0.6320000 0.031250 2.471163   158
## [4]  {orange juice,                                                                        
##       rasberry jam}       => {other vegetables} 0.019750  0.5064103 0.039000 2.653099   158
## [5]  {rasberry jam,                                                                        
##       root vegetables}    => {orange juice}     0.014625  0.6685714 0.021875 2.614160   117
## [6]  {other vegetables,                                                                    
##       whipped/sour cream} => {orange juice}     0.014500  0.5110132 0.028375 1.998097   116
## [7]  {rasberry jam,                                                                        
##       tropical fruit}     => {orange juice}     0.014500  0.7116564 0.020375 2.782625   116
## [8]  {root vegetables,                                                                     
##       tropical fruit}     => {other vegetables} 0.012875  0.5819209 0.022125 3.048702   103
## [9]  {root vegetables,                                                                     
##       tropical fruit}     => {orange juice}     0.012875  0.5819209 0.022125 2.275351   103
## [10] {other vegetables,                                                                    
##       pip fruit}          => {orange juice}     0.012750  0.5177665 0.024625 2.024502   102
## [11] {croissant,                                                                           
##       root vegetables}    => {orange juice}     0.012750  0.5284974 0.024125 2.066461   102
## [12] {domestic eggs,                                                                       
##       other vegetables}   => {orange juice}     0.012625  0.5906433 0.021375 2.309456   101
## [13] {rasberry jam,                                                                        
##       root vegetables}    => {other vegetables} 0.012000  0.5485714 0.021875 2.873983    96
## [14] {butter,                                                                              
##       other vegetables}   => {orange juice}     0.011375  0.5909091 0.019250 2.310495    91
## [15] {croissant,                                                                           
##       rasberry jam}       => {orange juice}     0.011125  0.6641791 0.016750 2.596986    89
## [16] {rasberry jam,                                                                        
##       tropical fruit}     => {other vegetables} 0.010875  0.5337423 0.020375 2.796293    87
## [17] {citrus fruit,                                                                        
##       root vegetables}    => {other vegetables} 0.010500  0.6131387 0.017125 3.212252    84

#rules analysis with respect to confidence
rules.conf<-sort(rules, by="confidence", decreasing=TRUE) 
inspect(rules.conf)

##      lhs                     rhs                 support confidence coverage     lift count
## [1]  {rasberry jam,                                                                        
##       tropical fruit}     => {orange juice}     0.014500  0.7116564 0.020375 2.782625   116
## [2]  {rasberry jam,                                                                        
##       root vegetables}    => {orange juice}     0.014625  0.6685714 0.021875 2.614160   117
## [3]  {croissant,                                                                           
##       rasberry jam}       => {orange juice}     0.011125  0.6641791 0.016750 2.596986    89
## [4]  {rasberry jam}       => {orange juice}     0.039000  0.6500000 0.060000 2.541544   312
## [5]  {other vegetables,                                                                    
##       rasberry jam}       => {orange juice}     0.019750  0.6320000 0.031250 2.471163   158
## [6]  {citrus fruit,                                                                        
##       root vegetables}    => {other vegetables} 0.010500  0.6131387 0.017125 3.212252    84
## [7]  {butter,                                                                              
##       other vegetables}   => {orange juice}     0.011375  0.5909091 0.019250 2.310495    91
## [8]  {domestic eggs,                                                                       
##       other vegetables}   => {orange juice}     0.012625  0.5906433 0.021375 2.309456   101
## [9]  {root vegetables,                                                                     
##       tropical fruit}     => {other vegetables} 0.012875  0.5819209 0.022125 3.048702   103
## [10] {root vegetables,                                                                     
##       tropical fruit}     => {orange juice}     0.012875  0.5819209 0.022125 2.275351   103
## [11] {rasberry jam,                                                                        
##       root vegetables}    => {other vegetables} 0.012000  0.5485714 0.021875 2.873983    96
## [12] {rasberry jam,                                                                        
##       tropical fruit}     => {other vegetables} 0.010875  0.5337423 0.020375 2.796293    87
## [13] {croissant,                                                                           
##       root vegetables}    => {orange juice}     0.012750  0.5284974 0.024125 2.066461   102
## [14] {rasberry jam}       => {other vegetables} 0.031250  0.5208333 0.060000 2.728662   250
## [15] {other vegetables,                                                                    
##       pip fruit}          => {orange juice}     0.012750  0.5177665 0.024625 2.024502   102
## [16] {other vegetables,                                                                    
##       whipped/sour cream} => {orange juice}     0.014500  0.5110132 0.028375 1.998097   116
## [17] {orange juice,                                                                        
##       rasberry jam}       => {other vegetables} 0.019750  0.5064103 0.039000 2.653099   158

#rules analysis with respect to lift
rules.lift<-sort(rules, by="lift", decreasing=TRUE) 
inspect(rules.lift)

##      lhs                     rhs                 support confidence coverage     lift count
## [1]  {citrus fruit,                                                                        
##       root vegetables}    => {other vegetables} 0.010500  0.6131387 0.017125 3.212252    84
## [2]  {root vegetables,                                                                     
##       tropical fruit}     => {other vegetables} 0.012875  0.5819209 0.022125 3.048702   103
## [3]  {rasberry jam,                                                                        
##       root vegetables}    => {other vegetables} 0.012000  0.5485714 0.021875 2.873983    96
## [4]  {rasberry jam,                                                                        
##       tropical fruit}     => {other vegetables} 0.010875  0.5337423 0.020375 2.796293    87
## [5]  {rasberry jam,                                                                        
##       tropical fruit}     => {orange juice}     0.014500  0.7116564 0.020375 2.782625   116
## [6]  {rasberry jam}       => {other vegetables} 0.031250  0.5208333 0.060000 2.728662   250
## [7]  {orange juice,                                                                        
##       rasberry jam}       => {other vegetables} 0.019750  0.5064103 0.039000 2.653099   158
## [8]  {rasberry jam,                                                                        
##       root vegetables}    => {orange juice}     0.014625  0.6685714 0.021875 2.614160   117
## [9]  {croissant,                                                                           
##       rasberry jam}       => {orange juice}     0.011125  0.6641791 0.016750 2.596986    89
## [10] {rasberry jam}       => {orange juice}     0.039000  0.6500000 0.060000 2.541544   312
## [11] {other vegetables,                                                                    
##       rasberry jam}       => {orange juice}     0.019750  0.6320000 0.031250 2.471163   158
## [12] {butter,                                                                              
##       other vegetables}   => {orange juice}     0.011375  0.5909091 0.019250 2.310495    91
## [13] {domestic eggs,                                                                       
##       other vegetables}   => {orange juice}     0.012625  0.5906433 0.021375 2.309456   101
## [14] {root vegetables,                                                                     
##       tropical fruit}     => {orange juice}     0.012875  0.5819209 0.022125 2.275351   103
## [15] {croissant,                                                                           
##       root vegetables}    => {orange juice}     0.012750  0.5284974 0.024125 2.066461   102
## [16] {other vegetables,                                                                    
##       pip fruit}          => {orange juice}     0.012750  0.5177665 0.024625 2.024502   102
## [17] {other vegetables,                                                                    
##       whipped/sour cream} => {orange juice}     0.014500  0.5110132 0.028375 1.998097   116

#rules analysis with respect to count
rules.count<-sort(rules, by="count", decreasing=TRUE) 
inspect(rules.count)

##      lhs                     rhs                 support confidence coverage     lift count
## [1]  {rasberry jam}       => {orange juice}     0.039000  0.6500000 0.060000 2.541544   312
## [2]  {rasberry jam}       => {other vegetables} 0.031250  0.5208333 0.060000 2.728662   250
## [3]  {other vegetables,                                                                    
##       rasberry jam}       => {orange juice}     0.019750  0.6320000 0.031250 2.471163   158
## [4]  {orange juice,                                                                        
##       rasberry jam}       => {other vegetables} 0.019750  0.5064103 0.039000 2.653099   158
## [5]  {rasberry jam,                                                                        
##       root vegetables}    => {orange juice}     0.014625  0.6685714 0.021875 2.614160   117
## [6]  {other vegetables,                                                                    
##       whipped/sour cream} => {orange juice}     0.014500  0.5110132 0.028375 1.998097   116
## [7]  {rasberry jam,                                                                        
##       tropical fruit}     => {orange juice}     0.014500  0.7116564 0.020375 2.782625   116
## [8]  {root vegetables,                                                                     
##       tropical fruit}     => {other vegetables} 0.012875  0.5819209 0.022125 3.048702   103
## [9]  {root vegetables,                                                                     
##       tropical fruit}     => {orange juice}     0.012875  0.5819209 0.022125 2.275351   103
## [10] {other vegetables,                                                                    
##       pip fruit}          => {orange juice}     0.012750  0.5177665 0.024625 2.024502   102
## [11] {croissant,                                                                           
##       root vegetables}    => {orange juice}     0.012750  0.5284974 0.024125 2.066461   102
## [12] {domestic eggs,                                                                       
##       other vegetables}   => {orange juice}     0.012625  0.5906433 0.021375 2.309456   101
## [13] {rasberry jam,                                                                        
##       root vegetables}    => {other vegetables} 0.012000  0.5485714 0.021875 2.873983    96
## [14] {butter,                                                                              
##       other vegetables}   => {orange juice}     0.011375  0.5909091 0.019250 2.310495    91
## [15] {croissant,                                                                           
##       rasberry jam}       => {orange juice}     0.011125  0.6641791 0.016750 2.596986    89
## [16] {rasberry jam,                                                                        
##       tropical fruit}     => {other vegetables} 0.010875  0.5337423 0.020375 2.796293    87
## [17] {citrus fruit,                                                                        
##       root vegetables}    => {other vegetables} 0.010500  0.6131387 0.017125 3.212252    84

IMs <- interestMeasure(rules, c("support", "confidence", "lift", "count", "conviction", "jaccard", "hyperLift"),
    transactions = trans)
inspect(head(rules))

##     lhs                     rhs                 support confidence coverage     lift count
## [1] {rasberry jam}       => {other vegetables} 0.031250  0.5208333 0.060000 2.728662   250
## [2] {rasberry jam}       => {orange juice}     0.039000  0.6500000 0.060000 2.541544   312
## [3] {butter,                                                                              
##      other vegetables}   => {orange juice}     0.011375  0.5909091 0.019250 2.310495    91
## [4] {domestic eggs,                                                                       
##      other vegetables}   => {orange juice}     0.012625  0.5906433 0.021375 2.309456   101
## [5] {other vegetables,                                                                    
##      pip fruit}          => {orange juice}     0.012750  0.5177665 0.024625 2.024502   102
## [6] {other vegetables,                                                                    
##      whipped/sour cream} => {orange juice}     0.014500  0.5110132 0.028375 1.998097   116

head(IMs)

##    support confidence     lift count conviction    jaccard hyperLift
## 1 0.031250  0.5208333 2.728662   250   1.688609 0.14228799  2.252252
## 2 0.039000  0.6500000 2.541544   312   2.126429 0.14092141  2.151724
## 3 0.011375  0.5909091 2.310495    91   1.819278 0.04314841  1.750000
## 4 0.012625  0.5906433 2.309456   101   1.818096 0.04773157  1.771930
## 5 0.012750  0.5177665 2.024502   102   1.543339 0.04764129  1.569231
## 6 0.014500  0.5110132 1.998097   116   1.522025 0.05377840  1.589041

Based on the results above, support value varies between 1-4%, which is rather low, however the dataset contains 8000 transactions, so it is not a warning message. As for the confidence, the top rules have it on the level of up to 0.71, which can be interpreted as 71% reliability of the rule. When it comes to lift it is visible that the best rules appear together around 2 to 3 times more often than in case of no dependencies. Support count indicates that rules taking into consideration orange juice, raspberry jam and other vegetables appear the most frequently and are the strongest. Conviction values fluctuate around 1-2%, which is meant by rules independence. Jacard coefficient values are almost equal to zero, which can be interpreted as poor similarity of products. Finally, hyperLift takes the values from around 1% to 2%, which indicates independence. Overall, the majority of the best rules consist of orange juice and other vegetables.

Methodology - Rules Visualisation

Computed results will be now visualized for further analysis. Firstly, let’s take into consideration a Parallel coordinates plot for 17 rules. The plot shows the path of making a decision when buying products in the supermarket. For instance, when a customer buys root vegetables and then raspberry jam, he/she will be likely to purchase other vegetables. In addition, when buying tropical fruit, followed by raspberry purchase, the customer will probably buy orange juice too.

plot(rules, method="paracoord", control=list(reorder=TRUE))

The scatter plot presents 17 rules with regard to support, confidence and lift measures. It is shown, that the rules of the highest importance (meaning the lift above 3.00) have support on the level of 0.01 and confidence on the levels from 55% to 65%.

plot(rules, measure=c("support", "confidence"), shading="lift", interactive=FALSE, jitter=0)

## Warning in plot.rules(rules, measure = c("support", "confidence"), shading =
## "lift", : The parameter interactive is deprecated. Use engine='interactive'
## instead.

The matrix for 17 rules presents the most interesting rules in terms of support and confidence values. X axis denotes LHS or a rule, while y axis - RHS. The most significant itemsets in antecedent combine different types of vegetables, fruit as well as jam and croissant. Itemsets in consequent include orange juice and other variables.

The most interesting rules denoted below illustrate that when the customer purchases various vegetables or fruit, the subsequent move is either buying more vegetables or orange juice.

inspect(head(sort(sort(rules, by ="confidence"),by="support"),17))

##      lhs                     rhs                 support confidence coverage     lift count
## [1]  {rasberry jam}       => {orange juice}     0.039000  0.6500000 0.060000 2.541544   312
## [2]  {rasberry jam}       => {other vegetables} 0.031250  0.5208333 0.060000 2.728662   250
## [3]  {other vegetables,                                                                    
##       rasberry jam}       => {orange juice}     0.019750  0.6320000 0.031250 2.471163   158
## [4]  {orange juice,                                                                        
##       rasberry jam}       => {other vegetables} 0.019750  0.5064103 0.039000 2.653099   158
## [5]  {rasberry jam,                                                                        
##       root vegetables}    => {orange juice}     0.014625  0.6685714 0.021875 2.614160   117
## [6]  {rasberry jam,                                                                        
##       tropical fruit}     => {orange juice}     0.014500  0.7116564 0.020375 2.782625   116
## [7]  {other vegetables,                                                                    
##       whipped/sour cream} => {orange juice}     0.014500  0.5110132 0.028375 1.998097   116
## [8]  {root vegetables,                                                                     
##       tropical fruit}     => {other vegetables} 0.012875  0.5819209 0.022125 3.048702   103
## [9]  {root vegetables,                                                                     
##       tropical fruit}     => {orange juice}     0.012875  0.5819209 0.022125 2.275351   103
## [10] {croissant,                                                                           
##       root vegetables}    => {orange juice}     0.012750  0.5284974 0.024125 2.066461   102
## [11] {other vegetables,                                                                    
##       pip fruit}          => {orange juice}     0.012750  0.5177665 0.024625 2.024502   102
## [12] {domestic eggs,                                                                       
##       other vegetables}   => {orange juice}     0.012625  0.5906433 0.021375 2.309456   101
## [13] {rasberry jam,                                                                        
##       root vegetables}    => {other vegetables} 0.012000  0.5485714 0.021875 2.873983    96
## [14] {butter,                                                                              
##       other vegetables}   => {orange juice}     0.011375  0.5909091 0.019250 2.310495    91
## [15] {croissant,                                                                           
##       rasberry jam}       => {orange juice}     0.011125  0.6641791 0.016750 2.596986    89
## [16] {rasberry jam,                                                                        
##       tropical fruit}     => {other vegetables} 0.010875  0.5337423 0.020375 2.796293    87
## [17] {citrus fruit,                                                                        
##       root vegetables}    => {other vegetables} 0.010500  0.6131387 0.017125 3.212252    84

matrix_rules = head(sort(sort(rules, by ="confidence"),by="support"),17)
plot(matrix_rules, method="matrix", measure="lift")

## Itemsets in Antecedent (LHS)
##  [1] "{citrus fruit,root vegetables}"       
##  [2] "{rasberry jam,tropical fruit}"        
##  [3] "{rasberry jam,root vegetables}"       
##  [4] "{root vegetables,tropical fruit}"     
##  [5] "{orange juice,rasberry jam}"          
##  [6] "{rasberry jam}"                       
##  [7] "{croissant,rasberry jam}"             
##  [8] "{other vegetables,rasberry jam}"      
##  [9] "{butter,other vegetables}"            
## [10] "{domestic eggs,other vegetables}"     
## [11] "{croissant,root vegetables}"          
## [12] "{other vegetables,pip fruit}"         
## [13] "{other vegetables,whipped/sour cream}"
## Itemsets in Consequent (RHS)
## [1] "{orange juice}"     "{other vegetables}"

An additional visualization is a network illustration of the 17 transaction rules.

plot(rules, method="graph", engine="htmlwidget", measure="support", shading="lift", main="Graph for 17 rules")

## Warning: Unknown control parameters: main

## Available control parameters (with default values):
## itemCol   =  #CBD2FC
## nodeCol   =  c("#EE0000", "#EE0303", "#EE0606", "#EE0909", "#EE0C0C", "#EE0F0F", "#EE1212", "#EE1515", "#EE1818", "#EE1B1B", "#EE1E1E", "#EE2222", "#EE2525", "#EE2828", "#EE2B2B", "#EE2E2E", "#EE3131", "#EE3434", "#EE3737", "#EE3A3A", "#EE3D3D", "#EE4040", "#EE4444", "#EE4747", "#EE4A4A", "#EE4D4D", "#EE5050", "#EE5353", "#EE5656", "#EE5959", "#EE5C5C", "#EE5F5F", "#EE6262", "#EE6666", "#EE6969", "#EE6C6C", "#EE6F6F", "#EE7272", "#EE7575", "#EE7878", "#EE7B7B", "#EE7E7E", "#EE8181", "#EE8484", "#EE8888", "#EE8B8B",  "#EE8E8E", "#EE9191", "#EE9494", "#EE9797", "#EE9999", "#EE9B9B", "#EE9D9D", "#EE9F9F", "#EEA0A0", "#EEA2A2", "#EEA4A4", "#EEA5A5", "#EEA7A7", "#EEA9A9", "#EEABAB", "#EEACAC", "#EEAEAE", "#EEB0B0", "#EEB1B1", "#EEB3B3", "#EEB5B5", "#EEB7B7", "#EEB8B8", "#EEBABA", "#EEBCBC", "#EEBDBD", "#EEBFBF", "#EEC1C1", "#EEC3C3", "#EEC4C4", "#EEC6C6", "#EEC8C8", "#EEC9C9", "#EECBCB", "#EECDCD", "#EECFCF", "#EED0D0", "#EED2D2", "#EED4D4", "#EED5D5", "#EED7D7", "#EED9D9", "#EEDBDB", "#EEDCDC", "#EEDEDE", "#EEE0E0",  "#EEE1E1", "#EEE3E3", "#EEE5E5", "#EEE7E7", "#EEE8E8", "#EEEAEA", "#EEECEC", "#EEEEEE")
## precision     =  3
## igraphLayout  =  layout_nicely
## interactive   =  TRUE
## engine    =  visNetwork
## max   =  100
## selection_menu    =  TRUE
## degree_highlight  =  1
## verbose   =  FALSE

Finally, a grouped matrix for 17 rules is introduced. Taking into consideration LHS and RHS groups of itemsets, as well as lift and support measures realtionship, the rules where LHS = raspberry jam, and RHS = other vegetables and orange juice have the highest support value on the level of 3%.

plot(rules, method="grouped", main="Grouped matrix for 17 rules")

## Warning: Unknown control parameters: main

## Available control parameters (with default values):
## k     =  20
## aggr.fun  =  function (x, ...)  UseMethod("mean")
## rhs_max   =  10
## lhs_label_items   =  2
## col   =  c("#EE0000FF", "#EEEEEEFF")
## groups    =  NULL
## engine    =  ggplot2
## verbose   =  FALSE

Methodology - Croissant Rules

One of the assumptions of the project is to conduct an initial marketing research and analysis examine the rules for certain product - croissant. The results from the rules can be used for establishing not only marketing strategies, but also sales and cross-selling initiatives in order to receive sufficient ROI.

For that purpose, new set of rules will be gathered with regard to croissant product. The Apriori algorithm and soring rules by confidence quality measure enables to assess what are the products that customer buy before purchasing croissant. Such a piece of information is a truly valuable insight for marketing managers in terms of association croissant with other commonly purchased goods by potential customers, eg. {orange juice,other vegetables, soda, waffles}.

rules.cr<-apriori(data=trans, parameter=list(supp=0.001,conf = 0.08), 
appearance=list(default="lhs",rhs="croissant"), control=list(verbose=F)) 

rules.cr.byconf<-sort(rules.cr, by="confidence", decreasing=TRUE)
inspect(head(rules.cr.byconf))

##     lhs                     rhs          support confidence coverage     lift count
## [1] {orange juice,                                                                 
##      other vegetables,                                                             
##      soda,                                                                         
##      waffles}            => {croissant} 0.001000  0.8888889 0.001125 4.811307     8
## [2] {butter,                                                                       
##      pastry,                                                                       
##      pork}               => {croissant} 0.001000  0.8000000 0.001250 4.330176     8
## [3] {newspapers,                                                                   
##      spread cheese}      => {croissant} 0.001250  0.7692308 0.001625 4.163631    10
## [4] {sausage,                                                                      
##      specialty bar}      => {croissant} 0.001375  0.7333333 0.001875 3.969328    11
## [5] {hamburger meat,                                                               
##      root vegetables,                                                              
##      tropical fruit}     => {croissant} 0.001000  0.7272727 0.001375 3.936524     8
## [6] {citrus fruit,                                                                 
##      orange juice,                                                                 
##      pastry,                                                                       
##      whipped/sour cream} => {croissant} 0.001250  0.7142857 0.001750 3.866228    10

Although, croissant is not the most frequently bought item, it is often bought by customers who prefer salty snacks, beverages and fruit. The above rules can be plot as the graph.

plot(rules.cr, method="graph",control = list(cex=0.9))

## Warning: Unknown control parameters: cex

## Available control parameters (with default values):
## layout    =  stress
## circular  =  FALSE
## ggraphdots    =  NULL
## edges     =  <environment>
## nodes     =  <environment>
## nodetext  =  <environment>
## colors    =  c("#EE0000FF", "#EEEEEEFF")
## engine    =  ggplot2
## max   =  100
## verbose   =  FALSE

## Warning: Too many rules supplied. Only plotting the best 100 using
## 'lift' (change control parameter max if needed).

Final results and analysis

The purpose of this project was to compute the patterns for association rules in one of the supermarkets. In addition, the preliminary marketing research was analyses for croissant product. The Market Basket Analysis was conducted for sales transactions with the use of Eclat and Apriori algorithms to create set of rules. Moreover, a variety of quality measures were checked for association rules. The overall analysis was supported by matrix, grouped matrix, parallel coordinates plot and graphs to illustrate customer decision paths considering all products and croissant only.

References

www.r-project.org

www.mhahsler.github.io/arules/docs/measures

www.towardsdatascience.com

S.A. Alvarez. Chi-squared computation for association rules: preliminary results. Technical Report BC-CS-2003-01, July 2003.

P. Tan, V. Kumar, J. Srivastava. Selecting the Right Interestingness Measure for Association Patterns. Association for Computing Machinery, New York, NY, USA, 32–41, 2002.

J. Manimaran, T. Velmurugan. Analysing the quality of Association Rules by Computing an Interestingness Measures. Indian Journal of Science and Technology, Vol 8(15), DOI:10.17485/ijst/2015/v8i15/76693, July 2015.

Unsupervised Learning

Karolina Szczęsna

25 02 2022