The main purpose of the project is to compute all association rules within the sales transactions based on Market Basket Anaylsis method. In addition, I will conduct the marketing research in order to analyse the motivation of supermarket’s customers to buy croissants.
In my analysis, I will use dataset “sales”, which presents 8000 records of supermarket itemsets. The methodology of the project will concentrate on: explanatory data analysis of the dataset with respect to bought items frequency; using Eclat Algorithm to find the most frequent itemsets; finding all association rules for sales transactions; computing various rules’ quality measures; extracting the rules results for croissant. The analysis will be supported by the necessary visualisations. The outcome of the project is to distinguish the number of rules and specifications of croissant rules for further marketing analysis.
Firstly, the needed libraries are introduced:
arules - consists of data structures, mining algorithms and interest measures.
arulesViz - enables visualization of association rules.
arulesCBA - provides classification algorithms based on association rules.
library(arules)
## Ładowanie wymaganego pakietu: Matrix
##
## Dołączanie pakietu: 'arules'
## Następujące obiekty zostały zakryte z 'package:base':
##
## abbreviate, write
library(arulesViz)
library(arulesCBA)
##
## Dołączanie pakietu: 'arulesCBA'
## Następujący obiekt został zakryty z 'package:arules':
##
## rules
sales <-read.csv("//Users//slawek//Downloads//sales.csv")
tail(sales)
## Item.1 Item.2 Item.3 Item.4
## 7994 beef curd frozen vegetables frozen dessert
## 7995 root vegetables UHT-milk ice cream frozen potato products
## 7996 butter milk specialty bar napkins
## 7997 sausage pork tropical fruit root vegetables
## 7998 tropical fruit pastry soda chocolate
## 7999 canned beer
## Item.5 Item.6 Item.7 Item.8 Item.9
## 7994 croissant coffee newspapers shopping bags
## 7995 bottled water soda bottled beer specialty bar napkins
## 7996
## 7997 other vegetables orange juice curd croissant soda
## 7998
## 7999
## Item.10 Item.11 Item.12 Item.13 Item.14 Item.15
## 7994
## 7995 shopping bags
## 7996
## 7997
## 7998
## 7999
The dataset presents 8000 supermarket transactions 185 items from 1 to 15 items in an itemset.
trans<-read.transactions("sales.csv", format="basket", sep=",", skip=0)
## Warning in readLines(file, encoding = encoding): niekompletna końcowa linia
## znaleziona w 'sales.csv'
trans
## transactions in sparse format with
## 8000 transactions (rows) and
## 185 items (columns)
On the basis of transactions summary it is stated that the most frequently bought items are:
Orange juice;
Other vegetables;
Croissant;
Soda;
Bottled water.
summary(trans)
## transactions as itemMatrix in sparse format with
## 8000 rows (elements/itemsets/transactions) and
## 185 columns (items) and a density of 0.02350135
##
## most frequent items:
## orange juice other vegetables croissant soda
## 2046 1527 1478 1406
## bottled water (Other)
## 882 27443
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
## 1761 1344 1038 819 701 528 439 380 288 188 139 96 61 59 159
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 3.000 4.348 6.000 15.000
##
## includes extended item information - examples:
## labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3 baby cosmetics
Such an insight is also visible on the frequency plot below. E.g. orange juice, as the most frequently purchased item, was bought more than 2000 times.
itemFrequencyPlot(trans,topN=15,type="absolute")
Prior to association rule methodology, Eclat algorithm will be taken into consideration to find the most frequent itemsets. Support and confidence levels are the standard measures that descibe the rules and patterns between items.Although the default support value is 0.1 I will use it on the level 0.05 to gain accurate results with more than one item in a basket.
freq_items<-eclat(trans, parameter=list(supp=0.05, maxlen=10))
## Eclat
##
## parameter specification:
## tidLists support minlen maxlen target ext
## FALSE 0.05 1 10 frequent itemsets TRUE
##
## algorithmic control:
## sparse sort verbose
## 7 -2 TRUE
##
## Absolute minimum support count: 400
##
## create itemset ...
## set transactions ...[185 item(s), 8000 transaction(s)] done [0.00s].
## sorting and recoding items ... [28 item(s)] done [0.00s].
## creating sparse bit matrix ... [28 row(s), 8000 column(s)] done [0.00s].
## writing ... [31 set(s)] done [0.00s].
## Creating S4 object ... done [0.00s].
inspect(freq_items)
## items support count
## [1] {orange juice, root vegetables} 0.050250 402
## [2] {croissant, orange juice} 0.057250 458
## [3] {orange juice, other vegetables} 0.073625 589
## [4] {orange juice} 0.255750 2046
## [5] {other vegetables} 0.190875 1527
## [6] {croissant} 0.184750 1478
## [7] {soda} 0.175750 1406
## [8] {root vegetables} 0.110250 882
## [9] {tropical fruit} 0.103750 830
## [10] {bottled water} 0.110250 882
## [11] {sausage} 0.092375 739
## [12] {rasberry jam} 0.060000 480
## [13] {citrus fruit} 0.082750 662
## [14] {shopping bags} 0.095750 766
## [15] {pastry} 0.087500 700
## [16] {whipped/sour cream} 0.072250 578
## [17] {pip fruit} 0.075625 605
## [18] {coke} 0.070000 560
## [19] {domestic eggs} 0.063625 509
## [20] {newspapers} 0.078000 624
## [21] {butter} 0.055250 442
## [22] {margarine} 0.058875 471
## [23] {brown bread} 0.065000 520
## [24] {bottled beer} 0.080250 642
## [25] {frankfurter} 0.058125 465
## [26] {curd} 0.053125 425
## [27] {pork} 0.056625 453
## [28] {coffee} 0.059000 472
## [29] {raspberry jam} 0.077000 616
## [30] {beef} 0.051625 413
## [31] {canned beer} 0.081125 649
The methodology of creating association rules is based on Apriori algorithm. In order to find the most optimal condition to find proper association rules, three scenarios are taken into consideration:
supp=0.01, conf=0.5;
supp=0.01, conf=0.6;
supp=0.05, conf=0.5.
rules<-apriori(trans, parameter=list(supp=0.01, conf=0.5))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.5 0.1 1 none FALSE TRUE 5 0.01 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 80
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[185 item(s), 8000 transaction(s)] done [0.00s].
## sorting and recoding items ... [88 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [17 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
rules2<-apriori(trans, parameter=list(supp=0.01, conf=0.6))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.6 0.1 1 none FALSE TRUE 5 0.01 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 80
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[185 item(s), 8000 transaction(s)] done [0.00s].
## sorting and recoding items ... [88 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [6 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
rules3<-apriori(trans, parameter=list(supp=0.05, conf=0.5))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.5 0.1 1 none FALSE TRUE 5 0.05 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 400
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[185 item(s), 8000 transaction(s)] done [0.00s].
## sorting and recoding items ... [28 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 done [0.00s].
## writing ... [0 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
When support level is equal to 0.01 and confidence to 0.5, there are 17 association rules for all transactions. Support of level 0.01 with confidence of level 0.6 results in only 6 rules, wheres in case of the third scenario (supp=0.05, conf=0.5) - there are no rules. Therefore, I will conduct my analysis with supp=0.01, conf=0.5 for 17 rules.
There are two more standard association rules measures used apart from support and confidence mentioned before, which are lift and support count. These metrics are often used to determine the interestingness of association patterns. In addition there exist tons of other quality measures to assess various characteristics of generated rules. For the purpose of my analysis, I will also take into account conviction, jacard index, chi-squared and hyperLift, due to the statement by authors of the paper “Analysing the quality of Association Rules by Computing an Interestingness Measures” of their outperforming outcomes.
The meanings of the mentioned quality measures:
support - is a widely used measure as it represents the statistical significance of a pattern; it is anti-monotonic, thus it is extensively used in development of efficient algorithms for mining patterns;
confidence - denotes the likelihood of certain items purchased together, that is why it is extensively used for assessing cross-selling strategies;
lift - is of range is [0, +∞] for an itemset e.g. A <> B (lift says how likely A is purchased when B is purchased while controlling the popularity of A), when = 1 items are independent, so no inferences can be made, when > 1 it means that A is likely to be purchased together with B;
support count - minimum support and count measures are sometimes called “strong” measures for association rules;
conviction - compares the probability that A appears without B if they were dependent on the actual frequency of the appearance of A without B, developed as an alternative to confidence but shares similar characteristics to lift measure, however conviction is directed since it also uses the information of the absence of the consequent;
Jaccard coefficient - it is a null-invariant measure for dependence using the Jaccard similarity between the two sets of transactions that contain the items in A and B, respectively;it is used to calculate the Jaccard Index, which presents the likelihood of buying two products together;
hyperLift - it is an adaptation of the lift measure where instead of dividing by the expected count under independence a higher quantile of the hypergeometric count distribution is used; hyperLift is more robust for low counts and results in fewer false positives when used for rule filtering.
The aforementioned Interest Measures’ results are presented below.
#rules analysis with respect to support
rules.supp<-sort(rules, by="support", decreasing=TRUE)
inspect(rules.supp)
## lhs rhs support confidence coverage lift count
## [1] {rasberry jam} => {orange juice} 0.039000 0.6500000 0.060000 2.541544 312
## [2] {rasberry jam} => {other vegetables} 0.031250 0.5208333 0.060000 2.728662 250
## [3] {other vegetables,
## rasberry jam} => {orange juice} 0.019750 0.6320000 0.031250 2.471163 158
## [4] {orange juice,
## rasberry jam} => {other vegetables} 0.019750 0.5064103 0.039000 2.653099 158
## [5] {rasberry jam,
## root vegetables} => {orange juice} 0.014625 0.6685714 0.021875 2.614160 117
## [6] {other vegetables,
## whipped/sour cream} => {orange juice} 0.014500 0.5110132 0.028375 1.998097 116
## [7] {rasberry jam,
## tropical fruit} => {orange juice} 0.014500 0.7116564 0.020375 2.782625 116
## [8] {root vegetables,
## tropical fruit} => {other vegetables} 0.012875 0.5819209 0.022125 3.048702 103
## [9] {root vegetables,
## tropical fruit} => {orange juice} 0.012875 0.5819209 0.022125 2.275351 103
## [10] {other vegetables,
## pip fruit} => {orange juice} 0.012750 0.5177665 0.024625 2.024502 102
## [11] {croissant,
## root vegetables} => {orange juice} 0.012750 0.5284974 0.024125 2.066461 102
## [12] {domestic eggs,
## other vegetables} => {orange juice} 0.012625 0.5906433 0.021375 2.309456 101
## [13] {rasberry jam,
## root vegetables} => {other vegetables} 0.012000 0.5485714 0.021875 2.873983 96
## [14] {butter,
## other vegetables} => {orange juice} 0.011375 0.5909091 0.019250 2.310495 91
## [15] {croissant,
## rasberry jam} => {orange juice} 0.011125 0.6641791 0.016750 2.596986 89
## [16] {rasberry jam,
## tropical fruit} => {other vegetables} 0.010875 0.5337423 0.020375 2.796293 87
## [17] {citrus fruit,
## root vegetables} => {other vegetables} 0.010500 0.6131387 0.017125 3.212252 84
#rules analysis with respect to confidence
rules.conf<-sort(rules, by="confidence", decreasing=TRUE)
inspect(rules.conf)
## lhs rhs support confidence coverage lift count
## [1] {rasberry jam,
## tropical fruit} => {orange juice} 0.014500 0.7116564 0.020375 2.782625 116
## [2] {rasberry jam,
## root vegetables} => {orange juice} 0.014625 0.6685714 0.021875 2.614160 117
## [3] {croissant,
## rasberry jam} => {orange juice} 0.011125 0.6641791 0.016750 2.596986 89
## [4] {rasberry jam} => {orange juice} 0.039000 0.6500000 0.060000 2.541544 312
## [5] {other vegetables,
## rasberry jam} => {orange juice} 0.019750 0.6320000 0.031250 2.471163 158
## [6] {citrus fruit,
## root vegetables} => {other vegetables} 0.010500 0.6131387 0.017125 3.212252 84
## [7] {butter,
## other vegetables} => {orange juice} 0.011375 0.5909091 0.019250 2.310495 91
## [8] {domestic eggs,
## other vegetables} => {orange juice} 0.012625 0.5906433 0.021375 2.309456 101
## [9] {root vegetables,
## tropical fruit} => {other vegetables} 0.012875 0.5819209 0.022125 3.048702 103
## [10] {root vegetables,
## tropical fruit} => {orange juice} 0.012875 0.5819209 0.022125 2.275351 103
## [11] {rasberry jam,
## root vegetables} => {other vegetables} 0.012000 0.5485714 0.021875 2.873983 96
## [12] {rasberry jam,
## tropical fruit} => {other vegetables} 0.010875 0.5337423 0.020375 2.796293 87
## [13] {croissant,
## root vegetables} => {orange juice} 0.012750 0.5284974 0.024125 2.066461 102
## [14] {rasberry jam} => {other vegetables} 0.031250 0.5208333 0.060000 2.728662 250
## [15] {other vegetables,
## pip fruit} => {orange juice} 0.012750 0.5177665 0.024625 2.024502 102
## [16] {other vegetables,
## whipped/sour cream} => {orange juice} 0.014500 0.5110132 0.028375 1.998097 116
## [17] {orange juice,
## rasberry jam} => {other vegetables} 0.019750 0.5064103 0.039000 2.653099 158
#rules analysis with respect to lift
rules.lift<-sort(rules, by="lift", decreasing=TRUE)
inspect(rules.lift)
## lhs rhs support confidence coverage lift count
## [1] {citrus fruit,
## root vegetables} => {other vegetables} 0.010500 0.6131387 0.017125 3.212252 84
## [2] {root vegetables,
## tropical fruit} => {other vegetables} 0.012875 0.5819209 0.022125 3.048702 103
## [3] {rasberry jam,
## root vegetables} => {other vegetables} 0.012000 0.5485714 0.021875 2.873983 96
## [4] {rasberry jam,
## tropical fruit} => {other vegetables} 0.010875 0.5337423 0.020375 2.796293 87
## [5] {rasberry jam,
## tropical fruit} => {orange juice} 0.014500 0.7116564 0.020375 2.782625 116
## [6] {rasberry jam} => {other vegetables} 0.031250 0.5208333 0.060000 2.728662 250
## [7] {orange juice,
## rasberry jam} => {other vegetables} 0.019750 0.5064103 0.039000 2.653099 158
## [8] {rasberry jam,
## root vegetables} => {orange juice} 0.014625 0.6685714 0.021875 2.614160 117
## [9] {croissant,
## rasberry jam} => {orange juice} 0.011125 0.6641791 0.016750 2.596986 89
## [10] {rasberry jam} => {orange juice} 0.039000 0.6500000 0.060000 2.541544 312
## [11] {other vegetables,
## rasberry jam} => {orange juice} 0.019750 0.6320000 0.031250 2.471163 158
## [12] {butter,
## other vegetables} => {orange juice} 0.011375 0.5909091 0.019250 2.310495 91
## [13] {domestic eggs,
## other vegetables} => {orange juice} 0.012625 0.5906433 0.021375 2.309456 101
## [14] {root vegetables,
## tropical fruit} => {orange juice} 0.012875 0.5819209 0.022125 2.275351 103
## [15] {croissant,
## root vegetables} => {orange juice} 0.012750 0.5284974 0.024125 2.066461 102
## [16] {other vegetables,
## pip fruit} => {orange juice} 0.012750 0.5177665 0.024625 2.024502 102
## [17] {other vegetables,
## whipped/sour cream} => {orange juice} 0.014500 0.5110132 0.028375 1.998097 116
#rules analysis with respect to count
rules.count<-sort(rules, by="count", decreasing=TRUE)
inspect(rules.count)
## lhs rhs support confidence coverage lift count
## [1] {rasberry jam} => {orange juice} 0.039000 0.6500000 0.060000 2.541544 312
## [2] {rasberry jam} => {other vegetables} 0.031250 0.5208333 0.060000 2.728662 250
## [3] {other vegetables,
## rasberry jam} => {orange juice} 0.019750 0.6320000 0.031250 2.471163 158
## [4] {orange juice,
## rasberry jam} => {other vegetables} 0.019750 0.5064103 0.039000 2.653099 158
## [5] {rasberry jam,
## root vegetables} => {orange juice} 0.014625 0.6685714 0.021875 2.614160 117
## [6] {other vegetables,
## whipped/sour cream} => {orange juice} 0.014500 0.5110132 0.028375 1.998097 116
## [7] {rasberry jam,
## tropical fruit} => {orange juice} 0.014500 0.7116564 0.020375 2.782625 116
## [8] {root vegetables,
## tropical fruit} => {other vegetables} 0.012875 0.5819209 0.022125 3.048702 103
## [9] {root vegetables,
## tropical fruit} => {orange juice} 0.012875 0.5819209 0.022125 2.275351 103
## [10] {other vegetables,
## pip fruit} => {orange juice} 0.012750 0.5177665 0.024625 2.024502 102
## [11] {croissant,
## root vegetables} => {orange juice} 0.012750 0.5284974 0.024125 2.066461 102
## [12] {domestic eggs,
## other vegetables} => {orange juice} 0.012625 0.5906433 0.021375 2.309456 101
## [13] {rasberry jam,
## root vegetables} => {other vegetables} 0.012000 0.5485714 0.021875 2.873983 96
## [14] {butter,
## other vegetables} => {orange juice} 0.011375 0.5909091 0.019250 2.310495 91
## [15] {croissant,
## rasberry jam} => {orange juice} 0.011125 0.6641791 0.016750 2.596986 89
## [16] {rasberry jam,
## tropical fruit} => {other vegetables} 0.010875 0.5337423 0.020375 2.796293 87
## [17] {citrus fruit,
## root vegetables} => {other vegetables} 0.010500 0.6131387 0.017125 3.212252 84
IMs <- interestMeasure(rules, c("support", "confidence", "lift", "count", "conviction", "jaccard", "hyperLift"),
transactions = trans)
inspect(head(rules))
## lhs rhs support confidence coverage lift count
## [1] {rasberry jam} => {other vegetables} 0.031250 0.5208333 0.060000 2.728662 250
## [2] {rasberry jam} => {orange juice} 0.039000 0.6500000 0.060000 2.541544 312
## [3] {butter,
## other vegetables} => {orange juice} 0.011375 0.5909091 0.019250 2.310495 91
## [4] {domestic eggs,
## other vegetables} => {orange juice} 0.012625 0.5906433 0.021375 2.309456 101
## [5] {other vegetables,
## pip fruit} => {orange juice} 0.012750 0.5177665 0.024625 2.024502 102
## [6] {other vegetables,
## whipped/sour cream} => {orange juice} 0.014500 0.5110132 0.028375 1.998097 116
head(IMs)
## support confidence lift count conviction jaccard hyperLift
## 1 0.031250 0.5208333 2.728662 250 1.688609 0.14228799 2.252252
## 2 0.039000 0.6500000 2.541544 312 2.126429 0.14092141 2.151724
## 3 0.011375 0.5909091 2.310495 91 1.819278 0.04314841 1.750000
## 4 0.012625 0.5906433 2.309456 101 1.818096 0.04773157 1.771930
## 5 0.012750 0.5177665 2.024502 102 1.543339 0.04764129 1.569231
## 6 0.014500 0.5110132 1.998097 116 1.522025 0.05377840 1.589041
Based on the results above, support value varies between 1-4%, which is rather low, however the dataset contains 8000 transactions, so it is not a warning message. As for the confidence, the top rules have it on the level of up to 0.71, which can be interpreted as 71% reliability of the rule. When it comes to lift it is visible that the best rules appear together around 2 to 3 times more often than in case of no dependencies. Support count indicates that rules taking into consideration orange juice, raspberry jam and other vegetables appear the most frequently and are the strongest. Conviction values fluctuate around 1-2%, which is meant by rules independence. Jacard coefficient values are almost equal to zero, which can be interpreted as poor similarity of products. Finally, hyperLift takes the values from around 1% to 2%, which indicates independence. Overall, the majority of the best rules consist of orange juice and other vegetables.
Computed results will be now visualized for further analysis. Firstly, let’s take into consideration a Parallel coordinates plot for 17 rules. The plot shows the path of making a decision when buying products in the supermarket. For instance, when a customer buys root vegetables and then raspberry jam, he/she will be likely to purchase other vegetables. In addition, when buying tropical fruit, followed by raspberry purchase, the customer will probably buy orange juice too.
plot(rules, method="paracoord", control=list(reorder=TRUE))
The scatter plot presents 17 rules with regard to support, confidence and lift measures. It is shown, that the rules of the highest importance (meaning the lift above 3.00) have support on the level of 0.01 and confidence on the levels from 55% to 65%.
plot(rules, measure=c("support", "confidence"), shading="lift", interactive=FALSE, jitter=0)
## Warning in plot.rules(rules, measure = c("support", "confidence"), shading =
## "lift", : The parameter interactive is deprecated. Use engine='interactive'
## instead.
The matrix for 17 rules presents the most interesting rules in terms of support and confidence values. X axis denotes LHS or a rule, while y axis - RHS. The most significant itemsets in antecedent combine different types of vegetables, fruit as well as jam and croissant. Itemsets in consequent include orange juice and other variables.
The most interesting rules denoted below illustrate that when the customer purchases various vegetables or fruit, the subsequent move is either buying more vegetables or orange juice.
inspect(head(sort(sort(rules, by ="confidence"),by="support"),17))
## lhs rhs support confidence coverage lift count
## [1] {rasberry jam} => {orange juice} 0.039000 0.6500000 0.060000 2.541544 312
## [2] {rasberry jam} => {other vegetables} 0.031250 0.5208333 0.060000 2.728662 250
## [3] {other vegetables,
## rasberry jam} => {orange juice} 0.019750 0.6320000 0.031250 2.471163 158
## [4] {orange juice,
## rasberry jam} => {other vegetables} 0.019750 0.5064103 0.039000 2.653099 158
## [5] {rasberry jam,
## root vegetables} => {orange juice} 0.014625 0.6685714 0.021875 2.614160 117
## [6] {rasberry jam,
## tropical fruit} => {orange juice} 0.014500 0.7116564 0.020375 2.782625 116
## [7] {other vegetables,
## whipped/sour cream} => {orange juice} 0.014500 0.5110132 0.028375 1.998097 116
## [8] {root vegetables,
## tropical fruit} => {other vegetables} 0.012875 0.5819209 0.022125 3.048702 103
## [9] {root vegetables,
## tropical fruit} => {orange juice} 0.012875 0.5819209 0.022125 2.275351 103
## [10] {croissant,
## root vegetables} => {orange juice} 0.012750 0.5284974 0.024125 2.066461 102
## [11] {other vegetables,
## pip fruit} => {orange juice} 0.012750 0.5177665 0.024625 2.024502 102
## [12] {domestic eggs,
## other vegetables} => {orange juice} 0.012625 0.5906433 0.021375 2.309456 101
## [13] {rasberry jam,
## root vegetables} => {other vegetables} 0.012000 0.5485714 0.021875 2.873983 96
## [14] {butter,
## other vegetables} => {orange juice} 0.011375 0.5909091 0.019250 2.310495 91
## [15] {croissant,
## rasberry jam} => {orange juice} 0.011125 0.6641791 0.016750 2.596986 89
## [16] {rasberry jam,
## tropical fruit} => {other vegetables} 0.010875 0.5337423 0.020375 2.796293 87
## [17] {citrus fruit,
## root vegetables} => {other vegetables} 0.010500 0.6131387 0.017125 3.212252 84
matrix_rules = head(sort(sort(rules, by ="confidence"),by="support"),17)
plot(matrix_rules, method="matrix", measure="lift")
## Itemsets in Antecedent (LHS)
## [1] "{citrus fruit,root vegetables}"
## [2] "{rasberry jam,tropical fruit}"
## [3] "{rasberry jam,root vegetables}"
## [4] "{root vegetables,tropical fruit}"
## [5] "{orange juice,rasberry jam}"
## [6] "{rasberry jam}"
## [7] "{croissant,rasberry jam}"
## [8] "{other vegetables,rasberry jam}"
## [9] "{butter,other vegetables}"
## [10] "{domestic eggs,other vegetables}"
## [11] "{croissant,root vegetables}"
## [12] "{other vegetables,pip fruit}"
## [13] "{other vegetables,whipped/sour cream}"
## Itemsets in Consequent (RHS)
## [1] "{orange juice}" "{other vegetables}"
An additional visualization is a network illustration of the 17 transaction rules.
plot(rules, method="graph", engine="htmlwidget", measure="support", shading="lift", main="Graph for 17 rules")
## Warning: Unknown control parameters: main
## Available control parameters (with default values):
## itemCol = #CBD2FC
## nodeCol = c("#EE0000", "#EE0303", "#EE0606", "#EE0909", "#EE0C0C", "#EE0F0F", "#EE1212", "#EE1515", "#EE1818", "#EE1B1B", "#EE1E1E", "#EE2222", "#EE2525", "#EE2828", "#EE2B2B", "#EE2E2E", "#EE3131", "#EE3434", "#EE3737", "#EE3A3A", "#EE3D3D", "#EE4040", "#EE4444", "#EE4747", "#EE4A4A", "#EE4D4D", "#EE5050", "#EE5353", "#EE5656", "#EE5959", "#EE5C5C", "#EE5F5F", "#EE6262", "#EE6666", "#EE6969", "#EE6C6C", "#EE6F6F", "#EE7272", "#EE7575", "#EE7878", "#EE7B7B", "#EE7E7E", "#EE8181", "#EE8484", "#EE8888", "#EE8B8B", "#EE8E8E", "#EE9191", "#EE9494", "#EE9797", "#EE9999", "#EE9B9B", "#EE9D9D", "#EE9F9F", "#EEA0A0", "#EEA2A2", "#EEA4A4", "#EEA5A5", "#EEA7A7", "#EEA9A9", "#EEABAB", "#EEACAC", "#EEAEAE", "#EEB0B0", "#EEB1B1", "#EEB3B3", "#EEB5B5", "#EEB7B7", "#EEB8B8", "#EEBABA", "#EEBCBC", "#EEBDBD", "#EEBFBF", "#EEC1C1", "#EEC3C3", "#EEC4C4", "#EEC6C6", "#EEC8C8", "#EEC9C9", "#EECBCB", "#EECDCD", "#EECFCF", "#EED0D0", "#EED2D2", "#EED4D4", "#EED5D5", "#EED7D7", "#EED9D9", "#EEDBDB", "#EEDCDC", "#EEDEDE", "#EEE0E0", "#EEE1E1", "#EEE3E3", "#EEE5E5", "#EEE7E7", "#EEE8E8", "#EEEAEA", "#EEECEC", "#EEEEEE")
## precision = 3
## igraphLayout = layout_nicely
## interactive = TRUE
## engine = visNetwork
## max = 100
## selection_menu = TRUE
## degree_highlight = 1
## verbose = FALSE
Finally, a grouped matrix for 17 rules is introduced. Taking into consideration LHS and RHS groups of itemsets, as well as lift and support measures realtionship, the rules where LHS = raspberry jam, and RHS = other vegetables and orange juice have the highest support value on the level of 3%.
plot(rules, method="grouped", main="Grouped matrix for 17 rules")
## Warning: Unknown control parameters: main
## Available control parameters (with default values):
## k = 20
## aggr.fun = function (x, ...) UseMethod("mean")
## rhs_max = 10
## lhs_label_items = 2
## col = c("#EE0000FF", "#EEEEEEFF")
## groups = NULL
## engine = ggplot2
## verbose = FALSE
One of the assumptions of the project is to conduct an initial marketing research and analysis examine the rules for certain product - croissant. The results from the rules can be used for establishing not only marketing strategies, but also sales and cross-selling initiatives in order to receive sufficient ROI.
For that purpose, new set of rules will be gathered with regard to croissant product. The Apriori algorithm and soring rules by confidence quality measure enables to assess what are the products that customer buy before purchasing croissant. Such a piece of information is a truly valuable insight for marketing managers in terms of association croissant with other commonly purchased goods by potential customers, eg. {orange juice,other vegetables, soda, waffles}.
rules.cr<-apriori(data=trans, parameter=list(supp=0.001,conf = 0.08),
appearance=list(default="lhs",rhs="croissant"), control=list(verbose=F))
rules.cr.byconf<-sort(rules.cr, by="confidence", decreasing=TRUE)
inspect(head(rules.cr.byconf))
## lhs rhs support confidence coverage lift count
## [1] {orange juice,
## other vegetables,
## soda,
## waffles} => {croissant} 0.001000 0.8888889 0.001125 4.811307 8
## [2] {butter,
## pastry,
## pork} => {croissant} 0.001000 0.8000000 0.001250 4.330176 8
## [3] {newspapers,
## spread cheese} => {croissant} 0.001250 0.7692308 0.001625 4.163631 10
## [4] {sausage,
## specialty bar} => {croissant} 0.001375 0.7333333 0.001875 3.969328 11
## [5] {hamburger meat,
## root vegetables,
## tropical fruit} => {croissant} 0.001000 0.7272727 0.001375 3.936524 8
## [6] {citrus fruit,
## orange juice,
## pastry,
## whipped/sour cream} => {croissant} 0.001250 0.7142857 0.001750 3.866228 10
Although, croissant is not the most frequently bought item, it is often bought by customers who prefer salty snacks, beverages and fruit. The above rules can be plot as the graph.
plot(rules.cr, method="graph",control = list(cex=0.9))
## Warning: Unknown control parameters: cex
## Available control parameters (with default values):
## layout = stress
## circular = FALSE
## ggraphdots = NULL
## edges = <environment>
## nodes = <environment>
## nodetext = <environment>
## colors = c("#EE0000FF", "#EEEEEEFF")
## engine = ggplot2
## max = 100
## verbose = FALSE
## Warning: Too many rules supplied. Only plotting the best 100 using
## 'lift' (change control parameter max if needed).
The purpose of this project was to compute the patterns for association rules in one of the supermarkets. In addition, the preliminary marketing research was analyses for croissant product. The Market Basket Analysis was conducted for sales transactions with the use of Eclat and Apriori algorithms to create set of rules. Moreover, a variety of quality measures were checked for association rules. The overall analysis was supported by matrix, grouped matrix, parallel coordinates plot and graphs to illustrate customer decision paths considering all products and croissant only.
www.r-project.org
www.mhahsler.github.io/arules/docs/measures
www.towardsdatascience.com
S.A. Alvarez. Chi-squared computation for association rules: preliminary results. Technical Report BC-CS-2003-01, July 2003.
P. Tan, V. Kumar, J. Srivastava. Selecting the Right Interestingness Measure for Association Patterns. Association for Computing Machinery, New York, NY, USA, 32–41, 2002.
J. Manimaran, T. Velmurugan. Analysing the quality of Association Rules by Computing an Interestingness Measures. Indian Journal of Science and Technology, Vol 8(15), DOI:10.17485/ijst/2015/v8i15/76693, July 2015.