Association rules is an unsupervised learning technique which aims to describe and discover regularities between items in transaction data.
It is often used in basket analysis in sales to check if there are some general patterns in customers behaviour.
If customer buys X, he also tends to buy Y
This is the statement that advice the sale department to improve knowledge of customers’ behavior.
The main goal of this analysis is to perform most common algorithm used to observe interesting patterns between consumer’s purchases.
The data used in this project contains information about Customers buying different grocery items at a Mall and you can find it on kaggle: https://www.kaggle.com/roshansharma/market-basket-optimization/version/1.
As the summary output shows, there are 7500 transactions and 119 products.
data<- read.transactions("Market_Basket_Optimisation.csv",
format = "basket", sep = ",", header = T)
summary(data)## transactions as itemMatrix in sparse format with
## 7500 rows (elements/itemsets/transactions) and
## 119 columns (items) and a density of 0.03287171
##
## most frequent items:
## mineral water eggs spaghetti french fries chocolate
## 1787 1348 1306 1282 1229
## (Other)
## 22386
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## 1754 1358 1044 816 667 493 391 324 259 139 102 67 40 22 17 4
## 18 19
## 1 2
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 3.000 3.912 5.000 19.000
##
## includes extended item information - examples:
## labels
## 1 almonds
## 2 antioxydant juice
## 3 asparagus
The output above anticipates already the most frequent items present in the data, let's try to present them on graph.
First of all, I have to create the rules using Apriori Algorithm.
There are three main indicators used to assess the quality of rules:
SupportConfidenceliftIn order to obtain any results to analysis the confidence had to be lowered, I decided to lower their values to 0.01 (support) and 0.4 (confidence).
17 rules were found.
rules = apriori(data, parameter = list(supp = 0.01, conf = 0.4))## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.4 0.1 1 none FALSE TRUE 5 0.01 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 75
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[119 item(s), 7500 transaction(s)] done [0.00s].
## sorting and recoding items ... [75 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [17 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
Support is the number of times a certain group of items appears in all orders, in other words, it is the probability of appearing a transaction with all items together.
\[Support(x) = \frac{Count(x)}{N}\] where x represents an item and N represents the total number of transactions.
Analysing the most frequent rules by support (around 2.5%):
rules_support = sort(rules, by = "support", decreasing = TRUE)
inspect(head(rules_support))## lhs rhs support confidence
## [1] {ground beef} => {mineral water} 0.04093333 0.4165536
## [2] {olive oil} => {mineral water} 0.02746667 0.4178499
## [3] {soup} => {mineral water} 0.02306667 0.4564644
## [4] {ground beef,spaghetti} => {mineral water} 0.01706667 0.4353741
## [5] {ground beef,mineral water} => {spaghetti} 0.01706667 0.4169381
## [6] {chocolate,spaghetti} => {mineral water} 0.01586667 0.4047619
## coverage lift count
## [1] 0.09826667 1.748266 307
## [2] 0.06573333 1.753707 206
## [3] 0.05053333 1.915771 173
## [4] 0.03920000 1.827256 128
## [5] 0.04093333 2.394361 128
## [6] 0.03920000 1.698777 119
Confidence indicates the power of the rule, how often given rule is true.
It has maximum value of 1 and it is when customers always buy item B with item A.
\[Confidence(x -> y) = \frac{Support(x,y)}{Support(x)}\] It is calculated as the support of item x and y divided by the support of item x.
rules_confidence = sort(rules, by = "confidence", decreasing = TRUE)
inspect(head(rules_confidence))## lhs rhs support confidence
## [1] {eggs,ground beef} => {mineral water} 0.01013333 0.5066667
## [2] {ground beef,milk} => {mineral water} 0.01106667 0.5030303
## [3] {chocolate,ground beef} => {mineral water} 0.01093333 0.4739884
## [4] {frozen vegetables,milk} => {mineral water} 0.01106667 0.4689266
## [5] {soup} => {mineral water} 0.02306667 0.4564644
## [6] {pancakes,spaghetti} => {mineral water} 0.01146667 0.4550265
## coverage lift count
## [1] 0.02000000 2.126469 76
## [2] 0.02200000 2.111207 83
## [3] 0.02306667 1.989319 82
## [4] 0.02360000 1.968075 83
## [5] 0.05053333 1.915771 173
## [6] 0.02520000 1.909736 86
Lift can be seen as a measure of correlation of sorts and it is the indicator of how strong the items are linked.
It can be also defined as a measure of how much more likely one item is to be purchased relative to its typical purchase rate, given that you know another item has been purchased.
\[Lift(x -> y) = \frac{Confidence(x -> y)}{Support(y)}\]
In this case, it is more probable that a customer buys ground beef, mineral water and spaghetti together than just these products alone.
rules_lift = sort(rules, by = "lift", decreasing = TRUE)
inspect(head(rules_lift))## lhs rhs support confidence
## [1] {ground beef,mineral water} => {spaghetti} 0.01706667 0.4169381
## [2] {eggs,ground beef} => {mineral water} 0.01013333 0.5066667
## [3] {ground beef,milk} => {mineral water} 0.01106667 0.5030303
## [4] {chocolate,ground beef} => {mineral water} 0.01093333 0.4739884
## [5] {frozen vegetables,milk} => {mineral water} 0.01106667 0.4689266
## [6] {soup} => {mineral water} 0.02306667 0.4564644
## coverage lift count
## [1] 0.04093333 2.394361 128
## [2] 0.02000000 2.126469 76
## [3] 0.02200000 2.111207 83
## [4] 0.02306667 1.989319 82
## [5] 0.02360000 1.968075 83
## [6] 0.05053333 1.915771 173
In order to go in depth with my analysis, I decided to concentrate my focus on a tipically product of my Italian country: Spaghetti.
In other word, I want to find out what products usually are bought together with the famous type of pasta.
The output below shows that most strong rule is the combination frozen vegetables,olive oil,tomatoes and spaghetti.
We can say that a combination of pasta taste has been identified.
It has the highest lift <- 4.835980.
On other hand, 33 transactions on total, contained olive oil, tomatoes and spaghetti.
rules_spaghetti = apriori(data,
parameter = list(supp = 0.002, conf = 0.6),
appearance = list(default = "lhs", rhs = "spaghetti"),
control = list(verbose = F)
)
inspect(rules_spaghetti, linebreak = FALSE)## lhs rhs support
## [1] {french wine,ground beef} => {spaghetti} 0.002400000
## [2] {cereals,olive oil} => {spaghetti} 0.002000000
## [3] {cereals,ground beef} => {spaghetti} 0.003066667
## [4] {olive oil,tomatoes} => {spaghetti} 0.004400000
## [5] {cooking oil,ground beef,mineral water} => {spaghetti} 0.002133333
## [6] {frozen vegetables,olive oil,tomatoes} => {spaghetti} 0.002133333
## [7] {frozen vegetables,ground beef,tomatoes} => {spaghetti} 0.002000000
## [8] {mineral water,olive oil,pancakes} => {spaghetti} 0.002800000
## [9] {frozen vegetables,ground beef,olive oil} => {spaghetti} 0.002133333
## [10] {frozen vegetables,ground beef,shrimp} => {spaghetti} 0.002400000
## confidence coverage lift count
## [1] 0.6206897 0.003866667 3.564451 18
## [2] 0.6818182 0.002933333 3.915495 15
## [3] 0.6764706 0.004533333 3.884785 23
## [4] 0.6111111 0.007200000 3.509444 33
## [5] 0.6666667 0.003200000 3.828484 16
## [6] 0.8421053 0.002533333 4.835980 16
## [7] 0.6250000 0.003200000 3.589204 15
## [8] 0.6000000 0.004666667 3.445636 21
## [9] 0.6400000 0.003333333 3.675345 16
## [10] 0.7500000 0.003200000 4.307044 18
Let's try to plot the 10 rules created above.
plot(rules_spaghetti, method="graph", cex=0.7)plot(rules_spaghetti, method="paracoord", cex=0.7)In addition to the basic measures (support, confidence, lift) there are also different measures that can be conducted to get the deep knowledge of data:
Those two measures will be calculated on the more frequent items.
The possibility of calculating the dissimilarity of items using the Jaccard index still exists and it is based on probability calcus.
Checking the product dissimilarity, the most dissimilar products are chocolate and green tea, green tea and milk and french fries with milk.
df<-data[,itemFrequency(data)>0.1]
J_index<-dissimilarity(df, which="items")
round(J_index,digits=3)## chocolate eggs french fries green tea milk mineral water
## eggs 0.893
## french fries 0.885 0.884
## green tea 0.914 0.911 0.896
## milk 0.877 0.889 0.914 0.928
## mineral water 0.849 0.861 0.910 0.909 0.850
## spaghetti 0.869 0.885 0.913 0.905 0.868 0.831
plot(hclust(J_index, method = "ward.D2"), main = "Dendrogram for items")On the contrary to Jaccard Index, let’s use Affinity measure in order to discover similarity of items.
The least probable itemset contains french fries and milk.
sim<-affinity(df)
round(sim, digits = 4)## An object of class "ar_similarity"
## chocolate eggs french fries green tea milk mineral water
## chocolate 0.0000 0.1070 0.1145 0.0861 0.1230 0.1507
## eggs 0.1070 0.0000 0.1158 0.0890 0.1106 0.1388
## french fries 0.1145 0.1158 0.0000 0.1040 0.0857 0.0898
## green tea 0.0861 0.0890 0.1040 0.0000 0.0721 0.0912
## milk 0.1230 0.1106 0.0857 0.0721 0.0000 0.1501
## mineral water 0.1507 0.1388 0.0898 0.0912 0.1501 0.0000
## spaghetti 0.1312 0.1151 0.0869 0.0949 0.1322 0.1694
## spaghetti
## chocolate 0.1312
## eggs 0.1151
## french fries 0.0869
## green tea 0.0949
## milk 0.1322
## mineral water 0.1694
## spaghetti 0.0000
## Slot "method":
## [1] "Affinity"