Association rules is an unsupervised learning technique which aims to describe and discover regularities between items in transaction data.

It is often used in basket analysis in sales to check if there are some general patterns in customers behaviour.

If customer buys X, he also tends to buy Y

This is the statement that advice the sale department to improve knowledge of customers’ behavior.

The main goal of this analysis is to perform most common algorithm used to observe interesting patterns between consumer’s purchases.

Dataset

The data used in this project contains information about Customers buying different grocery items at a Mall and you can find it on kaggle: https://www.kaggle.com/roshansharma/market-basket-optimization/version/1.

As the summary output shows, there are 7500 transactions and 119 products.

data<- read.transactions("Market_Basket_Optimisation.csv",
                         format = "basket", sep = ",", header = T)
summary(data)

## transactions as itemMatrix in sparse format with
##  7500 rows (elements/itemsets/transactions) and
##  119 columns (items) and a density of 0.03287171 
## 
## most frequent items:
## mineral water          eggs     spaghetti  french fries     chocolate 
##          1787          1348          1306          1282          1229 
##       (Other) 
##         22386 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
## 1754 1358 1044  816  667  493  391  324  259  139  102   67   40   22   17    4 
##   18   19 
##    1    2 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   3.912   5.000  19.000 
## 
## includes extended item information - examples:
##              labels
## 1           almonds
## 2 antioxydant juice
## 3         asparagus

The output above anticipates already the most frequent items present in the data, let's try to present them on graph.

Association rules

Apriori

First of all, I have to create the rules using Apriori Algorithm.

There are three main indicators used to assess the quality of rules:

Support
Confidence
lift

In order to obtain any results to analysis the confidence had to be lowered, I decided to lower their values to 0.01 (support) and 0.4 (confidence).

17 rules were found.

rules = apriori(data, parameter = list(supp = 0.01, conf = 0.4))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.4    0.1    1 none FALSE            TRUE       5    0.01      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 75 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[119 item(s), 7500 transaction(s)] done [0.00s].
## sorting and recoding items ... [75 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [17 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

Visualization of rules

Support

Support is the number of times a certain group of items appears in all orders, in other words, it is the probability of appearing a transaction with all items together.

\[Support(x) = \frac{Count(x)}{N}\] where x represents an item and N represents the total number of transactions.

Analysing the most frequent rules by support (around 2.5%):

307 transactions on total, contained ground beef and mineral water.

rules_support = sort(rules, by = "support", decreasing = TRUE)
inspect(head(rules_support))

##     lhs                            rhs             support    confidence
## [1] {ground beef}               => {mineral water} 0.04093333 0.4165536 
## [2] {olive oil}                 => {mineral water} 0.02746667 0.4178499 
## [3] {soup}                      => {mineral water} 0.02306667 0.4564644 
## [4] {ground beef,spaghetti}     => {mineral water} 0.01706667 0.4353741 
## [5] {ground beef,mineral water} => {spaghetti}     0.01706667 0.4169381 
## [6] {chocolate,spaghetti}       => {mineral water} 0.01586667 0.4047619 
##     coverage   lift     count
## [1] 0.09826667 1.748266 307  
## [2] 0.06573333 1.753707 206  
## [3] 0.05053333 1.915771 173  
## [4] 0.03920000 1.827256 128  
## [5] 0.04093333 2.394361 128  
## [6] 0.03920000 1.698777 119

Confidence

Confidence indicates the power of the rule, how often given rule is true.

It has maximum value of 1 and it is when customers always buy item B with item A.

\[Confidence(x -> y) = \frac{Support(x,y)}{Support(x)}\] It is calculated as the support of item x and y divided by the support of item x.

rules_confidence = sort(rules, by = "confidence", decreasing = TRUE)
inspect(head(rules_confidence))

##     lhs                         rhs             support    confidence
## [1] {eggs,ground beef}       => {mineral water} 0.01013333 0.5066667 
## [2] {ground beef,milk}       => {mineral water} 0.01106667 0.5030303 
## [3] {chocolate,ground beef}  => {mineral water} 0.01093333 0.4739884 
## [4] {frozen vegetables,milk} => {mineral water} 0.01106667 0.4689266 
## [5] {soup}                   => {mineral water} 0.02306667 0.4564644 
## [6] {pancakes,spaghetti}     => {mineral water} 0.01146667 0.4550265 
##     coverage   lift     count
## [1] 0.02000000 2.126469  76  
## [2] 0.02200000 2.111207  83  
## [3] 0.02306667 1.989319  82  
## [4] 0.02360000 1.968075  83  
## [5] 0.05053333 1.915771 173  
## [6] 0.02520000 1.909736  86

Lift

Lift can be seen as a measure of correlation of sorts and it is the indicator of how strong the items are linked.

Lift > 1 -> products are positively correlated
Lift < 1 -> products are negatively correlated
Lift = 1 -> products are independent

It can be also defined as a measure of how much more likely one item is to be purchased relative to its typical purchase rate, given that you know another item has been purchased.

\[Lift(x -> y) = \frac{Confidence(x -> y)}{Support(y)}\]

In this case, it is more probable that a customer buys ground beef, mineral water and spaghetti together than just these products alone.

rules_lift = sort(rules, by = "lift", decreasing = TRUE)
inspect(head(rules_lift))

##     lhs                            rhs             support    confidence
## [1] {ground beef,mineral water} => {spaghetti}     0.01706667 0.4169381 
## [2] {eggs,ground beef}          => {mineral water} 0.01013333 0.5066667 
## [3] {ground beef,milk}          => {mineral water} 0.01106667 0.5030303 
## [4] {chocolate,ground beef}     => {mineral water} 0.01093333 0.4739884 
## [5] {frozen vegetables,milk}    => {mineral water} 0.01106667 0.4689266 
## [6] {soup}                      => {mineral water} 0.02306667 0.4564644 
##     coverage   lift     count
## [1] 0.04093333 2.394361 128  
## [2] 0.02000000 2.126469  76  
## [3] 0.02200000 2.111207  83  
## [4] 0.02306667 1.989319  82  
## [5] 0.02360000 1.968075  83  
## [6] 0.05053333 1.915771 173

Spaghetti rules

In order to go in depth with my analysis, I decided to concentrate my focus on a tipically product of my Italian country: Spaghetti.

In other word, I want to find out what products usually are bought together with the famous type of pasta.

The output below shows that most strong rule is the combination frozen vegetables,olive oil,tomatoes and spaghetti.

We can say that a combination of pasta taste has been identified.

It has the highest lift <- 4.835980.

On other hand, 33 transactions on total, contained olive oil, tomatoes and spaghetti.

rules_spaghetti = apriori(data,
    parameter = list(supp = 0.002, conf = 0.6),
    appearance = list(default = "lhs", rhs = "spaghetti"),
    control = list(verbose = F)
  )
inspect(rules_spaghetti, linebreak = FALSE)

##      lhs                                          rhs         support    
## [1]  {french wine,ground beef}                 => {spaghetti} 0.002400000
## [2]  {cereals,olive oil}                       => {spaghetti} 0.002000000
## [3]  {cereals,ground beef}                     => {spaghetti} 0.003066667
## [4]  {olive oil,tomatoes}                      => {spaghetti} 0.004400000
## [5]  {cooking oil,ground beef,mineral water}   => {spaghetti} 0.002133333
## [6]  {frozen vegetables,olive oil,tomatoes}    => {spaghetti} 0.002133333
## [7]  {frozen vegetables,ground beef,tomatoes}  => {spaghetti} 0.002000000
## [8]  {mineral water,olive oil,pancakes}        => {spaghetti} 0.002800000
## [9]  {frozen vegetables,ground beef,olive oil} => {spaghetti} 0.002133333
## [10] {frozen vegetables,ground beef,shrimp}    => {spaghetti} 0.002400000
##      confidence coverage    lift     count
## [1]  0.6206897  0.003866667 3.564451 18   
## [2]  0.6818182  0.002933333 3.915495 15   
## [3]  0.6764706  0.004533333 3.884785 23   
## [4]  0.6111111  0.007200000 3.509444 33   
## [5]  0.6666667  0.003200000 3.828484 16   
## [6]  0.8421053  0.002533333 4.835980 16   
## [7]  0.6250000  0.003200000 3.589204 15   
## [8]  0.6000000  0.004666667 3.445636 21   
## [9]  0.6400000  0.003333333 3.675345 16   
## [10] 0.7500000  0.003200000 4.307044 18

Visualization of Spaghetti rules

Let's try to plot the 10 rules created above.

plot(rules_spaghetti, method="graph", cex=0.7)

plot(rules_spaghetti, method="paracoord", cex=0.7)

Conclusion

In addition to the basic measures (support, confidence, lift) there are also different measures that can be conducted to get the deep knowledge of data:

Jaccard Index
Affinity measure

Those two measures will be calculated on the more frequent items.

Dissimilarity

The possibility of calculating the dissimilarity of items using the Jaccard index still exists and it is based on probability calcus.

Checking the product dissimilarity, the most dissimilar products are chocolate and green tea, green tea and milk and french fries with milk.

df<-data[,itemFrequency(data)>0.1]
J_index<-dissimilarity(df, which="items") 
round(J_index,digits=3)

##               chocolate  eggs french fries green tea  milk mineral water
## eggs              0.893                                                 
## french fries      0.885 0.884                                           
## green tea         0.914 0.911        0.896                              
## milk              0.877 0.889        0.914     0.928                    
## mineral water     0.849 0.861        0.910     0.909 0.850              
## spaghetti         0.869 0.885        0.913     0.905 0.868         0.831

plot(hclust(J_index, method = "ward.D2"), main = "Dendrogram for items")

Similarity

On the contrary to Jaccard Index, let’s use Affinity measure in order to discover similarity of items.

The least probable itemset contains french fries and milk.

sim<-affinity(df)
round(sim, digits = 4)

## An object of class "ar_similarity"
##               chocolate   eggs french fries green tea   milk mineral water
## chocolate        0.0000 0.1070       0.1145    0.0861 0.1230        0.1507
## eggs             0.1070 0.0000       0.1158    0.0890 0.1106        0.1388
## french fries     0.1145 0.1158       0.0000    0.1040 0.0857        0.0898
## green tea        0.0861 0.0890       0.1040    0.0000 0.0721        0.0912
## milk             0.1230 0.1106       0.0857    0.0721 0.0000        0.1501
## mineral water    0.1507 0.1388       0.0898    0.0912 0.1501        0.0000
## spaghetti        0.1312 0.1151       0.0869    0.0949 0.1322        0.1694
##               spaghetti
## chocolate        0.1312
## eggs             0.1151
## french fries     0.0869
## green tea        0.0949
## milk             0.1322
## mineral water    0.1694
## spaghetti        0.0000
## Slot "method":
## [1] "Affinity"

Market Basket Analysis

Matteo Pancaldi