Market Basket Analysis

INTRODUCTION

In the past stores had a precarious way of registering their sales to customers (pen and paper), this happens still in some countries. Nowadays with technological tools such as advanced software we can have detailed reports of how many, which, what type of payment and to whom a quantity of goods a supermarket sold. But what if a supermarket wants to see if a sold product has a direct relation with another? For example coffee with sugar, notebook with pens, or bread with marmalade…just for mentioning intuitive related products.

This information could bring a new horizon of possibilities for stores like making discounts, seeing which products were least purchased so we could boost the rotation of this products making special offers and so on. The scope of this tool is quiet broad for the following analysis.

The following paper will try to prove how convenient is the use of an Unsupervised Learning method called Association Rules for Market Basket Analysis in a supermarket or any retail or wholesale related store.

Association Rules is a method where we can find relationships or dependencies between variables in datasets. Finding these relationships between variables will provide useful information for decision making in any business.

Now we will proceed with our analysis.

LIBRARIES NEEDED

To proceed with our analisys we need the following library packages.

library("arules")
library("arulesViz")
library("plotly")

DATA SET OVERVIEW

The dataset is composed by 20 columns and 7501 rows. Rows depict costumers habits for consumption, each row represents a costumer and the goods that they buy in the supermarket.

## [1] 7501

ncol(shop)

## [1] 20

MORE ABOUT THE DATASET…

So for instance the first row indicates that in one transaction costumer number one bought shrimp, almonds, avocado, vegetables mix, green grapes, whole weat flour (typo in the dataset), yams, cottage cheese, energy drink, tomato juice, low fat yogurt, green tea, honey, salad, mineral water, salmon, antioxydant juice, frozen smoothie, spinach, olive oil. The second row shows which goods were bought in one transaction by the second costumer: burgers, meatballs and eggs. The columns only represent the different types of goods that a costumer purchased in a single transaction.

trans<-read.transactions("/Users/ayaxdiaz/Desktop/UL/Market Basket Analysis/MBA.csv", format = "basket", sep=",", header = TRUE)

## Warning in asMethod(object): removing duplicated items in transactions

trans

## transactions in sparse format with
##  7500 transactions (rows) and
##  119 items (columns)

Data Exploration

summary(trans)

## transactions as itemMatrix in sparse format with
##  7500 rows (elements/itemsets/transactions) and
##  119 columns (items) and a density of 0.03287171 
## 
## most frequent items:
## mineral water          eggs     spaghetti  french fries     chocolate 
##          1787          1348          1306          1282          1229 
##       (Other) 
##         22386 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
## 1754 1358 1044  816  667  493  391  324  259  139  102   67   40   22   17    4 
##   18   19 
##    1    2 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   3.912   5.000  19.000 
## 
## includes extended item information - examples:
##              labels
## 1           almonds
## 2 antioxydant juice
## 3         asparagus

SOME RELEVANT INFORMATION…

itemFrequencyPlot(trans, topN=10, type="absolute", main="Items Frequency")

Clearly we see that this supermarket has sold many mineral water, eggs, spaghetti, french fries and chocolate, just to mention the relevant ones. So it wouldn’t be a surprise finding goods that are related to these previously mentioned goods.

Also we can realize that the least purchased products were water spray, napkins, cream, bramble, tea, mashed potato and so on, which also wouldn’t be surprising not finding many direct relation of these products to other ones.

tail(sort(itemFrequency(trans, type="absolute"), decreasing=TRUE), n=10)

##         ketchup         oatmeal chocolate bread         chutney   mashed potato 
##              33              33              32              31              31 
##             tea         bramble           cream         napkins     water spray 
##              29              14               7               5               3

THE ASSOCIATION RULES

As we mentioned before Association Rules is a method where we can find relationships or dependencies between variables in datasets. Finding these relationships between variables will provide useful information for decision making in any business.

Market-basket analysis is one of the most intuitive applications of association rules and it strives in analyzing customer buying patterns by finding associations between items that customers put into their baskets.

rules <- apriori(trans, parameter = list(supp = 0.01, conf = 0.40))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.4    0.1    1 none FALSE            TRUE       5    0.01      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 75 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[119 item(s), 7500 transaction(s)] done [0.00s].
## sorting and recoding items ... [75 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [17 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

SUPPORT (DEFINITION)

We can define support as the occurrence of these two items being purchased into one basket expressed as a fraction of the total transactions. So when the support is higher, most likely the items set occurs.

This can mathematically be expressed as the following:

\(Support = \frac{Number of transactions with both A and B items}{Total number of transactions}\)

library("DT")
support_rules <- sort(rules, by = "support", decreasing = TRUE)
support_table <- inspect(support_rules)

##      lhs                              rhs             support    confidence
## [1]  {ground beef}                 => {mineral water} 0.04093333 0.4165536 
## [2]  {olive oil}                   => {mineral water} 0.02746667 0.4178499 
## [3]  {soup}                        => {mineral water} 0.02306667 0.4564644 
## [4]  {ground beef,spaghetti}       => {mineral water} 0.01706667 0.4353741 
## [5]  {ground beef,mineral water}   => {spaghetti}     0.01706667 0.4169381 
## [6]  {chocolate,spaghetti}         => {mineral water} 0.01586667 0.4047619 
## [7]  {milk,spaghetti}              => {mineral water} 0.01573333 0.4436090 
## [8]  {chocolate,milk}              => {mineral water} 0.01400000 0.4356846 
## [9]  {chocolate,eggs}              => {mineral water} 0.01346667 0.4056225 
## [10] {eggs,milk}                   => {mineral water} 0.01306667 0.4242424 
## [11] {frozen vegetables,spaghetti} => {mineral water} 0.01200000 0.4306220 
## [12] {pancakes,spaghetti}          => {mineral water} 0.01146667 0.4550265 
## [13] {frozen vegetables,milk}      => {mineral water} 0.01106667 0.4689266 
## [14] {ground beef,milk}            => {mineral water} 0.01106667 0.5030303 
## [15] {chocolate,ground beef}       => {mineral water} 0.01093333 0.4739884 
## [16] {olive oil,spaghetti}         => {mineral water} 0.01026667 0.4476744 
## [17] {eggs,ground beef}            => {mineral water} 0.01013333 0.5066667 
##      coverage   lift     count
## [1]  0.09826667 1.748266 307  
## [2]  0.06573333 1.753707 206  
## [3]  0.05053333 1.915771 173  
## [4]  0.03920000 1.827256 128  
## [5]  0.04093333 2.394361 128  
## [6]  0.03920000 1.698777 119  
## [7]  0.03546667 1.861817 118  
## [8]  0.03213333 1.828559 105  
## [9]  0.03320000 1.702389 101  
## [10] 0.03080000 1.780536  98  
## [11] 0.02786667 1.807311  90  
## [12] 0.02520000 1.909736  86  
## [13] 0.02360000 1.968075  83  
## [14] 0.02200000 2.111207  83  
## [15] 0.02306667 1.989319  82  
## [16] 0.02293333 1.878880  77  
## [17] 0.02000000 2.126469  76

datatable(support_table)

In this case when we sorted our data by support, we realize that ground beef was purchased along with mineral water in the most cases (307 times) based on this rule. The least transactions based on these rules was eggs and ground beef purchased along with mineral water which made 76 appearances based in the rules.

CONFIDENCE (DEFINITION)

We can define confidence as the probability that a transaction that contains the items in the left hand side of the rule also contains the item on the right hand side. So when the confidence is higher, the greater the likelihood that the item in the right hand side will be purchased.

This can mathematically be expressed as the following:

\(Confidence = \frac{Number of transactions with both A and B items}{Total number of transactions with A}\)

confidence_rules <- sort(rules, by = "confidence", decreasing = TRUE)
confidence_table <- inspect(confidence_rules)

##      lhs                              rhs             support    confidence
## [1]  {eggs,ground beef}            => {mineral water} 0.01013333 0.5066667 
## [2]  {ground beef,milk}            => {mineral water} 0.01106667 0.5030303 
## [3]  {chocolate,ground beef}       => {mineral water} 0.01093333 0.4739884 
## [4]  {frozen vegetables,milk}      => {mineral water} 0.01106667 0.4689266 
## [5]  {soup}                        => {mineral water} 0.02306667 0.4564644 
## [6]  {pancakes,spaghetti}          => {mineral water} 0.01146667 0.4550265 
## [7]  {olive oil,spaghetti}         => {mineral water} 0.01026667 0.4476744 
## [8]  {milk,spaghetti}              => {mineral water} 0.01573333 0.4436090 
## [9]  {chocolate,milk}              => {mineral water} 0.01400000 0.4356846 
## [10] {ground beef,spaghetti}       => {mineral water} 0.01706667 0.4353741 
## [11] {frozen vegetables,spaghetti} => {mineral water} 0.01200000 0.4306220 
## [12] {eggs,milk}                   => {mineral water} 0.01306667 0.4242424 
## [13] {olive oil}                   => {mineral water} 0.02746667 0.4178499 
## [14] {ground beef,mineral water}   => {spaghetti}     0.01706667 0.4169381 
## [15] {ground beef}                 => {mineral water} 0.04093333 0.4165536 
## [16] {chocolate,eggs}              => {mineral water} 0.01346667 0.4056225 
## [17] {chocolate,spaghetti}         => {mineral water} 0.01586667 0.4047619 
##      coverage   lift     count
## [1]  0.02000000 2.126469  76  
## [2]  0.02200000 2.111207  83  
## [3]  0.02306667 1.989319  82  
## [4]  0.02360000 1.968075  83  
## [5]  0.05053333 1.915771 173  
## [6]  0.02520000 1.909736  86  
## [7]  0.02293333 1.878880  77  
## [8]  0.03546667 1.861817 118  
## [9]  0.03213333 1.828559 105  
## [10] 0.03920000 1.827256 128  
## [11] 0.02786667 1.807311  90  
## [12] 0.03080000 1.780536  98  
## [13] 0.06573333 1.753707 206  
## [14] 0.04093333 2.394361 128  
## [15] 0.09826667 1.748266 307  
## [16] 0.03320000 1.702389 101  
## [17] 0.03920000 1.698777 119

datatable(confidence_table)

Now we sorted by confidence and we realize that when a costumer buys eggs and ground beef, it is most likely that he or she will also buy mineral water and this was the transaction that had the highest confidence at a value of approximately 0.51. This indicates that there is a 51% chance that the rule with its support value is likely to happen.

LIFT (DEFINITION)

We can define lift as the probability of all the items in the rule occurring together divided by the product of the probabilities of the items on the left and right hand side occurring as if there was no association between them. Lift summarizes the strength of association between the products on the left and the right hand side of the rule. So the larger the lift, the greater the link between the two products.

This can mathematically be expressed as the following:

\(Lift = \frac{Confidence}{Expected Confidence}\)

And we also can define Expected Confidence as the following:

\(Expected Confidence = \frac{Number of transactions with B}{Total Number of transactions}\)

lift_rules <- sort(rules, by = "lift", decreasing = TRUE)
lift_table <- inspect(lift_rules)

##      lhs                              rhs             support    confidence
## [1]  {ground beef,mineral water}   => {spaghetti}     0.01706667 0.4169381 
## [2]  {eggs,ground beef}            => {mineral water} 0.01013333 0.5066667 
## [3]  {ground beef,milk}            => {mineral water} 0.01106667 0.5030303 
## [4]  {chocolate,ground beef}       => {mineral water} 0.01093333 0.4739884 
## [5]  {frozen vegetables,milk}      => {mineral water} 0.01106667 0.4689266 
## [6]  {soup}                        => {mineral water} 0.02306667 0.4564644 
## [7]  {pancakes,spaghetti}          => {mineral water} 0.01146667 0.4550265 
## [8]  {olive oil,spaghetti}         => {mineral water} 0.01026667 0.4476744 
## [9]  {milk,spaghetti}              => {mineral water} 0.01573333 0.4436090 
## [10] {chocolate,milk}              => {mineral water} 0.01400000 0.4356846 
## [11] {ground beef,spaghetti}       => {mineral water} 0.01706667 0.4353741 
## [12] {frozen vegetables,spaghetti} => {mineral water} 0.01200000 0.4306220 
## [13] {eggs,milk}                   => {mineral water} 0.01306667 0.4242424 
## [14] {olive oil}                   => {mineral water} 0.02746667 0.4178499 
## [15] {ground beef}                 => {mineral water} 0.04093333 0.4165536 
## [16] {chocolate,eggs}              => {mineral water} 0.01346667 0.4056225 
## [17] {chocolate,spaghetti}         => {mineral water} 0.01586667 0.4047619 
##      coverage   lift     count
## [1]  0.04093333 2.394361 128  
## [2]  0.02000000 2.126469  76  
## [3]  0.02200000 2.111207  83  
## [4]  0.02306667 1.989319  82  
## [5]  0.02360000 1.968075  83  
## [6]  0.05053333 1.915771 173  
## [7]  0.02520000 1.909736  86  
## [8]  0.02293333 1.878880  77  
## [9]  0.03546667 1.861817 118  
## [10] 0.03213333 1.828559 105  
## [11] 0.03920000 1.827256 128  
## [12] 0.02786667 1.807311  90  
## [13] 0.03080000 1.780536  98  
## [14] 0.06573333 1.753707 206  
## [15] 0.09826667 1.748266 307  
## [16] 0.03320000 1.702389 101  
## [17] 0.03920000 1.698777 119

datatable(lift_table)

After making our calculations we can realize how items are associated with this rule, in our case, {ground beef, mineral water} => {spaghetti} had a lift value of around 2.40 which suggest that the items of the left hand side and the right hand side are 2.4 times more likely to be purchased together compared to purchases when the items are treated to be unrelated.

Next we will visualize these rules (support, confidence and lift) together.

plot(rules, engine="plotly")

WHAT IF THE SUPERMARKET WANT TO ANALYZE AN SPECIFIC PRODUCT?…

Now we will run an analysis for mineral water and the relation that this product has with others, with this information we could think about some business strategy for boosting sales and make the supermarket get the best out if it.

water_rules <- apriori(
    data = trans,
    parameter = list(supp = 0.001, conf = 0.9),
    appearance = list(default = "lhs", rhs = "mineral water"),
    control = list(verbose = F)
  )
water_rules_table <- inspect(water_rules, linebreak = FALSE)

##     lhs                                               rhs            
## [1] {red wine,soup}                                => {mineral water}
## [2] {ground beef,light cream,olive oil}            => {mineral water}
## [3] {ground beef,pancakes,whole wheat rice}        => {mineral water}
## [4] {cake,olive oil,shrimp}                        => {mineral water}
## [5] {frozen vegetables,milk,spaghetti,turkey}      => {mineral water}
## [6] {chocolate,frozen vegetables,olive oil,shrimp} => {mineral water}
##     support     confidence coverage    lift     count
## [1] 0.001866667 0.9333333  0.002000000 3.917180 14   
## [2] 0.001200000 1.0000000  0.001200000 4.196978  9   
## [3] 0.001333333 0.9090909  0.001466667 3.815435 10   
## [4] 0.001200000 1.0000000  0.001200000 4.196978  9   
## [5] 0.001200000 0.9000000  0.001333333 3.777280  9   
## [6] 0.001200000 0.9000000  0.001333333 3.777280  9

datatable(water_rules_table)

HOW MANY RULES WE ENCOUNTERED RELATED TO MINERAL WATER?

plot(water_rules, method="graph")

How can we know how many rules are attached to mineral water? Well, lets count all the arrows that are pointing out mineral water, in this case we see that 6 arrows are showing a relation to mineral water. Lets analyze now.

Strong red dots show a strong relationship to the main product we are analyzing, in this case mineral water, while the more light colors show a weak relation with it.

For example, to depict a strong relation to mineral water we see that people that purchase shrimp, cake, and olive oil most likely have 4.19 times chance that they will lift mineral water, Also it happens with costumers who buy olive oil, light cream and ground beef, when purchased, the possibility of buying mineral water is 4.19 times. From the table we can see that both of these relations to mineral water have a confidence of 100%.

On the other side, people that buy milk, spaghetti, frozen vegetables, and turkey, will most likely have a chance in purchasing mineral water, but this combination is likely to increase the lifting for this item up 3.77 times. Same happens with frozen vegetables, chocolate, shrimp, and olive oil, the possibility that the costumer will buy mineral water will be also 3.77 times. Confidence here is 90%.

SUMMARY

In this case study, supermarkets can see which are the goods that are mostly purchased and related to others.

For example, in this supermarket the analysis of data shows that we can make a slight discount to mineral water when a costumer is hesitant of buying it after buying ground beef, light cream and olive oil, same happens with cake olive oil and shrimp. However we cannot discard other relations with mineral water, like chocolate, frozen vegetables, olive oil, and shrimp, or frozen vegetables, milk, spaghetti, and turkey.

For boosting visits of costumers we could also use this information, lets say that a costumer didn’t buy last time this articles together, so we could make individualized marketing campaigns telling the costumer that if he buys next time this bundle he could get a discount for the upcoming 3 days, that will make the costumer return to the store and claim his discount or at least visit for some randomized purchase, which was not likely to happen by not knowing this information beforehand by the store.

So in other word Association rules, an specifically Market Basket Analysis is a powerful tool which could let us know how to make bundles of goods or even services in different industries for boosting sales, rotating inventory, among others. It could also help us profile individually costumers knowing which are their habits and make special discounts for them.

REFERENCES

Lecture materials

Market Basket Analysis

Ayax Fabian Diaz Noriega

03/03/2022

INTRODUCTION

LIBRARIES NEEDED

DATA SET OVERVIEW

MORE ABOUT THE DATASET…

Data Exploration

SOME RELEVANT INFORMATION…

THE ASSOCIATION RULES

SUPPORT (DEFINITION)

CONFIDENCE (DEFINITION)

LIFT (DEFINITION)

WHAT IF THE SUPERMARKET WANT TO ANALYZE AN SPECIFIC PRODUCT?…

SUMMARY

REFERENCES