Market Basket Optimization

Introduction

Business Goal

The objective of this project is to identify which products, customers are buying in bundle and based on that, suggest the next product that could probably interest that customer or similar customers. Using Market Basket Analysis techniques such as Apriori and Eclat, we want to achieve following business goals for our online supermarket:

  • Improve the online shopping experience of the customers by suggesting them what they should add next in their cart. This will reduce the time to search a desired product and give customers an idea about the products to purchase in bundle especially when they’re buying something for the first time. For example, ingredients to bake a cake.

  • Develop an effective pricing, cross-sell, and upsell strategy by identifying customer behavior in different demographic or socio-economic conditions.

  • Improve inventory control by identifying demand of products and predicting the sales in specific locations or during specific time of the year.

    drawing

What is MBA?

Market Basket Analysis (MBA) is a modelling technique that helps us analyze transactional datasets and derives association rules from it. For example, if we apply the MBA technique to a transactional data of a retail store, it would uncover rules telling if customers purchase a certain bunch of products, they are more (or less) expected to purchase another product or bunch of products. For example, if you are shopping online on a website and you buy a mobile phone and haven’t yet bought a screen protector, you are more likely to buy a screen protector at the same time than somebody who didn’t buy a mobile phone. The bunch of products a customer purchase is described as an itemset and MBA process tries to derive links between purchases.

Typically, the link (or relationship) will be in the form of a rule such as-

  • {mobile phone} => {screen protector} which represents {X} => {Y}
drawing

If we have a total of 10 transactions: the number of occurences of mobile phone i.e. X was 8 out of those 10 transactions, the number of transactions containing screen protector was 6 out of 10 and the number of occurences of mobile phone and screen protector i.e. X and Y being bought together was 6 out of those 10 transactions then :

  • Support is \[Supp(X∪Y) = freq(X,Y)/ Total_Transactions = 6/10= 0.6 \]

  • Confidence is \[conf(X⇒Y)=Supp(X∪Y)/Supp(X) = 0.6/(8/10)= 0.75\] which shows that for 75% of the transactions containing mobile phone the above association rule is correct.

  • Lift is a measure to filter or rank the found rules. It is defined as \[lift(X⇒Y)=Supp(X∪Y)/(Supp(X)Supp(Y))=0.6/((8/10)*(6/10))= 1.25\] If we have a higher lift value then, stronger is the association between LHS and RHS items

A major problem is that many of the association rules derived could be trivial for the business. Even though the amount of data has been reduced, we could still be trying to find rules with a very high probability value (confidence) from many product categories. One solution to this problem could be to set a high bar for support and confidence levels while deriving these rules. Then based on our requirements, we can adjust the threshold of support and confidence during our analysis. Other compplications primarily occur due to exhaustive categorization, overlooking combination explosions (a retail store may have thousands of different products), and a large number of transactions.

Grocery Store

How is it used?

Nothing in a supermarket appears in front of a customer by accident. Every item on the shelf is planned. The aim of the retail stores is to increase their revenue by blurring the line between what a customer needs and what they want or purchase buying anyway. It is not hidden from anyone that most of the products are purchased on impulse by the customers. Market basket analysis tries to answer questions such as what a customer might have purchased if they had got an idea or a recommendation.

Market basket analysis helps the retailers to understand customer behavior and find answers to certain questions which would help to blur the line between need vs desire. Using these techniques, they can decide the shelf location and promotion of products inside a store or on their website.For example, A housewife might buy healthy ingredients for a family dinner, while a bachelor might buy beer and chips. And understanding these requirements can help in increasing sales in several ways, also, when you go to the store, would you not want the aisles to be ordered in such a manner that would reduce your efforts?

Now, this is done in such a way, by finding associations between two items, say x and y, which can be placed on the same shelf, so that the buyers of one item would be promoted to buy the other item. Moreover, promotional discounts could be applied to such items. Advertisements on X could be targeted, to buyers who purchase both X and Y. And X & Y could also be combined into a new product to attract the customers.

But this is only the basic analysis. Different market basket analysis ways can uncover interesting outcomes and can also eradicate the problem of a high volume of insignificant association rules.

Analysts can slice their data based on different stores, customers of different demographic groups and different times of the year, the different seasons of the year, etc. Analyzing the transactions on these differently sliced data can provide interesting patterns.

If we observe that an association rule has more confidence in one store than the other, we can further investigate it to find the root cause and upgrade the performance of the other stores as well by finding out the differences. Similarly, if we can find an unobvious purchase pattern during a specific period of the year, we can make our supply chain efficient and plan the promotions accordingly to improve sales.

drawing
Other Application Areas

Although Market Basket Analysis conjures up pictures of shopping carts and supermarket shoppers, it is important to realize that there are many other areas in which it can be applied. These include:

Analysis of credit card purchases. Analysis of telephone calling patterns. Identification of fraudulent medical insurance claims. (Consider cases where common rules are broken). Analysis of telecom service purchases. Note that despite the terminology, there is no requirement for all the items to be purchased at the same time. The algorithms can be adapted to look at a sequence of purchases (or events) spread out over time. A predictive market basket analysis can be used to identify sets of item purchases (or events) that generally occur in sequence — something of interest to direct marketers, criminologists and many others.

EDA

Packages Used

The packages that we have used are as follows:

  • Arules: It provides the infrastructure to represent, mainpulate and analyze the transactional data and the patterns (Association rules and Frequent Itemsets). It also provides with the implementation of association mining algorithms like Apriori and Eclat.

  • ArulesViz: It provides various interactive visualizations for association rule exploration and itemsets.

library(arules)
library(arulesViz)

Importing the data

dataset = read.csv('Market_Basket_Optimisation.csv', header = FALSE)
dataset = read.transactions('Market_Basket_Optimisation.csv', sep = ',', rm.duplicates = TRUE)
## distribution of transactions with duplicates:
## 1 
## 5
Observation:
  • The dataset consisted of 7501 rows. Each observation row here corresponds to a specific customer who bought a specific basket of products. For these baskets, we further found out that there were 119 unique products available in our dataset.

  • The data had to be transformed into a sparse matrix, that is, all the 119 products of the data were taken as column values and if the transaction had that product present, 1 was marked, else 0 was placed in the matrix.

Summarizing the data

summary(dataset)
## transactions as itemMatrix in sparse format with
##  7501 rows (elements/itemsets/transactions) and
##  119 columns (items) and a density of 0.03288973 
## 
## most frequent items:
## mineral water          eggs     spaghetti  french fries     chocolate 
##          1788          1348          1306          1282          1229 
##       (Other) 
##         22405 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
## 1754 1358 1044  816  667  493  391  324  259  139  102   67   40   22   17    4 
##   18   19   20 
##    1    2    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   3.914   5.000  20.000 
## 
## includes extended item information - examples:
##              labels
## 1           almonds
## 2 antioxydant juice
## 3         asparagus
Observation:
  • The summary function shows that the basket 1 is associated to 1754, meaning there were 1754 baskets containing only one product and then we have 1358 baskets containing two products, 1044 baskets containing three products and so on with 20 being maximum product size.

  • We also have the quantiles of this distribution with the minimum and the maximum value. The minimum value is a basket of one product the maximum value is a basket of 20 products and on average people put four products in their basket when they go to the store.

Plotting the Frequency distribution

itemFrequencyPlot(dataset, topN = 10, col = rainbow(2))

Observation:
  • The top-selling 10 items were: mineral water, eggs, spaghetti, French fries, chocolates, and so on.
  • This shows the frequently bought items by french people.

Apriori

Apriori Algorithm

Apriori was developed in 1994 and it is the first algorithm for mining common patterns from the dataset. It works on horizontal layout-based data. It is built on Boolean association rules that use the breadth-first search and generate and test approach. Apriori algorithm uses frequent k items to find a bigger itemset of k+1 items. In the Apriori method, each item is given a support count. The algorithm first scans the dataset to identify all the frequent items based on their support/frequency value. The support of an item is calculated by counting its occurrence in all transactions. For example, if an item exists 60 times in a dataset of 100 transactions, its support value will be 0.6. All the items with a lower threshold value are dropped from the further process.

A-priori algorithm consists of 4 steps:

  1. Setting a minimum threshold of support and confidence.
  2. Sub-setting the transactions having higher support than minimum support.
  3. Sub-setting the transactions having higher confidence than minimum confidence.
  4. Sorting the rules by decreasing lift.
rules = apriori(data = dataset, parameter = list(support = 0.003, confidence = 0.4, minlen = 2))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.4    0.1    1 none FALSE            TRUE       5   0.003      2
##  maxlen target   ext
##      10  rules FALSE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 22 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[119 item(s), 7501 transaction(s)] done [0.00s].
## sorting and recoding items ... [115 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 done [0.00s].
## writing ... [281 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

Visualising the results

inspect(sort(rules, by = 'lift')[1:10])
##      lhs                                            rhs                
## [1]  {mineral water,whole wheat pasta}           => {olive oil}        
## [2]  {spaghetti,tomato sauce}                    => {ground beef}      
## [3]  {french fries,herb & pepper}                => {ground beef}      
## [4]  {cereals,spaghetti}                         => {ground beef}      
## [5]  {frozen vegetables,mineral water,soup}      => {milk}             
## [6]  {chocolate,herb & pepper}                   => {ground beef}      
## [7]  {chocolate,mineral water,shrimp}            => {frozen vegetables}
## [8]  {frozen vegetables,mineral water,olive oil} => {milk}             
## [9]  {cereals,ground beef}                       => {spaghetti}        
## [10] {frozen vegetables,soup}                    => {milk}             
##      support     confidence lift     count
## [1]  0.003866151 0.4027778  6.115863 29   
## [2]  0.003066258 0.4893617  4.980600 23   
## [3]  0.003199573 0.4615385  4.697422 24   
## [4]  0.003066258 0.4600000  4.681764 23   
## [5]  0.003066258 0.6052632  4.670863 23   
## [6]  0.003999467 0.4411765  4.490183 30   
## [7]  0.003199573 0.4210526  4.417225 24   
## [8]  0.003332889 0.5102041  3.937285 25   
## [9]  0.003066258 0.6764706  3.885303 23   
## [10] 0.003999467 0.5000000  3.858539 30

Here we can see a visualization of our association rules. Arrows pointing from items to rule vertices indicate LHS (Grouped) items and arrows from rules to items indicates the RHS (Rule Item). We can further try modifying some of the parameters for the Apriori algotrithm and see the results. This process would prove to be more intuitive if given a key for each corresponding Product_ID, so will only implement the algorithm once more.

Results:

  • The first rule we obtained was that if people buy mineral water and whole wheat pasta, they will also buy olive oil in 40 percent of the cases.
  • And the second rule was, if people buy spaghetti and tomato sauce, they will also buy ground beef in 48 percent of the cases.
  • If people buy French fries and Herb and pepper, they will also buy ground beef in 46 percent of the cases.

Plotting the graph distribution

plot(head(sort(rules, by="lift"), 10), method = "graph")

Observation:

The above graph is plotted for the 10 rules with the highest lift. The size of the bubbles indicate the support, with larger bubbles representing a higher support value. Fill color represents the lift values, with darker colors representing higher lifts.

Plotting the two-key plot

plotly_arules(head(sort(rules, by="lift"), 10), method = "two-key plot")
Observation:

x and y axes represent support and confidence respectively.

Forming specific rules

With what products will the customer buy milk?
dataset.rhs <- subset(rules, subset = rhs %in% "milk" & lift > 3.5)
inspect((sort(dataset.rhs, by = 'lift')))
##     lhs                                            rhs    support    
## [1] {frozen vegetables,mineral water,soup}      => {milk} 0.003066258
## [2] {frozen vegetables,mineral water,olive oil} => {milk} 0.003332889
## [3] {frozen vegetables,soup}                    => {milk} 0.003999467
## [4] {chicken,olive oil}                         => {milk} 0.003599520
## [5] {frozen smoothie,mineral water,spaghetti}   => {milk} 0.003199573
## [6] {spaghetti,whole wheat pasta}               => {milk} 0.003999467
##     confidence lift     count
## [1] 0.6052632  4.670863 23   
## [2] 0.5102041  3.937285 25   
## [3] 0.5000000  3.858539 30   
## [4] 0.5000000  3.858539 27   
## [5] 0.4705882  3.631566 24   
## [6] 0.4545455  3.507763 30
Conclusion:

When lift was greater than 3.5, there were 6 rules formed. Based on the previous transaction history the customer would buy milk if he follows and buys any of these rules.

Rules when the customer buys frozen vegetables and ground beef
dataset.lhs <- subset(rules, subset = lhs %in% "frozen vegetables" &  lhs %in% "ground beef" & !(rhs %in% "mineral water"))
inspect((sort(dataset.lhs, by = 'lift')))
##     lhs                    rhs             support confidence     lift count
## [1] {frozen vegetables,                                                     
##      ground beef,                                                           
##      mineral water}     => {milk}      0.003732836  0.4057971 3.131568    28
## [2] {frozen vegetables,                                                     
##      ground beef,                                                           
##      milk}              => {spaghetti} 0.003066258  0.5348837 3.072100    23
## [3] {chocolate,                                                             
##      frozen vegetables,                                                     
##      ground beef}       => {spaghetti} 0.003066258  0.5348837 3.072100    23
## [4] {frozen vegetables,                                                     
##      ground beef}       => {spaghetti} 0.008665511  0.5118110 2.939582    65
## [5] {frozen vegetables,                                                     
##      ground beef,                                                           
##      mineral water}     => {spaghetti} 0.004399413  0.4782609 2.746887    33
Conclusion:

Based on the previous transaction history, when the customer buys frozen vegetables and ground beef he/she is likely to buy either milk or spaghetti. This can be seen from the above 5 rules.

Eclat

Eclat

Eclat is an algorithm used for mining frequent itemsets from the vertical database-layout. It works on a depth-first search algorithm. In the first step of the Eclat algorithm, the data is embodied in a bit matrix structure. If an item is bought in a specific transaction, the bit is set to 1 else it is set to 0. After transforming the data into a bit matrix, a tree is constructed out of it. To find the root level of the tree, Eclat algorithm uses the combination of the 1st row with all other transaction rows, and to create the 2nd level the combination of the 2nd row is taken with the remaining rows and this continues to complete the construction of the prefix tree.

Insignificant rows are filtered out from the next calculations. To identify frequent itemsets, the depth-first search algorithm is used on the prefix tree with backtracking. Common patterns are identified and stored in a bit matrix layout. Eclat uses memory in an efficient manner because it uses the prefix tree. This algorithm is highly scalable due to the compact representation using a prefix-tree.

Steps for building the eclat model.

  1. Setting a minimum threshold of support.
  2. Sub-setting the transactions having higher support than minimum support.
  3. Sorting the rules by decreasing support.
start_time_e_1 <- Sys.time()
rules_e_1 = eclat(data = dataset, parameter = list(support = 0.003, minlen = 3))
## Eclat
## 
## parameter specification:
##  tidLists support minlen maxlen            target   ext
##     FALSE   0.003      3     10 frequent itemsets FALSE
## 
## algorithmic control:
##  sparse sort verbose
##       7   -2    TRUE
## 
## Absolute minimum support count: 22 
## 
## create itemset ... 
## set transactions ...[119 item(s), 7501 transaction(s)] done [0.00s].
## sorting and recoding items ... [115 item(s)] done [0.00s].
## creating sparse bit matrix ... [115 row(s), 7501 column(s)] done [0.00s].
## writing  ... [542 set(s)] done [0.01s].
## Creating S4 object  ... done [0.00s].
end_time_e_1 <- Sys.time()

diff_e_1 <- end_time_e_1 - start_time_e_1

Visualising the results

inspect(sort(rules_e_1, by = 'support')[1:10])
##      items                                       support    count
## [1]  {ground beef,mineral water,spaghetti}       0.01706439 128  
## [2]  {chocolate,mineral water,spaghetti}         0.01586455 119  
## [3]  {milk,mineral water,spaghetti}              0.01573124 118  
## [4]  {eggs,mineral water,spaghetti}              0.01426476 107  
## [5]  {chocolate,milk,mineral water}              0.01399813 105  
## [6]  {chocolate,eggs,mineral water}              0.01346487 101  
## [7]  {eggs,milk,mineral water}                   0.01306492  98  
## [8]  {frozen vegetables,mineral water,spaghetti} 0.01199840  90  
## [9]  {mineral water,pancakes,spaghetti}          0.01146514  86  
## [10] {frozen vegetables,milk,mineral water}      0.01106519  83

Observation:

The sets that we obtained were:

  1. Ground beef, mineral water and spaghetti are a set of items which are frequently bought together
  2. Chocolate, mineral water and spaghetti the next set bought together
  3. Milk, mineral water and spaghetti being the third most popular set.

Comparison

Finding the factor for comparison:

The similarity between all these algorithms is that they are used to find frequently occurring patterns from a transactional dataset, but they differ in terms of their execution strategy and performance when tested under different conditions such as different support/confidence thresholds and number of records in the dataset. But as the support is the only common parameter used in the two algorithms, we compare Apriori and Eclat algorithms based on this factor.

Comparing the performance of the two algorithms

par(mfrow=c(1,2))
plot(support_values, time_difference, xlab = "Support", ylab = "Time Taken", main = "A priori")
lines(support_values, time_difference)  
plot(support_values1, time_difference1, xlab = "Support", ylab = "Time Taken", main = "Eclat")
lines(support_values1, time_difference1)  

We compared the execution time for both apriori and eclat for different support values: 0.0003, 0.003 and 0.01.

Performance of Apriori and Eclat for different Support values

Performance

To check the performance of both the algorithms, we compared these algorithms based on their execution time for different values of support on 2 different datasets. The first dataset was the original dataset, with 7501 values and the second dataset was manually formed, which consisted of over 9 million transactions. The support values tested were: 0.003, 0.01. We then plotted two graphs to access the time taken for execution of the code.

Conclusion:

  • If there is a need to optimize the sales and revenue for the business in the long run, then one must go for apriori algorithm to create some added values for the business.
  • But if someone is looking for simple information for short term discounts or to gain quick results, one must adopt eclat method. Eclat method is feasible in case of small data, which in our case was perfect. If the data had transactions greater than 1 million or so, apriori would have been a better choice.
  • The Eclat algorithm forms different sets of products, which would help the store manager to decide the discounts on various products, the position of these products on the aisle and most importanly help in accessing the product requirement to stock up the inventory.
  • Based on these factors and the execution time, we can observe that Eclat performed better and thus, it is also our ‘Champion’ in this business scenario.