Market Basket Analysis (Association Rule Mining)

This is a market basket analysis based on a sample dataset from Kaggle.

Source: https://www.kaggle.com/carrie1/ecommerce-data#data.csv

Load the packages and the dataset

#load packages
library(tidyverse)
library(arules)
library(arulesViz)
library(htmlwidgets)
library(plotly)

#load dataset
library(readr)
data <- read_csv("~/Downloads/data.csv")

#view glimpse
glimpse(data)

## Rows: 541,909
## Columns: 8
## $ InvoiceNo   <chr> "536365", "536365", "536365", "536365", "536365", "5363...
## $ StockCode   <chr> "85123A", "71053", "84406B", "84029G", "84029E", "22752...
## $ Description <chr> "WHITE HANGING HEART T-LIGHT HOLDER", "WHITE METAL LANT...
## $ Quantity    <dbl> 6, 6, 8, 6, 6, 2, 6, 6, 6, 32, 6, 6, 8, 6, 6, 3, 2, 3, ...
## $ InvoiceDate <chr> "12/1/2010 8:26", "12/1/2010 8:26", "12/1/2010 8:26", "...
## $ UnitPrice   <dbl> 2.55, 3.39, 2.75, 3.39, 3.39, 7.65, 4.25, 1.85, 1.85, 1...
## $ CustomerID  <dbl> 17850, 17850, 17850, 17850, 17850, 17850, 17850, 17850,...
## $ Country     <chr> "United Kingdom", "United Kingdom", "United Kingdom", "...

Prepare the dataset and create transactional object

#create a transactional object and view the first few rows
df_list = split(data$Description,
                  data$InvoiceNo)
df_trx = as(df_list, "transactions")

inspect(head(df_trx))

##     items                                 transactionID
## [1] {CREAM CUPID HEARTS COAT HANGER,                   
##      GLASS STAR FROSTED T-LIGHT HOLDER,                
##      KNITTED UNION FLAG HOT WATER BOTTLE,              
##      RED WOOLLY HOTTIE WHITE HEART.,                   
##      SET 7 BABUSHKA NESTING BOXES,                     
##      WHITE HANGING HEART T-LIGHT HOLDER,               
##      WHITE METAL LANTERN}                        536365
## [2] {HAND WARMER RED POLKA DOT,                        
##      HAND WARMER UNION JACK}                     536366
## [3] {ASSORTED COLOUR BIRD ORNAMENT,                    
##      BOX OF 6 ASSORTED COLOUR TEASPOONS,               
##      BOX OF VINTAGE ALPHABET BLOCKS,                   
##      BOX OF VINTAGE JIGSAW BLOCKS,                     
##      DOORMAT NEW ENGLAND,                              
##      FELTCRAFT PRINCESS CHARLOTTE DOLL,                
##      HOME BUILDING BLOCK WORD,                         
##      IVORY KNITTED MUG COSY,                           
##      LOVE BUILDING BLOCK WORD,                         
##      POPPY'S PLAYHOUSE BEDROOM,                        
##      POPPY'S PLAYHOUSE KITCHEN,                        
##      RECIPE BOX WITH METAL HEART}                536367
## [4] {BLUE COAT RACK PARIS FASHION,                     
##      JAM MAKING SET WITH JARS,                         
##      RED COAT RACK PARIS FASHION,                      
##      YELLOW COAT RACK PARIS FASHION}             536368
## [5] {BATH BUILDING BLOCK WORD}                   536369
## [6] {ALARM CLOCK BAKELIKE GREEN,                       
##      ALARM CLOCK BAKELIKE PINK,                        
##      ALARM CLOCK BAKELIKE RED,                         
##      CHARLOTTE BAG DOLLY GIRL DESIGN,                  
##      CIRCUS PARADE LUNCH BOX,                          
##      INFLATABLE POLITICAL GLOBE,                       
##      LUNCH BOX I LOVE LONDON,                          
##      MINI JIGSAW CIRCUS PARADE,                        
##      MINI JIGSAW SPACEBOY,                             
##      MINI PAINT SET VINTAGE,                           
##      PANDA AND BUNNIES STICKER SHEET,                  
##      POSTAGE,                                          
##      RED TOADSTOOL LED NIGHT LIGHT,                    
##      ROUND SNACK BOXES SET OF4 WOODLAND,               
##      SET 2 TEA TOWELS I LOVE LONDON,                   
##      SET/2 RED RETROSPOT TEA TOWELS,                   
##      SPACEBOY LUNCH BOX,                               
##      STARS GIFT TAPE,                                  
##      VINTAGE HEADS AND TAILS CARD GAME,                
##      VINTAGE SEASIDE JIGSAW PUZZLES}             536370

Plot most popular items

#most popular
itemFrequencyPlot(df_trx, 
                  type = 'absolute',
                  topN = 15,
                  horiz = TRUE,
                  col = 'indianred4',
                  main = 'Most popular items (absolute)')

itemFrequencyPlot(df_trx, 
                  type = 'relative',
                  topN = 15,
                  horiz = TRUE,
                  col = 'indianred4',
                  main = 'Most popular items (relative)')

Apply the apriori algorithm to get the rules

1. Get appropriate parameters

In order to find the appropriate parameters to mine for rules, I will test different support levels of 1%, 2% and 3%, loop through different confidence levels and visualize the results. Instead of testing different support levels, I could also test confidence levels and loop through support levels instead.

# Set of confidence levels
confidenceLevels = seq(from=0.95, to=0.5, by=-0.05)

# Create empty vector
rules_sup1 = NULL

# Apriori algorithm with a support level of 1%
for (i in 1:length(confidenceLevels)) {
  rules_sup1[i] = 
  length(apriori(df_trx,
                 parameter=list(sup=0.01, 
                                conf=confidenceLevels[i],
                                target="rules")))
}

# Create empty vector
rules_sup2 = NULL

# Apriori algorithm with a support level of 2%
for (i in 1:length(confidenceLevels)) {
  rules_sup2[i] = 
  length(apriori(df_trx,
                 parameter=list(sup=0.02, 
                                conf=confidenceLevels[i],
                                target="rules")))
}

# Create empty vector
rules_sup3 = NULL

# Apriori algorithm with a support level of 3%
for (i in 1:length(confidenceLevels)) {
  rules_sup3[i] = 
  length(apriori(df_trx,
                 parameter=list(sup=0.03, 
                                conf=confidenceLevels[i],
                                target="rules")))
}

# Create data frame with all metrics to be plotted
nb_rules = data.frame(rules_sup1, rules_sup2, rules_sup3,
                      confidenceLevels)

# Number of rules found with a support level of 1% and 2%
rules_plot <- ggplot(data=nb_rules, aes(x=confidenceLevels)) +
  # Lines and points for rules_sup1
  geom_line(aes(y=rules_sup1, colour="Support level of 1%")) + 
  geom_point(aes(y=rules_sup1,colour="Support level of 1%")) +
  # Lines and points for rules_sup2
  geom_line(aes(y=rules_sup2, colour="Support level of 2%")) +
  geom_point(aes(y=rules_sup2,colour="Support level of 2%")) + 
  # Lines and points for rules_sup3
  geom_line(aes(y=rules_sup3, colour="Support level of 3%")) +
  geom_point(aes(y=rules_sup3,colour="Support level of 3%")) + 
  # Polishing the graph
  theme_bw() + ylab("") +
  ggtitle("Number of extracted rules with apriori") 
ggplotly(rules_plot)

Based on this graph, a minimum support of 1% with a confidence level of 80% results in 25 rules, whereas a minimum support of 2% with a confidence level of 60% results in 18 rules. A lower support and confidence level results in more rules, so more associations are found if the number of transactions in which the rule must be true is lower.

Therefore, when choosing the appropriate parameters for the apriori algorithm, a few different approaches may be considered, such as:

Obtaining a certain minimum/maximum number of rules that is manageable
Obtaining a minimum number of rules with the highest possible support/confidence levels

In this case, if I want to get at least 100 rules, I might choose something like a minimum support of 1% and confidence level of 65%. However, if I’m interested in getting at least 25 rules with the highest possible support and confidence levels, I might choose a support level of 1% and a confidence level of 80%.

Here I will choose a support level of 1% and a confidence level of 60% which will result in 179 rules at first.

2. Apply the apriori algorithm with selected parameters

#Apply the apriori algorithm to find rules:
#minimum support: 0.01 (must occur in at least 1% of all transactions)
#minimum confidence: 0.6 (rule is true in 60% of transactions)
#minlen = 2 (no empty sets)
rules = apriori(df_trx,
                      parameter = list(supp = 0.01, 
                                       conf = 0.6, 
                                       minlen = 2,
                                       target = "rules"))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.6    0.1    1 none FALSE            TRUE       5    0.01      2
##  maxlen target   ext
##      10  rules FALSE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 259 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[4211 item(s), 25900 transaction(s)] done [0.19s].
## sorting and recoding items ... [590 item(s)] done [0.01s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 3 4 done [0.03s].
## writing ... [179 rule(s)] done [0.00s].
## creating S4 object  ... done [0.01s].

# find redundant rules
redundant_rules = is.redundant(rules)

# leave out redundanct rules
rules1 = rules[!redundant_rules]

3. View the summary

summary(rules1)

## set of 179 rules
## 
## rule length distribution (lhs + rhs):sizes
##   2   3   4 
##  62 106  11 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   2.000   3.000   2.715   3.000   4.000 
## 
## summary of quality measures:
##     support          confidence          lift            count      
##  Min.   :0.01000   Min.   :0.6013   Min.   : 7.404   Min.   :259.0  
##  1st Qu.:0.01064   1st Qu.:0.6458   1st Qu.:10.371   1st Qu.:275.5  
##  Median :0.01174   Median :0.6880   Median :17.844   Median :304.0  
##  Mean   :0.01316   Mean   :0.7048   Mean   :19.428   Mean   :340.7  
##  3rd Qu.:0.01301   3rd Qu.:0.7558   3rd Qu.:22.041   3rd Qu.:337.0  
##  Max.   :0.03216   Max.   :0.8991   Max.   :60.260   Max.   :833.0  
## 
## mining info:
##    data ntransactions support confidence
##  df_trx         25900    0.01        0.6

After removing the redundant rules, the same number of rules (179) remains. The minimum support and confidence levels are as was specified in the parameters for the apriori algorithm above.

The median and maximum values provide interesting insights into the retrieved rules. On an average, the itemsets appear in about 1% of all transactions with a maximum of 3%. This would suggest that there is a variety of products that people often buy from the retailer in combination with each other, but the products that are most frequently purchased together make up no more than 3% of all transactions.

However, it must be noted that market basket analysis is not concerned with the quantities of purchased items but rather which items were bought at least once per one transaction.

View the first few rows

rules_df <- as(rules1, 'data.frame')

head(rules_df)

##                                                                       rules
## 1         {PAINTED METAL PEARS ASSORTED} => {ASSORTED COLOUR BIRD ORNAMENT}
## 2            {PINK HAPPY BIRTHDAY BUNTING} => {BLUE HAPPY BIRTHDAY BUNTING}
## 3            {BLUE HAPPY BIRTHDAY BUNTING} => {PINK HAPPY BIRTHDAY BUNTING}
## 4 {CANDLEHOLDER PINK HANGING HEART} => {WHITE HANGING HEART T-LIGHT HOLDER}
## 5               {ALARM CLOCK BAKELIKE ORANGE} => {ALARM CLOCK BAKELIKE RED}
## 6   {SET OF 6 TEA TIME BAKING CASES} => {SET OF 12 FAIRY CAKE BAKING CASES}
##      support confidence      lift count
## 1 0.01003861  0.6989247 12.339571   260
## 2 0.01019305  0.6502463 42.853382   264
## 3 0.01019305  0.6717557 42.853382   264
## 4 0.01096525  0.7047146  7.928805   284
## 5 0.01196911  0.6709957 16.076585   310
## 6 0.01007722  0.6796875 31.604859   261

Here we can see that items PAINTED METAL PEARS ASSORTED and ASSORTED COLOUR BIRD ORNAMENT appear together in 1% of all transactions. 69% of customers who bought the first item also bought the second. As the lift is over 1, it means that there is a very high probability that purchasing the one product is consequent of purchasing the other.

Scatterplot of all rules

plot(rules1, engine = 'plotly')

The two products that are most often bought together with the highest lift are REGENCY TEA PLATE PINK and REGENCY TEA PLATE GREEN.

The two items appear in together in 1% of all baskets. ~90% of customers who purchased REGENCY TEA PLATE PINK also purchased REGENCY TEA PLATE GREEN. The lift value indicates that purchasing the second product is highly dependent on purchasing the first one, so if someone purchases the first item they are also highly likely to buy the second one.

The first 5 rules with the highest support

plot(head(sort(rules1, by= "support"), 5),
     method = "graph",
     engine = "htmlwidget")

The first 5 rules with the highest confidence

plot(head(sort(rules1, by= "confidence"), 5),
     method = "graph",
     engine = "htmlwidget")

The first 5 rules with the highest lift

plot(head(sort(rules1, by= "lift"), 5),
     method = "graph",
     engine = "htmlwidget")

Ideas for further analysis

For further analysis, we could look at a specific item or itemsets on the left hand side to understand which other items people also bought if they bought the particular item or itemsets. Additionally, with a specific item or itemsets on the right hand side we can obtain rules regarding items or itemsets which led to the purchasing of that particular item or itemsets.

This information has many practical applications, for instance for placing the products in a particular fashion in a retail setting, clustering or segmenting customers based on their shopping patterns or sending personalized offers to customers based on what they have bought in the past and might be interested in purchasing in the future.

Additionally, we may find products that are popular on their own, so whether or not they are purchased would not depend on other items. On the other hand, we may also discover items that tend to be purchased mostly in combination with other products which can explain why a particular item sells better or worse than another.

Let’s imagine two possible fake scenarios in a supermarket setting with products A and B.

In the first scenario, the factory that produces product A announces that due to an unexpected logistical issue they will not be able to send any new stock to the supermarket for a period of two weeks. The supermarket data analyst finds out that there is a product B which is often bought together with product A where B is the consequent of purchasing A. As product B is a perishable product with a very short life time, the supermarket has several options to come up with a plan to reduce potential cost of product waste if they know that product B is very likely to sell a lot less if product A is not available. For example, they may some up with a campaign to promote product B or order less stock while product A has reduced availability.

In a different situation we may discover that the sales of product B have dropped unexpectedly from the beginning of the week and we want to know why. The data analyst then finds out that product B is often bought in combination with product A which currently has reduced availability so people cannot buy it, causing the sales of product B to drop. Again, the supermarket may want to think of a plan not to let the product go to waste.

An interactive table to explore all rules

Finally, an interactive table is a convenient way to explore all rules based on the metrics.

If there is a very large number of products that are classified into several different product categories, we may want to find out how items in different categories influence each other.

One approach would be to run the apriori algorithm separately for each category, but in that case the retrieved rules would only be limited to the items within the same category, ignoring the fact that products in different categories may also be very much linked together.

The second approach would be to run the apriori algorithm based on the product categories rather than the items in order to find interconnections between different types or products. This would probably not work very well with very general categories like food vs housewares as the results would probably be rather casual. However, it might work very well with more specific smaller categories or sub-categories, for example to discover if dairy products are often purchased together with meat products, or if someone is very likely to purchase a magazine if they also purchased stationary.

inspectDT(rules1)