Association rules

Market Basket Analysis

Market basket analysis is a specific application of association rules that focuses on analyzing transactions in a retail environment with the objective of identifying products that are frequently bought together. This information can be used by retailers to boost their sales and profits. By analyzing transaction data, retailers can identify patterns of customer behavior and make data-driven decisions about product placement, promotions, and pricing.

Dataset

For this analysis, I have decided to use the Online Retail Dataset, which is available through the University of California, Irvine Machine Learning Repository at [https://archive.ics.uci.edu/ml/datasets/Online+Retail].

The Online Retail Dataset is a valuable source of transnational data, as it contains detailed information on all the transactions that took place between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail company. The retailer specializes in unique all-occasion gifts and has a diverse customer base, with many of them being wholesalers.

The objective of the analysis is to identify which products are frequently purchased together in order to develop marketing strategies, optimize product placement, and improve customer satisfaction.

Importing required libraries

library(readxl)

## Warning: package 'readxl' was built under R version 4.2.2

library(arules)

## Warning: package 'arules' was built under R version 4.2.2

## Loading required package: Matrix

## 
## Attaching package: 'arules'

## The following objects are masked from 'package:base':
## 
##     abbreviate, write

library(plyr)

## Warning: package 'plyr' was built under R version 4.2.2

library(arulesViz)

## Warning: package 'arulesViz' was built under R version 4.2.2

library(data.table)

Loading dataset

mydataset <- read_excel("C:/Users/rahil/Desktop/UL papers/Association rules/Online Retail.xlsx")
mydataset<- data.frame(mydataset)
head(mydataset)

##   InvoiceNo StockCode                         Description Quantity
## 1    536365    85123A  WHITE HANGING HEART T-LIGHT HOLDER        6
## 2    536365     71053                 WHITE METAL LANTERN        6
## 3    536365    84406B      CREAM CUPID HEARTS COAT HANGER        8
## 4    536365    84029G KNITTED UNION FLAG HOT WATER BOTTLE        6
## 5    536365    84029E      RED WOOLLY HOTTIE WHITE HEART.        6
## 6    536365     22752        SET 7 BABUSHKA NESTING BOXES        2
##           InvoiceDate UnitPrice CustomerID        Country
## 1 2010-12-01 08:26:00      2.55      17850 United Kingdom
## 2 2010-12-01 08:26:00      3.39      17850 United Kingdom
## 3 2010-12-01 08:26:00      2.75      17850 United Kingdom
## 4 2010-12-01 08:26:00      3.39      17850 United Kingdom
## 5 2010-12-01 08:26:00      3.39      17850 United Kingdom
## 6 2010-12-01 08:26:00      7.65      17850 United Kingdom

Dataset description

InvoiceNo: Invoice number - integral number uniquely assigned to each transaction. If this code starts with letter ‘c’, it indicates a cancellation.

StockCode: Product code - integral number uniquely assigned to each distinct product.

Description: Product name.

Quantity: The quantities of each product per transaction.

InvoiceDate: Invoice Date and time - the day and time when each transaction was generated.

UnitPrice: Unit price - Product price per unit in sterling.

CustomerID: Customer number - integral number uniquely assigned to each customer.

Country: Country name - the name of the country where each customer resides.

Dataset preprocessing

Following codes are used to filter out irrelevant transactions and clean up the dataset in preparation for market basket analysis.

#Adjustments for stock on hand by selecting transactions with UnitPrice > 0.
mydataset2 <- subset(mydataset, UnitPrice > 0)
#Stock codes not related to actual purchases by excluding them from the subset of data.
excluded_codes <- c("D", "DOT", "S", "POST", "M", "C2", "AMAZONFEE", "B", "BANK CHARGES", "CRUK", "m", "PADS")
mydataset2 <- mydataset2[!mydataset2$StockCode %in% excluded_codes, ]
#Gift cards by identifying and removing rows with a Stock Code column that starts with "gift".
giftcardrows <- mydataset2$InvoiceNo[mydataset2$StockCode %like% "gift*"]
mydataset2 <- mydataset2[!mydataset2$InvoiceNo %in% giftcardrows,]
#Commas in the Description column to avoid CSV file issues.
mydataset2$Description <- gsub(",", "", mydataset2$Description, fixed = TRUE)
#Columns not needed for market basket analysis
mydataset2 <- subset(mydataset2, select = -c(5, 6, 7, 8))

Transforming into basket format:

basket <- aggregate(Description ~ InvoiceNo, mydataset2, function(x) paste(x, collapse = ","))
write.csv(basket, "Products.csv", quote = FALSE, row.names = FALSE)

The data is merged by grouping it according to the invoice number, whereby all items bought together on the same invoice are concatenated in a single cell and separated by a comma. The resulting modified data is then exported as a CSV file named ‘Products’.

Finding association rules

suppressWarnings({
  transactions <- read.transactions(file = "Products.csv", format = "basket", sep = ",", cols = 1)
  transactions@itemInfo$labels <- gsub("\"", "", transactions@itemInfo$labels)
})
summary(transactions)

## transactions as itemMatrix in sparse format with
##  23169 rows (elements/itemsets/transactions) and
##  8689 columns (items) and a density of 0.002037598 
## 
## most frequent items:
## WHITE HANGING HEART T-LIGHT HOLDER           REGENCY CAKESTAND 3 TIER 
##                               1953                               1871 
##            JUMBO BAG RED RETROSPOT                      PARTY BUNTING 
##                               1721                               1448 
##            LUNCH BAG RED RETROSPOT                            (Other) 
##                               1373                             401834 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
## 3510 1662 1199  962  884  789  731  701  688  632  630  544  550  545  557  534 
##   17   18   19   20   21   22   23   24   25   26   27   28   29   30   31   32 
##  476  468  497  430  383  324  335  264  251  244  238  236  237  211  165  166 
##   33   34   35   36   37   38   39   40   41   42   43   44   45   46   47   48 
##  133  151  144  104  116  114  119   94  100   89   83   85   64   76   64   70 
##   49   50   51   52   53   54   55   56   57   58   59   60   61   62   63   64 
##   53   58   59   51   52   54   54   32   42   37   30   30   29   15   28   35 
##   65   66   67   68   69   70   71   72   73   74   75   76   77   78   79   80 
##   24   27   27   26   19   17   20   20   17   17   10   16   12   15   16   11 
##   81   82   83   84   85   86   87   88   89   90   91   92   93   94   95   96 
##   16   12    9   14   15   16   12    8   12    8   10   12    5    7    5    8 
##   97   98   99  100  101  102  103  104  105  106  107  108  109  110  111  112 
##    9    7    7    6    3    3    4    6    3    6    9    7    4    4    8    1 
##  113  114  115  116  117  118  119  120  121  122  123  124  125  126  128  129 
##    3    1    4    5    2    4    3    7    3    2    4    1    7    3    3    6 
##  130  131  132  133  134  135  136  137  138  139  140  141  142  143  144  145 
##    2    2    4    1    5    1    2    2    3    5    2    4    1    5    4    5 
##  146  147  148  149  150  151  152  153  154  155  156  157  158  159  160  161 
##    5    1    3    1    5    4    2    7    2    1    5    3    3    1    1    1 
##  162  163  164  165  166  167  168  169  170  172  173  174  175  176  177  178 
##    1    1    2    2    2    2    2    4    3    4    1    2    2    4    1    1 
##  179  180  182  183  185  187  189  191  192  193  194  195  196  197  198  200 
##    2    3    3    1    2    1    2    2    2    1    3    1    1    1    1    1 
##  201  202  203  204  205  206  207  208  214  215  216  217  219  222  223  224 
##    1    1    2    1    1    2    2    1    4    1    1    1    1    2    1    1 
##  225  226  227  230  231  233  235  236  237  240  246  247  249  253  254  255 
##    2    1    1    2    1    1    1    1    1    1    1    2    2    2    1    1 
##  257  262  265  267  270  272  276  281  289  296  297  299  300  307  308  311 
##    1    1    1    1    1    1    1    1    1    1    1    1    2    1    1    1 
##  312  316  319  322  323  327  332  333  338  341  345  353  358  361  362  367 
##    1    1    1    1    1    1    1    2    1    1    1    1    1    1    1    1 
##  373  375  388  389  392  394  396  409  414  416  423  429  438  442  455  463 
##    1    1    1    1    1    1    2    1    1    1    1    1    1    1    1    1 
##  469  505  509  527 
##    1    1    1    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0     3.0    10.0    17.7    22.0   527.0 
## 
## includes extended item information - examples:
##                    labels
## 1   *Boombox Ipod Classic
## 2 *USB Office Mirror Ball
## 3  10 COLOUR SPACEBOY PEN
## 
## includes extended transaction information - examples:
##   transactionID
## 1     InvoiceNo
## 2        536365
## 3        536366

The most frequent items in the dataset are WHITE HANGING HEART T-LIGHT HOLDER, which appears in 1965 transactions, followed by REGENCY CAKESTAND 3 TIER, JUMBO BAG RED RETROSPOT, PARTY BUNTING and LUNCH BAG RED RETROSPOT

library(RColorBrewer)
itemFrequencyPlot(transactions,topN=10,type="relative",col=brewer.pal(n=5, name="Set3"), main="Most 8 Frequently Purchased Items")

According to the plot, the items with the highest sales are ‘WHITE HANGING HEART T-LIGHT HOLDER’ and ‘REGENCY CAKESTAND 3 TIER’. To boost the sales of ‘SET OF 3 CAKE TINS PANTRY DESIGN’, the retailer could place it near the ‘REGENCY CAKESTAND 3 TIER’.

Apriori algorithm

rules <- apriori(transactions,parameter = list(supp=0.001, conf=0.8))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.8    0.1    1 none FALSE            TRUE       5   0.001      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 23 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[8689 item(s), 23169 transaction(s)] done [0.10s].
## sorting and recoding items ... [2504 item(s)] done [0.01s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 3 4 5 6 7 8 9 10

## Warning in apriori(transactions, parameter = list(supp = 0.001, conf = 0.8)):
## Mining stopped (maxlen reached). Only patterns up to a length of 10 returned!

##  done [0.82s].
## writing ... [699037 rule(s)] done [0.13s].
## creating S4 object  ... done [0.18s].

summary(rules)

## set of 699037 rules
## 
## rule length distribution (lhs + rhs):sizes
##      2      3      4      5      6      7      8      9     10 
##     54   7008  56541 140241 193178 166811  93634  33984   7586 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   5.000   6.000   6.303   7.000  10.000 
## 
## summary of quality measures:
##     support           confidence        coverage             lift        
##  Min.   :0.001036   Min.   :0.8000   Min.   :0.001036   Min.   :  9.491  
##  1st Qu.:0.001036   1st Qu.:0.8750   1st Qu.:0.001079   1st Qu.: 23.285  
##  Median :0.001079   Median :0.9355   Median :0.001165   Median : 33.197  
##  Mean   :0.001161   Mean   :0.9291   Mean   :0.001259   Mean   : 38.982  
##  3rd Qu.:0.001165   3rd Qu.:1.0000   3rd Qu.:0.001295   3rd Qu.: 46.146  
##  Max.   :0.019336   Max.   :1.0000   Max.   :0.022660   Max.   :532.621  
##      count      
##  Min.   : 24.0  
##  1st Qu.: 24.0  
##  Median : 25.0  
##  Mean   : 26.9  
##  3rd Qu.: 27.0  
##  Max.   :448.0  
## 
## mining info:
##          data ntransactions support confidence
##  transactions         23169   0.001        0.8
##                                                                      call
##  apriori(data = transactions, parameter = list(supp = 0.001, conf = 0.8))

inspect(rules[1:5])

##     lhs                                rhs                                support confidence    coverage     lift count
## [1] {MIRRORED WALL ART LADIES}      => {MIRRORED WALL ART GENTS}      0.001035867  0.8275862 0.001251672 532.6207    24
## [2] {BLUE FELT EASTER EGG BASKET}   => {CREAM FELT EASTER EGG BASKET} 0.002417023  0.8615385 0.002805473 184.8239    56
## [3] {VINTAGE RED ENAMEL TRIM PLATE} => {VINTAGE RED TRIM ENAMEL BOWL} 0.002589667  0.8000000 0.003237084 197.1830    60
## [4] {BLUE POLKADOT BEAKER}          => {RED POLKADOT BEAKER}          0.003021278  0.8333333 0.003625534 156.9715    70
## [5] {CHILDS GARDEN RAKE BLUE}       => {CHILDS GARDEN SPADE BLUE}     0.001942250  0.8035714 0.002417023 211.5676    45

min(rules@quality$lift)

## [1] 9.49063

length(rules)

## [1] 699037

Output rules: 699027

According Apriori, if a customer buys BLUE FELT EASTER EGG BASKET, there is a high probability of 86% that they will also buy CREAM FELT EASTER EGG BASKET.

conf <- sort (rules, by="confidence", decreasing=TRUE)
inspect(head(conf))

##     lhs                                      rhs                                      support confidence    coverage      lift count
## [1] {CHRISTMAS TREE PAINTED ZINC,                                                                                                   
##      WOODEN STAR CHRISTMAS SCANDINAVIAN}  => {WOODEN TREE CHRISTMAS SCANDINAVIAN} 0.001035867          1 0.001035867  93.42339    24
## [2] {CHRISTMAS TREE PAINTED ZINC,                                                                                                   
##      WOODEN HEART CHRISTMAS SCANDINAVIAN} => {WOODEN TREE CHRISTMAS SCANDINAVIAN} 0.001079028          1 0.001079028  93.42339    25
## [3] {CHILDS GARDEN RAKE BLUE,                                                                                                       
##      CHILDS GARDEN SPADE PINK}            => {CHILDS GARDEN SPADE BLUE}           0.001208511          1 0.001208511 263.28409    28
## [4] {MIXED NUTS LIGHT GREEN BOWL,                                                                                                   
##      SMALL CHOCOLATES PINK BOWL}          => {SMALL DOLLY MIX DESIGN ORANGE BOWL} 0.001165350          1 0.001165350  64.00276    27
## [5] {HERB MARKER ROSEMARY,                                                                                                          
##      IVORY GIANT GARDEN THERMOMETER}      => {HERB MARKER MINT}                   0.001165350          1 0.001165350 114.13300    27
## [6] {FELTCRAFT DOLL ROSIE,                                                                                                          
##      FELTCRAFT GIRL NICOLE KIT}           => {FELTCRAFT GIRL AMELIE KIT}          0.001079028          1 0.001079028 101.17467    25

summary(conf)

## set of 699037 rules
## 
## rule length distribution (lhs + rhs):sizes
##      2      3      4      5      6      7      8      9     10 
##     54   7008  56541 140241 193178 166811  93634  33984   7586 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   5.000   6.000   6.303   7.000  10.000 
## 
## summary of quality measures:
##     support           confidence        coverage             lift        
##  Min.   :0.001036   Min.   :0.8000   Min.   :0.001036   Min.   :  9.491  
##  1st Qu.:0.001036   1st Qu.:0.8750   1st Qu.:0.001079   1st Qu.: 23.285  
##  Median :0.001079   Median :0.9355   Median :0.001165   Median : 33.197  
##  Mean   :0.001161   Mean   :0.9291   Mean   :0.001259   Mean   : 38.982  
##  3rd Qu.:0.001165   3rd Qu.:1.0000   3rd Qu.:0.001295   3rd Qu.: 46.146  
##  Max.   :0.019336   Max.   :1.0000   Max.   :0.022660   Max.   :532.621  
##      count      
##  Min.   : 24.0  
##  1st Qu.: 24.0  
##  Median : 25.0  
##  Mean   : 26.9  
##  3rd Qu.: 27.0  
##  Max.   :448.0  
## 
## mining info:
##          data ntransactions support confidence
##  transactions         23169   0.001        0.8
##                                                                      call
##  apriori(data = transactions, parameter = list(supp = 0.001, conf = 0.8))

When the confidence is 1, it means that whenever the items on the left-hand side (LHS) are purchased, the items on the right-hand side (RHS) are also purchased 100% of the time. According to the infomration provided in the output, we can perform the following analyisis: All customers who purchased ‘CHRISTMAS TREE PAINTED ZINC’ and ‘WOODEN STAR CHRISTMAS SCANDINAVIAN’ also purchased also purchased ‘WOODEN TREE CHRISTMAS SCANDINAVIAN’ The lift value in the rule 1 is significantly high, indicating that the occurrence of the initial three items has a substantial influence on the confidence value.

Rules that have a high lift value suggest that the presence of certain items in a customer’s purchase history. This information can be beneficial to create a recommender system for online retail transactions.

What were the other items purchased by customers who bought the green Regency tea plate?

params <- list(support = 0.001, confidence = 0.8)
rules_tea <- apriori(data = transactions, parameter = params, appearance = list(lhs = "REGENCY TEA PLATE GREEN"))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.8    0.1    1 none FALSE            TRUE       5   0.001      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 23 
## 
## set item appearances ...[1 item(s)] done [0.00s].
## set transactions ...[8689 item(s), 23169 transaction(s)] done [0.11s].
## sorting and recoding items ... [2504 item(s)] done [0.01s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 done [0.01s].
## writing ... [1 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

inspect(rules_tea)

##     lhs                          rhs                       support   confidence
## [1] {REGENCY TEA PLATE GREEN} => {REGENCY TEA PLATE ROSES} 0.0108766 0.8289474 
##     coverage   lift     count
## [1] 0.01312098 52.61885 252

82.9% of the time, customers who purchased REGENCY TEA PLATE GREEN also bought REGENCY TEA PLATE ROSES.

Results visualisation

plot(rules, method = "two-key plot")

## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.

The horizontal and vertical axes on the graph indicate support and confidence correspondingly, while the shading in this instance reflects the number of items in the rule.

top10 <- head(rules, n = 10, by = "confidence")
plot(top10, method="paracoord")

The plot shows thatm the customer is likely to purchase the CHILDS GARDEN SPADE BLUE, while CHILDS GARDEN SPADE PINK and CHILDS GARDEN RAKE BLUE in his shopping basket.

ECLAT algorithm

eclat<- eclat(transactions, parameter=list(support=0.01, minlen = 2))

## Eclat
## 
## parameter specification:
##  tidLists support minlen maxlen            target  ext
##     FALSE    0.01      2     10 frequent itemsets TRUE
## 
## algorithmic control:
##  sparse sort verbose
##       7   -2    TRUE
## 
## Absolute minimum support count: 231 
## 
## create itemset ... 
## set transactions ...[8689 item(s), 23169 transaction(s)] done [0.16s].
## sorting and recoding items ... [479 item(s)] done [0.01s].
## creating sparse bit matrix ... [479 row(s), 23169 column(s)] done [0.01s].
## writing  ... [192 set(s)] done [0.43s].
## Creating S4 object  ... done [0.00s].

summary(eclat)

## set of 192 itemsets
## 
## most frequent items:
## LUNCH BAG RED RETROSPOT JUMBO BAG RED RETROSPOT LUNCH BAG  BLACK SKULL. 
##                      30                      25                      16 
##     LUNCH BAG CARS BLUE LUNCH BAG PINK POLKADOT                 (Other) 
##                      15                      15                     308 
## 
## element (itemset/transaction) length distribution:sizes
##   2   3   4 
## 168  23   1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00    2.00    2.00    2.13    2.00    4.00 
## 
## summary of quality measures:
##     support            count      
##  Min.   :0.01001   Min.   :232.0  
##  1st Qu.:0.01081   1st Qu.:250.5  
##  Median :0.01221   Median :283.0  
##  Mean   :0.01340   Mean   :310.5  
##  3rd Qu.:0.01498   3rd Qu.:347.0  
##  Max.   :0.02762   Max.   :640.0  
## 
## includes transaction ID lists: FALSE 
## 
## mining info:
##          data ntransactions support
##  transactions         23169    0.01
##                                                                      call
##  eclat(data = transactions, parameter = list(support = 0.01, minlen = 2))

inspect(sort(eclat, by='support', descending=TRUE)[1:6])

##     items                                 support count
## [1] {GREEN REGENCY TEACUP AND SAUCER,                  
##      ROSES REGENCY TEACUP AND SAUCER}  0.02762312   640
## [2] {JUMBO BAG PINK POLKADOT,                          
##      JUMBO BAG RED RETROSPOT}          0.02615564   606
## [3] {LUNCH BAG  BLACK SKULL.,                          
##      LUNCH BAG RED RETROSPOT}          0.02360913   547
## [4] {GREEN REGENCY TEACUP AND SAUCER,                  
##      PINK REGENCY TEACUP AND SAUCER}   0.02265959   525
## [5] {JUMBO BAG RED RETROSPOT,                          
##      JUMBO STORAGE BAG SUKI}           0.02227114   516
## [6] {LUNCH BAG PINK POLKADOT,                          
##      LUNCH BAG RED RETROSPOT}          0.02218482   514

Eclat measures the support of a set rather than an individual item, unlike apriori. It solely depends on the minimum support level, and does not consider lift or confidence. The algorithm outputs subsets or itemsets, rather than rules. In this case, 192 subsets or itemsets were identified. The most commonly occurring combination across all transactions was GREEN REGENCY TEACUP AND SAUCER and ROSES REGENCY TEACUP AND SAUCER, with a support of 0.02762312.

plot(sort(eclat, by='support', decreasing=TRUE)[1:6], method='paracoord')

The Eclat algorithm was applied to a dataset consisting of 23169 transactions with 192 unique items. The most frequent items in the dataset were identified as LUNCH BAG RED RETROSPOT, JUMBO BAG RED RETROSPOT, and LUNCH BAG BLACK SKULL. The output also shows the distribution of itemset sizes and summary statistics of support and count measures. The top six itemsets by support were identified and listed, which could be used to derive insights for marketing or inventory management purposes.

Refernces

[https://towardsdatascience.com/association-rule-mining-in-r-ddf2d044ae50]

[https://www.kirenz.com/post/2020-05-14-r-association-rule-mining/]