Market basket analysis is a specific application of association rules that focuses on analyzing transactions in a retail environment with the objective of identifying products that are frequently bought together. This information can be used by retailers to boost their sales and profits. By analyzing transaction data, retailers can identify patterns of customer behavior and make data-driven decisions about product placement, promotions, and pricing.
For this analysis, I have decided to use the Online Retail Dataset, which is available through the University of California, Irvine Machine Learning Repository at [https://archive.ics.uci.edu/ml/datasets/Online+Retail].
The Online Retail Dataset is a valuable source of transnational data, as it contains detailed information on all the transactions that took place between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail company. The retailer specializes in unique all-occasion gifts and has a diverse customer base, with many of them being wholesalers.
The objective of the analysis is to identify which products are frequently purchased together in order to develop marketing strategies, optimize product placement, and improve customer satisfaction.
library(readxl)
## Warning: package 'readxl' was built under R version 4.2.2
library(arules)
## Warning: package 'arules' was built under R version 4.2.2
## Loading required package: Matrix
##
## Attaching package: 'arules'
## The following objects are masked from 'package:base':
##
## abbreviate, write
library(plyr)
## Warning: package 'plyr' was built under R version 4.2.2
library(arulesViz)
## Warning: package 'arulesViz' was built under R version 4.2.2
library(data.table)
mydataset <- read_excel("C:/Users/rahil/Desktop/UL papers/Association rules/Online Retail.xlsx")
mydataset<- data.frame(mydataset)
head(mydataset)
## InvoiceNo StockCode Description Quantity
## 1 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6
## 2 536365 71053 WHITE METAL LANTERN 6
## 3 536365 84406B CREAM CUPID HEARTS COAT HANGER 8
## 4 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6
## 5 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6
## 6 536365 22752 SET 7 BABUSHKA NESTING BOXES 2
## InvoiceDate UnitPrice CustomerID Country
## 1 2010-12-01 08:26:00 2.55 17850 United Kingdom
## 2 2010-12-01 08:26:00 3.39 17850 United Kingdom
## 3 2010-12-01 08:26:00 2.75 17850 United Kingdom
## 4 2010-12-01 08:26:00 3.39 17850 United Kingdom
## 5 2010-12-01 08:26:00 3.39 17850 United Kingdom
## 6 2010-12-01 08:26:00 7.65 17850 United Kingdom
InvoiceNo: Invoice number - integral number uniquely assigned to each transaction. If this code starts with letter ‘c’, it indicates a cancellation.
StockCode: Product code - integral number uniquely assigned to each distinct product.
Description: Product name.
Quantity: The quantities of each product per transaction.
InvoiceDate: Invoice Date and time - the day and time when each transaction was generated.
UnitPrice: Unit price - Product price per unit in sterling.
CustomerID: Customer number - integral number uniquely assigned to each customer.
Country: Country name - the name of the country where each customer resides.
Following codes are used to filter out irrelevant transactions and clean up the dataset in preparation for market basket analysis.
#Adjustments for stock on hand by selecting transactions with UnitPrice > 0.
mydataset2 <- subset(mydataset, UnitPrice > 0)
#Stock codes not related to actual purchases by excluding them from the subset of data.
excluded_codes <- c("D", "DOT", "S", "POST", "M", "C2", "AMAZONFEE", "B", "BANK CHARGES", "CRUK", "m", "PADS")
mydataset2 <- mydataset2[!mydataset2$StockCode %in% excluded_codes, ]
#Gift cards by identifying and removing rows with a Stock Code column that starts with "gift".
giftcardrows <- mydataset2$InvoiceNo[mydataset2$StockCode %like% "gift*"]
mydataset2 <- mydataset2[!mydataset2$InvoiceNo %in% giftcardrows,]
#Commas in the Description column to avoid CSV file issues.
mydataset2$Description <- gsub(",", "", mydataset2$Description, fixed = TRUE)
#Columns not needed for market basket analysis
mydataset2 <- subset(mydataset2, select = -c(5, 6, 7, 8))
Transforming into basket format:
basket <- aggregate(Description ~ InvoiceNo, mydataset2, function(x) paste(x, collapse = ","))
write.csv(basket, "Products.csv", quote = FALSE, row.names = FALSE)
The data is merged by grouping it according to the invoice number, whereby all items bought together on the same invoice are concatenated in a single cell and separated by a comma. The resulting modified data is then exported as a CSV file named ‘Products’.
suppressWarnings({
transactions <- read.transactions(file = "Products.csv", format = "basket", sep = ",", cols = 1)
transactions@itemInfo$labels <- gsub("\"", "", transactions@itemInfo$labels)
})
summary(transactions)
## transactions as itemMatrix in sparse format with
## 23169 rows (elements/itemsets/transactions) and
## 8689 columns (items) and a density of 0.002037598
##
## most frequent items:
## WHITE HANGING HEART T-LIGHT HOLDER REGENCY CAKESTAND 3 TIER
## 1953 1871
## JUMBO BAG RED RETROSPOT PARTY BUNTING
## 1721 1448
## LUNCH BAG RED RETROSPOT (Other)
## 1373 401834
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## 3510 1662 1199 962 884 789 731 701 688 632 630 544 550 545 557 534
## 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
## 476 468 497 430 383 324 335 264 251 244 238 236 237 211 165 166
## 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
## 133 151 144 104 116 114 119 94 100 89 83 85 64 76 64 70
## 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64
## 53 58 59 51 52 54 54 32 42 37 30 30 29 15 28 35
## 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
## 24 27 27 26 19 17 20 20 17 17 10 16 12 15 16 11
## 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96
## 16 12 9 14 15 16 12 8 12 8 10 12 5 7 5 8
## 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112
## 9 7 7 6 3 3 4 6 3 6 9 7 4 4 8 1
## 113 114 115 116 117 118 119 120 121 122 123 124 125 126 128 129
## 3 1 4 5 2 4 3 7 3 2 4 1 7 3 3 6
## 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145
## 2 2 4 1 5 1 2 2 3 5 2 4 1 5 4 5
## 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161
## 5 1 3 1 5 4 2 7 2 1 5 3 3 1 1 1
## 162 163 164 165 166 167 168 169 170 172 173 174 175 176 177 178
## 1 1 2 2 2 2 2 4 3 4 1 2 2 4 1 1
## 179 180 182 183 185 187 189 191 192 193 194 195 196 197 198 200
## 2 3 3 1 2 1 2 2 2 1 3 1 1 1 1 1
## 201 202 203 204 205 206 207 208 214 215 216 217 219 222 223 224
## 1 1 2 1 1 2 2 1 4 1 1 1 1 2 1 1
## 225 226 227 230 231 233 235 236 237 240 246 247 249 253 254 255
## 2 1 1 2 1 1 1 1 1 1 1 2 2 2 1 1
## 257 262 265 267 270 272 276 281 289 296 297 299 300 307 308 311
## 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1
## 312 316 319 322 323 327 332 333 338 341 345 353 358 361 362 367
## 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1
## 373 375 388 389 392 394 396 409 414 416 423 429 438 442 455 463
## 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1
## 469 505 509 527
## 1 1 1 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 3.0 10.0 17.7 22.0 527.0
##
## includes extended item information - examples:
## labels
## 1 *Boombox Ipod Classic
## 2 *USB Office Mirror Ball
## 3 10 COLOUR SPACEBOY PEN
##
## includes extended transaction information - examples:
## transactionID
## 1 InvoiceNo
## 2 536365
## 3 536366
The most frequent items in the dataset are WHITE HANGING HEART T-LIGHT HOLDER, which appears in 1965 transactions, followed by REGENCY CAKESTAND 3 TIER, JUMBO BAG RED RETROSPOT, PARTY BUNTING and LUNCH BAG RED RETROSPOT
library(RColorBrewer)
itemFrequencyPlot(transactions,topN=10,type="relative",col=brewer.pal(n=5, name="Set3"), main="Most 8 Frequently Purchased Items")
According to the plot, the items with the highest sales are ‘WHITE HANGING HEART T-LIGHT HOLDER’ and ‘REGENCY CAKESTAND 3 TIER’. To boost the sales of ‘SET OF 3 CAKE TINS PANTRY DESIGN’, the retailer could place it near the ‘REGENCY CAKESTAND 3 TIER’.
rules <- apriori(transactions,parameter = list(supp=0.001, conf=0.8))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.8 0.1 1 none FALSE TRUE 5 0.001 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 23
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[8689 item(s), 23169 transaction(s)] done [0.10s].
## sorting and recoding items ... [2504 item(s)] done [0.01s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 3 4 5 6 7 8 9 10
## Warning in apriori(transactions, parameter = list(supp = 0.001, conf = 0.8)):
## Mining stopped (maxlen reached). Only patterns up to a length of 10 returned!
## done [0.82s].
## writing ... [699037 rule(s)] done [0.13s].
## creating S4 object ... done [0.18s].
summary(rules)
## set of 699037 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3 4 5 6 7 8 9 10
## 54 7008 56541 140241 193178 166811 93634 33984 7586
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 5.000 6.000 6.303 7.000 10.000
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.001036 Min. :0.8000 Min. :0.001036 Min. : 9.491
## 1st Qu.:0.001036 1st Qu.:0.8750 1st Qu.:0.001079 1st Qu.: 23.285
## Median :0.001079 Median :0.9355 Median :0.001165 Median : 33.197
## Mean :0.001161 Mean :0.9291 Mean :0.001259 Mean : 38.982
## 3rd Qu.:0.001165 3rd Qu.:1.0000 3rd Qu.:0.001295 3rd Qu.: 46.146
## Max. :0.019336 Max. :1.0000 Max. :0.022660 Max. :532.621
## count
## Min. : 24.0
## 1st Qu.: 24.0
## Median : 25.0
## Mean : 26.9
## 3rd Qu.: 27.0
## Max. :448.0
##
## mining info:
## data ntransactions support confidence
## transactions 23169 0.001 0.8
## call
## apriori(data = transactions, parameter = list(supp = 0.001, conf = 0.8))
inspect(rules[1:5])
## lhs rhs support confidence coverage lift count
## [1] {MIRRORED WALL ART LADIES} => {MIRRORED WALL ART GENTS} 0.001035867 0.8275862 0.001251672 532.6207 24
## [2] {BLUE FELT EASTER EGG BASKET} => {CREAM FELT EASTER EGG BASKET} 0.002417023 0.8615385 0.002805473 184.8239 56
## [3] {VINTAGE RED ENAMEL TRIM PLATE} => {VINTAGE RED TRIM ENAMEL BOWL} 0.002589667 0.8000000 0.003237084 197.1830 60
## [4] {BLUE POLKADOT BEAKER} => {RED POLKADOT BEAKER} 0.003021278 0.8333333 0.003625534 156.9715 70
## [5] {CHILDS GARDEN RAKE BLUE} => {CHILDS GARDEN SPADE BLUE} 0.001942250 0.8035714 0.002417023 211.5676 45
min(rules@quality$lift)
## [1] 9.49063
length(rules)
## [1] 699037
Output rules: 699027
According Apriori, if a customer buys BLUE FELT EASTER EGG BASKET, there is a high probability of 86% that they will also buy CREAM FELT EASTER EGG BASKET.
conf <- sort (rules, by="confidence", decreasing=TRUE)
inspect(head(conf))
## lhs rhs support confidence coverage lift count
## [1] {CHRISTMAS TREE PAINTED ZINC,
## WOODEN STAR CHRISTMAS SCANDINAVIAN} => {WOODEN TREE CHRISTMAS SCANDINAVIAN} 0.001035867 1 0.001035867 93.42339 24
## [2] {CHRISTMAS TREE PAINTED ZINC,
## WOODEN HEART CHRISTMAS SCANDINAVIAN} => {WOODEN TREE CHRISTMAS SCANDINAVIAN} 0.001079028 1 0.001079028 93.42339 25
## [3] {CHILDS GARDEN RAKE BLUE,
## CHILDS GARDEN SPADE PINK} => {CHILDS GARDEN SPADE BLUE} 0.001208511 1 0.001208511 263.28409 28
## [4] {MIXED NUTS LIGHT GREEN BOWL,
## SMALL CHOCOLATES PINK BOWL} => {SMALL DOLLY MIX DESIGN ORANGE BOWL} 0.001165350 1 0.001165350 64.00276 27
## [5] {HERB MARKER ROSEMARY,
## IVORY GIANT GARDEN THERMOMETER} => {HERB MARKER MINT} 0.001165350 1 0.001165350 114.13300 27
## [6] {FELTCRAFT DOLL ROSIE,
## FELTCRAFT GIRL NICOLE KIT} => {FELTCRAFT GIRL AMELIE KIT} 0.001079028 1 0.001079028 101.17467 25
summary(conf)
## set of 699037 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3 4 5 6 7 8 9 10
## 54 7008 56541 140241 193178 166811 93634 33984 7586
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 5.000 6.000 6.303 7.000 10.000
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.001036 Min. :0.8000 Min. :0.001036 Min. : 9.491
## 1st Qu.:0.001036 1st Qu.:0.8750 1st Qu.:0.001079 1st Qu.: 23.285
## Median :0.001079 Median :0.9355 Median :0.001165 Median : 33.197
## Mean :0.001161 Mean :0.9291 Mean :0.001259 Mean : 38.982
## 3rd Qu.:0.001165 3rd Qu.:1.0000 3rd Qu.:0.001295 3rd Qu.: 46.146
## Max. :0.019336 Max. :1.0000 Max. :0.022660 Max. :532.621
## count
## Min. : 24.0
## 1st Qu.: 24.0
## Median : 25.0
## Mean : 26.9
## 3rd Qu.: 27.0
## Max. :448.0
##
## mining info:
## data ntransactions support confidence
## transactions 23169 0.001 0.8
## call
## apriori(data = transactions, parameter = list(supp = 0.001, conf = 0.8))
When the confidence is 1, it means that whenever the items on the left-hand side (LHS) are purchased, the items on the right-hand side (RHS) are also purchased 100% of the time. According to the infomration provided in the output, we can perform the following analyisis: All customers who purchased ‘CHRISTMAS TREE PAINTED ZINC’ and ‘WOODEN STAR CHRISTMAS SCANDINAVIAN’ also purchased also purchased ‘WOODEN TREE CHRISTMAS SCANDINAVIAN’ The lift value in the rule 1 is significantly high, indicating that the occurrence of the initial three items has a substantial influence on the confidence value.
Rules that have a high lift value suggest that the presence of certain items in a customer’s purchase history. This information can be beneficial to create a recommender system for online retail transactions.
params <- list(support = 0.001, confidence = 0.8)
rules_tea <- apriori(data = transactions, parameter = params, appearance = list(lhs = "REGENCY TEA PLATE GREEN"))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.8 0.1 1 none FALSE TRUE 5 0.001 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 23
##
## set item appearances ...[1 item(s)] done [0.00s].
## set transactions ...[8689 item(s), 23169 transaction(s)] done [0.11s].
## sorting and recoding items ... [2504 item(s)] done [0.01s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 done [0.01s].
## writing ... [1 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
inspect(rules_tea)
## lhs rhs support confidence
## [1] {REGENCY TEA PLATE GREEN} => {REGENCY TEA PLATE ROSES} 0.0108766 0.8289474
## coverage lift count
## [1] 0.01312098 52.61885 252
82.9% of the time, customers who purchased REGENCY TEA PLATE GREEN also bought REGENCY TEA PLATE ROSES.
plot(rules, method = "two-key plot")
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.
The horizontal and vertical axes on the graph indicate support and confidence correspondingly, while the shading in this instance reflects the number of items in the rule.
top10 <- head(rules, n = 10, by = "confidence")
plot(top10, method="paracoord")
The plot shows thatm the customer is likely to purchase the CHILDS GARDEN SPADE BLUE, while CHILDS GARDEN SPADE PINK and CHILDS GARDEN RAKE BLUE in his shopping basket.
eclat<- eclat(transactions, parameter=list(support=0.01, minlen = 2))
## Eclat
##
## parameter specification:
## tidLists support minlen maxlen target ext
## FALSE 0.01 2 10 frequent itemsets TRUE
##
## algorithmic control:
## sparse sort verbose
## 7 -2 TRUE
##
## Absolute minimum support count: 231
##
## create itemset ...
## set transactions ...[8689 item(s), 23169 transaction(s)] done [0.16s].
## sorting and recoding items ... [479 item(s)] done [0.01s].
## creating sparse bit matrix ... [479 row(s), 23169 column(s)] done [0.01s].
## writing ... [192 set(s)] done [0.43s].
## Creating S4 object ... done [0.00s].
summary(eclat)
## set of 192 itemsets
##
## most frequent items:
## LUNCH BAG RED RETROSPOT JUMBO BAG RED RETROSPOT LUNCH BAG BLACK SKULL.
## 30 25 16
## LUNCH BAG CARS BLUE LUNCH BAG PINK POLKADOT (Other)
## 15 15 308
##
## element (itemset/transaction) length distribution:sizes
## 2 3 4
## 168 23 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 2.00 2.00 2.13 2.00 4.00
##
## summary of quality measures:
## support count
## Min. :0.01001 Min. :232.0
## 1st Qu.:0.01081 1st Qu.:250.5
## Median :0.01221 Median :283.0
## Mean :0.01340 Mean :310.5
## 3rd Qu.:0.01498 3rd Qu.:347.0
## Max. :0.02762 Max. :640.0
##
## includes transaction ID lists: FALSE
##
## mining info:
## data ntransactions support
## transactions 23169 0.01
## call
## eclat(data = transactions, parameter = list(support = 0.01, minlen = 2))
inspect(sort(eclat, by='support', descending=TRUE)[1:6])
## items support count
## [1] {GREEN REGENCY TEACUP AND SAUCER,
## ROSES REGENCY TEACUP AND SAUCER} 0.02762312 640
## [2] {JUMBO BAG PINK POLKADOT,
## JUMBO BAG RED RETROSPOT} 0.02615564 606
## [3] {LUNCH BAG BLACK SKULL.,
## LUNCH BAG RED RETROSPOT} 0.02360913 547
## [4] {GREEN REGENCY TEACUP AND SAUCER,
## PINK REGENCY TEACUP AND SAUCER} 0.02265959 525
## [5] {JUMBO BAG RED RETROSPOT,
## JUMBO STORAGE BAG SUKI} 0.02227114 516
## [6] {LUNCH BAG PINK POLKADOT,
## LUNCH BAG RED RETROSPOT} 0.02218482 514
Eclat measures the support of a set rather than an individual item, unlike apriori. It solely depends on the minimum support level, and does not consider lift or confidence. The algorithm outputs subsets or itemsets, rather than rules. In this case, 192 subsets or itemsets were identified. The most commonly occurring combination across all transactions was GREEN REGENCY TEACUP AND SAUCER and ROSES REGENCY TEACUP AND SAUCER, with a support of 0.02762312.
plot(sort(eclat, by='support', decreasing=TRUE)[1:6], method='paracoord')
The Eclat algorithm was applied to a dataset consisting of 23169 transactions with 192 unique items. The most frequent items in the dataset were identified as LUNCH BAG RED RETROSPOT, JUMBO BAG RED RETROSPOT, and LUNCH BAG BLACK SKULL. The output also shows the distribution of itemset sizes and summary statistics of support and count measures. The top six itemsets by support were identified and listed, which could be used to derive insights for marketing or inventory management purposes.