The purpose of a Market Basket Analysis is to identify and visualize the impact of statistically significant associations between unique items across a multitude of occurrences. As its name suggests, it originated in the analysis of items frequently purchased together in supermarket transactions, and we’ll use it in an online retail example below.
In selecting our model and drawing conclusions, we’ll be evaluating associations according to the following association measures:
The support of an item or is its frequency relative to the total number of orders in the dataset. It can be measured for individual items as well as for associations between them. Simply put,
\[ support (A, B) = \frac{orders\ with\ (A, B)}{all \ orders} \]
The confidence of an association with respect to one of its items, is the association’s frequency relative to that individual item’s frequency.
\[ confidence (A \Rightarrow B) = \frac{orders\ with\ (A, B)}{orders\ with\ A} \]
A high confidence association indicates a high proportion of orders for the “Left Hand Side” item also feature the “Right Hand Side” item. Note here that
\[ confidence (A \Rightarrow B) \neq confidence (B \Rightarrow A) \]
Lift quantifies the support of the association versus the support of its individual items independently. It is the frequency of the association divided by the frequencies of the individual items assumed independent:
\[ lift (A, B) = \frac{support(A, B)}{support(A) \times support(B)} \]
In part 1, we exported to .csv the results of our SQL Query SELECT * FROM TransactionLineView.
Here, we’ll begin by importing this data from .csv and taking a look at the size and shape of this dataset. It includes 401,365 online retail transaction line items from 18,533 unique transactions, shipped over the course of 2009-2010 from a UK-based online home decor retailer.
We’re going to be relying for the “meat” of our analysis on the packages arules and arulesViz.
library(tidyverse)
library(arules)
library(arulesViz)
library(DT)
options(scipen = 999)
#import transactions from .csv (as exported from database TransactionLineView in part 1)
transactions_raw <- read_delim("https://raw.githubusercontent.com/curdferguson/CUNY607final_MarketBasket/main/transactions_raw.csv", delim= ",")
The 18,533 transactions grossed the eCommerce retailer 8,982,405 GBP, with an average order value of 484.67 GBP and an average basket size of 21.7 unique items per transaction. While the company shipped to over 35 countries throughout Europe and as far afield as Brazil, the US, Japan, and Australia, 87.68% of its revenue shipped within the UK.
#view results in a DataTable
datatable(head(transactions_raw), filter = "none")
transactions_by_transaction <- transactions_raw %>% group_by(InvoiceID) %>% summarize(InvoiceID = max(InvoiceID), Customer = max(CustomerID), Country = max(Country), `Basket Size` = max(n()), Total = round(sum(Price * Qty), 2))
#summarize transactions
transactions_summary <- transactions_by_transaction %>% summarize(`No. of Transactions` = n(), `Avg. Basket Size` = round((sum(`Basket Size`) / `No. of Transactions`), 3), `Avg. Order Value` = round((sum(Total) / `No. of Transactions`), 2), Revenue = sum(Total))
datatable(transactions_summary)
# calculate transactions by country
transactions_by_country <- transactions_by_transaction %>% group_by(Country) %>% summarize(`No. of Transactions` = n(), `Avg. Basket Size` = round((sum(`Basket Size`) / `No. of Transactions`), 3), `Avg. Order Value` = round((sum(Total) / `No. of Transactions`), 2), Revenue = sum(Total), `% of Revenue` = round((Revenue / sum(transactions_summary$Revenue)), 4)) %>% arrange(desc(`% of Revenue`))
datatable(transactions_by_country)
There are 3683 unique items, identified by their StockCode and Description fields, that appear on our 1853 items. While we could analyze at length the percentage revenue or average quantity per order of the items in this dataset, we’ve narrowed our focus here to the variable of interest to Market Basket Analysis - the number of orders on which a particular item appears (without regard to the quantity ordered).
We’re going to remove 3 miscellaneous items POSTAGE, Manual, and CARRIAGE from the dataset at this point, as one in particular comes into our top 10 and may otherwise skew the analysis.
Finally, we’re going to view absolute and relative frequency plots for the items in our dataset, to get an read on how the quantity of orders is distributed amongst unique items.
#top 10 items by # of orders shipped
transactions_by_item_orders <- transactions_raw %>% group_by(StockCode) %>% summarize(StockCode = max(StockCode), Description = max(`Description`), Orders = max(n()), Quantity = sum(Qty), `Avg. Qty` = round((Quantity / Orders), 3), Price = max(Price), Total = round(sum(Price * Qty), 2)) %>% arrange(desc(Orders))
datatable(head(transactions_by_item_orders, 10))
“Single” format is one of two input types allowed by the read.transactions function of package arules. The “single” format lists two columns - one for the transaction’s unique identifier (in this case, InvoiceID) and one for Description - the unique item name.
Market Basket Analysis is agnostic toward the quantity and price of items on each transaction line, so we can exclude those columns.
We’ll use the arules package’s Item Frequency Plot function to get a quick read on the top items in our set. Will we see these top items show up in significant associations?
#wrangle to "single" format and write to .csv
transactions_mba <- transactions_raw %>% distinct(InvoiceID, Description)
transactions_mba$Description <- transactions_mba$Description %>% str_trim(side = c("both"))
#write_delim(transactions_mba, file = "transactions_mba.csv", delim = ",", na = "", append = FALSE)
# create transaction object from transactions_mba .csv file, and view summary output
transobj_mba <- read.transactions(file = "https://raw.githubusercontent.com/curdferguson/CUNY607final_MarketBasket/main/transactions_mba.csv", format = "single", header = TRUE, sep = ",", cols = c(1, 2), rm.duplicates = TRUE)
summary(transobj_mba)
## transactions as itemMatrix in sparse format with
## 13094 rows (elements/itemsets/transactions) and
## 6925 columns (items) and a density of 0.002195047
##
## most frequent items:
## WHITE HANGING HEART T-LIGHT HOLDER REGENCY CAKESTAND 3 TIER
## 992 953
## JUMBO BAG RED RETROSPOT PARTY BUNTING
## 830 715
## ASSORTED COLOUR BIRD ORNAMENT (Other)
## 696 194852
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## 2047 943 661 562 514 483 414 402 369 367 366 323 348 317 331 294
## 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
## 261 266 283 259 213 202 185 174 137 126 135 131 130 111 104 81
## 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
## 69 88 80 62 72 66 67 64 55 44 46 45 42 44 36 31
## 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64
## 31 33 29 30 24 23 34 18 27 22 13 16 17 10 17 9
## 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
## 17 15 13 13 7 8 8 8 5 7 6 8 7 5 9 7
## 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96
## 11 7 6 6 5 6 8 2 5 4 5 5 1 1 3 2
## 97 98 99 100 101 102 104 105 107 108 109 110 111 113 115 116
## 3 2 4 3 1 1 4 1 4 1 1 4 3 2 1 2
## 118 120 122 123 124 126 129 132 136 140 144 147 149 153 154 157
## 2 1 1 2 1 1 2 1 1 1 1 1 1 1 1 1
## 167 169 176 178 180 185 198 199 204 226 231 282 294 319
## 1 2 1 1 1 1 1 1 1 1 1 1 1 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 3.0 10.0 15.2 21.0 319.0
##
## includes extended item information - examples:
## labels
## 1 "ASSORTED FLOWER COLOUR ""LEIS"""
## 2 "CHARLIE+LOLA""EXTREMELY BUSY"" SIGN"
## 3 "FLOWER GLASS GARLAND NECKL.36""BLACK"
##
## includes extended transaction information - examples:
## transactionID
## 1 536365
## 2 536366
## 3 536367
# show an item frequency plot
itemFrequencyPlot(transobj_mba, type = c("absolute"), topN = 1000, horiz = FALSE, names = FALSE, col = "lightblue", angle = 45, xlab = "Items", ylab = "Frequency", main = "Absolute Item Frequency")
# show an item frequency plot
itemFrequencyPlot(transobj_mba, type = c("absolute"), topN = 10, horiz = TRUE, names = TRUE, col = "lightblue", xlab = "Items", ylab = "", main = "TOP 10 Absolute Item Frequency")
The strength and significance of the association rules we’ll derive from our data is determined primarily by two variables, support and confidence, as discussed above. We’re going to settle on support and confidence levels for our analysis by looping the apriori algorithm over the dataset at different confidence and support levels, and counting the number of rules generated.
The sweet spot we’re looking for is the intersection of support and confidence with 10-20 rules that has the highest value for each. We’ll start by comparing support levels of 3%, 2%, 1%, and 0.5%…
# Support and confidence values
supportLevels <- c(0.03, 0.02, 0.01, 0.005)
confidenceLevels <- c(0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2, 0.1)
# Empty integers
rules_sup3 <- integer(length=9)
rules_sup2 <- integer(length=9)
rules_sup1 <- integer(length=9)
rules_sup0.5 <- integer(length=9)
# Apriori algorithm with a support level of 3%
for (i in 1:length(confidenceLevels)) {
rules_sup3[i] <- length(apriori(transobj_mba, parameter=list(sup=supportLevels[1],
conf=confidenceLevels[i], target="rules")))
}
# Apriori algorithm with a support level of 2%
for (i in 1:length(confidenceLevels)){
rules_sup2[i] <- length(apriori(transobj_mba, parameter=list(sup=supportLevels[2],
conf=confidenceLevels[i], target="rules")))
}
# Apriori algorithm with a support level of 1%
for (i in 1:length(confidenceLevels)){
rules_sup1[i] <- length(apriori(transobj_mba, parameter=list(sup=supportLevels[3],
conf=confidenceLevels[i], target="rules")))
}
# Apriori algorithm with a support level of 0.5%
for (i in 1:length(confidenceLevels)){
rules_sup0.5[i] <- length(apriori(transobj_mba, parameter=list(sup=supportLevels[4],
conf=confidenceLevels[i], target="rules")))
}
# code chunk citation: Garcia, Xavier Vivancos. "Market Basket Analysis." Kaggle.com. Published May 2, 2020. Accessed from: https://www.kaggle.com/xvivancos/market-basket-analysis/code.
At our lowest levels of confidence and support, the algorithm generates over 2000 rules - far too many to offer us useful insights. At a support level of 3%, by contrast, the model appears to offer us zero statistically significant association rules. So in the next step, we’ll recalibrate to look at support levels from 1% - 2.5%, and 50% or greater confidence.
rules_plot <- data.frame(
confidence = confidenceLevels,
support_3 = rules_sup3,
support_2 = rules_sup2,
support_1 = rules_sup1,
support_0.5 = rules_sup0.5) %>%
pivot_longer(cols = c(support_3, support_2, support_1, support_0.5), names_to = "support", values_to = "rules")
rules_plot %>% ggplot(aes(x = confidence, y = rules, color = support)) +
geom_point() + geom_line(legend = TRUE) + scale_fill_brewer()
# Support and confidence values
supportLevels2 <- c(0.025, 0.02, 0.015, 0.01)
confidenceLevels2 <- c(0.9, 0.8, 0.7, 0.6, 0.5)
# Empty integers
rules_sup2.5 <- integer(length=5)
rules_sup2 <- integer(length=5)
rules_sup1.5 <- integer(length=5)
rules_sup1 <- integer(length=5)
# Apriori algorithm with a support level of 2.5%
for (i in 1:length(confidenceLevels2)) {
rules_sup2.5[i] <- length(apriori(transobj_mba, parameter=list(sup=supportLevels2[1],
conf=confidenceLevels2[i], target="rules")))
}
# Apriori algorithm with a support level of 2%
for (i in 1:length(confidenceLevels2)){
rules_sup2[i] <- length(apriori(transobj_mba, parameter=list(sup=supportLevels2[2],
conf=confidenceLevels2[i], target="rules")))
}
# Apriori algorithm with a support level of 1.5%
for (i in 1:length(confidenceLevels2)){
rules_sup1.5[i] <- length(apriori(transobj_mba, parameter=list(sup=supportLevels2[3],
conf=confidenceLevels2[i], target="rules")))
}
# Apriori algorithm with a support level of 1%
for (i in 1:length(confidenceLevels2)){
rules_sup1[i] <- length(apriori(transobj_mba, parameter=list(sup=supportLevels2[4],
conf=confidenceLevels2[i], target="rules")))
}
# code chunk citation: Garcia, Xavier Vivancos. "Market Basket Analysis." Kaggle.com. Published May 2, 2020. Accessed from: https://www.kaggle.com/xvivancos/market-basket-analysis/code.
The recalibrated algorithm helps us drill down into the region of most value to us. We can see that there are two options that meet our criteria:
Below, we will run the apriori agorithm at each of these intersections of support and confidence, and compare the results.
rules_plot1 <- data.frame(
confidence = confidenceLevels2,
support_2.5 = rules_sup2.5,
support_2 = rules_sup2,
support_1.5 = rules_sup1.5,
support_1 = rules_sup1) %>%
pivot_longer(cols = c(support_2.5, support_2, support_1.5, support_1), names_to = "support", values_to = "rules")
rules_plot1 %>% ggplot(aes(x = confidence, y = rules, color = support)) +
geom_point() + geom_line(legend = TRUE) + scale_fill_brewer()
There are a few immediate takeaways from our dataset: - Items with the greatest absolute frequency in our dataset are not necessarily the items that appear most frequently in association rules (though a few notables do re-occur). - There are a few categories that stand out for this company as driving significant association:teacups/ saucers and cake stands; lunch bags; party bunting; and hanging candle holders. - The lift amongst the associations in this group of 19 items is extraordinary - a lift of 47.615 when customers purchase pink and blue Happy Birthday bunting together versus separately indicates the probability of the items being purchased together is 47.615 times greater than the probability they are purchased separately.
(Remember formally, the definition of lift is the frequency of the association divided by the frequencies of the individual items assumed independent).
\[ lift (A, B) = \frac{support(A, B)}{support(A) \times support(B)} \]
rules1 <- apriori(transobj_mba, parameter=list(sup=0.01, conf=0.7, target="rules"))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.7 0.1 1 none FALSE TRUE 5 0.01 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 130
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[6925 item(s), 13094 transaction(s)] done [0.30s].
## sorting and recoding items ... [397 item(s)] done [0.01s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 3 4 done [0.02s].
## writing ... [19 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
inspectDT(rules1)
summary(rules1)
## set of 19 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3
## 9 10
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 2.000 3.000 2.526 3.000 3.000
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.01000 Min. :0.7000 Min. :0.01298 Min. : 9.24
## 1st Qu.:0.01100 1st Qu.:0.7189 1st Qu.:0.01466 1st Qu.:23.67
## Median :0.01168 Median :0.7487 Median :0.01642 Median :25.52
## Mean :0.01351 Mean :0.7608 Mean :0.01779 Mean :28.22
## 3rd Qu.:0.01604 3rd Qu.:0.7978 3rd Qu.:0.02058 3rd Qu.:31.00
## Max. :0.02207 Max. :0.8787 Max. :0.02948 Max. :47.61
## count
## Min. :131.0
## 1st Qu.:144.0
## Median :153.0
## Mean :176.9
## 3rd Qu.:210.0
## Max. :289.0
##
## mining info:
## data ntransactions support confidence
## transobj_mba 13094 0.01 0.7
rules_scatter <- plot(rules1, measure=c("support", "lift"),
shading = "confidence",
interactive = FALSE)
In the case of this dataset, we happened to lose the 4 rules with the most significant lift by upping our support threshold by 0.5%. I would not necessarily generalize that we should air on the side of greater confidence over greater support - the relationship may vary considerably depending on the distribution of items and the nature of the observations.
rules2 <- apriori(transobj_mba, parameter=list(sup=0.015, conf=0.6, target="rules"))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.6 0.1 1 none FALSE TRUE 5 0.015 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 196
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[6925 item(s), 13094 transaction(s)] done [0.34s].
## sorting and recoding items ... [197 item(s)] done [0.01s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 3 done [0.01s].
## writing ... [15 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
inspect(rules2)
## lhs rhs support confidence coverage lift count
## [1] {PINK REGENCY TEACUP AND SAUCER} => {GREEN REGENCY TEACUP AND SAUCER} 0.01909271 0.7987220 0.02390408 27.094473 250
## [2] {GREEN REGENCY TEACUP AND SAUCER} => {PINK REGENCY TEACUP AND SAUCER} 0.01909271 0.6476684 0.02947915 27.094473 250
## [3] {PINK REGENCY TEACUP AND SAUCER} => {ROSES REGENCY TEACUP AND SAUCER} 0.01825263 0.7635783 0.02390408 23.197898 239
## [4] {JUMBO BAG STRAWBERRY} => {JUMBO BAG RED RETROSPOT} 0.01580877 0.6330275 0.02497327 9.986581 207
## [5] {GARDENERS KNEELING PAD CUP OF TEA} => {GARDENERS KNEELING PAD KEEP CALM} 0.01580877 0.7040816 0.02245303 25.258205 207
## [6] {RED HANGING HEART T-LIGHT HOLDER} => {WHITE HANGING HEART T-LIGHT HOLDER} 0.01687796 0.6443149 0.02619520 8.504696 221
## [7] {GREEN REGENCY TEACUP AND SAUCER} => {ROSES REGENCY TEACUP AND SAUCER} 0.02207118 0.7487047 0.02947915 22.746030 289
## [8] {ROSES REGENCY TEACUP AND SAUCER} => {GREEN REGENCY TEACUP AND SAUCER} 0.02207118 0.6705336 0.03291584 22.746030 289
## [9] {ALARM CLOCK BAKELIKE GREEN} => {ALARM CLOCK BAKELIKE RED} 0.01985642 0.6735751 0.02947915 19.599539 260
## [10] {JUMBO BAG PINK POLKADOT} => {JUMBO BAG RED RETROSPOT} 0.02008554 0.6232227 0.03222850 9.831902 263
## [11] {DOLLY GIRL LUNCH BOX} => {SPACEBOY LUNCH BOX} 0.01634336 0.6484848 0.02520238 23.851856 214
## [12] {SPACEBOY LUNCH BOX} => {DOLLY GIRL LUNCH BOX} 0.01634336 0.6011236 0.02718803 23.851856 214
## [13] {GREEN REGENCY TEACUP AND SAUCER,
## PINK REGENCY TEACUP AND SAUCER} => {ROSES REGENCY TEACUP AND SAUCER} 0.01603788 0.8400000 0.01909271 25.519629 210
## [14] {PINK REGENCY TEACUP AND SAUCER,
## ROSES REGENCY TEACUP AND SAUCER} => {GREEN REGENCY TEACUP AND SAUCER} 0.01603788 0.8786611 0.01825263 29.806187 210
## [15] {GREEN REGENCY TEACUP AND SAUCER,
## ROSES REGENCY TEACUP AND SAUCER} => {PINK REGENCY TEACUP AND SAUCER} 0.01603788 0.7266436 0.02207118 30.398311 210
summary(rules2)
## set of 15 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3
## 12 3
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.0 2.0 2.0 2.2 2.0 3.0
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.01581 Min. :0.6011 Min. :0.01825 Min. : 8.505
## 1st Qu.:0.01604 1st Qu.:0.6460 1st Qu.:0.02318 1st Qu.:21.173
## Median :0.01688 Median :0.6736 Median :0.02520 Median :23.852
## Mean :0.01799 Mean :0.7068 Mean :0.02579 Mean :21.966
## 3rd Qu.:0.01947 3rd Qu.:0.7561 3rd Qu.:0.02948 3rd Qu.:26.307
## Max. :0.02207 Max. :0.8787 Max. :0.03292 Max. :30.398
## count
## Min. :207.0
## 1st Qu.:210.0
## Median :221.0
## Mean :235.5
## 3rd Qu.:255.0
## Max. :289.0
##
## mining info:
## data ntransactions support confidence
## transobj_mba 13094 0.015 0.6
rules_scatter2 <- plot(rules2, measure=c("support", "lift"),
shading = "confidence",
interactive = FALSE)
The analysis we performed here reveals 19 associations between specific items that suggest strong cross-promotional opportunities for this retailer in 4 categories:
Activating these opportunities in the real world would require more information about the nature of the company’s customers and their purchasing habits. While the dataset is characterized as “online retail transactions”, the high AOV and Basket Size, combined with the types of association rules uncovered which bias items within the same category, indicates this site’s customers are likely in the B2B space, and are reselling the items through online or brick-and-mortar retail.
Garcia, Xavier Vivancos. “Market Basket Analysis.” Kaggle.com. Published May 2, 2020. Accessed from: https://www.kaggle.com/xvivancos/market-basket-analysis/code.
Goel, Niharika. “Market Basket Analysis Assocication Rules.” Medium.com. 23 July 2018.https://medium.com/@niharika.goel/market-basket-analysis-association-rules-e7c27b377bd8