The aim of this lab is to become familiar with practical association rule mining for market basket analysis.
By the end of this lab session, students should be able to
Transform long data of market basket transactions to wide data in a sparse matrix form.
Clean market basket data.
Extract association rules with Apriori Algorithm.
Visualize association rules.
Please run the R chunks one by one, look at the output and make sure that you understand how it is produced. There will be questions that either require a short answer - then you type your answer right in this document - or modifying R codes - then you modify the R codes here. In either case, you can discuss your work with the lab instructor.
The data is taken from the following source:
http://archive.ics.uci.edu/ml/datasets/online+retail
This lab is inspired by the following online tutorial (there are some changes):
https://www.datacamp.com/community/tutorials/market-basket-analysis-r
Below we load data into R, keeping only unique r, report data dimensions, and print a sample.
set.seed(64)
library(tidyverse)
library(janitor)
library(gt)
library(broom)
library(readxl)
library(arules)
library(arulesViz)
library(RColorBrewer)
long_data <- read_excel("online_retail.xlsx", sheet = 1)
cat("Dimensions of the dataset are", dim(long_data), "\n")
## Dimensions of the dataset are 541909 8
cat("Below is a sample of the data:\n")
## Below is a sample of the data:
long_data %>% sample_n(10) %>% gt()
| InvoiceNo | StockCode | Description | Quantity | InvoiceDate | UnitPrice | CustomerID | Country |
|---|---|---|---|---|---|---|---|
| 579524 | 84951B | SET OF 4 BLACK LOVEBIRD COASTERS | 24 | 2011-11-30 08:17:00 | 0.79 | 16295 | United Kingdom |
| 540848 | 90214K | LETTER "K" BLING KEY RING | 1 | 2011-01-12 09:26:00 | 0.85 | NA | United Kingdom |
| 568949 | 22534 | MAGIC DRAWING SLATE SPACEBOY | 1 | 2011-09-29 15:13:00 | 0.83 | NA | United Kingdom |
| 536783 | 22492 | MINI PAINT SET VINTAGE | 144 | 2010-12-02 15:19:00 | 0.55 | 15061 | United Kingdom |
| 578757 | 85014A | BLACK/BLUE POLKADOT UMBRELLA | 1 | 2011-11-25 11:41:00 | 5.95 | 12748 | United Kingdom |
| 580384 | 23355 | HOT WATER BOTTLE KEEP CALM | 2 | 2011-12-04 10:10:00 | 4.95 | 17243 | United Kingdom |
| 540646 | 85170C | SET/6 EAU DE NIL BIRD T-LIGHTS | 1 | 2011-01-10 14:32:00 | 2.51 | NA | United Kingdom |
| 543399 | 20752 | BLUE POLKADOT WASHING UP GLOVES | 1 | 2011-02-08 11:07:00 | 4.13 | NA | United Kingdom |
| 578995 | 22265 | EASTER DECORATION NATURAL CHICK | 2 | 2011-11-27 15:59:00 | 0.65 | 15861 | United Kingdom |
| 573113 | 22940 | FELTCRAFT CHRISTMAS FAIRY | 4 | 2011-10-27 15:36:00 | 4.25 | 18261 | United Kingdom |
The distribution of the number of invoices by country:
long_data %>% tabyl(Country)
Plot the distribution of the number of invoices by country.
# The problem is that if we simply plot the raw data, then the same invoice will be repeated several times. Thus to do this plot we need to remove repeated invoice numbers. This operation would, of course, render market basket analysis impossible. Thus instead of actually removing repeated invoice numbers, we will save indices of repeated invoices plot the data of non-repeated invoices only.
long_data %>%
filter(!duplicated(InvoiceNo)) %>%
ggplot(aes(x = Country)) +
geom_bar() +
scale_y_log10() +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
How many invoices are there from the country that has the largest number of invoices?
The largest number of invoices comes from United Kingdom and it is about 23K. The exact number is
long_data %>%
filter(!duplicated(InvoiceNo)) %>%
count(Country) %>%
arrange(desc(n)) %>%
head(1)
Below is the number of missing values in each category:
long_data %>%
summarise_all(~sum(is.na(.))) %>%
gt()
| InvoiceNo | StockCode | Description | Quantity | InvoiceDate | UnitPrice | CustomerID | Country |
|---|---|---|---|---|---|---|---|
| 0 | 0 | 1454 | 0 | 0 | 0 | 135080 | 0 |
While we don’t care about missing customerID, a missing description may be a problem, so we will just delete all transactions with missing description.
long_data <- long_data %>%
drop_na(Description)
Below is the number of stock codes in the data
long_data %>%
pull(StockCode) %>%
unique %>%
length
## [1] 3958
Below is the number of products in the data
long_data %>%
pull(Description) %>%
unique %>%
length
## [1] 4211
The numbers are different, but they are supposed to be the same thing. To find out the reason for this discrepancy, we will look at stock codes that correspond to more than one description.
To do it, we will group the data by the stock code and the description and look at the possible combinations. Below is the number of possible combinations.
comb_stock_description <- long_data %>%
count(StockCode, Description)
nrow(comb_stock_description)
## [1] 4789
First, we find stock codes appearing more than once
duplicated_stock_codes <- comb_stock_description %>%
count(StockCode) %>%
arrange(desc(n)) %>%
filter(n > 1)
cat("The number of duplicated stock codes is",
nrow(duplicated_stock_codes), "\n")
## The number of duplicated stock codes is 647
cat("Below are most common duplicated stock codes and the number of times they appear:\n")
## Below are most common duplicated stock codes and the number of times they appear:
head(duplicated_stock_codes, 10)
And here are some duplicated stock codes together with their product descriptions.
comb_stock_description %>%
filter(StockCode %in% duplicated_stock_codes$StockCode[1:5]) %>%
arrange(StockCode)
Now it becomes clear that the column StockCode probably
represents items that customers buy, i.e., what we need to do
association analysis with. However, stock codes are not convenient to
work with and it is better to get item names. A real store would, of
course, have a separate spreadsheet containing stock codes with item
names, but it is a good exercise to extract these names from the data
that we have at our hands.
There can be several ways to approach this problem. The most thorough method is to manually go through all 637 stock codes that correspond to more than one description — it will certainly take some time, but is doable.
We will do it differently. Specifically, we will assume that most of
the time, the description assigned to a stock code will contain the
actual item name and not some comments. For every value of
StockCode, we will take Description
corresponding to the largest number of transactions with that stock
code. And this will be our table of item names
table_item_names <- comb_stock_description %>%
group_by(StockCode) %>%
summarise(ItemName = Description[which.max(n)])
table_item_names
Now we will introduce the new variable, ItemName by
mapping the first column of the table above to the second column.
long_data %>%
left_join(table_item_names)
## Joining with `by = join_by(StockCode)`
A little problem with this approach is that there are rarely sold items such that the method described above does not let us to uniquely identify the most common description for the given stock code.
Identify stock codes that have more than one equally common description assigned to them and print them together with their descriptions and the number of transactions under each pair (stock code, description).
Hint: you will need to explore functions such as
group_by(), max(), filter(), and
add_count().
## WRITE YOUR CODES HERE
codes_for_careful_check <- comb_stock_description %>%
group_by(StockCode) %>%
filter(n == max(n)) %>%
add_count(StockCode) %>%
filter(nn > 1)
## Storing counts in `nn`, as `n` already present in input
## ℹ Use `name = "new_name"` to pick a new name.
cat("We need to pay special attention to codes:\n")
## We need to pay special attention to codes:
codes_for_careful_check %>%
pull(StockCode) %>%
unique()
## [1] "23595" "35610B" "35817P" "35833G" "79323W" "84614A"
## [7] "84743C" "84809B" "90014C" "DCGS0069"
We will now transform the data into matrix form, at the same time getting rid of columns that we are not using for our analysis - time stamp, country etc. Note that we will only keep unique observations of the pair (InvoiceNo, ItemName). If we worked in an actual store, we would probably be able to figure out why the pair (InvoiceNo, ItemName) is sometimes not unique.
order_table <- long_data %>%
left_join(table_item_names) %>%
distinct(InvoiceNo, ItemName, .keep_all = TRUE) %>%
pivot_wider(id_cols = "InvoiceNo",
names_from = "ItemName",
values_from = "Quantity") %>%
replace(is.na(.) , 0)
## Joining with `by = join_by(StockCode)`
cat("Transaction data dim =", dim(order_table))
## Transaction data dim = 24446 3815
cat("Sample of the transaction data:\n")
## Sample of the transaction data:
order_table[1:10, 1:10] %>% gt()
| InvoiceNo | WHITE HANGING HEART T-LIGHT HOLDER | WHITE METAL LANTERN | CREAM CUPID HEARTS COAT HANGER | KNITTED UNION FLAG HOT WATER BOTTLE | RED WOOLLY HOTTIE WHITE HEART. | SET 7 BABUSHKA NESTING BOXES | GLASS STAR FROSTED T-LIGHT HOLDER | HAND WARMER UNION JACK | HAND WARMER RED RETROSPOT |
|---|---|---|---|---|---|---|---|---|---|
| 536365 | 6 | 6 | 8 | 6 | 6 | 2 | 6 | 0 | 0 |
| 536366 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 6 | 6 |
| 536367 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 536368 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 536369 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 536370 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 536371 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 536372 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 6 | 6 |
| 536373 | 6 | 6 | 8 | 6 | 6 | 2 | 6 | 0 | 0 |
| 536374 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Note that currently our transaction table is a tibble or
data.frame:
class(order_table)
## [1] "tbl_df" "tbl" "data.frame"
The first thing that we will do is converting the matrix of transactions into a special matrix class for the library “arules”. Note that this operation will change item counts to just zeroes and ones (Apriori Algorithm doesn’t care about the number of items in each transaction).
X <- order_table %>%
select(-InvoiceNo) %>%
as.matrix() %>%
as("itemMatrix")
## Warning in asMethod(object): matrix contains values other than 0 and 1! Setting
## all entries != 0 to 1.
cat("\nBelow is class of X:\n")
##
## Below is class of X:
class(X)
## [1] "itemMatrix"
## attr(,"package")
## [1] "arules"
Here is the distribution of most frequently bought items produced by
a special function in the library arules (this is,
actually, the only thing that we need to convert our data to
itemMatrix for — association rules can be extracted
directly from a usual matrix):
itemFrequencyPlot(X, topN = 20, type="relative",
col = brewer.pal(8, 'Accent'),
main = "Support of most common items")
Now we generate rules with minimum support 0.01, minimum confidence 0.6 and maximum length 5.
association.rules <- apriori(X, parameter = list(supp=0.01, conf=0.6, maxlen=5))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.6 0.1 1 none FALSE TRUE 5 0.01 1
## maxlen target ext
## 5 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 244
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[3814 item(s), 24446 transaction(s)] done [0.04s].
## sorting and recoding items ... [645 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.01s].
## writing ... [231 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
summary(association.rules)
## set of 231 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3 4
## 72 140 19
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 2.000 3.000 2.771 3.000 4.000
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.01002 Min. :0.6014 Min. :0.01113 Min. : 6.988
## 1st Qu.:0.01080 1st Qu.:0.6455 1st Qu.:0.01534 1st Qu.: 9.782
## Median :0.01203 Median :0.6965 Median :0.01726 Median :16.666
## Mean :0.01331 Mean :0.7081 Mean :0.01908 Mean :18.790
## 3rd Qu.:0.01374 3rd Qu.:0.7581 3rd Qu.:0.02080 3rd Qu.:21.279
## Max. :0.03408 Max. :0.9485 Max. :0.05036 Max. :62.380
## count
## Min. :245.0
## 1st Qu.:264.0
## Median :294.0
## Mean :325.4
## 3rd Qu.:336.0
## Max. :833.0
##
## mining info:
## data ntransactions support confidence
## X 24446 0.01 0.6
## call
## apriori(data = X, parameter = list(supp = 0.01, conf = 0.6, maxlen = 5))
Note the class of association.rules:
class(association.rules)
## [1] "rules"
## attr(,"package")
## [1] "arules"
There is a function to convert association rules to a data frame. Here is how it works (we print the rules with the highest lift):
ar <- DATAFRAME(association.rules)
ar %>%
arrange(-lift) %>%
head
However, this built-in function drops some useful information, so here is an improved version written with the help of our friendly ChatGPT 5.0:
tidy_rules <- function(rules_object) {
tibble(
LHS = labels(lhs(rules_object)),
RHS = labels(rhs(rules_object)),
length = size(rules_object),
quality(rules_object)
)
}
association.rules %>%
tidy_rules() %>%
arrange(desc(lift)) %>%
head()
Find rules with minimum support 0.005, minimum confidence 0.7, and maximum length 10. How many of them are there? How many of them have length 5? How many rules of confidence 1 are there? What is the rule with the largest lift among all the rules with minimum support 0.005 and minimum confidence 0.7?
First, we generate the rules
arules_2 <- X %>%
apriori(parameter = list(supp=0.005, conf=0.7, maxlen=10)) %>%
suppressMessages() %>%
tidy_rules()
Now the number of rules of different lengths:
arules_2 %>% tabyl(length)
Now we find the number of rules of confidence 1:
arules_2 %>%
filter(confidence == 1) %>%
nrow
## [1] 15
Now we find the rule with the largest lift:
arules_2 %>%
arrange(-lift) %>%
head(1) %>%
gt()
| LHS | RHS | length | support | confidence | coverage | lift | count |
|---|---|---|---|---|---|---|---|
| {HERB MARKER MINT,HERB MARKER ROSEMARY,HERB MARKER BASIL,HERB MARKER PARSLEY,HERB MARKER THYME} | {HERB MARKER CHIVES} | 6 | 0.006463225 | 0.9132948 | 0.007076822 | 106.3162 | 158 |
We can visualize association rules by a scatterplot. The default plot
looks kind of like ggplot (because under the hood, it uses
similar styles and colours):
plot(association.rules)
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.
But it is not hard to convert it to a data frame and plot with
ggplot() (e.g., if customization is needed):
association.rules %>%
tidy_rules() %>%
ggplot(aes(x = support, y = confidence,
color = lift, size = length)) +
geom_point(alpha = 0.8) +
scale_color_binned(
type = "gradient",
low = "#fee0d2",
high = "#a50f15",
n.breaks = 4
) +
scale_size_continuous(
breaks = c(2, 3, 4), # legend shows only these
range = c(1, 5), # visual size range
limits = c(1, 4) # optional: clamp to valid values
) +
theme_minimal(base_size = 13) +
labs(
title = paste(length(association.rules), "rules"),
x = "Support",
y = "Confidence",
color = "Lift (binned)",
size = "Rule length"
)
Some customization is available with the default plot, but it may be a little awkward. For example, we can extract rules by certain threshold:
subRules <- association.rules %>%
subset(confidence >= 0.8)
plot(subRules)
We can colour the scatterplot according to the rule length:
plot(subRules, method = "two-key plot")
We can plot a small set of rules as direct graph, where items and rules are vertices and there is an edge from iteam \(A\) to rule \(R\) whenever \(A\) appears in the LFS or rules \(R\) and an adge from a rule \(R\) to an item \(B\) whenever \(B\) appears in the RHS or rule \(R\).
top5subRules <- head(subRules, n = 5, by = "confidence")
plot(top5subRules, method = "graph")
Produce an interactive version of the plot above by adding the option engine = “htmlwidget” to the plotting command.
# Write your code here
plot(top5subRules, method = "graph",
engine = "htmlwidget")