The aim of this lab is to become familiar with practical association rule mining for market basket analysis.
By the end of this lab session, students should be able to
Transform long data of market basket transactions to wide data in a sparse matrix form.
Clean market basket data.
Extract association rules with Apriori Algorithm.
Visualize association rules.
Please run the R chunks one by one, look at the output and make sure that you understand how it is produced. There will be questions that either require a short answer - then you type your answer right in this document - or modifying R codes - then you modify the R codes here. In either case, you can discuss your work with the lab instructor.
The data is taken from the following source:
http://archive.ics.uci.edu/ml/datasets/online+retail
This lab is inspired by the following online tutorial (there are some changes):
https://www.datacamp.com/community/tutorials/market-basket-analysis-r
Below we load data into R, keeping only unique r, report data dimensions, and print a sample.
set.seed(64)
library(tidyverse)
library(readxl)
library(arules)
library(arulesViz)
library(RColorBrewer)
long_data <- read_excel("online_retail.xlsx", sheet = 1)
cat("Dimensions of the dataset are", dim(long_data), "\n")
## Dimensions of the dataset are 541909 8
cat("Below is a sample of the data:\n")
## Below is a sample of the data:
long_data %>% sample_n(10)
Plot the distribution of the number of invoices by country.
# The problem is that if we simply plot the raw data, then the same invoice will be repeated several times. Thus to do this plot we need to remove repeated invoice numbers. This operation would, of course, render market basket analysis impossible. Thus instead of actually removing repeated invoice numbers, we will save indices of repeated invoices plot the data of non-repeated invoices only.
long_data %>%
filter(!duplicated(InvoiceNo)) %>%
ggplot(aes(x = Country)) +
geom_bar() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
How many invoices are there from the country that has the largest number of invoices?
The largest number of invoices comes from United Kingdom and it is about 23K. The exact number is
long_data %>%
filter(!duplicated(InvoiceNo), Country == "United Kingdom") %>%
nrow
## [1] 23494
Below is the number of missing values in each category:
sapply(long_data, function(x) sum(is.na(x)))
## InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice
## 0 0 1454 0 0 0
## CustomerID Country
## 135080 0
While we don’t care about missing customerID, a missing description may be a problem, so we will just delete all transactions with missing description.
long_data <- long_data %>%
drop_na(Description)
Below is the number of stock codes in the data
long_data %>%
pull(StockCode) %>%
unique %>%
length
## [1] 3958
Below is the number of products in the data
long_data %>%
pull(Description) %>%
unique %>%
length
## [1] 4211
The numbers are different, but they are supposed to be the same thing. To find out the reason for this discrepancy, we will look at stock codes that correspond to more than one description.
To do it, we will group the data by the stock code and the description and look at the possible combinations. Below is the number of possible combinations.
comb_stock_description <- long_data %>%
group_by(StockCode, Description) %>%
summarise(n = n())
## `summarise()` regrouping output by 'StockCode' (override with `.groups` argument)
nrow(comb_stock_description)
## [1] 4789
First, we find stock codes appearing more than once
duplicated_stock_codes <- (table(comb_stock_description$StockCode)) %>%
sort(decreasing = TRUE)
duplicated_stock_codes <- duplicated_stock_codes[duplicated_stock_codes >= 2]
cat("The number of duplicated stock codes is",
length(duplicated_stock_codes), "\n")
## The number of duplicated stock codes is 647
cat("Below are most common duplicated stock codes and the number of times they appear:\n")
## Below are most common duplicated stock codes and the number of times they appear:
head(duplicated_stock_codes, 10)
##
## 20713 23084 21830 85175 21181 23131 23343 72807A 85172 21621
## 8 7 6 6 5 5 5 5 5 4
And here are some duplicated stock codes together with their product descriptions.
comb_stock_description %>%
filter(StockCode %in% names(duplicated_stock_codes[1:5])) %>%
arrange(StockCode)
Now it becomes clear that the column StockCode
probably represents items that customers buy, i.e., what we need to do association analysis with. However, stock codes are not convenient to work with and it is better to get item names. A real store would, of course, have a separate spreadsheet containing stock codes with item names, but it is a good exercise to extract these names from the data that we have at our hands.
There can be several ways to approach this problem. The most thorough method is to manually go through all 637 stock codes that correspond to more than one description — it will certainly take some time, but is doable.
We will do it differently. Specifically, we will assume that most of the time, the description assigned to a stock code will contain the actual item name and not some comments. For every value of StockCode
, we will take Description
corresponding to the largest number of transactions with that stock code. And this will be our table of item names
table_item_names <- comb_stock_description %>%
group_by(StockCode) %>%
summarise(ItemName = Description[which.max(n)])
## `summarise()` ungrouping output (override with `.groups` argument)
table_item_names
Now we will introduce the new variable, ItemName
by mapping the first column of the table above to the second column.
level_key <- table_item_names %>%
pull(ItemName) %>%
setNames(table_item_names$StockCode)
long_data <- long_data %>%
mutate(ItemName = recode_factor(StockCode, !!!level_key))
head(long_data)
A little problem with this approach is that there are rarely sold items such that the method described above does not let us to uniquely identify the most common description for the given stock code.
Identify stock codes that have more than one equally common description assigned to them and print them together with their descriptions and the number of transactions under each pair (stock code, description).
codes_for_careful_check <- comb_stock_description %>%
filter(StockCode %in% names(duplicated_stock_codes)) %>%
group_by(StockCode) %>%
summarise(V = var(n)) %>%
ungroup %>%
filter(V == 0) %>%
pull(StockCode)
## `summarise()` ungrouping output (override with `.groups` argument)
cat("We need to pay special attention to codes: ",
codes_for_careful_check, "\n")
## We need to pay special attention to codes: 23595 35610B 35817P 35833G 79323W 84614A 84743C 84809B 90014C DCGS0069
comb_stock_description %>%
filter(StockCode %in% codes_for_careful_check)
We will now transform the data into matrix form, at the same time getting rid of columns that we are not using for our analysis - time stamp, country etc. Note that we will only keep unique observations of the pair (InvoiceNo, ItemName). If we worked in an actual store, we would probably be able to figure out why the pair (InvoiceNo, ItemName) is sometimes not unique.
order_table <- long_data %>%
distinct(InvoiceNo, ItemName, .keep_all = TRUE) %>%
pivot_wider(id_cols = "InvoiceNo",
names_from = "ItemName",
values_from = "Quantity") %>%
replace(is.na(.) , 0)
cat("Transaction data dim =", dim(order_table))
## Transaction data dim = 24446 3813
cat("Sample of the transaction data:\n")
## Sample of the transaction data:
order_table[1:10, 1:10]
Note that currently our transaction table has class “data frame”:
class(order_table)
## [1] "tbl_df" "tbl" "data.frame"
The first thing that we will do is converting the matrix of transactions into a special matrix class for the library “arules”. Note that this operation will change item counts to just zeroes and ones (Apriori Algorithm doesn’t care about the number of items in each transaction).
T <- order_table %>%
select(-InvoiceNo) %>%
as.matrix() %>%
as("itemMatrix")
## Warning in asMethod(object): matrix contains values other than 0 and 1! Setting
## all entries != 0 to 1.
cat("\nBelow is class of T:\n")
##
## Below is class of T:
class(T)
## [1] "itemMatrix"
## attr(,"package")
## [1] "arules"
Here is the distribution of most frequently bought items produced by a special function in the library arules
(this is, actually, the only thing that we need to convert our data to itemMatrix
for — association rules can be extracted directly from a usual matrix):
itemFrequencyPlot(T,topN=20,type="relative",
col = brewer.pal(8, 'Accent'),
main = "Support of most common items")
Now we generate rules with minimum support 0.01, minimum confidence 0.6 and maximum length 5.
association.rules <- apriori(T, parameter = list(supp=0.01, conf=0.6, maxlen=5))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.6 0.1 1 none FALSE TRUE 5 0.01 1
## maxlen target ext
## 5 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 244
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[3812 item(s), 24446 transaction(s)] done [0.35s].
## sorting and recoding items ... [645 item(s)] done [0.01s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 3 4 done [0.06s].
## writing ... [231 rule(s)] done [0.00s].
## creating S4 object ... done [0.01s].
summary(association.rules)
## set of 231 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3 4
## 72 140 19
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 2.000 3.000 2.771 3.000 4.000
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.01002 Min. :0.6014 Min. :0.01113 Min. : 6.988
## 1st Qu.:0.01080 1st Qu.:0.6455 1st Qu.:0.01534 1st Qu.: 9.782
## Median :0.01203 Median :0.6965 Median :0.01726 Median :16.666
## Mean :0.01331 Mean :0.7081 Mean :0.01908 Mean :18.790
## 3rd Qu.:0.01374 3rd Qu.:0.7581 3rd Qu.:0.02080 3rd Qu.:21.279
## Max. :0.03408 Max. :0.9485 Max. :0.05036 Max. :62.380
## count
## Min. :245.0
## 1st Qu.:264.0
## Median :294.0
## Mean :325.4
## 3rd Qu.:336.0
## Max. :833.0
##
## mining info:
## data ntransactions support confidence
## T 24446 0.01 0.6
Note the class of association.rules
:
class(association.rules)
## [1] "rules"
## attr(,"package")
## [1] "arules"
There is a function to convert association rules to a data frame. Here is how it works (we print the rules with the highest lift):
ar <- DATAFRAME(association.rules)
ar %>%
arrange(-lift) %>%
head
Find rules with minimum support 0.005, minimum confidence 0.7, and maximum length 10. How many of them are there? How many of them have length 5? How many rules of confidence 1 are there? What is the rule with the largest lift among all the rules with minimum support 0.005 and minimum confidence 0.7?
First, we generate the rules
arules_2 <- apriori(T, parameter = list(supp=0.005, conf=0.7, maxlen=10))
To find the number of rules of length 5, we look at the summary. There are 412 rules of length 5
summary(arules_2)
## set of 3567 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3 4 5 6
## 115 1502 1510 412 28
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 3.000 4.000 3.646 4.000 6.000
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.005032 Min. :0.7000 Min. :0.005072 Min. : 7.484
## 1st Qu.:0.005318 1st Qu.:0.7430 1st Qu.:0.006709 1st Qu.: 14.733
## Median :0.005809 Median :0.7886 Median :0.007322 Median : 20.589
## Mean :0.006282 Mean :0.8045 Mean :0.007883 Mean : 26.300
## 3rd Qu.:0.006668 3rd Qu.:0.8571 3rd Qu.:0.008345 3rd Qu.: 28.241
## Max. :0.032071 Max. :1.0000 Max. :0.045815 Max. :106.316
## count
## Min. :123.0
## 1st Qu.:130.0
## Median :142.0
## Mean :153.6
## 3rd Qu.:163.0
## Max. :784.0
##
## mining info:
## data ntransactions support confidence
## T 24446 0.005 0.7
Now we find the number of rules of confidence 1
ar <- DATAFRAME(arules_2)
ar %>%
filter(confidence == 1) %>%
nrow
## [1] 15
Now we find the rule with the largest lift
ar %>%
arrange(-lift) %>%
head(1)
We can visualize association rules by a scatterplot as follows:
plot(association.rules)
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.
We can extract rules by certain threshold:
subRules <- association.rules %>%
subset(confidence >= 0.8)
plot(subRules)
We can colour the scatterplot according to the rule length:
plot(subRules, method = "two-key plot")
We can plot a small set of rules as direct graph, where items and rules are vertices and there is an edge from iteam \(A\) to rule \(R\) whenever \(A\) appears in the LFS or rules \(R\) and an adge from a rule \(R\) to an item \(B\) whenever \(B\) appears in the RHS or rule \(R\).
top5subRules <- head(subRules, n = 5, by = "confidence")
plot(top5subRules, method = "graph")
Produce an interactive version of the plot above by adding the option engine = “htmlwidget” to the plotting command.
plot(top5subRules, method = "graph",
engine = "htmlwidget")
Here are the answers: