Info about the lab

Learning aim

The aim of this lab is to become familiar with practical association rule mining for market basket analysis.

Objectives

By the end of this lab session, students should be able to

Transform long data of market basket transactions to wide data in a sparse matrix form.
Clean market basket data.
Extract association rules with Apriori Algorithm.
Visualize association rules.

Mode

Please run the R chunks one by one, look at the output and make sure that you understand how it is produced. There will be questions that either require a short answer - then you type your answer right in this document - or modifying R codes - then you modify the R codes here. In either case, you can discuss your work with the lab instructor.

Data

The data is taken from the following source:

http://archive.ics.uci.edu/ml/datasets/online+retail

This lab is inspired by the following online tutorial (there are some changes):

https://www.datacamp.com/community/tutorials/market-basket-analysis-r

Libraries and data

Below we load data into R, keeping only unique r, report data dimensions, and print a sample.

set.seed(64)

library(tidyverse)
library(janitor)
library(gt)
library(broom)
library(readxl)
library(arules)
library(arulesViz)
library(RColorBrewer)

long_data <- read_excel("online_retail.xlsx", sheet = 1)

cat("Dimensions of the dataset are", dim(long_data), "\n")

## Dimensions of the dataset are 541909 8

cat("Below is a sample of the data:\n")

## Below is a sample of the data:

long_data %>% sample_n(10) %>% gt()

InvoiceNo	StockCode	Description	Quantity	InvoiceDate	UnitPrice	CustomerID	Country
579524	84951B	SET OF 4 BLACK LOVEBIRD COASTERS	24	2011-11-30 08:17:00	0.79	16295	United Kingdom
540848	90214K	LETTER "K" BLING KEY RING	1	2011-01-12 09:26:00	0.85	NA	United Kingdom
568949	22534	MAGIC DRAWING SLATE SPACEBOY	1	2011-09-29 15:13:00	0.83	NA	United Kingdom
536783	22492	MINI PAINT SET VINTAGE	144	2010-12-02 15:19:00	0.55	15061	United Kingdom
578757	85014A	BLACK/BLUE POLKADOT UMBRELLA	1	2011-11-25 11:41:00	5.95	12748	United Kingdom
580384	23355	HOT WATER BOTTLE KEEP CALM	2	2011-12-04 10:10:00	4.95	17243	United Kingdom
540646	85170C	SET/6 EAU DE NIL BIRD T-LIGHTS	1	2011-01-10 14:32:00	2.51	NA	United Kingdom
543399	20752	BLUE POLKADOT WASHING UP GLOVES	1	2011-02-08 11:07:00	4.13	NA	United Kingdom
578995	22265	EASTER DECORATION NATURAL CHICK	2	2011-11-27 15:59:00	0.65	15861	United Kingdom
573113	22940	FELTCRAFT CHRISTMAS FAIRY	4	2011-10-27 15:36:00	4.25	18261	United Kingdom

The distribution of the number of invoices by country:

long_data %>% tabyl(Country)

Question 1

Plot the distribution of the number of invoices by country.

# The problem is that if we simply plot the raw data, then the same invoice will be repeated several times. Thus to do this plot we need to remove repeated invoice numbers. This operation would, of course, render market basket analysis impossible. Thus instead of actually removing repeated invoice numbers, we will save indices of repeated invoices plot the data of non-repeated invoices only.

long_data %>%
  filter(!duplicated(InvoiceNo)) %>%
  ggplot(aes(x = Country)) +
  geom_bar() +
  scale_y_log10() +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

How many invoices are there from the country that has the largest number of invoices?

The largest number of invoices comes from United Kingdom and it is about 23K. The exact number is

long_data %>%
  filter(!duplicated(InvoiceNo)) %>%
  count(Country) %>%
  arrange(desc(n)) %>%
  head(1)

Data cleaning

Below is the number of missing values in each category:

long_data %>%
  summarise_all(~sum(is.na(.))) %>%
  gt()

InvoiceNo	StockCode	Description	Quantity	InvoiceDate	UnitPrice	CustomerID	Country
0	0	1454	0	0	0	135080	0

While we don’t care about missing customerID, a missing description may be a problem, so we will just delete all transactions with missing description.

long_data <- long_data %>%
  drop_na(Description)

Below is the number of stock codes in the data

long_data %>%
  pull(StockCode) %>%
  unique %>%
  length

## [1] 3958

Below is the number of products in the data

long_data %>%
  pull(Description) %>%
  unique %>%
  length

## [1] 4211

The numbers are different, but they are supposed to be the same thing. To find out the reason for this discrepancy, we will look at stock codes that correspond to more than one description.

To do it, we will group the data by the stock code and the description and look at the possible combinations. Below is the number of possible combinations.

comb_stock_description <- long_data %>%
  count(StockCode, Description)

nrow(comb_stock_description)

## [1] 4789

First, we find stock codes appearing more than once

duplicated_stock_codes <- comb_stock_description %>%
  count(StockCode) %>%
  arrange(desc(n)) %>%
  filter(n > 1)

cat("The number of duplicated stock codes is", 
    nrow(duplicated_stock_codes), "\n")

## The number of duplicated stock codes is 647

cat("Below are most common duplicated stock codes and the number of times they appear:\n")

## Below are most common duplicated stock codes and the number of times they appear:

head(duplicated_stock_codes, 10)

And here are some duplicated stock codes together with their product descriptions.

comb_stock_description %>%
  filter(StockCode %in% duplicated_stock_codes$StockCode[1:5]) %>%
  arrange(StockCode)

Now it becomes clear that the column StockCode probably represents items that customers buy, i.e., what we need to do association analysis with. However, stock codes are not convenient to work with and it is better to get item names. A real store would, of course, have a separate spreadsheet containing stock codes with item names, but it is a good exercise to extract these names from the data that we have at our hands.

There can be several ways to approach this problem. The most thorough method is to manually go through all 637 stock codes that correspond to more than one description — it will certainly take some time, but is doable.

We will do it differently. Specifically, we will assume that most of the time, the description assigned to a stock code will contain the actual item name and not some comments. For every value of StockCode, we will take Description corresponding to the largest number of transactions with that stock code. And this will be our table of item names

table_item_names <- comb_stock_description %>%
  group_by(StockCode) %>%
  summarise(ItemName =  Description[which.max(n)]) 

table_item_names

Now we will introduce the new variable, ItemName by mapping the first column of the table above to the second column.

long_data %>%
  left_join(table_item_names)

## Joining with `by = join_by(StockCode)`

Question 2

A little problem with this approach is that there are rarely sold items such that the method described above does not let us to uniquely identify the most common description for the given stock code.

Identify stock codes that have more than one equally common description assigned to them and print them together with their descriptions and the number of transactions under each pair (stock code, description).

Hint: you will need to explore functions such as group_by(), max(), filter(), and add_count().

## WRITE YOUR CODES HERE
codes_for_careful_check <- comb_stock_description %>%
  group_by(StockCode) %>%
  filter(n == max(n)) %>%
  add_count(StockCode) %>%
  filter(nn > 1)

## Storing counts in `nn`, as `n` already present in input
## ℹ Use `name = "new_name"` to pick a new name.

cat("We need to pay special attention to codes:\n")

## We need to pay special attention to codes:

codes_for_careful_check %>%
  pull(StockCode) %>%
  unique()

##  [1] "23595"    "35610B"   "35817P"   "35833G"   "79323W"   "84614A"  
##  [7] "84743C"   "84809B"   "90014C"   "DCGS0069"

Data transformation

We will now transform the data into matrix form, at the same time getting rid of columns that we are not using for our analysis - time stamp, country etc. Note that we will only keep unique observations of the pair (InvoiceNo, ItemName). If we worked in an actual store, we would probably be able to figure out why the pair (InvoiceNo, ItemName) is sometimes not unique.

order_table <- long_data %>%
  left_join(table_item_names) %>%  
  distinct(InvoiceNo, ItemName, .keep_all = TRUE) %>%
  pivot_wider(id_cols = "InvoiceNo",
    names_from = "ItemName",
    values_from = "Quantity") %>%
  replace(is.na(.) , 0)

## Joining with `by = join_by(StockCode)`

cat("Transaction data dim =", dim(order_table))

## Transaction data dim = 24446 3815

cat("Sample of the transaction data:\n")

## Sample of the transaction data:

order_table[1:10, 1:10] %>% gt()

InvoiceNo	WHITE HANGING HEART T-LIGHT HOLDER	WHITE METAL LANTERN	CREAM CUPID HEARTS COAT HANGER	KNITTED UNION FLAG HOT WATER BOTTLE	RED WOOLLY HOTTIE WHITE HEART.	SET 7 BABUSHKA NESTING BOXES	GLASS STAR FROSTED T-LIGHT HOLDER	HAND WARMER UNION JACK	HAND WARMER RED RETROSPOT
536365	6	6	8	6	6	2	6	0	0
536366	0	0	0	0	0	0	0	6	6
536367	0	0	0	0	0	0	0	0	0
536368	0	0	0	0	0	0	0	0	0
536369	0	0	0	0	0	0	0	0	0
536370	0	0	0	0	0	0	0	0	0
536371	0	0	0	0	0	0	0	0	0
536372	0	0	0	0	0	0	0	6	6
536373	6	6	8	6	6	2	6	0	0
536374	0	0	0	0	0	0	0	0	0

Association rule mining

Data preparation

Note that currently our transaction table is a tibble or data.frame:

class(order_table)

## [1] "tbl_df"     "tbl"        "data.frame"

The first thing that we will do is converting the matrix of transactions into a special matrix class for the library “arules”. Note that this operation will change item counts to just zeroes and ones (Apriori Algorithm doesn’t care about the number of items in each transaction).

X <- order_table %>%
  select(-InvoiceNo) %>%
  as.matrix() %>%
  as("itemMatrix")

## Warning in asMethod(object): matrix contains values other than 0 and 1! Setting
## all entries != 0 to 1.

cat("\nBelow is class of X:\n")

## 
## Below is class of X:

class(X)

## [1] "itemMatrix"
## attr(,"package")
## [1] "arules"

Here is the distribution of most frequently bought items produced by a special function in the library arules (this is, actually, the only thing that we need to convert our data to itemMatrix for — association rules can be extracted directly from a usual matrix):

itemFrequencyPlot(X, topN = 20, type="relative",
                  col = brewer.pal(8, 'Accent'),
                  main = "Support of most common items")

Association rules

Now we generate rules with minimum support 0.01, minimum confidence 0.6 and maximum length 5.

association.rules <- apriori(X, parameter = list(supp=0.01, conf=0.6, maxlen=5))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.6    0.1    1 none FALSE            TRUE       5    0.01      1
##  maxlen target  ext
##       5  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 244 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[3814 item(s), 24446 transaction(s)] done [0.04s].
## sorting and recoding items ... [645 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.01s].
## writing ... [231 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

summary(association.rules)

## set of 231 rules
## 
## rule length distribution (lhs + rhs):sizes
##   2   3   4 
##  72 140  19 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   2.000   3.000   2.771   3.000   4.000 
## 
## summary of quality measures:
##     support          confidence        coverage            lift       
##  Min.   :0.01002   Min.   :0.6014   Min.   :0.01113   Min.   : 6.988  
##  1st Qu.:0.01080   1st Qu.:0.6455   1st Qu.:0.01534   1st Qu.: 9.782  
##  Median :0.01203   Median :0.6965   Median :0.01726   Median :16.666  
##  Mean   :0.01331   Mean   :0.7081   Mean   :0.01908   Mean   :18.790  
##  3rd Qu.:0.01374   3rd Qu.:0.7581   3rd Qu.:0.02080   3rd Qu.:21.279  
##  Max.   :0.03408   Max.   :0.9485   Max.   :0.05036   Max.   :62.380  
##      count      
##  Min.   :245.0  
##  1st Qu.:264.0  
##  Median :294.0  
##  Mean   :325.4  
##  3rd Qu.:336.0  
##  Max.   :833.0  
## 
## mining info:
##  data ntransactions support confidence
##     X         24446    0.01        0.6
##                                                                      call
##  apriori(data = X, parameter = list(supp = 0.01, conf = 0.6, maxlen = 5))

Note the class of association.rules:

class(association.rules)

## [1] "rules"
## attr(,"package")
## [1] "arules"

There is a function to convert association rules to a data frame. Here is how it works (we print the rules with the highest lift):

ar <- DATAFRAME(association.rules)
ar %>%
  arrange(-lift) %>%
  head

However, this built-in function drops some useful information, so here is an improved version written with the help of our friendly ChatGPT 5.0:

tidy_rules <- function(rules_object) {
  tibble(
    LHS = labels(lhs(rules_object)),
    RHS = labels(rhs(rules_object)),
    length = size(rules_object),
    quality(rules_object)
  )
}

association.rules %>%
  tidy_rules() %>%
  arrange(desc(lift)) %>%
  head()

Question 3

Find rules with minimum support 0.005, minimum confidence 0.7, and maximum length 10. How many of them are there? How many of them have length 5? How many rules of confidence 1 are there? What is the rule with the largest lift among all the rules with minimum support 0.005 and minimum confidence 0.7?

First, we generate the rules

arules_2 <- X %>%
  apriori(parameter = list(supp=0.005, conf=0.7, maxlen=10)) %>%
  suppressMessages() %>%
  tidy_rules()

Now the number of rules of different lengths:

arules_2 %>% tabyl(length)

Now we find the number of rules of confidence 1:

arules_2 %>%
  filter(confidence == 1) %>%
  nrow

## [1] 15

Now we find the rule with the largest lift:

arules_2 %>%
  arrange(-lift) %>%
  head(1) %>%
  gt()

LHS	RHS	length	support	confidence	coverage	lift	count
{HERB MARKER MINT,HERB MARKER ROSEMARY,HERB MARKER BASIL,HERB MARKER PARSLEY,HERB MARKER THYME}	{HERB MARKER CHIVES}	6	0.006463225	0.9132948	0.007076822	106.3162	158

Association rule visualization

We can visualize association rules by a scatterplot. The default plot looks kind of like ggplot (because under the hood, it uses similar styles and colours):

plot(association.rules)

## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.

But it is not hard to convert it to a data frame and plot with ggplot() (e.g., if customization is needed):

association.rules %>%
  tidy_rules() %>%
  ggplot(aes(x = support, y = confidence,
             color = lift, size = length)) +
  geom_point(alpha = 0.8) +
  scale_color_binned(
    type = "gradient",
    low = "#fee0d2",
    high = "#a50f15",
    n.breaks = 4
  ) +
  scale_size_continuous(
    breaks = c(2, 3, 4),       # legend shows only these
    range = c(1, 5),           # visual size range
    limits = c(1, 4)           # optional: clamp to valid values
  ) +
  theme_minimal(base_size = 13) +
  labs(
    title = paste(length(association.rules), "rules"),
    x = "Support",
    y = "Confidence",
    color = "Lift (binned)",
    size = "Rule length"
  )

Some customization is available with the default plot, but it may be a little awkward. For example, we can extract rules by certain threshold:

subRules <- association.rules %>%
  subset(confidence >= 0.8)

plot(subRules)

We can colour the scatterplot according to the rule length:

plot(subRules, method = "two-key plot")

We can plot a small set of rules as direct graph, where items and rules are vertices and there is an edge from iteam \(A\) to rule \(R\) whenever \(A\) appears in the LFS or rules \(R\) and an adge from a rule \(R\) to an item \(B\) whenever \(B\) appears in the RHS or rule \(R\).

top5subRules <- head(subRules, n = 5, by = "confidence")
plot(top5subRules, method = "graph")

Question 4

Produce an interactive version of the plot above by adding the option engine = “htmlwidget” to the plotting command.

# Write your code here
plot(top5subRules, method = "graph", 
     engine = "htmlwidget")

Survey

There is a link to a simple survey after lab 10:

https://forms.gle/hQPREuphSiSkTvFj8

Answers

Here are the answers:

https://rpubs.com/fduzhin/mh4510-lab-10-answers

Lab 10 — Association Analysis