Info about the lab

Learning aim

The aim of this lab is to become familiar with practical association rule mining for market basket analysis.

Objectives

By the end of this lab session, students should be able to

  1. Transform long data of market basket transactions to wide data in a sparse matrix form.

  2. Clean market basket data.

  3. Extract association rules with Apriori Algorithm.

  4. Visualize association rules.

Mode

Please run the R chunks one by one, look at the output and make sure that you understand how it is produced. There will be questions that either require a short answer - then you type your answer right in this document - or modifying R codes - then you modify the R codes here. In either case, you can discuss your work with the lab instructor.

Data

The data is taken from the following source:

http://archive.ics.uci.edu/ml/datasets/online+retail

This lab is inspired by the following online tutorial (there are some changes):

https://www.datacamp.com/community/tutorials/market-basket-analysis-r

Libraries and data

Below we load data into R, keeping only unique r, report data dimensions, and print a sample.

set.seed(64)

library(tidyverse)
library(readxl)
library(arules)
library(arulesViz)
library(RColorBrewer)

long_data <- read_excel("online_retail.xlsx", sheet = 1)

cat("Dimensions of the dataset are", dim(long_data), "\n")
## Dimensions of the dataset are 541909 8
cat("Below is a sample of the data:\n")
## Below is a sample of the data:
long_data %>% sample_n(10)

Question 1

Plot the distribution of the number of invoices by country.

# The problem is that if we simply plot the raw data, then the same invoice will be repeated several times. Thus to do this plot we need to remove repeated invoice numbers. This operation would, of course, render market basket analysis impossible. Thus instead of actually removing repeated invoice numbers, we will save indices of repeated invoices plot the data of non-repeated invoices only.

long_data %>%
  filter(!duplicated(InvoiceNo)) %>%
  ggplot(aes(x = Country)) +
  geom_bar() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

How many invoices are there from the country that has the largest number of invoices?

The largest number of invoices comes from United Kingdom and it is about 23K. The exact number is

long_data %>%
  filter(!duplicated(InvoiceNo), Country == "United Kingdom") %>%
  nrow
## [1] 23494

Data cleaning

Below is the number of missing values in each category:

sapply(long_data, function(x) sum(is.na(x)))
##   InvoiceNo   StockCode Description    Quantity InvoiceDate   UnitPrice 
##           0           0        1454           0           0           0 
##  CustomerID     Country 
##      135080           0

While we don’t care about missing customerID, a missing description may be a problem, so we will just delete all transactions with missing description.

long_data <- long_data %>%
  drop_na(Description)

Below is the number of stock codes in the data

long_data %>%
  pull(StockCode) %>%
  unique %>%
  length
## [1] 3958

Below is the number of products in the data

long_data %>%
  pull(Description) %>%
  unique %>%
  length
## [1] 4211

The numbers are different, but they are supposed to be the same thing. To find out the reason for this discrepancy, we will look at stock codes that correspond to more than one description.

To do it, we will group the data by the stock code and the description and look at the possible combinations. Below is the number of possible combinations.

comb_stock_description <- long_data %>%
  group_by(StockCode, Description) %>%
  summarise(n = n()) 
## `summarise()` regrouping output by 'StockCode' (override with `.groups` argument)
nrow(comb_stock_description)
## [1] 4789

First, we find stock codes appearing more than once

duplicated_stock_codes <- (table(comb_stock_description$StockCode)) %>% 
  sort(decreasing = TRUE) 

duplicated_stock_codes <- duplicated_stock_codes[duplicated_stock_codes >= 2]

cat("The number of duplicated stock codes is", 
    length(duplicated_stock_codes), "\n")
## The number of duplicated stock codes is 647
cat("Below are most common duplicated stock codes and the number of times they appear:\n")
## Below are most common duplicated stock codes and the number of times they appear:
head(duplicated_stock_codes, 10)
## 
##  20713  23084  21830  85175  21181  23131  23343 72807A  85172  21621 
##      8      7      6      6      5      5      5      5      5      4

And here are some duplicated stock codes together with their product descriptions.

comb_stock_description %>%
  filter(StockCode %in% names(duplicated_stock_codes[1:5])) %>%
  arrange(StockCode) 

Now it becomes clear that the column StockCode probably represents items that customers buy, i.e., what we need to do association analysis with. However, stock codes are not convenient to work with and it is better to get item names. A real store would, of course, have a separate spreadsheet containing stock codes with item names, but it is a good exercise to extract these names from the data that we have at our hands.

There can be several ways to approach this problem. The most thorough method is to manually go through all 637 stock codes that correspond to more than one description — it will certainly take some time, but is doable.

We will do it differently. Specifically, we will assume that most of the time, the description assigned to a stock code will contain the actual item name and not some comments. For every value of StockCode, we will take Description corresponding to the largest number of transactions with that stock code. And this will be our table of item names

table_item_names <- comb_stock_description %>%
  group_by(StockCode) %>%
  summarise(ItemName =  Description[which.max(n)]) 
## `summarise()` ungrouping output (override with `.groups` argument)
table_item_names

Now we will introduce the new variable, ItemName by mapping the first column of the table above to the second column.

level_key <- table_item_names %>%
  pull(ItemName) %>%
  setNames(table_item_names$StockCode)

long_data <- long_data %>%
  mutate(ItemName = recode_factor(StockCode, !!!level_key))

head(long_data)

Question 2

A little problem with this approach is that there are rarely sold items such that the method described above does not let us to uniquely identify the most common description for the given stock code.

Identify stock codes that have more than one equally common description assigned to them and print them together with their descriptions and the number of transactions under each pair (stock code, description).

codes_for_careful_check <- comb_stock_description %>%
  filter(StockCode %in% names(duplicated_stock_codes)) %>%
  group_by(StockCode) %>%
  summarise(V = var(n)) %>%
  ungroup %>%
  filter(V == 0) %>%
  pull(StockCode)
## `summarise()` ungrouping output (override with `.groups` argument)
cat("We need to pay special attention to codes: ",
    codes_for_careful_check, "\n")
## We need to pay special attention to codes:  23595 35610B 35817P 35833G 79323W 84614A 84743C 84809B 90014C DCGS0069
comb_stock_description %>%
  filter(StockCode %in% codes_for_careful_check) 

Data transformation

We will now transform the data into matrix form, at the same time getting rid of columns that we are not using for our analysis - time stamp, country etc. Note that we will only keep unique observations of the pair (InvoiceNo, ItemName). If we worked in an actual store, we would probably be able to figure out why the pair (InvoiceNo, ItemName) is sometimes not unique.

order_table <- long_data %>%
  distinct(InvoiceNo, ItemName, .keep_all = TRUE) %>%
  pivot_wider(id_cols = "InvoiceNo",
    names_from = "ItemName",
    values_from = "Quantity") %>%
  replace(is.na(.) , 0)

cat("Transaction data dim =", dim(order_table))
## Transaction data dim = 24446 3813
cat("Sample of the transaction data:\n")
## Sample of the transaction data:
order_table[1:10, 1:10]

Association rule mining

Data preparation

Note that currently our transaction table has class “data frame”:

class(order_table)
## [1] "tbl_df"     "tbl"        "data.frame"

The first thing that we will do is converting the matrix of transactions into a special matrix class for the library “arules”. Note that this operation will change item counts to just zeroes and ones (Apriori Algorithm doesn’t care about the number of items in each transaction).

T <- order_table %>%
  select(-InvoiceNo) %>%
  as.matrix() %>%
  as("itemMatrix")
## Warning in asMethod(object): matrix contains values other than 0 and 1! Setting
## all entries != 0 to 1.
cat("\nBelow is class of T:\n")
## 
## Below is class of T:
class(T)
## [1] "itemMatrix"
## attr(,"package")
## [1] "arules"

Here is the distribution of most frequently bought items produced by a special function in the library arules (this is, actually, the only thing that we need to convert our data to itemMatrix for — association rules can be extracted directly from a usual matrix):

itemFrequencyPlot(T,topN=20,type="relative",
                  col = brewer.pal(8, 'Accent'),
                  main = "Support of most common items")

Association rules

Now we generate rules with minimum support 0.01, minimum confidence 0.6 and maximum length 5.

association.rules <- apriori(T, parameter = list(supp=0.01, conf=0.6, maxlen=5))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.6    0.1    1 none FALSE            TRUE       5    0.01      1
##  maxlen target  ext
##       5  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 244 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[3812 item(s), 24446 transaction(s)] done [0.35s].
## sorting and recoding items ... [645 item(s)] done [0.01s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 3 4 done [0.06s].
## writing ... [231 rule(s)] done [0.00s].
## creating S4 object  ... done [0.01s].
summary(association.rules)
## set of 231 rules
## 
## rule length distribution (lhs + rhs):sizes
##   2   3   4 
##  72 140  19 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   2.000   3.000   2.771   3.000   4.000 
## 
## summary of quality measures:
##     support          confidence        coverage            lift       
##  Min.   :0.01002   Min.   :0.6014   Min.   :0.01113   Min.   : 6.988  
##  1st Qu.:0.01080   1st Qu.:0.6455   1st Qu.:0.01534   1st Qu.: 9.782  
##  Median :0.01203   Median :0.6965   Median :0.01726   Median :16.666  
##  Mean   :0.01331   Mean   :0.7081   Mean   :0.01908   Mean   :18.790  
##  3rd Qu.:0.01374   3rd Qu.:0.7581   3rd Qu.:0.02080   3rd Qu.:21.279  
##  Max.   :0.03408   Max.   :0.9485   Max.   :0.05036   Max.   :62.380  
##      count      
##  Min.   :245.0  
##  1st Qu.:264.0  
##  Median :294.0  
##  Mean   :325.4  
##  3rd Qu.:336.0  
##  Max.   :833.0  
## 
## mining info:
##  data ntransactions support confidence
##     T         24446    0.01        0.6

Note the class of association.rules:

class(association.rules)
## [1] "rules"
## attr(,"package")
## [1] "arules"

There is a function to convert association rules to a data frame. Here is how it works (we print the rules with the highest lift):

ar <- DATAFRAME(association.rules)
ar %>%
  arrange(-lift) %>%
  head

Question 3

Find rules with minimum support 0.005, minimum confidence 0.7, and maximum length 10. How many of them are there? How many of them have length 5? How many rules of confidence 1 are there? What is the rule with the largest lift among all the rules with minimum support 0.005 and minimum confidence 0.7?

First, we generate the rules

arules_2 <- apriori(T, parameter = list(supp=0.005, conf=0.7, maxlen=10))

To find the number of rules of length 5, we look at the summary. There are 412 rules of length 5

summary(arules_2)
## set of 3567 rules
## 
## rule length distribution (lhs + rhs):sizes
##    2    3    4    5    6 
##  115 1502 1510  412   28 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   3.000   4.000   3.646   4.000   6.000 
## 
## summary of quality measures:
##     support           confidence        coverage             lift        
##  Min.   :0.005032   Min.   :0.7000   Min.   :0.005072   Min.   :  7.484  
##  1st Qu.:0.005318   1st Qu.:0.7430   1st Qu.:0.006709   1st Qu.: 14.733  
##  Median :0.005809   Median :0.7886   Median :0.007322   Median : 20.589  
##  Mean   :0.006282   Mean   :0.8045   Mean   :0.007883   Mean   : 26.300  
##  3rd Qu.:0.006668   3rd Qu.:0.8571   3rd Qu.:0.008345   3rd Qu.: 28.241  
##  Max.   :0.032071   Max.   :1.0000   Max.   :0.045815   Max.   :106.316  
##      count      
##  Min.   :123.0  
##  1st Qu.:130.0  
##  Median :142.0  
##  Mean   :153.6  
##  3rd Qu.:163.0  
##  Max.   :784.0  
## 
## mining info:
##  data ntransactions support confidence
##     T         24446   0.005        0.7

Now we find the number of rules of confidence 1

ar <- DATAFRAME(arules_2)
ar %>%
  filter(confidence == 1) %>%
  nrow
## [1] 15

Now we find the rule with the largest lift

ar %>%
  arrange(-lift) %>%
  head(1)

Association rule visualization

We can visualize association rules by a scatterplot as follows:

plot(association.rules)
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.

We can extract rules by certain threshold:

subRules <- association.rules %>%
  subset(confidence >= 0.8)

plot(subRules)

We can colour the scatterplot according to the rule length:

plot(subRules, method = "two-key plot")

We can plot a small set of rules as direct graph, where items and rules are vertices and there is an edge from iteam \(A\) to rule \(R\) whenever \(A\) appears in the LFS or rules \(R\) and an adge from a rule \(R\) to an item \(B\) whenever \(B\) appears in the RHS or rule \(R\).

top5subRules <- head(subRules, n = 5, by = "confidence")
plot(top5subRules, method = "graph")

Question 4

Produce an interactive version of the plot above by adding the option engine = “htmlwidget” to the plotting command.

plot(top5subRules, method = "graph", 
     engine = "htmlwidget")

Survey

There is a link to a simple survey after lab 9:

Answers

Here are the answers: