DATA624 HW5: MBA

library(tidyverse)
library(kableExtra)
library(arules)
library(arulesViz)

Introduction

Imagine 1000 receipts sitting on your table. Each receipt represents a transaction with items that were purchased. The receipt is a representation of stuff that went into a customer’s basket – and therefore ‘Market Basket Analysis’.

That is exactly what the Groceries Data Set contains: a collection of receipts with each line representing 1 receipt and the items purchased. Each line is called a transaction and each column in a row represents an item.

Here is the dataset = GroceryDataSet.csv (comma separated file).

Your assignment is to use R to mine the data for association rules. You should report support, confidence and lift and your top 10 rules by lift.

Data Prep

Visual inspection of the data identifies an unequal number of columns per row. Pulling the data into a dataframe and assigning each row a number to represent each basket, we pivot the data from wide to long for analysis.

url <- "https://raw.githubusercontent.com/klgriffen96/summer23_data624/main/hw_2/GroceryDataSet.csv"
data <- read.csv(url, header=FALSE)
data[data == ""] <- NA # replace empty cells with NA
data$tid <- as.integer(row.names(data)) # create transaction ID column using row indices

# pivot data
baskets <- data |> 
  pivot_longer(cols = starts_with("V"), 
               values_to = "item", 
               values_drop_na = TRUE) |>
  select(tid, item)

kable(head(baskets)) |> kable_styling()
tid item
1 citrus fruit
1 semi-finished bread
1 margarine
1 ready soups
2 tropical fruit
2 yogurt

We can visualize the frequency of items in our baskets,

baskets |>
  group_by(item) |>
  summarize(count = n()) |>
  arrange(desc(count)) |> 
  head(10) |>
  ggplot(aes(x=reorder(item,count), y=count)) +
  geom_col(fill="blue") +
  labs(x=NULL, y='Count') + ggtitle('Top 10 Items') +
  coord_flip() + 
  theme_classic()

And the distribution of items per transaction,

baskets |>
  group_by(tid) |>
  summarize(total_items = n()) |>
  ggplot(aes(x=total_items)) +
  geom_histogram(fill="blue", bins=30) +
  theme_classic()

We can see most of the baskets contain 1 or 2 items, the maximum basket size appears to be just about 30, we can check this:

baskets |>
  group_by(tid) |>
  summarize(total_items = n()) |>
  filter(total_items == max(total_items)) |>
  kable() |>
  kable_styling()
tid total_items
1217 32

Now we prepare the data for Market Basket Analysis, first we create a transaction object with the data,

# Transform tid into a factor
baskets$tid <- factor(baskets$tid) 

# split into groups 
baskets_list <- split(baskets$item, 
                      baskets$tid)

# transform to transactional dataset
baskets_trx <- as(baskets_list,"transactions")

# inspect transactions 
inspect(head(baskets_trx))
##     items                       transactionID
## [1] {citrus fruit,                           
##      margarine,                              
##      ready soups,                            
##      semi-finished bread}                   1
## [2] {coffee,                                 
##      tropical fruit,                         
##      yogurt}                                2
## [3] {whole milk}                            3
## [4] {cream cheese ,                          
##      meat spreads,                           
##      pip fruit,                              
##      yogurt}                                4
## [5] {condensed milk,                         
##      long life bakery product,               
##      other vegetables,                       
##      whole milk}                            5
## [6] {abrasive cleaner,                       
##      butter,                                 
##      rice,                                   
##      whole milk,                             
##      yogurt}                                6

We can also call the summary function on the transactions object,

summary(baskets_trx)
## transactions as itemMatrix in sparse format with
##  9835 rows (elements/itemsets/transactions) and
##  169 columns (items) and a density of 0.02609146 
## 
## most frequent items:
##       whole milk other vegetables       rolls/buns             soda 
##             2513             1903             1809             1715 
##           yogurt          (Other) 
##             1372            34055 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
## 2159 1643 1299 1005  855  645  545  438  350  246  182  117   78   77   55   46 
##   17   18   19   20   21   22   23   24   26   27   28   29   32 
##   29   14   14    9   11    4    6    1    1    1    1    3    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   4.409   6.000  32.000 
## 
## includes extended item information - examples:
##             labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3   baby cosmetics
## 
## includes extended transaction information - examples:
##   transactionID
## 1             1
## 2             2
## 3             3

All of these operations can be simplified by calling the read.transactions() function to read in the data as a transaction object.

basket_tr <- read.transactions(url, sep=",")
itemFrequencyPlot(basket_tr, 
                  topN=10, 
                  type="absolute", 
                  xlab="Count", 
                  ylab="",
                  col="blue", 
                  main="Top 10 items",
                  cex.names=0.8,
                  horiz=TRUE)

Market Basket Analysis

Association rule mining consists of two subtasks which provide insights into the relationships between items in transactional data:

  • Frequent itemset generation
  • Rule generation

Support

First we’ll take a look at the itemsets in the data to identify the itemsets that occur most frequently. An itemset is a combination of items that appear together in a set of transactions. The support measures the frequency or occurrence of an itemset in the data by dividing the number of transactions that contain a specific itemset by the total number of transactions. High support indicates the itemset appears frequently.

# Frequent itemsets for all items 
support_all <- apriori(basket_tr,
                       parameter = list(target="frequent itemsets",
                                        supp = 0.01,
                                        minlen=2))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##          NA    0.1    1 none FALSE            TRUE       5    0.01      2
##  maxlen            target  ext
##      10 frequent itemsets TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 98 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.01s].
## sorting and recoding items ... [88 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## sorting transactions ... done [0.00s].
## writing ... [245 set(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
# inspect(head(sort(support_all, by="support),10))

# 10 most frequent items
support_all |>
  as("data.frame") |>
  arrange(desc(support)) |>
  head(10) |>
  kable() |>
  kable_styling()
items support count
{other vegetables,whole milk} 0.0748348 736
{rolls/buns,whole milk} 0.0566345 557
{whole milk,yogurt} 0.0560244 551
{root vegetables,whole milk} 0.0489070 481
{other vegetables,root vegetables} 0.0473818 466
{other vegetables,yogurt} 0.0434164 427
{other vegetables,rolls/buns} 0.0426029 419
{tropical fruit,whole milk} 0.0422979 416
{soda,whole milk} 0.0400610 394
{rolls/buns,soda} 0.0383325 377

Interpreting these results, the support column indicates the proportion of transactions that contain each itemset: 7.5% of all transactions contain “other vegetables” and “whole milk”, 5.7% of all transactions contain “other vegetables”rolls/buns” and “whole milk”, and so on.

Rules

Extracting rules allows us to observe the support, confidence, and lift measures of frequent itemsets.

rules <- apriori(baskets_trx,
                 parameter= list(supp = 0.01, # minimum support
                                 conf = 0.4,  # minimum confidence
                                 minlen = 2)) # disallow empty sets
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.4    0.1    1 none FALSE            TRUE       5    0.01      2
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 98 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.01s].
## sorting and recoding items ... [88 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [62 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

The arulesViz package contains a number of functions that enhance graphical display of association rules. The inspectDT function outputs an interactive display which can be sorted by column:

inspectDT(rules)

Plotting the rules with the engine = "plotly" argument creates an interactive visualization that allows the user to hover over data points in order to view association rule metrics,

plot(rules, jitter=2, engine = "plotly")

Confidence

Each itemset consists of antecedents and consequents. Antecedents are the items or itemsets on the left-hand side of the association rule that represents a set of items or conditions that act as the premise or condition for the rule. Consequents are the items or itemsets on the right-hand side of the association rule that represents the set of items or outcomes that are predicted or observed based on the presence of the antecent. Confidence measures the likelihood or probability of finding the consequent item(s) in a rule given the antecedent(s). It is the proportion of transactions containing the antecedent that also contain the consequent. High confidence suggests a strong association between the antecedent and consequent.

rules |>
  as("data.frame") |>
  arrange(desc(confidence)) |>
  head(10) |>
  kable() |>
  kable_styling()
rules support confidence coverage lift count
{citrus fruit,root vegetables} => {other vegetables} 0.0103711 0.5862069 0.0176919 3.029608 102
{root vegetables,tropical fruit} => {other vegetables} 0.0123030 0.5845411 0.0210473 3.020999 121
{curd,yogurt} => {whole milk} 0.0100661 0.5823529 0.0172852 2.279125 99
{butter,other vegetables} => {whole milk} 0.0114896 0.5736041 0.0200305 2.244885 113
{root vegetables,tropical fruit} => {whole milk} 0.0119980 0.5700483 0.0210473 2.230969 118
{root vegetables,yogurt} => {whole milk} 0.0145399 0.5629921 0.0258261 2.203354 143
{domestic eggs,other vegetables} => {whole milk} 0.0123030 0.5525114 0.0222674 2.162336 121
{whipped/sour cream,yogurt} => {whole milk} 0.0108795 0.5245098 0.0207422 2.052747 107
{rolls/buns,root vegetables} => {whole milk} 0.0127097 0.5230126 0.0243010 2.046888 125
{other vegetables,pip fruit} => {whole milk} 0.0135231 0.5175097 0.0261312 2.025351 133

We can interpret the confidence that 58% of customers that bought “citrus fruit” and “root vegetables” also bought “other vegetables”, and so on.

Lift

Lift measures the strength of association between the antecedent and consequent items compared to their individual occurrences. It is the ratio of the observed support to the expected support if the antecedent and consequent were independent. Lift greater than 1 indicates a positive association, while lift less than 1 indicates a negative association.

rules |>
  as("data.frame") |>
  arrange(desc(lift)) |>
  head(10) |>
  kable() |>
  kable_styling()
rules support confidence coverage lift count
{citrus fruit,root vegetables} => {other vegetables} 0.0103711 0.5862069 0.0176919 3.029608 102
{root vegetables,tropical fruit} => {other vegetables} 0.0123030 0.5845411 0.0210473 3.020999 121
{rolls/buns,root vegetables} => {other vegetables} 0.0122013 0.5020921 0.0243010 2.594890 120
{root vegetables,yogurt} => {other vegetables} 0.0129131 0.5000000 0.0258261 2.584078 127
{whipped/sour cream,yogurt} => {other vegetables} 0.0101678 0.4901961 0.0207422 2.533410 100
{root vegetables,whole milk} => {other vegetables} 0.0231825 0.4740125 0.0489070 2.449770 228
{onions} => {other vegetables} 0.0142349 0.4590164 0.0310117 2.372268 140
{pork,whole milk} => {other vegetables} 0.0101678 0.4587156 0.0221657 2.370714 100
{whipped/sour cream,whole milk} => {other vegetables} 0.0146416 0.4542587 0.0322318 2.347679 144
{pip fruit,whole milk} => {other vegetables} 0.0135231 0.4493243 0.0300966 2.322178 133

Again, looking at the first line, we can interpret the lift of 3.03 to represent the occurrence of the consequent (other vegetables) is 3.03 times likelier when the antecedent items (citrus fruit & root vegetables) are present, and so on.